- Information about the dataset / sampling rate / stages of pipeline etc are to be entered in config.yaml
- This code requires Python3/gcc to be installed on the system.
- Run the following code to install the dependencies associated with Python packages
python3 -m pip install -r requirements.txt
- Installing libraries to run the RLA code
sudo apt install libxml2-dev libboost-regex-dev libmpich-dev libboost-log-dev
- We generate the random sample from the dataset. (Sampling rate is set in config.yaml)
- We use RLA_CL code written in C++ to generate the processed data. (RLA_CL is code for record linkage algorithm that was published in 2016)
- The processed data file which contains the dataset in format suitable for association analysis. This file would contain several transaction (generated by string comparison of record pairs) that are duplicate amd leading to mismatch. We prune them to improve the speed of the pipeline.
- Attribute selection algorithm using association rule mining is run on the processed dataset.
- Information about the dataset / sampling rate / stages of pipeline etc are to be entered in config.yaml
- Create virtual environment to manage the dependencies (https://docs.python.org/3/library/venv.html)
- Run the following code to install the dependencies associated with Python packages
python3 -m pip install -r requirements.txt
- Place the dataset files (atleast two are required) into the data subdirectory.
- Update the config.yaml file by changing the attribute input_files with the names of the input dataset files.
- Update the header attribute with the names of the attributes for your dataset. (Make sure the headers are not present in your dataset)
- Change the id_column attribute to denote the unique identifier in your dataset.
- Update the THRESHOLD with value suitable for your dataset (This would help program to correct link the records even when errors such as typos are present). The value is usually 1 or 2 (Please note this value should be a positive integer as we are using levenshtein distance)
- If you are going to create sample then update SAMPLE_RATE attribute with the sampling percentage.
- Since we are using k-mer based blocking while generation of sample, BLOCKING_ATTRIBUTE is needed to identify the number of the attribute to be used for blocking. (note blocking speeds up the computation of pairs so choosing right attribute would impact speed)
- Populate the COMPARISON_ATTRIBUTE based on number of attributes in your dataset. for e.g if your dataset has 5 attributes and 0th attribute is unique_identifier then
[1 2 3 4]
should be correct value for comparision attribute. (Note we are using all attributes so that they are included in attribute selection algorithm).
- If your dataset is small (total records less than 50,000) then you might not want to generate samples. In the case disable sample_generation inside run attribute by setting it to false.
- Update sample_output attribute with the path to the input dataset (Same value as input_files).
- DO NOT touch this setting if you are using sampling.
- Run
python3 setup.py build_ext --inplace
inside src folder. This compiles the cython code. - In order to run the program, run the file
main.py
inside src folder. - It is advisable to create sample (Stage 1) by setting the value to True in run attribute while keeping every other attribute inside run to False. This will create sample and store it in appropriate location.
- Other stages of the pipeline can run togther by changing the sub attributes in run to True and setting sample_generation to False. This is done so that program doesn't crash.
- Once all stages are completed, in order to find the attributes please take a look at rules_out.csv inside data folder.
- The headers of the output are attributes from the input dataset set as integer value [0,1,2,3] and M,lift,leverage,convergence where M stands what value was there in Y for association rule (X --> Y)
- The contents may have values like 0,1,2 which stands for the edit distance for the respective attribute.
- The output is sorted with best rules at the top (Having high value for lift,conviction and leverage)