AttributeSelectionAlgorithm

Information about the dataset / sampling rate / stages of pipeline etc are to be entered in config.yaml
This code requires Python3/gcc to be installed on the system.
Run the following code to install the dependencies associated with Python packages python3 -m pip install -r requirements.txt
Installing libraries to run the RLA code sudo apt install libxml2-dev libboost-regex-dev libmpich-dev libboost-log-dev

Information about the pipeline

We generate the random sample from the dataset. (Sampling rate is set in config.yaml)
We use RLA_CL code written in C++ to generate the processed data. (RLA_CL is code for record linkage algorithm that was published in 2016)
The processed data file which contains the dataset in format suitable for association analysis. This file would contain several transaction (generated by string comparison of record pairs) that are duplicate amd leading to mismatch. We prune them to improve the speed of the pipeline.
Attribute selection algorithm using association rule mining is run on the processed dataset.

Requirements

Information about the dataset / sampling rate / stages of pipeline etc are to be entered in config.yaml
Create virtual environment to manage the dependencies (https://docs.python.org/3/library/venv.html)
Run the following code to install the dependencies associated with Python packages python3 -m pip install -r requirements.txt

Steps to run the program

Input dataset

Place the dataset files (atleast two are required) into the data subdirectory.

Update configuration file

Update the config.yaml file by changing the attribute input_files with the names of the input dataset files.
Update the header attribute with the names of the attributes for your dataset. (Make sure the headers are not present in your dataset)
Change the id_column attribute to denote the unique identifier in your dataset.
Update the THRESHOLD with value suitable for your dataset (This would help program to correct link the records even when errors such as typos are present). The value is usually 1 or 2 (Please note this value should be a positive integer as we are using levenshtein distance)
If you are going to create sample then update SAMPLE_RATE attribute with the sampling percentage.
Since we are using k-mer based blocking while generation of sample, BLOCKING_ATTRIBUTE is needed to identify the number of the attribute to be used for blocking. (note blocking speeds up the computation of pairs so choosing right attribute would impact speed)
Populate the COMPARISON_ATTRIBUTE based on number of attributes in your dataset. for e.g if your dataset has 5 attributes and 0th attribute is unique_identifier then [1 2 3 4] should be correct value for comparision attribute. (Note we are using all attributes so that they are included in attribute selection algorithm).

No Sample run

If your dataset is small (total records less than 50,000) then you might not want to generate samples. In the case disable sample_generation inside run attribute by setting it to false.
Update sample_output attribute with the path to the input dataset (Same value as input_files).
DO NOT touch this setting if you are using sampling.

Running the program

Run python3 setup.py build_ext --inplace inside src folder. This compiles the cython code.
In order to run the program, run the file main.py inside src folder.
It is advisable to create sample (Stage 1) by setting the value to True in run attribute while keeping every other attribute inside run to False. This will create sample and store it in appropriate location.
Other stages of the pipeline can run togther by changing the sub attributes in run to True and setting sample_generation to False. This is done so that program doesn't crash.
Once all stages are completed, in order to find the attributes please take a look at rules_out.csv inside data folder.

Interpret the Output

The headers of the output are attributes from the input dataset set as integer value [0,1,2,3] and M,lift,leverage,convergence where M stands what value was there in Y for association rule (X --> Y)
The contents may have values like 0,1,2 which stands for the edit distance for the respective attribute.
The output is sorted with best rules at the top (Having high value for lift,conviction and leverage)

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
RLA_CL_EXTRACT		RLA_CL_EXTRACT
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
call_rl.py		call_rl.py
config.yaml		config.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AttributeSelectionAlgorithm

Information about the pipeline

Requirements

Steps to run the program

Input dataset

Update configuration file

No Sample run

Running the program

Interpret the Output

About

Releases

Packages

Contributors 2

Languages

License

UltraArceus3/AttributeSelectionAlgorithm

Folders and files

Latest commit

History

Repository files navigation

AttributeSelectionAlgorithm

Information about the pipeline

Requirements

Steps to run the program

Input dataset

Update configuration file

No Sample run

Running the program

Interpret the Output

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages