This is the code repository for Feature-PI.
Prepare the python environment by executing the scripts below:
$ cd Feature-PI
$ pip install -r requirements.txtFor Mac users, you might want to install libomp as required by LightGBM:
brew install libompUse the following command to download the whole Feature-PI data from HuggingFace.
$ cd Feature-PI
$ mkdir data
$ export HF_HUB_DOWNLOAD_TIMEOUT=300
$ huggingface-cli download yinlou/Feature-PI --repo-type dataset --local-dir dataThis should download the whole data repo from HuggingFace. The data repo is very large (60GB+) and contains lots of .csv and .json files. It is expected to take couple of hours to finish downloading.
The permutation feature importance scores are pre-computed using random seed 1 to 5 and can be found in each folder (representing a dataset) as permutation_importance_s[1-5].json.
To manually generate labels for binary classification datasets with random seed 2, use the following command under Feature-PI directory:
$ ./scripts/generate_labels_binary_classification.sh data/training/binary_classification 2This will create in each folder under data/training/binary_classification a json file named permutation_importance_s2.json and it should be the same as permutation_importance_v2.json. It contains 15 trials of permutation feature importance and is used as labels for learning to estimate feature importance for that dataset.
Similarly, for regression datasets, you can use the following command:
$ ./scripts/generate_labels_regression.sh data/training/regression 1This command creates in each folder under data/training/regression a json file named permutation_importance_s1.json. It contains 15 trials of permutation feature importance and is used as labels for learning to estimate feature importance for that dataset.
Suppose we want to evaluate PI-ensemble for "[UCI]Arrhythmia in the benchmark set, we could use the following command:
python benchmark/eval.py -d ./ -f "[UCI]Arrhythmia" -t binary_classification -e pi_ensembleThis will go into each cross validation fold under "[UCI]Arrhythmia and generate a file called pi_ensemble_eval_result.json, which contains AUC (for binary classification problems) on 5%, 10%, 15%, 20%, and 100% of feature subset, ranked according to feature importance.