GDBS

GDBS: A hybrid feature selection method for small-sample, high-dimensional datasets

Python code guide and documentation: GDBS

Please refer to https://github.com/hnematzadeh/EMC-DWES/tree/main/Datasets for the datasets used.

Import Libraries and Prepare Data

Key Libraries: NumPy, Pandas, Scikit-learn (for data processing and modeling).

Data Loading:

Reads data from Excel/CSV files (multiple datasets are commented out).

Example: df = pd.read_excel('/datasets/colon.xlsx', header=None)

Preprocessing:

Separates features (X) and labels (y).

Standardizes features using StandardScaler().

Gini Impurity Calculation

Goal: Rank features based on their ability to separate classes.

Function gini_based_impurity(arr):

Computes Gini impurity for a sorted feature array.

Lower impurity = Higher feature importance.

Process:

Sorts data by each feature and calculates its Gini impurity.

Selects the top 10% of features (Xn) with the lowest impurity.

Output: A subset of high-impact features.

Feature Discretization via K-Means Clustering

Goal: Convert continuous features to discrete values for improved model performance.

Steps:

• For each selected feature:

Reshapes the feature into a 2D array for clustering.
Tests cluster counts from the number of classes (max(y)) to ceil(sqrt(n_samples)/2).
Normalized Mutual Information (NMI) is used to evaluate the alignment between cluster labels and true labels.
Retains the cluster configuration with the highest NMI.
• Aggregates discretized features into a new matrix (clusters).

Bootstrapped Feature Subset Selection (FSS)

Goal: Further refine features using bootstrap sampling and forward selection.

Steps:

• Bootstrap Sampling: Generates 5 bootstrapped datasets.

• Forward Selection:

Starts with the best single feature (highest cross-validation accuracy).
Iteratively adds features that maximize accuracy.
Uses a Decision Tree classifier for evaluation.
• Aggregate Results:
Features selected in >3 bootstrap samples are retained.
Output: Final feature subset (final_X).

Model Evaluation

Classifiers used:

Decision Tree, Naive Bayes, Random Forest, SVM (RBF kernel), MLP, KNN, XGBoost.

Datasets Evaluated:

Full dataset (all features).

Gini-selected features (Xn).

Discretized features (top_NMI_matrix).

Final GDBS-selected features (final_X).

Metrics Reported:

Accuracy, MCC, Precision, Recall, F1 Score, Balanced Accuracy.

Save Results

Results Summary:

-Metrics for all combinations of algorithms and datasets are stored in results_df.xlsx.

Key Outputs:

Indices of top discretized features (top_NMI_Indices).

Final selected features (top_selected_Indices).

Execution times for Gini, discretization, and FSS steps.

Key Notes

Designed for High-Dimensional Data (e.g., genomic data).

Flexibility: Adjust parameters like limit (top feature percentage), p (max clusters), or num_datasets (bootstrap samples).

Reproducibility: Uses random_state=42 for deterministic results.

Suppressed Warnings: Non-critical warnings are ignored to ensure cleaner output.

To run the code:

Ensure dataset paths are correct.

Install required libraries (scikit-learn, pandas, numpy, xgboost, etc.).

Uncomment the desired dataset line (e.g., /datasets/colon.xlsx).

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
GDBS_main.py		GDBS_main.py
KNN_Cross_evaluation.py		KNN_Cross_evaluation.py
KNN_evaluation.py		KNN_evaluation.py
README.md		README.md
XGBoost_evaluation.py		XGBoost_evaluation.py
decision_tree_evaluation.py		decision_tree_evaluation.py
mlp_evaluate.py		mlp_evaluate.py
naive_bayes_evaluation.py		naive_bayes_evaluation.py
random_forest_evaluation.py		random_forest_evaluation.py
svm_rbf_evaluation.py		svm_rbf_evaluation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GDBS

About

Uh oh!

Releases

Packages

Languages

sadeqa20/GDBS

Folders and files

Latest commit

History

Repository files navigation

GDBS

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages