This project implements a machine learning pipeline for training and evaluating classifiers (Random Forest and Multi-Layer Perceptron) on tabular data, with additional functionality for smoothing predictions and generating visualization plots. The script processes CSV files, trains models, evaluates their performance, and outputs predictions alongside visualizations for test samples.
The code is designed to:
- Load and preprocess training data from multiple CSV files.
- Train a Random Forest Classifier (or optionally a Multi-Layer Perceptron) on the data.
- Evaluate the model using a classification report.
- Process test samples, smooth predictions, and generate scatter plot visualizations comparing predicted and true labels against a depth variable.
The pipeline is configurable via a config dictionary, allowing customization of file paths, feature names, and hyperparameters.
- Python 3.x
- Libraries:
scikit-learn(for machine learning models and metrics)pandas(for data manipulation)numpy(for numerical operations)matplotlib(for plotting)os(for file system operations)glob(for file pattern matching)csv(for CSV file handling)
Install the required libraries using pip:
pip install scikit-learn pandas numpy matplotlib- Training Data: CSV files in
./training/directory. - Test Data: CSV files in
./testing/directory. - Output: Predictions and plots saved to
./output/directory. - Main Script: The provided Python script (e.g.,
pipeline.py).
-
Prepare Data:
- Place training CSV files in the
./training/directory. - Place test CSV files in the
./testing/directory. - Ensure all CSV files have columns matching the
feature_names,label_name, anddepth_namespecified in theconfigdictionary.
- Place training CSV files in the
-
Directory Structure:
project/ ├── training/ # Directory with training CSV files ├── testing/ # Directory with test CSV files ├── output/ # Directory for output predictions and plots (created automatically) └── pipeline.py # The main script -
Run the Script:
- Execute the script in a Python environment:
python pipeline.py
- Execute the script in a Python environment:
The config dictionary at the bottom of the script defines the pipeline settings:
train_path: Path to the consolidated training CSV file (./train.csv).test_path: Directory containing test CSV files (./testing).output_path: Directory for saving predictions and plots (./output).feature_names: List of feature columns in the CSV files (e.g.,['qc', 'fs', 'u2', 'Level']).label_name: Target column for classification (e.g.,'Label').depth_name: Column representing depth or a similar variable (e.g.,'Level').test_size: Fraction of data for testing (e.g.,0.2).window_size: Smoothing window size for predictions (e.g.,8).mapping: Dictionary mapping label strings to integers (e.g.,'MD': 0, 'ALL-c (pal)': 1, ...).
Modify the config dictionary as needed for your dataset.
-
Data Loading:
- Combines all CSV files in
./training/into a single deduplicated DataFrame, saved as./train.csv. - Splits the data into training and test sets using
train_test_split.
- Combines all CSV files in
-
Model Training:
- Trains a Random Forest Classifier with parameters specified in
rf_param(e.g.,n_estimators=100,max_depth=100). - Optionally trains a Multi-Layer Perceptron (MLP) if uncommented, with parameters in
mlp_param.
- Trains a Random Forest Classifier with parameters specified in
-
Evaluation:
- Prints feature importance (for Random Forest) and a classification report comparing predictions to true labels.
-
Test Sample Processing:
- Loads test CSV files from
./testing/. - Predicts labels, smooths them using a custom
smoothfunction (based on a sliding window mode filter), and saves predictions to CSV files in./output/.
- Loads test CSV files from
-
Visualization:
- Generates scatter plots comparing predicted and true labels against depth, saved as PNG files in
./output/. - Uses a color map to represent different classes, with a legend based on the
mappingdictionary.
- Generates scatter plots comparing predicted and true labels against depth, saved as PNG files in
get_dataset(config): Loads and splits training data.get_test_samples(config): Loads test samples from CSV files.train_rf(data, param): Trains and evaluates a Random Forest Classifier.train_mlp(data, param): Trains and evaluates an MLP Classifier (optional).smooth(series, w): Smooths a series of predictions using a mode-based filter with a specified window size.plot(clf, config, test_sample): Generates and saves a visualization for a test sample.
- Model Selection: Uncomment the MLP training lines and adjust
mlp_paramto use an MLP instead of Random Forest. - Hyperparameters: Modify
rf_paramormlp_paramto tune model performance. - Smoothing: Adjust
window_sizeinconfigto control the smoothing effect. - Features and Labels: Update
feature_names,label_name, andmappinginconfigto match your dataset.
- CSV Files: Predictions saved as
prediction_<filename>.csvin./output/with columns for depth, predicted labels, and true labels. - PNG Files: Plots saved as
<filename>.pngin./output/, showing predicted vs. true labels along the depth axis.
- The script assumes all depth values in a test sample have the same sign (positive or negative).
- Missing values in the data are filled with
0usingfillna(0). - The smoothing function uses a mode-based approach, which may not work well with small window sizes or noisy data.
Assuming your training and test CSV files are ready:
python pipeline.pyCheck the ./output/ directory for results.