Skip to content

drakegeo/sensor_failure_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sensor Failure Analysis

Semi-supervised classification of equipment breakdown events using a small labeled set to guide label propagation across a large unlabeled dataset.

Problem

  • 1 600 breakdown events recorded across 20 sensors
  • Only 40 events are labeled (3 failure types); the remaining 1 560 are unlabeled
  • Goal: classify all events into Failure 1, Failure 2, or Failure 3

Approach

1. Feature selection — point-biserial correlation

With only 40 labeled points, training a model on all 20 sensors risks noise dominating signal. For each sensor we compute the point-biserial correlation against each failure-type indicator (one binary column per class). Sensors whose maximum absolute correlation across all three classes exceeds a threshold of 0.30 are kept.

This reduces the feature space to 4 sensors (Sensor 0, Sensor 2, Sensor 8, Sensor 9) that carry the clearest discriminative signal. The threshold was chosen via a sweep over the labeled data; below 0.30 noisy sensors are included and LOO F1 drops.

2. Semi-supervised model — Label Spreading

LabelSpreading (scikit-learn) propagates the 40 known labels across the full 1 600-point graph using an RBF kernel. The graph connects every point to every other point weighted by feature similarity; labels flow from labeled nodes to unlabeled nodes iteratively.

Key hyperparameters (tuned via LOO cross-validation):

Parameter Value Meaning
kernel rbf Similarity metric between points
gamma 1.0 RBF bandwidth
alpha 0.4 Label clamping strength (0 = hard labels, 1 = free propagation)

3. Evaluation — Leave-One-Out cross-validation

Standard train/test splits are not viable with only 40 labeled points. Instead, we use LOO-CV: for each of the 40 labeled events we temporarily hide its label, refit the model on the remaining 39 labeled + 1 560 unlabeled points, and record the prediction. This gives an honest estimate of generalization without wasting any labeled data.

LOO results (macro F1 = 0.686):

Class Precision Recall F1 Support
Failure 1 1.00 0.60 0.75 10
Failure 2 0.80 0.40 0.53 10
Failure 3 0.66 0.95 0.78 20
Macro 0.82 0.65 0.69 40

Failure 2 has the weakest recall — it is the hardest class to distinguish from the others given the available labeled examples.

4. Final model

After evaluation, the model is retrained on all 1 600 points (40 labeled + 1 560 unlabeled) to propagate labels across the full dataset.

Propagated distribution:

Label Count
Failure 1 447
Failure 2 442
Failure 3 711

Project structure

sensor_failure_analysis/
├── data/
│   └── data_sensors.csv          # 1 600 events × 20 sensors
├── pipelines/
│   ├── config.py                 # paths, thresholds, hyperparameters
│   └── main_pipeline.py          # end-to-end orchestration
├── src/
│   ├── data_processing/
│   │   ├── loader.py             # CSV loading, label parsing, scaling
│   │   └── feature_selection.py  # correlation matrix, sensor selection
│   └── model_training/
│       ├── train.py              # LabelSpreading fit + LOO-CV
│       └── evaluate.py           # metrics, MLflow logging
└── experiments/                  # exploratory notebooks and plots

How to run

Install dependencies:

poetry install

Run the pipeline:

poetry run python pipelines/main_pipeline.py

View experiment results in MLflow:

poetry run mlflow ui --backend-store-uri sqlite:///mlflow.db

Then open http://localhost:5000.

Design decisions

Decision Alternative considered Reason chosen
Label Spreading k-means / PCA clustering Unsupervised methods ignore the 40 labels; semi-supervised approach directly uses them
Point-biserial correlation for feature selection PCA, mutual information Interpretable, stable with small labeled sets, directly measures per-class discriminability
LOO-CV k-fold CV With only 40 labeled points, LOO maximises training data per fold and gives the most reliable estimate
StandardScaler on selected sensors only Scale all 20 sensors Avoids scaling noise sensors that were discarded

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors