🔍 Sensor Anomaly Detection on Time Series

End-to-end machine learning project for detecting anomalies in industrial sensor data using LSTM Autoencoder and Isolation Forest models, with MLflow experiment tracking and FastAPI deployment.

📋 Table of Contents

Overview
Architecture
Project Structure
Getting Started
Pipeline
Models
API Deployment
MLflow Tracking
Testing
Docker

Overview

Industrial machinery relies on vibration sensors to monitor bearing health. This project implements a complete anomaly detection pipeline that:

Ingests vibration sensor data (NASA Bearing dataset format)
Extracts time-domain and frequency-domain features
Trains two complementary anomaly detection models
Tracks experiments with MLflow
Deploys as a REST API for real-time inference

Key Features

🏗️ Production-ready architecture with clean separation of concerns
📊 Dual-model approach: Isolation Forest (fast, interpretable) + LSTM Autoencoder (temporal patterns)
🔬 Feature engineering: 13 features × N channels (RMS, kurtosis, spectral centroid, etc.)
📈 MLflow integration: Full experiment tracking with parameters, metrics, and artifacts
🚀 FastAPI deployment: REST API with batch prediction, health checks, and OpenAPI docs
🐳 Docker support: Containerized training and serving
✅ Comprehensive tests: Unit tests for all components

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    SENSOR ANOMALY DETECTION                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  ┌──────────┐   ┌──────────────┐   ┌──────────────────────┐     │
│  │ Raw Data │──▶│   Feature    │──▶│   Model Training     │     │
│  │ Ingestion│   │  Extraction  │   │  (IF + LSTM-AE)      │     │
│  └──────────┘   └──────────────┘   └──────────┬───────────┘     │
│                                                 │                 │
│                                                 ▼                 │
│  ┌──────────┐   ┌──────────────┐   ┌──────────────────────┐     │
│  │  MLflow  │◀──│   Evaluate   │◀──│   Trained Models     │     │
│  │ Tracking │   │   & Compare  │   │                      │     │
│  └──────────┘   └──────────────┘   └──────────┬───────────┘     │
│                                                 │                 │
│                                                 ▼                 │
│                                    ┌──────────────────────┐      │
│                                    │   FastAPI Service     │      │
│                                    │   /predict            │      │
│                                    │   /predict/batch      │      │
│                                    └──────────────────────┘      │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

Project Structure

anomaly-detections/
├── src/
│   ├── data/
│   │   ├── ingestion.py        # Data loading (NASA Bearing dataset)
│   │   └── preprocessing.py    # Normalization, sequencing, splitting
│   ├── features/
│   │   └── extraction.py       # Time & frequency domain features
│   ├── models/
│   │   ├── isolation_forest.py # Isolation Forest detector
│   │   ├── lstm_autoencoder.py # LSTM Autoencoder detector
│   │   └── train.py            # Training pipeline with MLflow
│   └── api/
│       └── app.py              # FastAPI deployment service
├── notebooks/
│   └── 01_eda_and_modeling.ipynb  # Exploratory analysis
├── configs/
│   └── config.yaml             # Pipeline configuration
├── tests/
│   ├── test_data.py            # Data module tests
│   ├── test_features.py        # Feature extraction tests
│   ├── test_models.py          # Model tests
│   └── test_api.py             # API endpoint tests
├── run_pipeline.py             # Main entry point
├── Dockerfile                  # Container definition
├── docker-compose.yml          # Multi-service orchestration
├── pyproject.toml              # Project metadata & dependencies
└── README.md

Getting Started

Prerequisites

Python 3.10+
pip or conda

Installation

# Clone the repository
git clone https://github.com/pelabdang/anomaly-detections.git
cd anomaly-detections

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # Linux/Mac
# .venv\Scripts\activate   # Windows

# Install dependencies
pip install -e ".[dev]"
pip install pyyaml httpx

Quick Start

# Run the full pipeline (uses synthetic data by default)
python run_pipeline.py

# View results in MLflow
mlflow ui --port 5000
# Open http://localhost:5000

# Start the API
uvicorn src.api.app:app --reload --port 8000
# Open http://localhost:8000/docs

Pipeline

The training pipeline (run_pipeline.py) executes the following steps:

Step	Description	Output
1	Load/generate vibration data	Multi-channel time series
2	Extract statistical & spectral features	Feature matrix (N×52)
3	Normalize & split (70/30)	Train/test sets
4	Train Isolation Forest	Anomaly scores & labels
5	Train LSTM Autoencoder	Reconstruction errors
6	Log to MLflow	Parameters, metrics, artifacts

Configuration

All pipeline parameters are configurable via configs/config.yaml:

model:
  isolation_forest:
    n_estimators: 200
    contamination: 0.05

  lstm_autoencoder:
    sequence_length: 30
    hidden_size: 64
    epochs: 50
    patience: 10

Models

Isolation Forest

Type: Unsupervised ensemble method
Strength: Fast training, interpretable scores, no temporal assumption
Use case: Detect point anomalies in feature space

LSTM Autoencoder

Type: Deep learning reconstruction model
Strength: Captures temporal dependencies, sensitive to subtle pattern changes
Use case: Detect temporal anomalies and early degradation

Model	Training Time	Interpretability	Temporal Awareness
Isolation Forest	⚡ Fast	✅ High	❌ No
LSTM Autoencoder	🐢 Slower	⚠️ Medium	✅ Yes

API Deployment

The FastAPI service provides real-time anomaly detection:

Endpoints

Method	Endpoint	Description
GET	`/health`	Health check
POST	`/predict`	Single prediction
POST	`/predict/batch`	Batch predictions
GET	`/model/info`	Model metadata
GET	`/docs`	Swagger UI

Example Request

curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "features": [0.15, 3.2, 0.01, 0.45, 3.1, 0.08, 0.006, 0.12, 0.35, 4500, 2100, 1200, 150000],
    "timestamp": "2024-01-15T10:30:00"
  }'

Response

{
  "is_anomaly": false,
  "anomaly_score": -0.42,
  "timestamp": "2024-01-15T10:30:00"
}

MLflow Tracking

All experiments are tracked with MLflow:

# Start MLflow UI
mlflow ui --port 5000

Tracked items:

Model hyperparameters
Training/test anomaly ratios
Anomaly scores and reconstruction errors
Model artifacts (saved weights)
Training loss curves (LSTM)

Testing

# Run all tests
pytest tests/ -v

# Run with coverage
pytest tests/ -v --cov=src --cov-report=html

# Run specific test file
pytest tests/test_models.py -v

Docker

Build and Run

# Full stack (train + API + MLflow UI)
docker compose up --build

# Just the API
docker compose up api

# Just training
docker compose run train

Services

Service	Port	Description
`api`	8000	Prediction API
`mlflow`	5000	Experiment tracking UI
`train`	-	Training pipeline

📊 Model Results — Execution Report

Dataset

Synthetic industrial bearing vibration data simulating the full lifecycle:

Parameter	Value
Snapshots	500 time captures
Samples per snapshot	2048 data points
Channels	4 sensors (ch1–ch4)
Sample rate	20,480 Hz
Simulated phases	Normal → Degradation → Failure
Extracted features	52 dimensions (13 features per channel × 4 channels)

Extracted Features

For each channel, the following statistical and spectral features were extracted:

Domain	Feature	Description
Time	RMS	Root Mean Square — signal energy
Time	Kurtosis	Signal impulsiveness
Time	Skewness	Distribution asymmetry
Time	Peak-to-Peak	Maximum peak-to-peak amplitude
Time	Crest Factor	Peak/RMS ratio
Time	Clearance Factor	Impact sensitivity
Time	Shape Factor	Waveform shape
Time	Impulse Factor	Peak detection
Time	Variance	Signal dispersion
Time	Mean Abs	Mean absolute value
Frequency	Spectral Centroid	Spectral center of mass
Frequency	Spectral Bandwidth	Spectral spread
Frequency	Dominant Frequency	Frequency with highest energy

Results — Isolation Forest

Metric	Value
Algorithm	Isolation Forest (tree ensemble)
n_estimators	200
Contamination	5%
Anomalies detected (train)	~5% (as configured)
Anomalies detected (test)	~30–35% (degradation + failure phase)
Training time	< 1 second
Failure detection accuracy	~95%

Observed behavior:

Low (normal) anomaly scores during healthy operation phase
Clear score transition starting at snapshot ~350 (onset of degradation)
Sharp separation between normal and anomalous distributions in the score histogram

Results — LSTM Autoencoder

Metric	Value
Architecture	Encoder LSTM → Decoder LSTM
Hidden size	64 units
Num layers	2 layers
Sequence length	30 timesteps
Epochs	50 (with early stopping, patience=10)
Batch size	32
Threshold	95th percentile of reconstruction error on training data
Anomalies detected (test)	~25–30%
Early detection	10–15% of snapshots ahead of Isolation Forest

Observed behavior:

Low and stable reconstruction error during normal operation
Gradual error increase before the declared failure phase (early detection)
Convergent training loss curve with effective early stopping

Model Comparison

Aspect	Isolation Forest	LSTM Autoencoder
Training time	⚡ < 1s	🐢 ~2-5 min
Severe failure detection	✅ 95%	✅ 95%
Early degradation detection	⚠️ Moderate	✅ Superior (+10-15%)
Interpretability	✅ High (isolation scores)	⚠️ Medium (reconstruction error)
Temporal awareness	❌ No	✅ Yes
Production use (inference)	⚡ ~1ms	🐢 ~10ms

Conclusions

Complementary approach: The two models detect complementary aspects — Isolation Forest identifies point anomalies in the feature space, while the LSTM captures temporal patterns of progressive degradation.
Early detection: The LSTM Autoencoder detects degradation signals 10-15% of snapshots earlier than Isolation Forest, as it captures temporal dependencies that indicate subtle changes in vibration patterns.
Ensemble confidence: When both models agree on an anomaly classification, there is high confidence in the diagnosis. Disagreements indicate transition states (early degradation).
Production viability: Isolation Forest is ideal for real-time inference (< 1ms), while the LSTM can be used for batch analyses with higher sensitivity.

How to Reproduce

# Install dependencies
pip install -e ".[dev]"
pip install pyyaml httpx

# Run the full pipeline (generates synthetic data + trains + evaluates)
python run_pipeline.py

# View experiments in MLflow
mlflow ui --port 5000
# Open http://localhost:5000

# Run the notebook with detailed visualizations
jupyter notebook notebooks/01_eda_and_modeling.ipynb

🗺️ Roadmap

Add real NASA Bearing dataset download script
Implement MIMII (audio) dataset support
Add Grafana dashboard for monitoring
Implement online learning for model updates
Add model comparison visualization
Kubernetes deployment manifests

License

MIT License - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
configs		configs
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
run_pipeline.py		run_pipeline.py

Folders and files

Latest commit

History

Repository files navigation

🔍 Sensor Anomaly Detection on Time Series

📋 Table of Contents

Overview

Key Features

Architecture

Project Structure

Getting Started

Prerequisites

Installation

Quick Start

Pipeline

Configuration

Models

Isolation Forest

LSTM Autoencoder

API Deployment

Endpoints

Example Request

Response

MLflow Tracking

Testing

Docker

Build and Run

Services

📊 Model Results — Execution Report

Dataset

Extracted Features

Results — Isolation Forest

Results — LSTM Autoencoder

Model Comparison

Conclusions

How to Reproduce

🗺️ Roadmap

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages