An intelligent monitoring system that learns your infrastructure's behavior and detects anomalies automatically
A collaborative project by Chanu716 and Charmi Seera
Imagine your server's CPU usage gradually creeping from 60% to 75% over a week. Traditional monitoring? Silent. It's under the 80% threshold. But that subtle change might indicate a memory leak that'll crash your system next Tuesday.
Or picture this: Your API response times spike to 200ms during lunch hour - completely normal for your e-commerce site. But your static threshold triggers an alert anyway. After the third false alarm this week, you start ignoring notifications.
We built SAIMon because monitoring shouldn't work this way.
Instead of setting arbitrary thresholds, what if your monitoring system could learn what's normal for your infrastructure? What if it understood that 85% CPU during deployments is fine, but 60% at 3 AM is suspicious?
SAIMon observes your system's behavior over time and builds a baseline understanding. When metrics deviate from learned patterns - even within "acceptable" ranges - it flags them. You get context-aware alerts: what changed, by how much, and confidence scores.
No guessing thresholds. No alert fatigue. Just intelligent detection that adapts to your system's personality.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Your Infrastructure β
β (Servers, Databases, Services) β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β System Metrics
βΌ
βββββββββββββββββββ
β Prometheus β ββββ Node Exporter
β (Metrics Store) β (Collects CPU, Memory, etc.)
ββββββββββ¬βββββββββ
β
ββββββββββββββ΄βββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββ ββββββββββββββββββββ
β ML Engine β β Grafana β
β β β (Dashboards) β
β β’ Z-Score β ββββββββββββββββββββ
β β’ Isolation β
β Forest β
β β’ Training β
β β’ Detection β
ββββββββββ¬ββββββββββ
β Detected Anomalies
βΌ
ββββββββββββββββββββ
β FastAPI β
β (REST API) β
ββββββββββ¬ββββββββββ
β
βΌ
ββββββββββββββββββββ
β PostgreSQL β
β (Database) β
ββββββββββββββββββββ
Components:
- Prometheus: Collects and stores time-series metrics every 15 seconds
- ML Engine: Python service that trains models and detects anomalies
- FastAPI: REST API for querying anomalies and managing models
- PostgreSQL: Stores detected anomalies, model metadata, and metric information
- Grafana: Visualization dashboards and real-time monitoring
- Redis: Caching layer for performance optimization
We use two complementary techniques to catch different types of anomalies:
Statistical Baseline (Z-Score): Calculates how far metrics deviate from historical averages. Great for catching sudden spikes or drops. If your CPU suddenly jumps to 95% when it normally hovers around 40%, the Z-score will be high.
Pattern Recognition (Isolation Forest): Builds an ensemble of decision trees to identify unusual patterns. This catches subtle anomalies that pure statistics miss - like gradual degradation or unusual combinations of metrics.
Both run every 5 minutes on recent data. When either model flags something unusual, you get an anomaly record with severity scoring.
Models retrain automatically every night using the past week's data. This keeps them current as your infrastructure evolves. New deployment increased baseline memory usage? The models adapt. Traffic patterns shifted? They learn the new normal.
Minimum requirement: 1000 data points per metric (about 4 hours at 15-second intervals). If you just added a new metric, SAIMon waits until there's enough history to train reliably.
Every 5 minutes, SAIMon:
- Fetches the latest metrics from Prometheus
- Runs them through both models
- Calculates anomaly scores (0 to 1, where higher = more anomalous)
- Classifies severity (low/medium/high/critical)
- Stores everything in PostgreSQL for analysis
Real-time Anomaly Detection: Checks your metrics every 5 minutes, flags unusual patterns
Automatic Learning: Models retrain daily on the past week's data - no manual tuning needed
Multiple Detection Methods: Z-Score catches sudden changes, Isolation Forest finds subtle patterns
Full History: Every anomaly stored with context - what happened, expected vs actual values, confidence scores
REST API: Query anomalies programmatically, integrate with your existing tools
Visual Dashboards: Grafana integration for real-time monitoring and historical analysis
Since deploying SAIMon on our infrastructure:
- 750+ anomalies detected across CPU, memory, disk, and network metrics
- 2 models trained and actively monitoring (statistical + ML-based)
- 6 different metrics being tracked continuously
- Sub-5-minute detection from metric collection to alert
- Zero manual threshold tuning required
The system runs 24/7, automatically adapting as our infrastructure changes. We've caught several issues before they became user-facing - including a gradual memory leak that traditional monitoring completely missed.
Backend: Python 3.11, FastAPI for REST API, SQLAlchemy ORM
ML Stack: Scikit-learn (Isolation Forest), NumPy, Pandas, SciPy
Data Storage: PostgreSQL (anomalies, models, metrics), Redis (caching)
Monitoring: Prometheus (metric collection), Grafana (visualization)
Infrastructure: Docker Compose (7 containerized services)
Microservices Design: Each component (ML engine, API, database, monitoring) runs independently in Docker containers. Services communicate via HTTP APIs and database connections.
Async Processing: ML training runs on a schedule (nightly), while anomaly detection operates continuously every 5 minutes. This keeps the system responsive while maintaining up-to-date models.
Time-Series Optimization: PostgreSQL indexes on timestamp fields, Prometheus for efficient metric queries, Redis for caching frequently-accessed data.
- Docker & Docker Compose (v20.10+)
- Python 3.11+
- Git
- 4GB+ RAM for all services- Clone the repository
git clone https://github.com/Chanu716/SAIMon.git
cd SAIMon- Start all services (7 containers)
docker-compose up -d- Verify services are running
docker-compose ps
# All 7 services should show "Up"- Run setup validation
pip install -r requirements.txt
python scripts/test_setup.py
# Expected: 5/5 tests passed β
| Service | URL | Purpose |
|---|---|---|
| Prometheus | http://localhost:9090 | Raw metrics & queries |
| Grafana | http://localhost:3000 | Visualization dashboards (admin/admin) |
| SAIMon API | http://localhost:8000 | REST API endpoints |
| API Docs | http://localhost:8000/docs | Interactive Swagger UI |
| PostgreSQL | localhost:5432 | Database (saimon/saimon) |
# View detected anomalies
curl http://localhost:8000/api/v1/anomalies?limit=10
# Check trained ML models
curl http://localhost:8000/api/v1/models
# View system health
curl http://localhost:8000/health
# Get anomaly statistics
curl http://localhost:8000/api/v1/anomalies/stats- Open http://localhost:3000 (login: admin/admin)
- Add Prometheus data source:
- URL:
http://prometheus:9090 - Click "Save & Test"
- URL:
- Import Node Exporter dashboard:
- Dashboard ID: 1860
- Select Prometheus as data source
- Import SAIMon anomaly dashboard:
- Upload
grafana/dashboards/saimon-anomalies.json
- Upload
SAIMon/
βββ services/
β βββ api/ # FastAPI REST API
β β βββ main.py # Application entry point
β β βββ models.py # SQLAlchemy ORM models
β β βββ database.py # Database connection
β β βββ routers/ # API endpoints
β β β βββ anomalies.py # GET/POST anomalies
β β β βββ models.py # ML model management
β β β βββ metrics.py # Metric queries
β β β βββ alerts.py # Alert configuration
β β β βββ health.py # Health checks
β β βββ requirements.txt # Python dependencies
β β
β βββ ml_engine/ # Machine Learning Service
β βββ main.py # Scheduling & orchestration
β βββ anomaly_detector.py # ML algorithms implementation
β βββ data_collector.py # Prometheus data fetching
β βββ config.py # Configuration loader
β βββ requirements.txt # ML dependencies
β
βββ config/
β βββ prometheus.yml # Prometheus scrape config
β βββ ml_config.yml # ML hyperparameters
β βββ db/init.sql # Database schema
β βββ grafana/ # Grafana provisioning
β
βββ grafana/dashboards/ # Custom dashboards
β βββ saimon-anomalies.json # ML anomaly dashboard
β
βββ notebooks/ # Jupyter notebooks
β βββ 01_anomaly_detection_experiments.ipynb
β
βββ scripts/ # Utility scripts
β βββ test_setup.py # Validation script
β βββ generate_test_metrics.py # Test data generator
β
βββ docs/ # Comprehensive documentation
β βββ VIEWING_DATA.md # Grafana & Prometheus guide
β βββ COMPLETE_OVERVIEW.md # System overview
β βββ QUICKSTART.md # Quick setup guide
β βββ TROUBLESHOOTING.md # Common issues
β
βββ docker-compose.yml # Multi-container orchestration
βββ requirements.txt # Root Python dependencies
βββ .gitignore # Git ignore patterns
βββ README.md # This file
Want to see how it actually works? Here are some key pieces:
The Z-Score model is straightforward - calculate mean and standard deviation from historical data:
def _train_zscore(self, metric_name: str, features: np.ndarray):
mean = np.mean(features[:, 0])
std = np.std(features[:, 0])
threshold = 3.0 # 3 standard deviations = 99.7% of normal data
model_data = {'mean': mean, 'std': std, 'threshold': threshold}
self._save_model(metric_name, model_data)Isolation Forest is more complex - it builds 100 decision trees and measures how quickly each point gets isolated:
def _train_isolation_forest(self, metric_name: str, features: np.ndarray):
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)
model = IsolationForest(
contamination=0.1, # Expect ~10% anomalies
n_estimators=100, # 100 trees in the forest
max_samples=256 # Sample size per tree
)
model.fit(features_scaled)The API lets you query anomalies with filters and pagination:
@router.get("/anomalies")
async def list_anomalies(
severity: Optional[str] = None,
metric_name: Optional[str] = None,
start_time: Optional[datetime] = None,
skip: int = 0,
limit: int = 10,
db: Session = Depends(get_db)
):
query = db.query(Anomaly)
if severity:
query = query.filter(Anomaly.severity == severity)
return query.offset(skip).limit(limit).all()Anomalies are stored with full context for later analysis:
CREATE TABLE anomalies (
id UUID PRIMARY KEY,
metric_id UUID REFERENCES metrics(id),
timestamp TIMESTAMP NOT NULL,
value DOUBLE PRECISION,
expected_value DOUBLE PRECISION,
anomaly_score DOUBLE PRECISION,
severity VARCHAR(20),
INDEX (timestamp DESC), -- Fast time-range queries
INDEX (metric_id), -- Fast metric filtering
INDEX (severity) -- Fast severity filtering
);All code is in the repo if you want to dig deeper!
We included a validation script to verify everything's working:
def test_api_health():
"""Make sure the API is up"""
response = requests.get("http://localhost:8000/health")
assert response.status_code == 200
def test_models_trained():
"""Check that models exist and are accessible"""
response = requests.get("http://localhost:8000/api/v1/models")
data = response.json()
assert data["total"] >= 2 # Should have Z-Score + Isolation Forest
def test_anomalies_detected():
"""Verify anomalies are being saved"""
response = requests.get("http://localhost:8000/api/v1/anomalies?limit=1")
data = response.json()
assert data["total"] > 0Run it: python scripts/test_setup.py
Some numbers from our deployment:
- Detection: Less than 1 second from metric scrape to anomaly flag
- Training: About 30 seconds per metric (with 1000+ historical points)
- API: ~50ms average response time
- Storage: Growing at ~1MB/day (6 metrics + 750 anomalies)
- Memory: ML engine uses ~200MB RAM
- Throughput: Processing 5,760 data points per metric daily
We're exploring several directions to make SAIMon even more capable:
Better Patterns: LSTM autoencoders to catch time-based sequences (like "CPU always spikes 2 hours after deployment")
Smarter Alerts: Direct Slack/email integration so you get notified immediately, not just logged
Multi-metric Correlation: Detecting when CPU + memory + disk all trend weird simultaneously (current version checks metrics independently)
Explainability: Adding SHAP analysis so when an anomaly fires, you can see which features triggered it
Seasonal Patterns: Using Facebook Prophet for systems with daily/weekly cycles (like e-commerce traffic)
If you're interested in contributing to any of these, we'd love the help!
Comprehensive guides available in docs/:
- VIEWING_DATA.md: Complete Grafana & Prometheus tutorial
- COMPLETE_OVERVIEW.md: System architecture deep dive
- QUICKSTART.md: 5-minute setup guide
- TROUBLESHOOTING.md: Common issues & solutions
We welcome contributions! Whether you're interested in:
- π Bug fixes: Improve stability and error handling
- β¨ New features: Add algorithms, metrics, or integrations
- π Documentation: Enhance guides or add tutorials
- π§ͺ Testing: Expand test coverage or add benchmarks
Collaboration Process:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Built collaboratively by Chanu716 and Charmi Seera.
We started this project to solve a real problem with traditional monitoring systems. Along the way, we learned a lot about time-series analysis, ensemble methods, and building production ML systems. If you're working on similar problems or want to discuss the approach, feel free to reach out!