A State-of-the-Art Hybrid Machine Learning Pipeline for Real-Time Network Traffic Classification and Zero-Day Anomaly Detection.
Modern network infrastructures are constantly exposed to sophisticated cyber threats. Traditional Signature-based Intrusion Detection Systems (SIDS) fail against novel, zero-day attacks, while Anomaly-based Intrusion Detection Systems (AIDS) tend to suffer from high false-alarm rates.
This project implements an AI-Powered Hybrid Intrusion Detection System (IDS) that harmonizes supervised classification and unsupervised anomaly detection:
- Supervised Classification (XGBoost): Matches incoming traffic features against known threat signatures (e.g., DDoS, Brute Force, Port Scan) with ultra-high precision and low latency.
- Unsupervised Anomaly Detection (PyTorch Autoencoder): Reconstructs normal traffic patterns. Deviations in reconstruction error (Mean Squared Error) act as a threshold-safe detection mechanism for previously unseen or zero-day anomalies.
graph TD
%% Define Styles
classDef inputStyle fill:#e1f5fe,stroke:#0288d1,stroke-width:2px,font-weight:bold;
classDef preprocessStyle fill:#e8f5e9,stroke:#388e3c,stroke-width:2px,font-weight:bold;
classDef supervisedStyle fill:#fff3e0,stroke:#f57c00,stroke-width:2px,font-weight:bold;
classDef unsupervisedStyle fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,font-weight:bold;
classDef stStyle fill:#eceff1,stroke:#455a64,stroke-width:2px,font-weight:bold;
classDef outputStyle fill:#ffebee,stroke:#c62828,stroke-width:2px,font-weight:bold;
%% Nodes
A["Raw Network Traffic <br>(CIC-IDS2017 CSV)"]:::inputStyle
B["Data Preprocessing <br>(SimpleImputer & StandardScaler)"]:::preprocessStyle
%% Split Parallel Paths
subgraph Supervised Pipeline
C1["XGBoost Classifier"]:::supervisedStyle
C2["Signature Detection"]:::supervisedStyle
C3["Predicts 'Known Attacks'"]:::supervisedStyle
end
subgraph Unsupervised Pipeline
D1["PyTorch Autoencoder"]:::unsupervisedStyle
D2["Reconstruction Error (MSE)"]:::unsupervisedStyle
D3["Predicts 'Zero-Day Anomalies'"]:::unsupervisedStyle
end
E["Streamlit Hybrid Dashboard <br>(Decision Consolidation)"]:::stStyle
F["Final Alert: BENIGN or THREAT"]:::outputStyle
%% Connections
A --> B
B -->|"Parallel Flow"| C1
B -->|"Parallel Flow"| D1
C1 --> C2 --> C3
D1 --> D2 --> D3
C3 --> E
D3 --> E
E --> F
Below is the directory structure showing the clean isolation of data pipelines, model weights, notebook experiments, and modular source code:
βββ data/
β βββ raw/
β β βββ dataset.csv # Original raw network traffic captures
β βββ processed/
β βββ clean_traffic.csv # Normalized and engineered baseline traffic
βββ models/
β βββ autoencoder.pth # Trained PyTorch Autoencoder weights
β βββ xgb_model.pkl # Trained XGBoost Booster object
β βββ preprocessor.pkl # Fitted RobustScaler/MinMaxScaler artifact
β βββ label_encoder.pkl # LabelEncoder mapping for known classes
βββ notebooks/
β βββ exploration.ipynb # Jupyter notebook containing EDA & modeling experiments
βββ src/
β βββ data_pipeline.py # Preprocessing, normalization, and split scripts
β βββ train_supervised.py # Supervised XGBoost training routines
β βββ train_anomaly.py # PyTorch Autoencoder training execution
β βββ evaluate.py # Performance testing & validation framework
βββ app.py # Hardware-aware Streamlit web dashboard
βββ Dockerfile # Production container specification
βββ requirements.txt # Project dependency specification
βββ .gitignore # Excluded files and dataset boundaries
βββ README.md # Project documentation
- Dual-Engine Hybrid Architecture: Synergizes supervised classification for swift known threat blocking and deep unsupervised autoencoders for zero-day threat discovery.
- Hardware-Aware Adaptive Loading: Dynamically auto-detects CUDA hardware for accelerated PyTorch tensor computing on Nvidia GPUs while gracefully falling back to CPU mode in containerized cloud environments (e.g., Hugging Face Spaces).
- Interactive Analytics Dashboard: Real-time evaluation dashboard powered by Streamlit, allowing manual single-row feature entry or batch CSV uploads with immediate comparison against baseline MSE metrics.
- Enterprise-Grade Containerization: Fully containerized using Docker, isolating environmental dependencies and facilitating seamless on-premise or cloud deployments.
Ensure you have Python 3.10+ and Pip installed on your system.
git clone https://github.com/your-username/ids-hybrid-system.git
cd ids-hybrid-system- On Windows:
python -m venv venv .\venv\Scripts\activate
- On Linux/macOS:
python3 -m venv venv source venv/bin/activate
Install the required packages, including PyTorch, XGBoost, Scikit-Learn, and Streamlit:
pip install -r requirements.txtOnce dependencies are installed and training artifacts are generated in the models/ directory, launch the application:
streamlit run app.pyOpen your browser and navigate to http://localhost:8501 to interact with the visual dashboard.
To build and run the application as a portable, production-ready container:
docker build -t hybrid-ids:latest .docker run -p 8501:8501 hybrid-ids:latestAccess the application at http://localhost:8501.
This project is prepared for dual-deployment and is currently hosted live on Hugging Face Spaces. The application runs on a CPU-only hardware allocation on the cloud, leveraging our hardware-aware model loader to ensure stability under low-resource environments.
π Live Hugging Face Spaces App: https://spandan228-ids-dashboard.hf.space
The hybrid system was evaluated using standard evaluation metrics on the reference benchmark testing partition (
Provides high-speed matching against signature attacks with ultra-high accuracy.
| Metric | Calculation Formula | Score (Known Attacks) |
|---|---|---|
| Accuracy | 99.99% (45,148 / 45,149) | |
| Precision | 100.00% | |
| Recall | 99.99% | |
| F1-Score | 99.99% |
Acts as a fallback to detect zero-day or unknown attacks by assessing deviations in reconstruction.
| Metric / Baseline Parameter | Value | Description |
|---|---|---|
| Baseline Normal MSE | 0.063459 | Average reconstruction error for benign traffic |
| Average Malicious MSE | 0.613365 | Average reconstruction error for attack traffic |
| Error Multiplier | 9.67x | Reconstruction error ratio ( |
| Anomaly Decision Threshold | 0.180529 | 95th percentile of normal benign traffic error |
| Zero-Day Detection Rate | 63.82% | Attacks flagged without any signature matching |
The hybrid system is optimized for high-throughput, low-latency enterprise environments:
-
XGBoost Inference:
$\approx 0.08\text{ ms}$ per packet. -
Autoencoder Inference:
$\approx 0.20\text{ ms}$ per packet (on CPU). -
Total Pipeline Latency:
$0.28\text{ ms}$ per packet (approx.$280\ \mu\text{s}$ ). - Throughput Capacity: Able to process ~3,570 Packets Per Second (PPS) under a CPU-bound single-thread regime, and up to ~15,000 PPS with multi-threaded batching on GPU-accelerated local deployments.