Binary classification of network connections as normal or anomaly using the KDD Cup 99 dataset.
KDD Cup 1999 — one of the most widely used datasets for network intrusion detection research.
| Property | Value |
|---|---|
| Source | Kaggle — sampadab17/network-intrusion-detection |
| Train samples | 25,192 |
| Test samples | 22,544 |
| Features | 41 |
| Target | class: normal / anomaly |
| Class balance | 53.4% normal / 46.6% anomaly |
Features cover three categories: basic connection properties (protocol_type, src_bytes, flag), content-level signals (num_failed_logins, root_shell), and traffic statistics (count, serror_rate, same_srv_rate).
cybersecurity/
├── cybersecurity.ipynb # Main notebook (English)
├── cybersecurity_ru.ipynb # Russian version
├── load_data.py # Script to download dataset from Kaggle
├── requirements.txt
├── README.md
└── data/
├── Train_data.csv
└── Test_data.csv
1. Clone / download the repository
2. Install dependencies
pip install -r requirements.txt3. Download the dataset
python load_data.pyRequires a Kaggle account and
kaggle.jsonAPI key placed in~/.kaggle/
4. Run the notebook
jupyter notebook cybersecurity.ipynbOr open in VS Code with the Jupyter extension.
Three classifiers are trained and compared:
| Model | Notes |
|---|---|
| Logistic Regression | Linear baseline; scaled features |
| Decision Tree | max_depth=10 to prevent overfitting |
| Random Forest | 100 trees; best performance |
Random Forest achieves the best scores across all metrics:
| Model | Accuracy | ROC-AUC |
|---|---|---|
| Logistic Regression | ~0.99 | ~0.99 |
| Decision Tree | ~0.99 | ~0.99 |
| Random Forest | ~0.99 | ~1.00 |
Exact values are printed when you run the notebook.
Top features identified by Random Forest: src_bytes, dst_host_same_srv_rate, flag, count, dst_host_serror_rate
- Connections with TCP flags
S0/REJare almost exclusively attacks (failed/refused handshakes — port scanning) - ICMP protocol is predominantly anomalous in this dataset
- Anomalous connections tend to have either very low
src_bytes(probing) or very high (exploitation) - High
countvalues indicate repeated scanning behavior
- Deploy Random Forest as a real-time IDS — integrate into network monitoring to auto-flag suspicious connections
- Alert on high-risk patterns — flag S0/REJ flags, high
countper source IP, ICMP spikes - Use probability scores, not just binary labels — tune the classification threshold based on the organization's risk tolerance
- Retrain periodically — attack patterns evolve; retrain every 2–3 months on fresh traffic data
- Translate EDA insights into firewall rules — block/rate-limit ICMP, alert on connection floods — actionable without ML infrastructure
| Notebook | Language | Description |
|---|---|---|
cybersecurity.ipynb |
English | Main analysis with comments and explanations |
cybersecurity_ru.ipynb |
Russian | Same analysis fully in Russian |