Network Intrusion Detection — ML Analysis

Binary classification of network connections as normal or anomaly using the KDD Cup 99 dataset.

Dataset

KDD Cup 1999 — one of the most widely used datasets for network intrusion detection research.

Property	Value
Source	Kaggle — sampadab17/network-intrusion-detection
Train samples	25,192
Test samples	22,544
Features	41
Target	`class`: `normal` / `anomaly`
Class balance	53.4% normal / 46.6% anomaly

Features cover three categories: basic connection properties (protocol_type, src_bytes, flag), content-level signals (num_failed_logins, root_shell), and traffic statistics (count, serror_rate, same_srv_rate).

Project Structure

cybersecurity/
├── cybersecurity.ipynb       # Main notebook (English)
├── cybersecurity_ru.ipynb    # Russian version
├── load_data.py              # Script to download dataset from Kaggle
├── requirements.txt
├── README.md
└── data/
    ├── Train_data.csv
    └── Test_data.csv

Setup

1. Clone / download the repository

2. Install dependencies

pip install -r requirements.txt

3. Download the dataset

python load_data.py

Requires a Kaggle account and kaggle.json API key placed in ~/.kaggle/

4. Run the notebook

jupyter notebook cybersecurity.ipynb

Or open in VS Code with the Jupyter extension.

Models

Three classifiers are trained and compared:

Model	Notes
Logistic Regression	Linear baseline; scaled features
Decision Tree	`max_depth=10` to prevent overfitting
Random Forest	100 trees; best performance

Results

Random Forest achieves the best scores across all metrics:

Model	Accuracy	ROC-AUC
Logistic Regression	~0.99	~0.99
Decision Tree	~0.99	~0.99
Random Forest	~0.99	~1.00

Exact values are printed when you run the notebook.

Top features identified by Random Forest: src_bytes, dst_host_same_srv_rate, flag, count, dst_host_serror_rate

Key Findings

Connections with TCP flags S0 / REJ are almost exclusively attacks (failed/refused handshakes — port scanning)
ICMP protocol is predominantly anomalous in this dataset
Anomalous connections tend to have either very low src_bytes (probing) or very high (exploitation)
High count values indicate repeated scanning behavior

Business Recommendations

Deploy Random Forest as a real-time IDS — integrate into network monitoring to auto-flag suspicious connections
Alert on high-risk patterns — flag S0/REJ flags, high count per source IP, ICMP spikes
Use probability scores, not just binary labels — tune the classification threshold based on the organization's risk tolerance
Retrain periodically — attack patterns evolve; retrain every 2–3 months on fresh traffic data
Translate EDA insights into firewall rules — block/rate-limit ICMP, alert on connection floods — actionable without ML infrastructure

Notebooks

Notebook	Language	Description
`cybersecurity.ipynb`	English	Main analysis with comments and explanations
`cybersecurity_ru.ipynb`	Russian	Same analysis fully in Russian

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Network Intrusion Detection — ML Analysis

Dataset

Project Structure

Setup

Models

Results

Key Findings

Business Recommendations

Notebooks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
.gitignore		.gitignore
README.md		README.md
cybersecurity.ipynb		cybersecurity.ipynb
cybersecurity_ru.ipynb		cybersecurity_ru.ipynb
load_data.py		load_data.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Network Intrusion Detection — ML Analysis

Dataset

Project Structure

Setup

Models

Results

Key Findings

Business Recommendations

Notebooks

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages