Skip to content

nebula387/cybersecurity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Network Intrusion Detection — ML Analysis

Binary classification of network connections as normal or anomaly using the KDD Cup 99 dataset.


Dataset

KDD Cup 1999 — one of the most widely used datasets for network intrusion detection research.

Property Value
Source Kaggle — sampadab17/network-intrusion-detection
Train samples 25,192
Test samples 22,544
Features 41
Target class: normal / anomaly
Class balance 53.4% normal / 46.6% anomaly

Features cover three categories: basic connection properties (protocol_type, src_bytes, flag), content-level signals (num_failed_logins, root_shell), and traffic statistics (count, serror_rate, same_srv_rate).


Project Structure

cybersecurity/
├── cybersecurity.ipynb       # Main notebook (English)
├── cybersecurity_ru.ipynb    # Russian version
├── load_data.py              # Script to download dataset from Kaggle
├── requirements.txt
├── README.md
└── data/
    ├── Train_data.csv
    └── Test_data.csv

Setup

1. Clone / download the repository

2. Install dependencies

pip install -r requirements.txt

3. Download the dataset

python load_data.py

Requires a Kaggle account and kaggle.json API key placed in ~/.kaggle/

4. Run the notebook

jupyter notebook cybersecurity.ipynb

Or open in VS Code with the Jupyter extension.


Models

Three classifiers are trained and compared:

Model Notes
Logistic Regression Linear baseline; scaled features
Decision Tree max_depth=10 to prevent overfitting
Random Forest 100 trees; best performance

Results

Random Forest achieves the best scores across all metrics:

Model Accuracy ROC-AUC
Logistic Regression ~0.99 ~0.99
Decision Tree ~0.99 ~0.99
Random Forest ~0.99 ~1.00

Exact values are printed when you run the notebook.

Top features identified by Random Forest: src_bytes, dst_host_same_srv_rate, flag, count, dst_host_serror_rate


Key Findings

  • Connections with TCP flags S0 / REJ are almost exclusively attacks (failed/refused handshakes — port scanning)
  • ICMP protocol is predominantly anomalous in this dataset
  • Anomalous connections tend to have either very low src_bytes (probing) or very high (exploitation)
  • High count values indicate repeated scanning behavior

Business Recommendations

  1. Deploy Random Forest as a real-time IDS — integrate into network monitoring to auto-flag suspicious connections
  2. Alert on high-risk patterns — flag S0/REJ flags, high count per source IP, ICMP spikes
  3. Use probability scores, not just binary labels — tune the classification threshold based on the organization's risk tolerance
  4. Retrain periodically — attack patterns evolve; retrain every 2–3 months on fresh traffic data
  5. Translate EDA insights into firewall rules — block/rate-limit ICMP, alert on connection floods — actionable without ML infrastructure

Notebooks

Notebook Language Description
cybersecurity.ipynb English Main analysis with comments and explanations
cybersecurity_ru.ipynb Russian Same analysis fully in Russian

About

network intrusion detection using ML — KDD Cup 99 dataset

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages