This repository contains a minimal, reproducible experiment for evaluating stream deduplication strategies on the CIC-IDS2017 CSV feature dataset before a fixed Random Forest anomaly detector.
The experiment implemented here is:
CIC-IDS2017 CSV files -> cleaning -> optional duplicate injection -> fingerprinting -> deduplication -> RF anomaly detection
The goal is to measure how different deduplication strategies affect event volume, benign/attack drops, throughput, and fixed RF detection metrics.
This repository is developed and primarily documented for macOS/Linux environments using Unix-like shells. Windows PowerShell setup guidance is included for convenience, but Windows has not been fully tested in this project.
Keep one copy of this README for all platforms. After the virtual environment is activated, the Python module commands are the same across platforms.
After placing the CIC-IDS2017 CSV files in data/, the minimum reproduction
sequence is:
macOS / Linux (Unix-like shell):
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
python3 -m src.train_rf
python3 -m src.runnerWindows (PowerShell):
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install -r requirements.txt
python -m src.train_rf
python -m src.runnerThere are no standalone executable binaries in this repository. The project is run through the Python module commands documented below.
This repository evaluates only the cleaning/deduplication + RF anomaly detection branch.
Snort is not part of this experiment. It is not used by the code, runner, metrics, dataset preparation, or README commands in this repository. The experiments here should not be interpreted as measuring Snort performance or Snort acceleration.
The downstream RF detector is intentionally kept fixed while deduplication is varied. This keeps deduplication as the main changing variable.
.
|-- README.md
|-- requirements.txt
|-- data/ # Place CIC-IDS2017 CSV files here
|-- artifacts/ # RF model artifacts written/read here
| |-- rf_model.joblib
| `-- rf_features.joblib
`-- src/
|-- load_clean.py # CIC-IDS2017 CSV loading and cleaning
|-- fingerprint.py # Event fingerprint variants
|-- duplicate_injector.py # Controlled exact replay duplicate injection
|-- microbatch.py # Micro-batch iterator
|-- metrics.py # Runtime aggregation helpers
|-- no_dedupe.py # NoDedupe baseline
|-- exact_map.py # Bounded exact-map deduplication
|-- bloom.py # Bloom-only approximate deduplication
|-- bloom_exact.py # Bloom front filter + ExactMap confirmation
|-- train_rf.py # Offline Random Forest training
|-- runner.py # Main experiment runner
|-- ids/
| `-- rf_detector.py # Saved RF inference wrapper
`-- experiments/
`-- bloom_fairness.py # Isolated Bloom parameter sweep side experiment
The code was developed with Python 3.14.2. Use Python 3.11 or newer.
From a fresh clone:
macOS / Linux (Unix-like shell):
python3 --version
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install -r requirements.txtWindows (PowerShell):
python --version
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -r requirements.txtAll commands below assume they are run from the repository root after the virtual environment has been activated.
Place the CIC-IDS2017 CSV files in the repository data/ directory:
data/
|-- Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv
|-- Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv
|-- Friday-WorkingHours-Morning.pcap_ISCX.csv
|-- Monday-WorkingHours.pcap_ISCX.csv
|-- Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv
|-- Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv
|-- Tuesday-WorkingHours.pcap_ISCX.csv
`-- Wednesday-workingHours.pcap_ISCX.csv
The loader reads every *.csv file in data/ in sorted filename order. With
the expected eight CIC-IDS2017 CSV files present, the cleaned merged dataset is
approximately 2.83 million rows.
The loader keeps only the columns needed for fingerprinting and RF inference, normalizes column names, parses timestamps, removes invalid rows, and creates a binary label:
binary_label = 0 for BENIGN
binary_label = 1 for all non-BENIGN labels
The main runner uses saved RF artifacts from artifacts/:
artifacts/rf_model.joblib
artifacts/rf_features.joblib
If these files are missing, train the RF model first:
python3 -m src.train_rfThis script:
- loads and cleans the CIC-IDS2017 CSV files from
data/ - trains a
RandomForestClassifier - prints baseline precision, recall, f1, and a classification report
- writes the model and feature list into
artifacts/
Note: the current RF training script uses a random row split. This is sufficient for the current deduplication experiment scaffold, but a stricter day/file holdout split should be used for final evaluation claims.
The fixed default fingerprint for the current experiments is:
packet_counts
This fingerprint combines the basic flow identity fields with:
Total Fwd Packets
Total Backward Packets
It was selected after fingerprint sensitivity testing because it was a better practical tradeoff than:
basic, which was too coarse and over-collapsed attack trafficduration, which was too strict and removed almost nothing
Other fingerprint modes remain available for sensitivity checks, but normal
deduplication experiments should keep packet_counts fixed.
The main runner supports four deduplication modes:
no_dedupe
exact_map
bloom
bloom_exact
no_dedupe
Passes every event through unchanged. This is the no-deduplication baseline.
exact_map
Uses a bounded recent set of fingerprints. If a fingerprint was seen recently, the event is dropped. This is exact within the configured recent state window.
bloom
Uses a Bloom filter as an approximate deduplicator. Because Bloom filters can produce false positives, this mode may drop some non-duplicate records.
bloom_exact
Uses Bloom as a front filter and ExactMap for confirmation. Bloom identifies "maybe seen" fingerprints, then ExactMap confirms before dropping. This avoids Bloom false-positive drops while preserving a Bloom-filtered path for comparison.
Run the default baseline. This command runs the exact_map deduplication
baseline with the fixed packet_counts fingerprint:
python3 -m src.runnerDefault settings:
fingerprint mode: packet_counts
dedupe mode: exact_map
batch size: 5000
ExactMap max recent state: 50000
Bloom bits: 50000000
Bloom hashes: 4
duplicate rate: 0
Compare all deduplication modes with the fixed packet_counts fingerprint:
python3 -m src.runner --dedupe-mode allRun one dedupe mode explicitly:
python3 -m src.runner --dedupe-mode no_dedupe
python3 -m src.runner --dedupe-mode exact_map
python3 -m src.runner --dedupe-mode bloom
python3 -m src.runner --dedupe-mode bloom_exactRun Bloom with explicit parameters:
python3 -m src.runner --dedupe-mode bloom --bloom-bits 50000000 --bloom-hashes 4Run Bloom+ExactMap with explicit parameters:
python3 -m src.runner --dedupe-mode bloom_exact --bloom-bits 50000000 --bloom-hashes 4Run the fingerprint sensitivity sweep only if needed:
python3 -m src.runner --fingerprint-mode allThe main runner can inject controlled exact replay duplicates before fingerprinting and deduplication.
Supported duplicate rates:
0, 5, 10, 20 percent
Example comparison under 10 percent exact replay pressure:
python3 -m src.runner --dedupe-mode all --duplicate-rate 10Run the full duplicate-rate sweep:
for rate in 0 5 10 20; do
python3 -m src.runner --dedupe-mode all --duplicate-rate "$rate"
doneInjection details:
- exact replay only
- deterministic by default with
--duplicate-seed 42 - duplicate rows are inserted immediately after sampled source rows
- injection happens after loading/cleaning and before fingerprinting
The side experiment in src/experiments/bloom_fairness.py exists only for Bloom
parameter fairness checks. It does not change the main runner or main experiment
defaults.
Run the full default Bloom fairness sweep:
python3 -m src.experiments.bloom_fairnessDefault sweep:
bloom_bits: 10000000, 25000000, 50000000, 100000000
bloom_hashes: 2, 3, 4, 5
Run with controlled duplicate pressure:
python3 -m src.experiments.bloom_fairness --duplicate-rate 10For faster sanity checks, cap the number of cleaned rows before duplicate injection:
python3 -m src.experiments.bloom_fairness \
--max-rows 50000 \
--duplicate-rate 10 \
--bloom-bits 10000000 25000000 50000000 \
--bloom-hashes 2 3 4The side experiment always uses the fixed packet_counts fingerprint and the
same saved RF detector as the main runner.
The scripts print progress and summary tables to stdout.
src.train_rf prints:
- total loaded rows
- class distribution
- train/test row counts
- RF precision, recall, and f1
- classification report
- saved artifact paths
src.runner prints:
- loaded files and cleaned row counts
- merged row count and class distribution
- duplicate injection settings
- fingerprint mode
- dedupe mode
- total input/output/dropped rows
- throughput and average batch time
- final dedupe state size
- dropped benign and attack rows
- RF precision, recall, and f1
- comparison table when multiple modes are run
src.experiments.bloom_fairness prints:
- loaded row counts
- optional row limit information
- duplicate injection settings
- fixed fingerprint mode
- ExactMap reference results
- Bloom and Bloom+ExactMap sweep results
- Bloom memory estimate in bytes
ExactMapis the default deduplication method in the main experiment.BloomandBloom+ExactMapare included as comparative deduplication methods.- The isolated Bloom fairness experiment is provided for parameter sensitivity checks and does not change the main runner defaults.
- Controlled duplicate injection is included for exact replay-pressure evaluation.
- Observations from this repository are implementation-specific and should be reported with the dataset split, fingerprint mode, duplicate rate, Bloom parameters, and ExactMap state size.
ModuleNotFoundError for packages such as pandas, sklearn, or joblib
Activate the virtual environment and install dependencies:
source .venv/bin/activate
python3 -m pip install -r requirements.txtModuleNotFoundError: No module named 'src'
Run scripts as modules from the repository root:
python3 -m src.runner
python3 -m src.train_rfDo not run python3 src/runner.py from inside another directory.
ValueError: No CSV files found in: data
Create data/ in the repository root and place the CIC-IDS2017 CSV files there.
The runner expects data/*.csv.
Missing RF artifact files
If artifacts/rf_model.joblib or artifacts/rf_features.joblib is missing, run:
python3 -m src.train_rfEncoding warning for the WebAttacks CSV
The loader first tries UTF-8 and then falls back to latin1 when needed. Seeing a warning like this is expected for the WebAttacks file:
[warn] utf-8 failed for data/Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv, retrying with latin1
Long runtime
The full CIC-IDS2017 dataset has approximately 2.83 million cleaned rows. Bloom
fairness sweeps can be expensive because each parameter combination reruns the
dedupe and RF path. Use --max-rows in the side experiment for quick sanity
checks.
Before submitting or reviewing results:
- Confirm Python version:
python3 --version - Create and activate
.venv - Install
requirements.txt - Place the eight CIC-IDS2017 CSV files in
data/ - Train RF artifacts if they are not already present:
python3 -m src.train_rf - Keep
packet_countsas the fixed fingerprint for dedupe comparisons - Record dedupe mode, duplicate rate, Bloom bits, Bloom hashes, and ExactMap
max_recent - Run commands from the repository root
- Save stdout tables for reported experimental results