NIDaaS Deduplication Experiment

This repository contains a minimal, reproducible experiment for evaluating stream deduplication strategies on the CIC-IDS2017 CSV feature dataset before a fixed Random Forest anomaly detector.

The experiment implemented here is:

CIC-IDS2017 CSV files -> cleaning -> optional duplicate injection -> fingerprinting -> deduplication -> RF anomaly detection

The goal is to measure how different deduplication strategies affect event volume, benign/attack drops, throughput, and fixed RF detection metrics.

Platform Notes

This repository is developed and primarily documented for macOS/Linux environments using Unix-like shells. Windows PowerShell setup guidance is included for convenience, but Windows has not been fully tested in this project.

Keep one copy of this README for all platforms. After the virtual environment is activated, the Python module commands are the same across platforms.

Quick Reproduction

After placing the CIC-IDS2017 CSV files in data/, the minimum reproduction sequence is:

macOS / Linux (Unix-like shell):

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
python3 -m src.train_rf
python3 -m src.runner

Windows (PowerShell):

python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install -r requirements.txt
python -m src.train_rf
python -m src.runner

There are no standalone executable binaries in this repository. The project is run through the Python module commands documented below.

Experiment Scope

This repository evaluates only the cleaning/deduplication + RF anomaly detection branch.

Snort is not part of this experiment. It is not used by the code, runner, metrics, dataset preparation, or README commands in this repository. The experiments here should not be interpreted as measuring Snort performance or Snort acceleration.

The downstream RF detector is intentionally kept fixed while deduplication is varied. This keeps deduplication as the main changing variable.

Repository Structure

.
|-- README.md
|-- requirements.txt
|-- data/                         # Place CIC-IDS2017 CSV files here
|-- artifacts/                    # RF model artifacts written/read here
|   |-- rf_model.joblib
|   `-- rf_features.joblib
`-- src/
    |-- load_clean.py             # CIC-IDS2017 CSV loading and cleaning
    |-- fingerprint.py            # Event fingerprint variants
    |-- duplicate_injector.py     # Controlled exact replay duplicate injection
    |-- microbatch.py             # Micro-batch iterator
    |-- metrics.py                # Runtime aggregation helpers
    |-- no_dedupe.py              # NoDedupe baseline
    |-- exact_map.py              # Bounded exact-map deduplication
    |-- bloom.py                  # Bloom-only approximate deduplication
    |-- bloom_exact.py            # Bloom front filter + ExactMap confirmation
    |-- train_rf.py               # Offline Random Forest training
    |-- runner.py                 # Main experiment runner
    |-- ids/
    |   `-- rf_detector.py        # Saved RF inference wrapper
    `-- experiments/
        `-- bloom_fairness.py     # Isolated Bloom parameter sweep side experiment

Environment Setup

The code was developed with Python 3.14.2. Use Python 3.11 or newer.

From a fresh clone:

macOS / Linux (Unix-like shell):

python3 --version
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install -r requirements.txt

Windows (PowerShell):

python --version
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -r requirements.txt

All commands below assume they are run from the repository root after the virtual environment has been activated.

Dataset Placement

Place the CIC-IDS2017 CSV files in the repository data/ directory:

data/
|-- Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv
|-- Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv
|-- Friday-WorkingHours-Morning.pcap_ISCX.csv
|-- Monday-WorkingHours.pcap_ISCX.csv
|-- Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv
|-- Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv
|-- Tuesday-WorkingHours.pcap_ISCX.csv
`-- Wednesday-workingHours.pcap_ISCX.csv

The loader reads every *.csv file in data/ in sorted filename order. With the expected eight CIC-IDS2017 CSV files present, the cleaned merged dataset is approximately 2.83 million rows.

The loader keeps only the columns needed for fingerprinting and RF inference, normalizes column names, parses timestamps, removes invalid rows, and creates a binary label:

binary_label = 0 for BENIGN
binary_label = 1 for all non-BENIGN labels

RF Model Training

The main runner uses saved RF artifacts from artifacts/:

artifacts/rf_model.joblib
artifacts/rf_features.joblib

If these files are missing, train the RF model first:

python3 -m src.train_rf

This script:

loads and cleans the CIC-IDS2017 CSV files from data/
trains a RandomForestClassifier
prints baseline precision, recall, f1, and a classification report
writes the model and feature list into artifacts/

Note: the current RF training script uses a random row split. This is sufficient for the current deduplication experiment scaffold, but a stricter day/file holdout split should be used for final evaluation claims.

Fingerprint Selection

The fixed default fingerprint for the current experiments is:

packet_counts

This fingerprint combines the basic flow identity fields with:

Total Fwd Packets
Total Backward Packets

It was selected after fingerprint sensitivity testing because it was a better practical tradeoff than:

basic, which was too coarse and over-collapsed attack traffic
duration, which was too strict and removed almost nothing

Other fingerprint modes remain available for sensitivity checks, but normal deduplication experiments should keep packet_counts fixed.

Deduplication Modes

The main runner supports four deduplication modes:

no_dedupe
exact_map
bloom
bloom_exact

no_dedupe

Passes every event through unchanged. This is the no-deduplication baseline.

exact_map

Uses a bounded recent set of fingerprints. If a fingerprint was seen recently, the event is dropped. This is exact within the configured recent state window.

bloom

Uses a Bloom filter as an approximate deduplicator. Because Bloom filters can produce false positives, this mode may drop some non-duplicate records.

bloom_exact

Uses Bloom as a front filter and ExactMap for confirmation. Bloom identifies "maybe seen" fingerprints, then ExactMap confirms before dropping. This avoids Bloom false-positive drops while preserving a Bloom-filtered path for comparison.

Main Experiment Commands

Run the default baseline. This command runs the exact_map deduplication baseline with the fixed packet_counts fingerprint:

python3 -m src.runner

Default settings:

fingerprint mode: packet_counts
dedupe mode: exact_map
batch size: 5000
ExactMap max recent state: 50000
Bloom bits: 50000000
Bloom hashes: 4
duplicate rate: 0

Compare all deduplication modes with the fixed packet_counts fingerprint:

python3 -m src.runner --dedupe-mode all

Run one dedupe mode explicitly:

python3 -m src.runner --dedupe-mode no_dedupe
python3 -m src.runner --dedupe-mode exact_map
python3 -m src.runner --dedupe-mode bloom
python3 -m src.runner --dedupe-mode bloom_exact

Run Bloom with explicit parameters:

python3 -m src.runner --dedupe-mode bloom --bloom-bits 50000000 --bloom-hashes 4

Run Bloom+ExactMap with explicit parameters:

python3 -m src.runner --dedupe-mode bloom_exact --bloom-bits 50000000 --bloom-hashes 4

Run the fingerprint sensitivity sweep only if needed:

python3 -m src.runner --fingerprint-mode all

Controlled Duplicate Injection

The main runner can inject controlled exact replay duplicates before fingerprinting and deduplication.

Supported duplicate rates:

0, 5, 10, 20 percent

Example comparison under 10 percent exact replay pressure:

python3 -m src.runner --dedupe-mode all --duplicate-rate 10

Run the full duplicate-rate sweep:

for rate in 0 5 10 20; do
  python3 -m src.runner --dedupe-mode all --duplicate-rate "$rate"
done

Injection details:

exact replay only
deterministic by default with --duplicate-seed 42
duplicate rows are inserted immediately after sampled source rows
injection happens after loading/cleaning and before fingerprinting

Isolated Bloom Fairness Experiment

The side experiment in src/experiments/bloom_fairness.py exists only for Bloom parameter fairness checks. It does not change the main runner or main experiment defaults.

Run the full default Bloom fairness sweep:

python3 -m src.experiments.bloom_fairness

Default sweep:

bloom_bits:   10000000, 25000000, 50000000, 100000000
bloom_hashes: 2, 3, 4, 5

Run with controlled duplicate pressure:

python3 -m src.experiments.bloom_fairness --duplicate-rate 10

For faster sanity checks, cap the number of cleaned rows before duplicate injection:

python3 -m src.experiments.bloom_fairness \
  --max-rows 50000 \
  --duplicate-rate 10 \
  --bloom-bits 10000000 25000000 50000000 \
  --bloom-hashes 2 3 4

The side experiment always uses the fixed packet_counts fingerprint and the same saved RF detector as the main runner.

Expected Outputs

The scripts print progress and summary tables to stdout.

src.train_rf prints:

total loaded rows
class distribution
train/test row counts
RF precision, recall, and f1
classification report
saved artifact paths

src.runner prints:

loaded files and cleaned row counts
merged row count and class distribution
duplicate injection settings
fingerprint mode
dedupe mode
total input/output/dropped rows
throughput and average batch time
final dedupe state size
dropped benign and attack rows
RF precision, recall, and f1
comparison table when multiple modes are run

src.experiments.bloom_fairness prints:

loaded row counts
optional row limit information
duplicate injection settings
fixed fingerprint mode
ExactMap reference results
Bloom and Bloom+ExactMap sweep results
Bloom memory estimate in bytes

Current Implementation Notes

ExactMap is the default deduplication method in the main experiment.
Bloom and Bloom+ExactMap are included as comparative deduplication methods.
The isolated Bloom fairness experiment is provided for parameter sensitivity checks and does not change the main runner defaults.
Controlled duplicate injection is included for exact replay-pressure evaluation.
Observations from this repository are implementation-specific and should be reported with the dataset split, fingerprint mode, duplicate rate, Bloom parameters, and ExactMap state size.

Troubleshooting

ModuleNotFoundError for packages such as pandas, sklearn, or joblib

Activate the virtual environment and install dependencies:

source .venv/bin/activate
python3 -m pip install -r requirements.txt

ModuleNotFoundError: No module named 'src'

Run scripts as modules from the repository root:

python3 -m src.runner
python3 -m src.train_rf

Do not run python3 src/runner.py from inside another directory.

ValueError: No CSV files found in: data

Create data/ in the repository root and place the CIC-IDS2017 CSV files there. The runner expects data/*.csv.

Missing RF artifact files

If artifacts/rf_model.joblib or artifacts/rf_features.joblib is missing, run:

python3 -m src.train_rf

Encoding warning for the WebAttacks CSV

The loader first tries UTF-8 and then falls back to latin1 when needed. Seeing a warning like this is expected for the WebAttacks file:

[warn] utf-8 failed for data/Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv, retrying with latin1

Long runtime

The full CIC-IDS2017 dataset has approximately 2.83 million cleaned rows. Bloom fairness sweeps can be expensive because each parameter combination reruns the dedupe and RF path. Use --max-rows in the side experiment for quick sanity checks.

Reproducibility Checklist

Before submitting or reviewing results:

Confirm Python version: python3 --version
Create and activate .venv
Install requirements.txt
Place the eight CIC-IDS2017 CSV files in data/
Train RF artifacts if they are not already present: python3 -m src.train_rf
Keep packet_counts as the fixed fingerprint for dedupe comparisons
Record dedupe mode, duplicate rate, Bloom bits, Bloom hashes, and ExactMap max_recent
Run commands from the repository root
Save stdout tables for reported experimental results

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NIDaaS Deduplication Experiment

Platform Notes

Quick Reproduction

Experiment Scope

Repository Structure

Environment Setup

Dataset Placement

RF Model Training

Fingerprint Selection

Deduplication Modes

Main Experiment Commands

Controlled Duplicate Injection

Isolated Bloom Fairness Experiment

Expected Outputs

Current Implementation Notes

Troubleshooting

Reproducibility Checklist

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NIDaaS Deduplication Experiment

Platform Notes

Quick Reproduction

Experiment Scope

Repository Structure

Environment Setup

Dataset Placement

RF Model Training

Fingerprint Selection

Deduplication Modes

Main Experiment Commands

Controlled Duplicate Injection

Isolated Bloom Fairness Experiment

Expected Outputs

Current Implementation Notes

Troubleshooting

Reproducibility Checklist

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages