Generative Network Traffic Augmentation for Federated Learning
Intrusion Detection System on CIC-IDS 2017 | 8 federated nodes | Grid'5000 ecotype Nantes
GeNTAC-FL is a federated learning framework for network intrusion detection that addresses the non-IID data problem across 8 heterogeneous nodes. Each node holds a different day of the CIC-IDS 2017 dataset, creating extreme class imbalance — some nodes have thousands of attack samples, others have fewer than 30.
The core contribution is a chain of four components (G1–G4), each proven necessary by ablation:
- G1 — Conditional GAN (cGAN): a centralised generator on the FL server produces class-targeted synthetic attack flows for minority nodes each round, using a stratified seed pool built from real attack samples collected across all nodes
- G2 — Domain Clamping: 4-rule corrected clamping on binary protocol fields (PSH Flag, FIN Flag, URG Flag, Destination Port) prevents the GAN from generating physically impossible values in scaled space
- G3 — Hard Negative Mining (HNM): synthetic samples that the current global model already classifies correctly are discarded each round, forcing the GAN to focus exclusively on decision boundary regions. HNM rate reaches 97–99% by round 50
- G4 — Group-Aware Aggregation: two-stage aggregation separates rich nodes (N0, N2, N6, N7) from minority nodes (N1, N3, N4, N5), merging the two group aggregates with a tunable alpha=0.55 to prevent minority knowledge from being diluted by sample-count weighting
On top of the FL loop, Adaptive Threshold Calibration (ATC) runs on the server every 5 rounds. It computes the optimal decision threshold per minority node using real attack seeds from the seed cache and pure benign samples from Node 0, then broadcasts the calibrated thresholds back to clients for the next round. A minimum threshold of t=0.05 prevents total collapse for nodes with no clean decision boundary.
| Exp | Setup | Global Recall | Global F1 |
|---|---|---|---|
| exp1_improved | FedAvg 4 nodes (upper bound) | 0.719 | 0.693 |
| exp2_fedprox | FedProx 8 nodes baseline | 0.549 | 0.547 |
| exp3_improved | FedAvg + unconditional GAN | 0.480 | 0.505 |
| exp4_hnm_fixed | FedProx + GAN + HNM + Focal Loss | 0.645 | 0.578 |
| exp5_fedprox_smote | FedProx + SMOTE + HNM | 0.675 | 0.582 |
| exp6b_cgan_recon01 | FedProx + cGAN (recon=0.1) + HNM | 0.655 | 0.574 |
| exp7_cgan_clamp | + Feature Clamping (14 rules) | 0.679 | 0.586 |
| exp8_group_rebal | + Group-Aware Aggregation (alpha=0.55) | 0.730 | 0.594 |
| exp9_corrected_clamp | + Corrected 4-rule Clamping | 0.750 | 0.598 |
| exp10_atc | + Adaptive Threshold Calibration (original) | 0.795 | 0.549 |
| exp10b_atc_fixed | + Node 0 benign anchor + t_min=0.05 | 0.763 | 0.562 |
| Node | CSV File | Attack Family | Role | Samples | Attack samples |
|---|---|---|---|---|---|
| N0 | Monday | BENIGN only | Anchor | 529,443 | 0 |
| N1 | Tuesday | FTP-Patator | cGAN minority | 445,617 | 13,832 |
| N2 | Wednesday | DoS / DDoS | Rich | 691,384 | 251,723 |
| N3 | Thursday AM | Web Attack | cGAN minority | 170,220 | 2,180 |
| N4 | Thursday PM | Infiltration | cGAN minority | 288,391 | 36 |
| N5 | Friday AM | Botnet | cGAN minority | 190,902 | 1,956 |
| N6 | Friday PM | PortScan | Seed | 286,060 | 158,804 |
| N7 | Friday PM | DDoS | Rich | 225,709 | 128,025 |
gentac-fl/
├── fl/
│ ├── server.py # Base server (exp1-2)
│ ├── client.py # Base client
│ ├── cgan_augment.py # CentralCGANAugmenter — cGAN inference on server
│ ├── gan_augment.py # Unconditional GAN augmenter (exp3-4)
│ └── smote_augment.py # SMOTE augmenter (exp5)
├── models/
│ └── traffic_ids.py # TrafficIDS MLP classifier architecture
├── models_cgan/ # cGAN weights — not tracked (too large)
│ ├── cadvgan_generator_recon01.pth # alpha_recon=0.1, ESR 56.90%
│ └── cgan_meta.joblib # label_to_idx, input_dim, num_classes
├── preprocessing/
│ └── preprocess.py # clean_dataframe, apply_preprocessing_artifacts
├── artifacts/ # Fitted preprocessing artifacts
│ ├── scaler.joblib
│ ├── var_selector.joblib
│ ├── final_features.joblib
│ ├── cols_to_drop.joblib
│ └── constant_cols.joblib
├── notebooks/
│ ├── exp1_improved.ipynb # FedAvg 4 nodes baseline
│ ├── exp2_improved.ipynb # FedProx 8 nodes
│ ├── exp3_improved.ipynb # Unconditional GAN
│ ├── exp4_fedprox_gan.ipynb # FedProx + GAN + HNM + Focal
│ ├── exp5_fedprox_smote.ipynb # SMOTE comparison
│ ├── exp6_cgan_fedprox.ipynb # cGAN introduction
│ ├── exp7_cgan_clamp.ipynb # Feature clamping
│ ├── exp8_group_rebal.ipynb # Group-Aware Aggregation
│ ├── exp9_gentac_fl.ipynb # Corrected 4-rule clamping
│ ├── exp10_atc.ipynb # Adaptive Threshold Calibration
│ └── ltc_threshold_calibration.ipynb # LTC analysis + stress test
├── results/
│ ├── all_results.json # Aggregated metrics across all experiments
│ ├── gentac_fl_exp10_comparison.png
│ ├── exp*/
│ │ ├── fl_results.json # Per-round metrics (accuracy, recall, F1)
│ │ └── threshold_log.json # ATC threshold evolution (exp10/10b only)
│ └── ltc_calibration/
│ ├── global_model_exp9_r50.pth
│ ├── global_model_exp10b_r50.pth
│ ├── exp10_threshold_log.json
│ ├── ltc_results.json
│ └── stress_test_exp10b.json
└── README.md
- Cluster: Grid'5000 ecotype, Nantes site
- Reservation: 9 nodes (1 FL server + 8 FL clients), walltime 2h
- FL framework: Flower (flwr==1.7.0)
- Strategy: FedProx with proximal_mu=0.01
- Rounds: 50 per experiment
- Local epochs: 3 per round
- Optimizer: Adam, weight_decay=1e-4
- Learning rate: 1e-3, halved every 10 rounds
- Loss: Focal Loss (gamma=2.0) with class_weight=10.0
nn.Sequential(
nn.Linear(43, 256), # input: 43 features after preprocessing
nn.BatchNorm1d(256),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(256, 128),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 2), # binary: benign vs attack
)flwr==1.7.0
torch==2.3.0
scikit-learn==1.6.1
pandas
joblib
numpy
matplotlib
enoslib
python-grid5000
Non-IID aggregation paradox: local models achieve recall >0.93 on their own attack family after 50 rounds of cGAN augmentation. The global model recall is 0.763 — the gap is the aggregation cost. This is the central problem GeNTAC-FL is designed to address.
G4 Group-Aware Aggregation is the single largest contributor: +7pp global recall vs Exp 7 by giving minority nodes equal group-level weight regardless of sample count. Removing any one of G1–G4 causes measurable regression.
Clamping correction: diagnostic analysis of 50k samples per attack family confirmed that negative values in continuous rate features (Flow Bytes/s, Bwd Packets/s) are valid StandardScaler artefacts, not GAN failures. The original 14-rule clamp was reduced to 4 rules covering binary protocol fields only.
N4 Infiltration floor: 36 real training samples (29 seeds after stratified split) is insufficient for GAN augmentation to overcome aggregation dilution. This is a data quantity limit, not an architectural failure.
ATC stress test: thresholds computed at round 50 were validated on completely held-out data never seen during training or calibration, confirming portability.
Two operating modes:
- Uncalibrated (t=0.5): maximum specificity, near-zero false positives, suitable for forensic/threat-hunting contexts
- Calibrated (ATC thresholds): balanced precision-recall per node, suitable for operational deployment where alert fatigue is a concern
If you use this code or results, please cite:
@mastersthesis{messiou2026gentacfl,
title = {GeNTAC-FL: Generative Network Traffic Augmentation for Federated Learning IDS},
author = {Messiou, M.},
school = {},
year = {2026}
}