Agentic-NIC-Dataplane-Lab is a Linux-first reference repo for designing, building, and benchmarking a split NIC dataplane for agentic AI systems, with an explicit path toward bounded autonomous NIC behavior.
The repo is opinionated about one thing: agentic AI is not only a model problem. It is a queueing, copy-avoidance, packet-steering, east-west transport, and local-control problem. The goal here is to make those tradeoffs concrete with architecture notes, Linux tuning guidance, compatibility tables, starter code, and a stronger systems model for how an autonomous NIC dataplane could operate safely.
The core thesis of this repo is that agentic inference is not shaped like traditional batched model serving.
Traditional inference is usually:
- large batch
- GPU-bound
- throughput-oriented
- dominated by matrix math and model execution efficiency
Agentic inference is usually:
- many tiny RPCs
- metadata fetches
- retrieval fan-out
- scheduler chatter
- checkpoint or state coordination
- latency amplification across multi-step plans
That difference matters because the bottleneck moves. In a classic throughput-oriented serving path, the GPU often dominates. In an agentic path, the orchestration fabric can dominate instead:
- socket and packet overhead
- queue contention
- kernel/userspace copy cost
- scheduler wakeups
- cross-service retries
- east-west networking
This repo is compelling only if that thesis holds up under measurement: at scale, networking overhead can become a first-order limiter for agent orchestration even when the model itself is fast enough.
The repo now includes a small v0.2 local workflow so a new reader can compile the lab, run one real localhost baseline, run one honest AF_XDP mock path, and generate charts from the resulting JSON.
make all
make demo
python3 tools/plot_baseline.py results/latest.jsonWhat this demo is and is not:
Path A kernel UDPis a real localhost kernel networking measurement using a simple echo server and client.Path B AF_XDP mockis a workflow-validation path that simulates a starterAF_XDPprocessing loop shape and output format.- The local demo exists to validate build, run, JSON, and plotting workflow without special hardware.
- Real
AF_XDPstill requires supported NIC, driver, queue, and privilege setup. The mock path does not claim real zero-copy dataplane performance.
| Path | Status | Requires root | Requires special NIC |
|---|---|---|---|
| Path A kernel UDP | runnable locally | no | no |
| Path B AF_XDP mock | runnable locally | no | no |
| Path B real AF_XDP | scaffold/in progress | yes | yes |
| Path C RDMA | scaffold/in progress | likely | yes |
This project explores a tri-path host networking model for agentic AI clusters:
Path A: kernel TCP for the majority of agent RPC, improved withbusy_poll, queue affinity, andio_uringzero-copy receive where supportedPath B:XDP+AF_XDPfor selected hot queues that need lower packet overhead and stronger queue-to-core controlPath C:RDMAfor bulk east-west movement such as state sync, checkpoint transfer, shard-to-shard movement, and GPU-adjacent feeds
It also starts to define a layered agentic NIC architecture:
- an
Intent Layerwhere the host expresses goals rather than register-level tweaks - a bounded
Agent Layerthat proposes local dataplane adjustments - a deterministic
Guardian Layerthat enforces safety and connectivity invariants - an
Audit Layerwith hardware-isolated reasoning logs for dataplane mutations - a
Tenant Quota Modelso local optimization does not destroy fairness
The intended audience is:
- systems engineers building agent infrastructure
- Linux kernel and NIC performance engineers
- inference platform teams comparing sockets vs
AF_XDPvsRDMA - researchers who want a reproducible testbed instead of architecture slides
- agentic AI networking architecture
- Linux NIC queueing and IRQ affinity
AF_XDP, UMEM, and queue ownershipio_uringreceive paths and zero-copy RxRDMA, queue pairs, memory registration, and bulk transport- Intel
ice/irdmaand Broadcombnxt_en/bnxt_re - benchmark design for agent-shaped workloads
- bounded autonomous NIC control
- deterministic guardrails and fail-safe policy
- hardware-isolated reasoning logs
- multi-tenant agent quotas and fairness
The repo now embeds the architecture diagram directly in the README so it renders on GitHub, and the source Mermaid file is still kept in ./diagrams/tri-path-agentic-dataplane.mmd.
flowchart TD
U["Users / upstream agents<br/>RPCs, tool calls, streaming"] --> I["Ingress NIC queues<br/>Classifier: RSS / XDP / flow rules"]
I --> A["Path A: kernel TCP<br/>io_uring ZC Rx · busy_poll"]
I --> B["Path B: AF_XDP<br/>UMEM · zero-copy · per-core"]
I --> C["Path C: RDMA<br/>RC QP · MR · bulk east-west"]
A --> O["Orchestrators / tools<br/>retrieval, memory, metadata"]
B --> G["Gateways / routers<br/>schedulers, token gateways"]
C --> S["State sync / checkpoints<br/>GPU feed, vector index sync"]
Agentic AI is a coordination workload:
- many small RPCs
- retrieval and metadata fetches
- tool execution
- policy checks
- fan-out and fan-in
- streaming and retries
That shifts bottlenecks toward:
- CPU time spent in the networking and storage path
- queue placement and RSS policy
- copy overhead between kernel, userspace, and devices
- memory registration and pinned-page cost
- east-west service traffic
- At what packet size and concurrency does
AF_XDPoutperform kernel sockets for agent RPC traffic? - Can queue affinity reduce
p99scheduler jitter for agent orchestration paths? - Does
busy_pollhelp or hurt mixed inference workloads that combine model serving with retrieval chatter? - When does userspace polling become CPU-inefficient relative to kernel sockets?
- Can bounded autonomous NIC scheduling reduce tail collapse without violating safety constraints?
This repo is intentionally in early-lab form:
- the build system is now present and compile-oriented
- the benchmark harness accepts real arguments and emits JSON metadata
- the code under
src/is still starter code, not production dataplane code - the docs are detailed enough to orient a contributor and define the next work
- the conceptual architecture now covers not just transport choice, but how local autonomy, safety, and auditability could fit into a SmartNIC-class system
./BUILDING.md./ROADMAP.md./docs/networking-for-agentic-ai-blog.md./docs/reference-architecture.md./docs/agentic-nic-architecture.md./docs/safety-and-guardrails.md./docs/reasoning-log-design.md./docs/audit-layer-threat-model.md./docs/multi-tenant-agent-quotas.md./docs/kernel-driver-tuning.md./docs/benchmark-plan.md./docs/compatibility-matrix.md./docs/perf-flamegraph-workflow.md./diagrams/tri-path-agentic-dataplane.mmd./src/io_uring/recv_zc.c./src/kernel_udp/udp_echo_server.c./src/kernel_udp/udp_client.c./src/af_xdp/main.c./src/af_xdp/xdp_pass.c./src/rdma/verbs_ping.c./scripts/run_local_baseline.sh./scripts/benchmark-matrix.sh./tools/af_xdp_load.sh./tools/bpftrace/guardian_preemption.bt./tools/bpftrace/guardian_tail_latency_guard.bt./tools/perf_capture.sh./tools/plot_baseline.py./tools/plot_e810_baseline.py./results/e810-baseline-2026-05-08.json./results/sample-local-baseline.json
If you want one practical Linux-first starting point:
Intel E810/E830icefor Ethernet, queueing, andAF_XDPirdmafor RDMA
If you are standardized on Broadcom:
bnxt_enfor Ethernetbnxt_refor RoCE
The repo does not assume one vendor forever. It is structured so contributors can compare stacks, kernels, firmware bundles, and feature maturity.
The repo now has a root ./Makefile and build notes in ./BUILDING.md.
Common commands:
make all
make kernel_udp
make af_xdp
make io_uring
make rdma
make xdp_prog
make demomake all now builds the runnable local kernel_udp demo path, builds the AF_XDP starter in real or mock-only mode depending on available headers and libraries, and skips optional io_uring, RDMA, or BPF object targets gracefully when the local environment does not provide those dependencies.
The benchmark harness is at ./scripts/benchmark-matrix.sh. It now accepts path, workload, and interface arguments and emits a JSON result envelope with host, kernel, NIC counter, and softirq metadata.
Example:
./scripts/benchmark-matrix.sh --path tcp --workload a --iface eth0 --out results/tcp-a.jsonIt is still intentionally conservative: if the required generator tool is missing, it fails loudly instead of pretending a benchmark ran.
The repo also now includes an illustrative baseline artifact at ./results/e810-baseline-2026-05-08.json plus a plotting helper at ./tools/plot_e810_baseline.py so the benchmark story is grounded in a reusable result format.
For the quick local workflow, use:
./scripts/run_local_baseline.sh
python3 tools/plot_baseline.py results/latest.jsonThat local path:
- runs a real localhost UDP echo benchmark for
Path A - runs an
AF_XDPmock/scaffold benchmark forPath B - writes a combined JSON file at
./results/local-baseline-YYYYMMDD-HHMMSS.json - refreshes
./results/latest.json - generates
./diagrams/local-baseline-throughput.png - generates
./diagrams/local-baseline-latency.pngwhen latency fields are present
A checked-in reference artifact is available at ./results/sample-local-baseline.json.
The next release blockers are now called out more explicitly in the docs:
- prove Path B is not worse than Path A for sub-
512 BRPCs - show guardian intervention does not violate tail-latency SLOs
- define exactly who can read the reasoning logs and under what trust model
The repo now includes a small profiling helper at ./tools/perf_capture.sh plus a workflow note in ./docs/perf-flamegraph-workflow.md.
Example:
sudo ./tools/perf_capture.sh --output-dir perf/local-demo -- ./build/udp_client --host 127.0.0.1 --port 9000 --packet-size 256 --count 2000This is intentionally lightweight:
- it captures
perf recorddata for a target command - emits
perf reporttext - emits
perf scriptoutput - optionally emits folded stacks and a flamegraph SVG when Brendan Gregg
FlameGraphscripts are available locally
Even a simple softirq vs userspace polling profile is valuable here, because it turns the repo from architecture opinion into an instrumentable lab.
- Expand the
AF_XDPsample into a real UMEM-backed receive loop with fill/completion management. - Add an
XDPloader and socket redirection plumbing aroundxdp_pass.c. - Extend the
io_uringsample from setup-only into a completeRECV_ZCnotification flow. - Flesh out the RDMA path with full
RESET -> INIT -> RTR -> RTSstate transitions and peer exchange helpers. - Add workload generators and result aggregation to the benchmark harness.
- Formalize the
Intent -> Agent -> Guardian -> Dataplane -> Auditmodel into a sharper design and possible patent memo.