Skip to content

Add reproducible API benchmark suite#540

Open
dicnunz wants to merge 1 commit into
SecureBananaLabs:mainfrom
dicnunz:codex/securebanana-api-benchmarks-30
Open

Add reproducible API benchmark suite#540
dicnunz wants to merge 1 commit into
SecureBananaLabs:mainfrom
dicnunz:codex/securebanana-api-benchmarks-30

Conversation

@dicnunz
Copy link
Copy Markdown

@dicnunz dicnunz commented May 21, 2026

/claim #30

Summary

  • Adds a benchmark manifest covering every mounted /api/* route plus /health (21 routes total).
  • Adds a local-first benchmark runner that can also target BENCHMARK_TARGET_URL, captures p50/p95/p99 latency, p99 TTFB, sustained/peak RPS, status distribution, error rate, and bytes received.
  • Adds route-coverage verification, reviewable thresholds, .env.benchmark.example, CI smoke gating, generated JSON/Markdown results, and a short demo artifact at demos/api-benchmark-demo.mp4.
  • Fixes the API package test script to point Node at test files instead of the test directory.

Validation

  • npm run benchmark:coverage -> covers 21 routes.
  • npm test -> 1 passing API test.
  • npm run benchmark:smoke -> 42 requests, 0 errors, max p99 latency 22.07 ms, max p99 TTFB 21.31 ms.
  • npm run benchmark -> 126 requests, 0 errors, max p99 latency 31.96 ms, max p99 TTFB 31.35 ms.
  • git diff --check -> clean.
  • ffprobe demos/api-benchmark-demo.mp4 -> 1280x720, 9s, 270 frames.

Benchmark Environment

Hardware

  • CPU model & core count: Apple M3, 8 cores
  • RAM (total & available during benchmark): 16 GB total; generated JSON captured available memory during the run
  • Storage type (SSD / NVMe / HDD): Apple internal SSD / APFS
  • Network interface (Ethernet / WiFi / loopback): loopback (127.0.0.1) for local benchmark target
  • Machine type (local workstation / cloud VM / CI runner — include instance type if cloud): local workstation
  • OS & version: macOS Darwin 25.3.0 arm64

Runtime

  • Node.js version (or relevant runtime): v25.9.0
  • Any resource limits applied (Docker memory cap, cgroup limits, etc.): none intentionally applied
  • Other significant processes running during benchmark (yes / no — if yes, describe): yes, normal local desktop and development tooling background processes

If submitted by or with an AI agent

  • Agent or tool name (e.g. Claude Code, Devin, Copilot Workspace, AutoGPT): OpenAI Codex desktop app
  • Underlying model and version (e.g. claude-sonnet-4-5, gpt-4o — if known): GPT-5-based Codex coding model; exact backend build not exposed
  • Inference provider (e.g. Anthropic, OpenAI, Azure, self-hosted): OpenAI
  • Orchestration framework if any (e.g. LangChain, AutoGen, custom): none beyond Codex desktop tooling
  • Execution mode (fully autonomous / human-supervised / human-initiated per step): human-initiated, agent-executed
  • Did the agent have shell/tool access during execution (yes / no): yes
  • Did the agent have internet access during execution (yes / no): yes
  • Were benchmark commands run by the agent directly or handed off to the human to run: run directly by the agent
  • Any known agent constraints or sandboxing that may have affected execution: local loopback benchmark only; no production/staging target, secrets, private prompt text, or private session dumps are included

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant