GORGO

Maximizing KV-Cache Reuse While Minimizing Network Latency in Cross-Region LLM Load Balancing.

Python Environment Setup

Create a Python virtual environment in the benchmark directory:

cd benchmark
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cd ..

Setup for Research Replication/Experimentation

Install golang
Clone this repo
Build gotoni from source

go build -o gotoni

Setup an account on Lambda.ai and run:

export LAMBDA_API_KEY=your_token_here

Email atoniolo76@gmail.com to get a Lambda.ai API key for testing

Run ./gotoni available to see available Lambda instances and ./gotoni launch to check out instances

Note: Instances from at least three separate regions are required for Gorgo's TTFT improvement (or two if running GORGO-proxy)

Run ./gotoni cluster setup, which handles sglang installation, mistral-7b download, and the gotoni binary deployment for load balancer processes. Uses all currently running Lambda.ai instances on your account.

Benchmarking Routing Policies

The repository includes preprocessed WildChat datasets:

benchmark/wildchat_guidellm.jsonl - 8,893 prompts with location mappings
benchmark/wildchat_location_lookup.json - 8,117 unique location entries

Step 1: Run Benchmarks

Available profiles: poisson, sweep, constant, concurrent, throughput

Start the geo-routing proxy and run GuideLLM benchmarks:

# Start geo-routing proxy (routes by user location from WildChat)
cd benchmark
python geo_proxy.py --port 9000 &
cd ..

# Run GuideLLM benchmark
guidellm benchmark \
  --target http://localhost:9000/v1 \
  --profile poisson \
  --rate 5 \
  --max-seconds 60 \
  --data ./benchmark/wildchat_guidellm.jsonl \
  --data-column-mapper '{"text_column": "prompt"}'

Results, including metrics like TTFT and Throughput, are saved to guidellm_results/.

Note #1: Update cluster node IPs in geo_proxy.py (lines 40-44) to match your deployment.

Note #2: Between benchmarks, run gotoni cluster flush-cache to clear both KV-caches and LB prefix trees, otherwise prefix-tree and GORGO will behave like least-load.

Other options

Switch between routing strategies (gorgo, prefix-tree, least-load):

./gotoni cluster restart-lb --strategy prefix-tree --max-concurrent 50

--max-concurrent sets the maximum requests forwarded per instance (higher = more parallelism, lower = stricter queueing).

Create traces of per-gpu request queueing/forwarding/processing. Before benchmarking, start the tracing service with:

./gotoni cluster stop-trace

Once benchmark is complete:

./gotoni cluster stop-trace

View in perfetto.dev:

Tuning GORGO Parameters

Fine-tune GORGO's cost calculation by editing pkg/config/constants.go:

DefaultGORGORunningCostFactor ($\hat{q}_s$ in paper) - Weight for running requests' prefill cost (default: 0.5)
- Lower = running requests weighted less in cost calculation
DefaultGORGOMsPerToken ($t_p$ in paper) - Prefill time per token in ms (default: 0.094)
- Measured cold-start prefill rate on your hardware
DefaultSGLangMaxRunningRequests - Max concurrent requests per SGLang instance (default: 10)
- Controls when requests queue (triggering the routing policy)

After changes, rebuild and redeploy: go build -o gotoni && ./gotoni cluster upload && ./gotoni cluster restart-lb

Step 2: Analyze Benchmark Results

After benchmarking completes, results are saved to guidellm_results/ (benchmark metrics).

View the per-gpu trace in benchmark_results/.

Important Metrics

Time to First Token (TTFT): First token generation latency (lower is better)
Throughput: Requests completed per second
Average Latency: Total request latency across all nodes

GORGO Centralized Proxy

The centralized proxy (./gotoni proxy) routes requests using GORGO with queue + running request tracking for ALL instances and not just the local server.

# Deploy proxy to a cloud instance
./gotoni proxy deploy west_2 --listen-port 9000
./gotoni proxy --remote west_2 status

# Manage servers
./gotoni proxy --remote west_2 clear-cache # clears *mirrored* prefix tree, not actual KV-cache on instances

# Benchmark through proxy (replace IP with your instance's IP)
guidellm benchmark --target http://146.235.219.131:9000/v1 --profile poisson --rate 5 --max-seconds 60

Running this benchmark without the geo_proxy.py will send all requests to the proxy, which will then run the GORGO forwarding policy (cache + latency aware).

View PrefixTrie Stats

./gotoni proxy prefix-stats
Prefix Tree Stats:
  Nodes:               257
  Leaves:              165
  Max depth:           17374
  Avg depth:           670.55
  Avg branching:       2.78
  Avg prefix length:   409.49
  Total server refs:   173
  Unique servers:      3

Name		Name	Last commit message	Last commit date
Latest commit History 221 Commits
.github/workflows		.github/workflows
benchmark		benchmark
cmd		cmd
docs		docs
pkg		pkg
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
build.sh		build.sh
go.mod		go.mod
go.mod.compatible		go.mod.compatible
go.sum		go.sum
install.sh		install.sh
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GORGO

Python Environment Setup

Setup for Research Replication/Experimentation

Benchmarking Routing Policies

Step 1: Run Benchmarks

Other options

Tuning GORGO Parameters

Step 2: Analyze Benchmark Results

Important Metrics

GORGO Centralized Proxy

View PrefixTrie Stats

About

Uh oh!

Releases 1

Packages

Contributors 3

Uh oh!

Languages

atoniolo76/gotoni

Folders and files

Latest commit

History

Repository files navigation

GORGO

Python Environment Setup

Setup for Research Replication/Experimentation

Benchmarking Routing Policies

Step 1: Run Benchmarks

Other options

Tuning GORGO Parameters

Step 2: Analyze Benchmark Results

Important Metrics

GORGO Centralized Proxy

View PrefixTrie Stats

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Uh oh!

Languages

Packages