Huanyu Wang, Ziyu Xia, Zhuoming Chen, Beidi Chen
Carnegie Mellon University
WWW.Serve operates as an intermediate decentralized serving layer between users and LLM service providers, offering users access to an open and competitive market of worldwide LLM services while preserving service providers’ anonymity and flexibility. Within WWW.Serve, inference requests follow a collaborative workflow that performs decentralized routing, execution, and quality-aware evaluation.
Three key designs are integrated:
- a credit-based transaction system for trustless collaboration.
- a gossip-driven protocol for dynamic peer synchronization.
- a duel-and-judge mechanism for robust contributor evaluation.
Under various configurations, WWW.Serve improves global SLO attainment by up to 1.5x and lowers latency by 27.6%. Its performance approaches, and in some cases surpasses, centralized scheduling, while preserving the benefits of decentralization.
![]() |
![]() |
![]() |
![]() |
The repository is organized as follows:
WWWServe/
├── experiments/
| ├── simulation/
| | └── simu_xxx.py
| └── visualization/
| └── visualize_xxx.ipynb
├── node_configs/
| └── nodex.yaml
├── www_serve/
| ├── policies/
| | └── default_policy.py
| └── core_codes.py
├── README.md
└── requirements.txt
experiments/: simulation scripts for network experiments and notebooks for visualization.node_configs/: YAML configuration files specifying parameters for each node.www_serve/: core implementation ofWWW.Serve, including policies and scheduling logic.
conda create -n wwwserve python=3.12
conda activate wwwserve
pip install -r requirements.txt
Note: The above installs only the core dependencies for scheduling. To deploy actual LLM servers, you will need to manually install additional backends (e.g., SGLang, vLLM) depending on your experimental setup.
The typical workflow of WWW.Serve consists of the following steps:
Start your preferred LLM backend (OpenAI-Compatible Server) and obtain its base URL and API key. For example:
# SGLang
# Note: use "--enable-metrics" to expose server status for scheduling
python3 -m sglang.launch_server --model-path $MODEL_PATH --host 0.0.0.0 --port $PORT --enable-metrics
# vLLM
vllm serve $MODEL_PATH --max-model-len=16384 --host 0.0.0.0 --port $PORTEach node is specified via a YAML configuration file placed in node_configs/:
server_params:
ip: 127.0.0.1 # node communication address
port: 5778
policy: default_sglang # scheduling / dispatching policy
offload_frequency: 0.8
queue_frequency: 0.2
accept_frequency: 0.8
ledger_params:
initial_credit: 1000.0
initial_staked: 0.0
models:
- model_path: Qwen/Qwen3-8B
base_url: <BASE_URL>:<PORT> # from launched LLM server (Step 1)
api_key: None
gen_params:
max_tokens: 8192
temperature: 0.0
top_p: 0.95
dispatch_params:
target_token_usage: 0.6Use the scripts in experiments/simulation/ to start a network simulation. For example:
cd experiments/simulation
python simu_decentralized.py
# python simu_centralized.py
# python simu_single.pyBy default, the simulation outputs are saved in experiments/results/, including:
- nodex.json: runtime status log of each node.
- result.json: aggregated results for all requests.
The simulation results can be analyzed using the Jupyter notebooks in experiments/visualization/. These notebooks allow you to visualize: Global SLO attainment, request latency distribution, and server load status.
TODO





