Standalone, Docker-packaged, multi-turn tool-calling benchmark that is compatible with eval-protocol and Fireworks RFT.
This repo is a pragmatic subset inspired by MCP benchmark design:
- real tool execution against mutable state (filesystem)
- deterministic verifier over post-rollout environment state
- low infra overhead (single local FastMCP server over stdio)
For an end-to-end RFT example, the cleanest path is filesystem-only MCP tasks:
MCPMarkemphasizes verifier-driven realism and notes filesystem tasks can run with zero API-key setup in a quickstart path.MCP-Universeis valuable but includes optional internet/API-dependent domains and a broader server matrix.MCP-Benchis comprehensive but setup-heavy (multiple provider keys + Docker stack + richer harness requirements).
References:
mcp_server/task_files_server.py: local FastMCP server with task-scoped filesystem tools.data/tasks.jsonl: 8 deterministic multi-turn tasks.benchmark/test_mcp_filesystem_rft.py:@evaluation_testbenchmark usingAgentRolloutProcessor+ deterministic verifier.benchmark/verifier.py: strict file-state checks (json_equals,text_equals,file_contains).Dockerfile: standalone runnable container.
Each rollout is expected to:
- call
init_task(task_id) - use
list_files/read_file - produce required output files with
write_file - append completion marker in checklist via
append_file
Reward is computed from real filesystem state, not just assistant text.
uv syncSet Fireworks auth:
export FIREWORKS_API_KEY=...Optional (default is a small qwen model):
export MCP_AGENT_MODEL=fireworks_ai/accounts/fireworks/models/qwen3-8bOptional low-cost knobs:
export MCP_AGENT_STEPS=8
export MCP_AGENT_MAX_TOKENS=512
export MCP_MAX_CONCURRENT_ROLLOUTS=1Cloud/RFT note: conservative defaults (steps=6, max_tokens=192) are used to reduce context-overflow risk on small models.
On-demand deployment override:
export MCP_ON_DEMAND_MODEL='fireworks_ai/accounts/fireworks/models/qwen3-32b#accounts/pyroworks/deployments/qwen3-32b-rft-py-02221823'uv run pytest benchmark/test_mcp_filesystem_rft.py::test_mcpmark_lite_filesystem -q -sSmall smoke run:
EP_MAX_DATASET_ROWS=1 MCP_AGENT_STEPS=6 MCP_AGENT_MAX_TOKENS=512 uv run pytest benchmark/test_mcp_filesystem_rft.py::test_mcpmark_lite_filesystem -q -sdocker build -t mcpmark-lite-rft .
docker run --rm -e FIREWORKS_API_KEY="$FIREWORKS_API_KEY" mcpmark-lite-rftUse a known evaluator id for this test:
test-mcp-filesystem-rft-test-mcpmark-lite-filesystem
Materialize an RFT-ready dataset first (required when using older eval-protocol releases that do not auto-apply dataset_adapter during create rft):
uv run python scripts/materialize_rft_dataset.py \
--input data/tasks.jsonl \
--output data/rft_tasks_smoke.jsonl \
--max-rows 1Create RFT (base model required):
uv run ep create rft \
--evaluator test-mcp-filesystem-rft-test-mcpmark-lite-filesystem \
--dataset-jsonl data/rft_tasks_smoke.jsonl \
--base-model accounts/fireworks/models/qwen3-8b \
--response-candidates-count 2 \
--max-output-tokens 1024 \
--chunk-size 1 \
--yes \
--ignore-docker \
--skip-validationMonitor a job until terminal state:
uv run python scripts/monitor_rft_job.py --job-id <rft_job_id> --account pyroworksNotes:
- In this
python-sdkbranch,create rftauto-detects JSONL input dataset from@evaluation_test(input_dataset=[...])in many cases. - If auto-detection fails in your environment, create/upload dataset first and rerun with
--dataset <dataset_id>.
- Deterministic checks make reward stable for RL.
- Task-scoped sandboxes prevent cross-row contamination.
- No external APIs required by default, which keeps rollout generation cost and failure modes low.