MCPMark-lite RFT Example

Standalone, Docker-packaged, multi-turn tool-calling benchmark that is compatible with eval-protocol and Fireworks RFT.

This repo is a pragmatic subset inspired by MCP benchmark design:

real tool execution against mutable state (filesystem)
deterministic verifier over post-rollout environment state
low infra overhead (single local FastMCP server over stdio)

Why this subset

For an end-to-end RFT example, the cleanest path is filesystem-only MCP tasks:

MCPMark emphasizes verifier-driven realism and notes filesystem tasks can run with zero API-key setup in a quickstart path.
MCP-Universe is valuable but includes optional internet/API-dependent domains and a broader server matrix.
MCP-Bench is comprehensive but setup-heavy (multiple provider keys + Docker stack + richer harness requirements).

References:

What is included

mcp_server/task_files_server.py: local FastMCP server with task-scoped filesystem tools.
data/tasks.jsonl: 8 deterministic multi-turn tasks.
benchmark/test_mcp_filesystem_rft.py: @evaluation_test benchmark using AgentRolloutProcessor + deterministic verifier.
benchmark/verifier.py: strict file-state checks (json_equals, text_equals, file_contains).
Dockerfile: standalone runnable container.

Tooling pattern

Each rollout is expected to:

call init_task(task_id)
use list_files / read_file
produce required output files with write_file
append completion marker in checklist via append_file

Reward is computed from real filesystem state, not just assistant text.

Local setup

uv sync

Set Fireworks auth:

export FIREWORKS_API_KEY=...

Optional (default is a small qwen model):

export MCP_AGENT_MODEL=fireworks_ai/accounts/fireworks/models/qwen3-8b

Optional low-cost knobs:

export MCP_AGENT_STEPS=8
export MCP_AGENT_MAX_TOKENS=512
export MCP_MAX_CONCURRENT_ROLLOUTS=1

Cloud/RFT note: conservative defaults (steps=6, max_tokens=192) are used to reduce context-overflow risk on small models.

On-demand deployment override:

export MCP_ON_DEMAND_MODEL='fireworks_ai/accounts/fireworks/models/qwen3-32b#accounts/pyroworks/deployments/qwen3-32b-rft-py-02221823'

Run benchmark

uv run pytest benchmark/test_mcp_filesystem_rft.py::test_mcpmark_lite_filesystem -q -s

Small smoke run:

EP_MAX_DATASET_ROWS=1 MCP_AGENT_STEPS=6 MCP_AGENT_MAX_TOKENS=512 uv run pytest benchmark/test_mcp_filesystem_rft.py::test_mcpmark_lite_filesystem -q -s

Docker run

docker build -t mcpmark-lite-rft .
docker run --rm -e FIREWORKS_API_KEY="$FIREWORKS_API_KEY" mcpmark-lite-rft

Fireworks RFT flow

Use a known evaluator id for this test:

test-mcp-filesystem-rft-test-mcpmark-lite-filesystem

Materialize an RFT-ready dataset first (required when using older eval-protocol releases that do not auto-apply dataset_adapter during create rft):

uv run python scripts/materialize_rft_dataset.py \
  --input data/tasks.jsonl \
  --output data/rft_tasks_smoke.jsonl \
  --max-rows 1

Create RFT (base model required):

uv run ep create rft \
  --evaluator test-mcp-filesystem-rft-test-mcpmark-lite-filesystem \
  --dataset-jsonl data/rft_tasks_smoke.jsonl \
  --base-model accounts/fireworks/models/qwen3-8b \
  --response-candidates-count 2 \
  --max-output-tokens 1024 \
  --chunk-size 1 \
  --yes \
  --ignore-docker \
  --skip-validation

Monitor a job until terminal state:

uv run python scripts/monitor_rft_job.py --job-id <rft_job_id> --account pyroworks

Notes:

In this python-sdk branch, create rft auto-detects JSONL input dataset from @evaluation_test(input_dataset=[...]) in many cases.
If auto-detection fails in your environment, create/upload dataset first and rerun with --dataset <dataset_id>.

Benchmark design constraints

Deterministic checks make reward stable for RL.
Task-scoped sandboxes prevent cross-row contamination.
No external APIs required by default, which keeps rollout generation cost and failure modes low.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
benchmark		benchmark
data		data
mcp_config		mcp_config
mcp_server		mcp_server
scripts		scripts
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MCPMark-lite RFT Example

Why this subset

What is included

Tooling pattern

Local setup

Run benchmark

Docker run

Fireworks RFT flow

Benchmark design constraints

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MCPMark-lite RFT Example

Why this subset

What is included

Tooling pattern

Local setup

Run benchmark

Docker run

Fireworks RFT flow

Benchmark design constraints

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages