Skip to content

eval-protocol/mcpmark-lite-rft-example

Repository files navigation

MCPMark-lite RFT Example

Standalone, Docker-packaged, multi-turn tool-calling benchmark that is compatible with eval-protocol and Fireworks RFT.

This repo is a pragmatic subset inspired by MCP benchmark design:

  • real tool execution against mutable state (filesystem)
  • deterministic verifier over post-rollout environment state
  • low infra overhead (single local FastMCP server over stdio)

Why this subset

For an end-to-end RFT example, the cleanest path is filesystem-only MCP tasks:

  • MCPMark emphasizes verifier-driven realism and notes filesystem tasks can run with zero API-key setup in a quickstart path.
  • MCP-Universe is valuable but includes optional internet/API-dependent domains and a broader server matrix.
  • MCP-Bench is comprehensive but setup-heavy (multiple provider keys + Docker stack + richer harness requirements).

References:

What is included

  • mcp_server/task_files_server.py: local FastMCP server with task-scoped filesystem tools.
  • data/tasks.jsonl: 8 deterministic multi-turn tasks.
  • benchmark/test_mcp_filesystem_rft.py: @evaluation_test benchmark using AgentRolloutProcessor + deterministic verifier.
  • benchmark/verifier.py: strict file-state checks (json_equals, text_equals, file_contains).
  • Dockerfile: standalone runnable container.

Tooling pattern

Each rollout is expected to:

  1. call init_task(task_id)
  2. use list_files / read_file
  3. produce required output files with write_file
  4. append completion marker in checklist via append_file

Reward is computed from real filesystem state, not just assistant text.

Local setup

uv sync

Set Fireworks auth:

export FIREWORKS_API_KEY=...

Optional (default is a small qwen model):

export MCP_AGENT_MODEL=fireworks_ai/accounts/fireworks/models/qwen3-8b

Optional low-cost knobs:

export MCP_AGENT_STEPS=8
export MCP_AGENT_MAX_TOKENS=512
export MCP_MAX_CONCURRENT_ROLLOUTS=1

Cloud/RFT note: conservative defaults (steps=6, max_tokens=192) are used to reduce context-overflow risk on small models.

On-demand deployment override:

export MCP_ON_DEMAND_MODEL='fireworks_ai/accounts/fireworks/models/qwen3-32b#accounts/pyroworks/deployments/qwen3-32b-rft-py-02221823'

Run benchmark

uv run pytest benchmark/test_mcp_filesystem_rft.py::test_mcpmark_lite_filesystem -q -s

Small smoke run:

EP_MAX_DATASET_ROWS=1 MCP_AGENT_STEPS=6 MCP_AGENT_MAX_TOKENS=512 uv run pytest benchmark/test_mcp_filesystem_rft.py::test_mcpmark_lite_filesystem -q -s

Docker run

docker build -t mcpmark-lite-rft .
docker run --rm -e FIREWORKS_API_KEY="$FIREWORKS_API_KEY" mcpmark-lite-rft

Fireworks RFT flow

Use a known evaluator id for this test:

  • test-mcp-filesystem-rft-test-mcpmark-lite-filesystem

Materialize an RFT-ready dataset first (required when using older eval-protocol releases that do not auto-apply dataset_adapter during create rft):

uv run python scripts/materialize_rft_dataset.py \
  --input data/tasks.jsonl \
  --output data/rft_tasks_smoke.jsonl \
  --max-rows 1

Create RFT (base model required):

uv run ep create rft \
  --evaluator test-mcp-filesystem-rft-test-mcpmark-lite-filesystem \
  --dataset-jsonl data/rft_tasks_smoke.jsonl \
  --base-model accounts/fireworks/models/qwen3-8b \
  --response-candidates-count 2 \
  --max-output-tokens 1024 \
  --chunk-size 1 \
  --yes \
  --ignore-docker \
  --skip-validation

Monitor a job until terminal state:

uv run python scripts/monitor_rft_job.py --job-id <rft_job_id> --account pyroworks

Notes:

  • In this python-sdk branch, create rft auto-detects JSONL input dataset from @evaluation_test(input_dataset=[...]) in many cases.
  • If auto-detection fails in your environment, create/upload dataset first and rerun with --dataset <dataset_id>.

Benchmark design constraints

  • Deterministic checks make reward stable for RL.
  • Task-scoped sandboxes prevent cross-row contamination.
  • No external APIs required by default, which keeps rollout generation cost and failure modes low.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors