Kernel Eval Protocol

Eval Protocol integration for evaluating VLM browser agents using Kernel browser pools.

This package provides standardized evaluation of browser-based VLM agents using the Eval Protocol framework.

Overview

The integration provides:

KernelBrowserRolloutProcessor: A custom rollout processor that runs multi-step browser episodes using Kernel browser pools
core/: Vendored agent code (QwenAgent, WebJudge, browser adapter) from kernel-tinker-rl
agent_auth/: Agent Auth benchmark configuration and custom actions
test_agent_auth.py: Example evaluation test for the Agent Auth benchmark

How It Works

┌─────────────────────────────────────────────────────────────────┐
│                     Eval Protocol                               │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  @evaluation_test(...)                                    │  │
│  │  async def test_agent_auth(row):                         │  │
│  │      # Rollout already executed by processor              │  │
│  │      trajectory = get_trajectory(row)                     │  │
│  │      score = webjudge.evaluate(trajectory)               │  │
│  └──────────────────────────────────────────────────────────┘  │
│                            │                                    │
│                            ▼                                    │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  KernelBrowserRolloutProcessor                            │  │
│  │    1. Acquire browser from Kernel pool                    │  │
│  │    2. Navigate to initial URL                             │  │
│  │    3. Run agent loop (QwenAgent)                          │  │
│  │       - Screenshot → VLM predict → Execute → Repeat       │  │
│  │    4. Capture trajectory (screenshots, actions, messages) │  │
│  │    5. Release browser                                     │  │
│  └──────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘
                             │
                             ▼
              ┌─────────────────────────────┐
              │      Kernel Browser Pool     │
              │  ┌─────┐ ┌─────┐ ┌─────┐    │
              │  │ 🌐  │ │ 🌐  │ │ 🌐  │    │
              │  └─────┘ └─────┘ └─────┘    │
              └─────────────────────────────┘

Setup

Prerequisites

Kernel API Key: Get from https://onkernel.com
OpenAI API Key: Get from https://platform.openai.com (for WebJudge scoring)
Fireworks API Key: Get from https://fireworks.ai (for VLM inference)
Browser Pool: Create a Kernel browser pool named eval-browser-pool

Installation

# Clone this repo
git clone <this-repo> kernel-eval-protocol
cd kernel-eval-protocol

# Install dependencies
pip install -r requirements.txt

# Set environment variables (or create a .env file)
export KERNEL_API_KEY="your-kernel-key"
export OPENAI_API_KEY="your-openai-key"
export FIREWORKS_API_KEY="your-fireworks-key"

Create .env file

KERNEL_API_KEY=your-kernel-key
OPENAI_API_KEY=your-openai-key
FIREWORKS_API_KEY=your-fireworks-key

Create Browser Pool

# Using Kernel CLI
kernel pools create eval-browser-pool --size 20

Usage

Run Agent Auth Evaluation

# Activate venv
source .venv/bin/activate

# Run evaluation
pytest test_agent_auth.py -vs

Custom Evaluations

Create your own evaluation test:

from eval_protocol.pytest import evaluation_test
from eval_protocol.models import EvaluateResult
from kernel_browser_rollout_processor import (
    KernelBrowserRolloutProcessor,
    decode_screenshots,
)
from core.reward_models.webjudge import Trajectory, WebJudge
from agent_auth.actions import AGENT_AUTH_ACTIONS
from agent_auth.config import get_agent_auth_system_prompt

@evaluation_test(
    input_dataset=["your_tasks.jsonl"],
    rollout_processor=KernelBrowserRolloutProcessor(
        pool_name="your-pool",
        max_steps=15,
        system_prompt=get_agent_auth_system_prompt(),
        extra_actions=AGENT_AUTH_ACTIONS,
    ),
    completion_params=[{"model": "accounts/fireworks/models/qwen3-vl-30b-a3b-thinking"}],
)
async def test_your_evaluation(row):
    # Trajectory data is in row.execution_metadata.extra
    extra = row.execution_metadata.extra
    screenshots = decode_screenshots(extra["screenshots_b64"])
    actions = extra["action_history"]
    
    # Message history (including tool_calls) is in row.messages
    messages = row.messages
    
    # Your evaluation logic here
    score = your_scorer(screenshots, actions)
    
    row.evaluation_result = EvaluateResult(score=score, reason="...")
    return row

Train with RFT

Create a reinforcement fine-tuning job using evaluation results:

ep create rft --base-model accounts/fireworks/models/qwen3-vl-8b-instruct --chunk-size 50 --max-context-length 32768 --batch-size 32768 --epochs 4

Project Structure

kernel-eval-protocol/
├── core/                          # Vendored from kernel-tinker-rl
│   ├── agent.py                   # QwenAgent - VLM agent with message history
│   ├── agent_loop.py              # Multi-step agent loop
│   ├── browser.py                 # Kernel browser adapter
│   ├── actions.py                 # Action definitions (click, type, etc.)
│   ├── prompts.py                 # System prompt builder
│   ├── tracking.py                # Episode tracking utilities
│   ├── utils.py                   # Helper utilities
│   └── reward_models/
│       ├── base.py                # Base reward model
│       └── webjudge.py            # WebJudge LLM-as-Judge scorer
├── agent_auth/                    # Agent Auth benchmark
│   ├── actions.py                 # FoundInputsAction for form discovery
│   └── config.py                  # System prompt configuration
├── kernel_browser_rollout_processor.py  # Eval Protocol rollout processor
├── test_agent_auth.py             # Example evaluation test
├── tasks.jsonl                    # Agent Auth task dataset
├── requirements.txt               # Python dependencies
├── pytest.ini                     # Pytest configuration
└── README.md

Data Format

Input Dataset (tasks.jsonl)

{
  "id": "gandi-net-register",
  "initial_url": "https://gandi.net",
  "task": "Navigate to gandi.net and find the register page..."
}

Message History (row.messages)

The agent preserves full conversation history including native tool calls:

[
    {"role": "system", "content": "...system prompt..."},
    {"role": "user", "content": [{"type": "image_url", ...}, {"type": "text", ...}]},
    {"role": "assistant", "content": "...", "tool_calls": [{"id": "...", "function": {...}}]},
    ...
]

Trajectory Data (row.execution_metadata.extra)

{
    # For evaluation (WebJudge)
    "screenshots_b64": ["base64...", ...],  # PNG images as base64
    "action_history": ["click(100, 200)", ...],
    
    # Task info
    "task": "Navigate to...",
    "task_id": "gandi-net-register",
    "initial_url": "https://gandi.net",
    "final_url": "https://gandi.net/register",
    
    # Episode metadata
    "termination_reason": "terminal_action",
    "terminal_action": "found_inputs(...)",
    "steps_completed": 5,
    "error": null,
    
    # Step details (for debugging)
    "step_results": [...]
}

Modifications to Vendored Code

The core/ directory is vendored from kernel-tinker-rl with the following modifications:

1. Message History Preservation (`core/agent.py`)

The original QwenAgent only stored text responses, discarding native tool calls. We modified it to preserve the full conversation history:

Added messages: list[dict] to AgentState
In predict(), now stores:
- System message (on first call)
- User messages (with screenshots + instruction)
- Assistant responses including tool_calls when present
Added get_messages() method to retrieve the full history

This enables accurate conversation replay in the Eval Protocol UI.

2. WebJudge OpenAI API Compatibility (`core/reward_models/webjudge.py`)

The original code used max_tokens and temperature=0 which work with OpenRouter but not with direct OpenAI API calls (newer models like gpt-5-mini):

Changed max_tokens → max_completion_tokens
Removed temperature=0 (newer OpenAI models only support temperature=1)

These changes allow WebJudge to work with direct OpenAI API calls instead of requiring OpenRouter.

Related Projects

kernel-tinker-rl - VLM RL training for computer use (source of vendored core/)
Eval Protocol - Pytest-based LLM evaluation framework
Kernel - Browser-as-a-service

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel Eval Protocol

Overview

How It Works

Setup

Prerequisites

Installation

Create Browser Pool

Usage

Run Agent Auth Evaluation

Custom Evaluations

Train with RFT

Project Structure

Data Format

Input Dataset (tasks.jsonl)

Message History (row.messages)

Trajectory Data (row.execution_metadata.extra)

Modifications to Vendored Code

1. Message History Preservation (`core/agent.py`)

2. WebJudge OpenAI API Compatibility (`core/reward_models/webjudge.py`)

Related Projects

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Kernel Eval Protocol

Overview

How It Works

Setup

Prerequisites

Installation

Create Browser Pool

Usage

Run Agent Auth Evaluation

Custom Evaluations

Train with RFT

Project Structure

Data Format

Input Dataset (tasks.jsonl)

Message History (row.messages)

Trajectory Data (row.execution_metadata.extra)

Modifications to Vendored Code

1. Message History Preservation (core/agent.py)

2. WebJudge OpenAI API Compatibility (core/reward_models/webjudge.py)

Related Projects

1. Message History Preservation (`core/agent.py`)

2. WebJudge OpenAI API Compatibility (`core/reward_models/webjudge.py`)