Skip to content

jnoahbaier/data-matchmaker-benchmark

Repository files navigation

Data Matchmaker Benchmark - Green Agent

A Green Agent for the AgentBeats Competition that evaluates Purple Agents on their ability to integrate data from multiple sources, join tables correctly, and compute accurate per-customer aggregations.

--

How can an agent reason over multiple datasets if it cannot first determine when differently named fields and records refer to the same underlying entity? This project introduces an evaluation agent that measures how effectively a candidate agent can join related datasets with non-identical keys, inconsistent schemas, and noisy identifiers by using contextual signals and structured inference rather than shared primary keys. The framework evaluates the accuracy, coverage, and coherence of inferred mappings, providing a benchmark for an agent’s ability to reconcile data generated by separate processes.

For example, in an enterprise context, accounting ledgers, payroll tables, CRM exports, and operational logs often describe the same customers, employees, or transactions using different naming conventions, partial identifiers, and incompatible schemas. The evaluation agent tests whether an alignment agent can correctly standardize and map these datasets into a unified representation, a prerequisite for automation, reporting, and reliable downstream decision-making.

At scale, agents with this capability can operate across virtually any data-producing environment, enabling autonomous analytics, continuous monitoring, and decision support without bespoke data engineering, and fundamentally expanding where and how agentic systems can be deployed.

What This Benchmark Evaluates

This benchmark tests a Purple Agent's data integration capabilities using a simplified TPC-DI style task:

Component Description Points
Column Check Correct columns in output 20
Row Count Correct number of output rows 10
Customer Coverage All customers represented 15
Numeric Accuracy Correct aggregation values 40
String Accuracy Correct text fields 15

The benchmark uses a hybrid evaluation approach:

  1. Deterministic numerical scoring for accuracy metrics (0-100 points)
  2. LLM-powered qualitative feedback for insights and recommendations (via Gemini)

The Task

Purple Agents must:

  1. Fetch source data files via HTTP from the Green Agent:

    • customers_tpcdi_lite_v3.csv - Customer information
    • accounts_tpcdi_lite_v3.csv - Account data linked to customers
    • trades_tpcdi_lite_v3.csv - Trade transactions linked to accounts
  2. Join the tables correctly:

    • Accounts → Customers on customer_id
    • Trades → Accounts on account_id
  3. Filter to only completed trades (trade_status = 'CMPT')

  4. Aggregate per customer to produce:

    • customer_id - Customer identifier
    • customer_name - First + last name
    • country - Customer's country
    • num_accounts - Count of accounts
    • total_balance - Sum of account balances
    • num_trades - Count of completed trades
    • total_trade_volume - Sum of trade quantities
    • total_trade_value - Sum of (quantity × trade_price)
    • symbols_traded - Comma-separated unique stock symbols

Quick Start

Prerequisites

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup
git clone <your-repo-url>
cd data-matchmaker-benchmark
uv sync

Running the Benchmark

Terminal 1 - Start the Green Agent:

uv run python src/server.py --host 127.0.0.1 --port 9009

The Green Agent exposes:

  • A2A endpoint: http://127.0.0.1:9009/ - For agent-to-agent communication
  • File listing: http://127.0.0.1:9009/files/ - Lists available data files
  • File download: http://127.0.0.1:9009/files/{filename} - Download individual files

Terminal 2 - Start the Mock Purple Agent (for testing):

# Copy and configure your Gemini API key
cp sample.env .env
# Edit .env and add your GEMINI_API_KEY

uv run python src/mock_purple.py --host 127.0.0.1 --port 9010

Terminal 3 - Run an evaluation:

curl -s -X POST http://127.0.0.1:9009/ \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "method": "message/send",
    "id": "1",
    "params": {
      "message": {
        "kind": "message",
        "role": "user",
        "parts": [{
          "kind": "text",
          "text": "{\"participants\": {\"data_integrator\": \"http://127.0.0.1:9010\"}, \"config\": {\"timeout\": 300}}"
        }],
        "messageId": "test"
      }
    }
  }' | python3 -m json.tool

Using the AgentBeats CLI

If you have the AgentBeats CLI installed:

uv run agentbeats-run scenario.toml

Project Structure

src/
├── server.py        # A2A server + file serving endpoints
├── main.py          # Alternative entry point with MCP server
├── agent.py         # Green agent: task generation + scoring + LLM feedback
├── executor.py      # A2A request handling
├── messenger.py     # A2A messaging utilities
├── mcp_server.py    # MCP resources for file access (optional)
└── mock_purple.py   # Baseline purple agent for testing

jan15_tasks/
├── customers_tpcdi_lite_v3.csv    # Customer data
├── accounts_tpcdi_lite_v3.csv     # Account data
├── trades_tpcdi_lite_v3.csv       # Trade transactions
├── prospect_tpcdi_lite_v3.xlsx    # Additional prospect data
├── finwire_tpcdi_lite_v3.xml      # Financial wire data
└── gold_ground_truth_tpcdi_lite_v3.csv  # Expected results (not exposed)

Input Format

The Green Agent expects an A2A assessment request with:

{
  "participants": {
    "data_integrator": "http://purple-agent-url:port"
  },
  "config": {
    "timeout": 300,
    "file_server_url": "http://green-agent-url:port"
  }
}

Output Format

The Green Agent returns a detailed evaluation:

{
  "score": 87,
  "max_score": 100,
  "details": {
    "columns": {"score": 20, "max": 20, "missing": [], "extra": []},
    "row_count": {"score": 10, "max": 10, "expected": 100, "submitted": 100},
    "customer_coverage": {"score": 15, "max": 15, "coverage_pct": 100.0},
    "numeric_accuracy": {"score": 32, "max": 40, "columns": {...}},
    "string_accuracy": {"score": 10, "max": 15, "fields": {...}}
  },
  "llm_feedback": {
    "enabled": true,
    "summary": "Good overall performance with minor aggregation issues...",
    "strengths": ["Correct table joins", "Complete customer coverage"],
    "weaknesses": ["Trade value calculation off by ~2%"],
    "recommendations": ["Verify trade_price handling"]
  },
  "purple_agent_url": "http://127.0.0.1:9010",
  "submitted_rows": 100,
  "expected_rows": 100
}

Purple Agent Requirements

Purple agents must:

  1. Accept a task message describing the data integration task
  2. Fetch data files via HTTP GET from the provided URLs
  3. Return CSV data with the expected columns

Example response format:

customer_id,customer_name,country,num_accounts,total_balance,num_trades,total_trade_volume,total_trade_value,symbols_traded
1,John Smith,USA,2,15000.50,5,100,12500.00,"AAPL,GOOGL"
2,Jane Doe,Canada,1,8000.00,3,50,4500.00,"MSFT"

Environment Variables

Variable Description Required
GEMINI_API_KEY API key for LLM feedback generation Optional
FILE_SERVER_URL Override file server URL Optional
MCP_SERVER_URL MCP server URL for main.py Optional

Running with Docker

# Build the green agent
docker build -t tpcdi-evaluator .

# Run
docker run -p 9009:9009 -e GEMINI_API_KEY=your_key tpcdi-evaluator

Running Tests

uv sync --extra test
uv run pytest tests/

Competition Info

This is a submission for the AgentBeats Competition Phase 1 (Green Agent).

  • Track: Data Engineering
  • Benchmark Type: TPC-DI Style Data Integration
  • Skills Tested: Data fetching, table joining, filtering, aggregation, ETL

License

MIT

About

AgentBeats Green agent benchmark for evaluating schema and table merging

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors