Data Matchmaker Benchmark - Green Agent

A Green Agent for the AgentBeats Competition that evaluates Purple Agents on their ability to integrate data from multiple sources, join tables correctly, and compute accurate per-customer aggregations.

--

How can an agent reason over multiple datasets if it cannot first determine when differently named fields and records refer to the same underlying entity? This project introduces an evaluation agent that measures how effectively a candidate agent can join related datasets with non-identical keys, inconsistent schemas, and noisy identifiers by using contextual signals and structured inference rather than shared primary keys. The framework evaluates the accuracy, coverage, and coherence of inferred mappings, providing a benchmark for an agent’s ability to reconcile data generated by separate processes.

For example, in an enterprise context, accounting ledgers, payroll tables, CRM exports, and operational logs often describe the same customers, employees, or transactions using different naming conventions, partial identifiers, and incompatible schemas. The evaluation agent tests whether an alignment agent can correctly standardize and map these datasets into a unified representation, a prerequisite for automation, reporting, and reliable downstream decision-making.

At scale, agents with this capability can operate across virtually any data-producing environment, enabling autonomous analytics, continuous monitoring, and decision support without bespoke data engineering, and fundamentally expanding where and how agentic systems can be deployed.

What This Benchmark Evaluates

This benchmark tests a Purple Agent's data integration capabilities using a simplified TPC-DI style task:

Component	Description	Points
Column Check	Correct columns in output	20
Row Count	Correct number of output rows	10
Customer Coverage	All customers represented	15
Numeric Accuracy	Correct aggregation values	40
String Accuracy	Correct text fields	15

The benchmark uses a hybrid evaluation approach:

Deterministic numerical scoring for accuracy metrics (0-100 points)
LLM-powered qualitative feedback for insights and recommendations (via Gemini)

The Task

Purple Agents must:

Fetch source data files via HTTP from the Green Agent:
- customers_tpcdi_lite_v3.csv - Customer information
- accounts_tpcdi_lite_v3.csv - Account data linked to customers
- trades_tpcdi_lite_v3.csv - Trade transactions linked to accounts
Join the tables correctly:
- Accounts → Customers on customer_id
- Trades → Accounts on account_id
Filter to only completed trades (trade_status = 'CMPT')
Aggregate per customer to produce:
- customer_id - Customer identifier
- customer_name - First + last name
- country - Customer's country
- num_accounts - Count of accounts
- total_balance - Sum of account balances
- num_trades - Count of completed trades
- total_trade_volume - Sum of trade quantities
- total_trade_value - Sum of (quantity × trade_price)
- symbols_traded - Comma-separated unique stock symbols

Quick Start

Prerequisites

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup
git clone <your-repo-url>
cd data-matchmaker-benchmark
uv sync

Running the Benchmark

Terminal 1 - Start the Green Agent:

uv run python src/server.py --host 127.0.0.1 --port 9009

The Green Agent exposes:

A2A endpoint: http://127.0.0.1:9009/ - For agent-to-agent communication
File listing: http://127.0.0.1:9009/files/ - Lists available data files
File download: http://127.0.0.1:9009/files/{filename} - Download individual files

Terminal 2 - Start the Mock Purple Agent (for testing):

# Copy and configure your Gemini API key
cp sample.env .env
# Edit .env and add your GEMINI_API_KEY

uv run python src/mock_purple.py --host 127.0.0.1 --port 9010

Terminal 3 - Run an evaluation:

curl -s -X POST http://127.0.0.1:9009/ \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "method": "message/send",
    "id": "1",
    "params": {
      "message": {
        "kind": "message",
        "role": "user",
        "parts": [{
          "kind": "text",
          "text": "{\"participants\": {\"data_integrator\": \"http://127.0.0.1:9010\"}, \"config\": {\"timeout\": 300}}"
        }],
        "messageId": "test"
      }
    }
  }' | python3 -m json.tool

Using the AgentBeats CLI

If you have the AgentBeats CLI installed:

uv run agentbeats-run scenario.toml

Project Structure

src/
├── server.py        # A2A server + file serving endpoints
├── main.py          # Alternative entry point with MCP server
├── agent.py         # Green agent: task generation + scoring + LLM feedback
├── executor.py      # A2A request handling
├── messenger.py     # A2A messaging utilities
├── mcp_server.py    # MCP resources for file access (optional)
└── mock_purple.py   # Baseline purple agent for testing

jan15_tasks/
├── customers_tpcdi_lite_v3.csv    # Customer data
├── accounts_tpcdi_lite_v3.csv     # Account data
├── trades_tpcdi_lite_v3.csv       # Trade transactions
├── prospect_tpcdi_lite_v3.xlsx    # Additional prospect data
├── finwire_tpcdi_lite_v3.xml      # Financial wire data
└── gold_ground_truth_tpcdi_lite_v3.csv  # Expected results (not exposed)

Input Format

The Green Agent expects an A2A assessment request with:

{
  "participants": {
    "data_integrator": "http://purple-agent-url:port"
  },
  "config": {
    "timeout": 300,
    "file_server_url": "http://green-agent-url:port"
  }
}

Output Format

The Green Agent returns a detailed evaluation:

{
  "score": 87,
  "max_score": 100,
  "details": {
    "columns": {"score": 20, "max": 20, "missing": [], "extra": []},
    "row_count": {"score": 10, "max": 10, "expected": 100, "submitted": 100},
    "customer_coverage": {"score": 15, "max": 15, "coverage_pct": 100.0},
    "numeric_accuracy": {"score": 32, "max": 40, "columns": {...}},
    "string_accuracy": {"score": 10, "max": 15, "fields": {...}}
  },
  "llm_feedback": {
    "enabled": true,
    "summary": "Good overall performance with minor aggregation issues...",
    "strengths": ["Correct table joins", "Complete customer coverage"],
    "weaknesses": ["Trade value calculation off by ~2%"],
    "recommendations": ["Verify trade_price handling"]
  },
  "purple_agent_url": "http://127.0.0.1:9010",
  "submitted_rows": 100,
  "expected_rows": 100
}

Purple Agent Requirements

Purple agents must:

Accept a task message describing the data integration task
Fetch data files via HTTP GET from the provided URLs
Return CSV data with the expected columns

Example response format:

customer_id,customer_name,country,num_accounts,total_balance,num_trades,total_trade_volume,total_trade_value,symbols_traded
1,John Smith,USA,2,15000.50,5,100,12500.00,"AAPL,GOOGL"
2,Jane Doe,Canada,1,8000.00,3,50,4500.00,"MSFT"

Environment Variables

Variable	Description	Required
`GEMINI_API_KEY`	API key for LLM feedback generation	Optional
`FILE_SERVER_URL`	Override file server URL	Optional
`MCP_SERVER_URL`	MCP server URL for main.py	Optional

Running with Docker

# Build the green agent
docker build -t tpcdi-evaluator .

# Run
docker run -p 9009:9009 -e GEMINI_API_KEY=your_key tpcdi-evaluator

Running Tests

uv sync --extra test
uv run pytest tests/

Competition Info

This is a submission for the AgentBeats Competition Phase 1 (Green Agent).

Track: Data Engineering
Benchmark Type: TPC-DI Style Data Integration
Skills Tested: Data fetching, table joining, filtering, aggregation, ETL

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
jan15_tasks		jan15_tasks
scripts		scripts
src		src
tasks		tasks
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
agentbeats-competition-info-session-deck.pdf		agentbeats-competition-info-session-deck.pdf
pyproject.toml		pyproject.toml
sample.env		sample.env
scenario.toml		scenario.toml
task_generation.ipynb		task_generation.ipynb
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Matchmaker Benchmark - Green Agent

What This Benchmark Evaluates

The Task

Quick Start

Prerequisites

Running the Benchmark

Using the AgentBeats CLI

Project Structure

Input Format

Output Format

Purple Agent Requirements

Environment Variables

Running with Docker

Running Tests

Competition Info

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Matchmaker Benchmark - Green Agent

What This Benchmark Evaluates

The Task

Quick Start

Prerequisites

Running the Benchmark

Using the AgentBeats CLI

Project Structure

Input Format

Output Format

Purple Agent Requirements

Environment Variables

Running with Docker

Running Tests

Competition Info

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages