A Green Agent for the AgentBeats Competition that evaluates Purple Agents on their ability to integrate data from multiple sources, join tables correctly, and compute accurate per-customer aggregations.
--
How can an agent reason over multiple datasets if it cannot first determine when differently named fields and records refer to the same underlying entity? This project introduces an evaluation agent that measures how effectively a candidate agent can join related datasets with non-identical keys, inconsistent schemas, and noisy identifiers by using contextual signals and structured inference rather than shared primary keys. The framework evaluates the accuracy, coverage, and coherence of inferred mappings, providing a benchmark for an agent’s ability to reconcile data generated by separate processes.
For example, in an enterprise context, accounting ledgers, payroll tables, CRM exports, and operational logs often describe the same customers, employees, or transactions using different naming conventions, partial identifiers, and incompatible schemas. The evaluation agent tests whether an alignment agent can correctly standardize and map these datasets into a unified representation, a prerequisite for automation, reporting, and reliable downstream decision-making.
At scale, agents with this capability can operate across virtually any data-producing environment, enabling autonomous analytics, continuous monitoring, and decision support without bespoke data engineering, and fundamentally expanding where and how agentic systems can be deployed.
This benchmark tests a Purple Agent's data integration capabilities using a simplified TPC-DI style task:
| Component | Description | Points |
|---|---|---|
| Column Check | Correct columns in output | 20 |
| Row Count | Correct number of output rows | 10 |
| Customer Coverage | All customers represented | 15 |
| Numeric Accuracy | Correct aggregation values | 40 |
| String Accuracy | Correct text fields | 15 |
The benchmark uses a hybrid evaluation approach:
- Deterministic numerical scoring for accuracy metrics (0-100 points)
- LLM-powered qualitative feedback for insights and recommendations (via Gemini)
Purple Agents must:
-
Fetch source data files via HTTP from the Green Agent:
customers_tpcdi_lite_v3.csv- Customer informationaccounts_tpcdi_lite_v3.csv- Account data linked to customerstrades_tpcdi_lite_v3.csv- Trade transactions linked to accounts
-
Join the tables correctly:
- Accounts → Customers on
customer_id - Trades → Accounts on
account_id
- Accounts → Customers on
-
Filter to only completed trades (
trade_status = 'CMPT') -
Aggregate per customer to produce:
customer_id- Customer identifiercustomer_name- First + last namecountry- Customer's countrynum_accounts- Count of accountstotal_balance- Sum of account balancesnum_trades- Count of completed tradestotal_trade_volume- Sum of trade quantitiestotal_trade_value- Sum of (quantity × trade_price)symbols_traded- Comma-separated unique stock symbols
# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and setup
git clone <your-repo-url>
cd data-matchmaker-benchmark
uv syncTerminal 1 - Start the Green Agent:
uv run python src/server.py --host 127.0.0.1 --port 9009The Green Agent exposes:
- A2A endpoint:
http://127.0.0.1:9009/- For agent-to-agent communication - File listing:
http://127.0.0.1:9009/files/- Lists available data files - File download:
http://127.0.0.1:9009/files/{filename}- Download individual files
Terminal 2 - Start the Mock Purple Agent (for testing):
# Copy and configure your Gemini API key
cp sample.env .env
# Edit .env and add your GEMINI_API_KEY
uv run python src/mock_purple.py --host 127.0.0.1 --port 9010Terminal 3 - Run an evaluation:
curl -s -X POST http://127.0.0.1:9009/ \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"method": "message/send",
"id": "1",
"params": {
"message": {
"kind": "message",
"role": "user",
"parts": [{
"kind": "text",
"text": "{\"participants\": {\"data_integrator\": \"http://127.0.0.1:9010\"}, \"config\": {\"timeout\": 300}}"
}],
"messageId": "test"
}
}
}' | python3 -m json.toolIf you have the AgentBeats CLI installed:
uv run agentbeats-run scenario.tomlsrc/
├── server.py # A2A server + file serving endpoints
├── main.py # Alternative entry point with MCP server
├── agent.py # Green agent: task generation + scoring + LLM feedback
├── executor.py # A2A request handling
├── messenger.py # A2A messaging utilities
├── mcp_server.py # MCP resources for file access (optional)
└── mock_purple.py # Baseline purple agent for testing
jan15_tasks/
├── customers_tpcdi_lite_v3.csv # Customer data
├── accounts_tpcdi_lite_v3.csv # Account data
├── trades_tpcdi_lite_v3.csv # Trade transactions
├── prospect_tpcdi_lite_v3.xlsx # Additional prospect data
├── finwire_tpcdi_lite_v3.xml # Financial wire data
└── gold_ground_truth_tpcdi_lite_v3.csv # Expected results (not exposed)
The Green Agent expects an A2A assessment request with:
{
"participants": {
"data_integrator": "http://purple-agent-url:port"
},
"config": {
"timeout": 300,
"file_server_url": "http://green-agent-url:port"
}
}The Green Agent returns a detailed evaluation:
{
"score": 87,
"max_score": 100,
"details": {
"columns": {"score": 20, "max": 20, "missing": [], "extra": []},
"row_count": {"score": 10, "max": 10, "expected": 100, "submitted": 100},
"customer_coverage": {"score": 15, "max": 15, "coverage_pct": 100.0},
"numeric_accuracy": {"score": 32, "max": 40, "columns": {...}},
"string_accuracy": {"score": 10, "max": 15, "fields": {...}}
},
"llm_feedback": {
"enabled": true,
"summary": "Good overall performance with minor aggregation issues...",
"strengths": ["Correct table joins", "Complete customer coverage"],
"weaknesses": ["Trade value calculation off by ~2%"],
"recommendations": ["Verify trade_price handling"]
},
"purple_agent_url": "http://127.0.0.1:9010",
"submitted_rows": 100,
"expected_rows": 100
}Purple agents must:
- Accept a task message describing the data integration task
- Fetch data files via HTTP GET from the provided URLs
- Return CSV data with the expected columns
Example response format:
customer_id,customer_name,country,num_accounts,total_balance,num_trades,total_trade_volume,total_trade_value,symbols_traded
1,John Smith,USA,2,15000.50,5,100,12500.00,"AAPL,GOOGL"
2,Jane Doe,Canada,1,8000.00,3,50,4500.00,"MSFT"| Variable | Description | Required |
|---|---|---|
GEMINI_API_KEY |
API key for LLM feedback generation | Optional |
FILE_SERVER_URL |
Override file server URL | Optional |
MCP_SERVER_URL |
MCP server URL for main.py | Optional |
# Build the green agent
docker build -t tpcdi-evaluator .
# Run
docker run -p 9009:9009 -e GEMINI_API_KEY=your_key tpcdi-evaluatoruv sync --extra test
uv run pytest tests/This is a submission for the AgentBeats Competition Phase 1 (Green Agent).
- Track: Data Engineering
- Benchmark Type: TPC-DI Style Data Integration
- Skills Tested: Data fetching, table joining, filtering, aggregation, ETL
MIT