An AI evaluation framework for assessing medical dialogue agents through realistic doctor-patient consultations using the GAA (Generative Adversarial Agents) system.
- 🏥 Medical Dialogue Evaluation - Evaluates doctor agents' communication and persuasion abilities
- 🧠 64 Patient Personas - 16 MBTI personality types × 2 medical conditions × 2 genders
- 📊 Multi-Dimensional Scoring - Real-time evaluation of empathy, persuasion, and patient safety
- 🔬 Information Asymmetry - Doctor receives only clinical data; patient personality and symptoms remain hidden
- ✅ Reproducible - Built on AgentBeats platform using A2A protocol
- Clone the repo
git clone https://github.com/MadGAA-Lab/OSCE-Project.git
cd OSCE-Project- Install dependencies
uv sync- Set environment variables
cp sample.env .envAdd your API credentials to the .env file (supports OpenAI, Anthropic, Google Gemini, etc.)
- Run evaluation
uv run agentbeats-run scenarios/medical_dialogue/scenario.tomlNote: Use --show-logs to see agent outputs during the assessment, and --serve-only to start agents without running the assessment.
After running, you should see dialogue rounds and evaluation scores:
src/agentbeats/ # Core A2A infrastructure
├─ green_executor.py # Base green agent executor
├─ models.py # Pydantic models for agent IO
├─ client.py # A2A messaging helpers
└─ run_scenario.py # Scenario runner
scenarios/medical_dialogue/ # Medical dialogue evaluation
├─ green_agents/
│ ├─ judge.py # Orchestrates doctor-patient dialogue
│ ├─ patient_agent.py # Simulates patient with personality
│ ├─ patient_constructor.py # Generates patient personas (MBTI)
│ ├─ per_round_scoring.py # Evaluates empathy, persuasion, safety
│ └─ report_generator.py # Creates performance reports
├─ purple_agents/
│ └─ doctor_agent.py # Doctor agent being evaluated
├─ prompts/ # MBTI traits & medical cases
└─ scenario.toml # Evaluation configuration
- 16 MBTI Types: INTJ, INTP, ENTJ, ENTP, INFJ, INFP, ENFJ, ENFP, ISTJ, ISFJ, ESTJ, ESFJ, ISTP, ISFP, ESTP, ESFP
- 2 Medical Cases: Pneumothorax, Lung Cancer
- 2 Genders: Male, Female (optional)
- Doctor sends response to patient
- Patient generates personality-driven response
- Judge evaluates the round:
- Empathy Score (0-10)
- Persuasion Score (0-10)
- Safety Score (0-10)
- Stop Conditions: Patient left / accepted treatment / max rounds reached
Doctor receives:
- Age, gender (if specified)
- Diagnosis and recommended treatment
- Treatment risks, benefits, and prognosis
Doctor does NOT receive:
- Patient symptoms (must discover through dialogue)
- Patient personality traits (MBTI)
- Patient concerns and fears
- Patient behavioral patterns
This mirrors real medical practice where doctors must discover patient information through conversation.
The following diagram shows how the agent evaluation system works:
graph TB
subgraph "AgentBeats Platform"
Runner[Scenario Runner]
end
subgraph "Green Agents - Evaluation System"
Judge[Judge Agent<br/>Central Orchestrator]
subgraph "Patient Simulation"
PersonaMgr[Persona Manager<br/>64 Personas]
PatientConst[Patient Constructor<br/>Generate Personas]
PatientAgent[Patient Agent<br/>MBTI-driven Behavior]
end
subgraph "Evaluation Components"
PerRoundScore[Per-Round Scoring<br/>LLM-as-Judge]
StopDetector[Stop Detector<br/>Termination Logic]
ReportGen[Report Generator<br/>Final Analysis]
end
Criteria[(Criteria CSV<br/>30 Evaluation Criteria)]
end
subgraph "Purple Agent - Under Evaluation"
Doctor[Doctor Agent<br/>Being Tested]
end
%% Initialization Flow
Runner -->|1. Start Evaluation| Judge
Judge -->|2. Get Persona| PersonaMgr
PersonaMgr -->|3. Load Templates| PatientConst
PatientConst -->|4. Generate Background| PatientAgent
PatientConst -->|5. Clinical Info| Judge
%% Round-based Dialogue Loop
Judge -->|6. Clinical Context| Doctor
Doctor -->|7. Doctor Response| Judge
Judge -->|8. Doctor Message| PatientAgent
PatientAgent -->|9. Patient Response| Judge
%% Evaluation Flow
Judge -->|10. Evaluate Round| PerRoundScore
Criteria -->|Evaluation Criteria| PerRoundScore
PerRoundScore -->|11. Scores<br/>Empathy/Persuasion/Safety| Judge
Judge -->|12. Check Stop| StopDetector
StopDetector -->|13. Continue/Stop| Judge
%% Final Report
Judge -->|14. Generate Report| ReportGen
ReportGen -->|15. Final Analysis| Runner
%% Styling
classDef green fill:#90EE90,stroke:#228B22,stroke-width:2px
classDef purple fill:#DDA0DD,stroke:#8B008B,stroke-width:2px
classDef data fill:#87CEEB,stroke:#4682B4,stroke-width:2px
classDef eval fill:#FFD700,stroke:#FF8C00,stroke-width:2px
class Judge,PersonaMgr,PatientConst,PatientAgent green
class Doctor purple
class Criteria data
class PerRoundScore,StopDetector,ReportGen eval
The system follows a sophisticated multi-round evaluation process:
- Scenario Runner starts evaluation with persona configuration
- Judge Agent receives evaluation request with persona IDs and max rounds
- Persona Manager selects personas (e.g., INTJ_M_PNEUMO)
- Patient Constructor generates:
- Full patient background (age, symptoms, personality traits, concerns)
- Clinical info subset (diagnosis, treatment details) → sent to Doctor
- Character description (MBTI-driven behavior) → for Patient Agent
- Roleplay examples → for context priming
For each round (max 10 rounds):
-
Judge sends clinical context to Doctor Agent:
- Patient demographics (age, gender)
- Diagnosis and recommended treatment
- Risks, benefits, prognosis
- Previous dialogue history
⚠️ NOT included: Patient personality, symptoms, concerns
-
Doctor Agent generates response attempting to:
- Show empathy and build trust
- Address patient concerns
- Persuade patient to accept treatment
- Ensure safety and informed consent
-
Patient Agent generates personality-driven response:
- Uses MBTI personality traits (hidden from Doctor)
- Responds naturally with concerns and emotions
- May resist, question, or gradually accept treatment
-
Per-Round Scoring Engine evaluates the round:
- Uses 30 criteria from
judge_criteria.csv - Categories: Empathy (1-10), Persuasion (11-20), Safety (21-30)
- LLM judges each criterion as met/not_met/not_relevant
- Calculates scores: Empathy, Persuasion, Safety (0-10 each)
- Uses 30 criteria from
-
Stop Detector checks termination conditions:
- Patient explicitly left/refused treatment
- Patient accepted treatment
- Max rounds reached
- Uses LLM to detect patient commitment/refusal signals
-
Loop continues or stops based on stop condition
-
Report Generator creates comprehensive analysis:
- Aggregate scores across all rounds (weighted 30/40/30)
- Qualitative analysis: strengths, weaknesses, key moments
- Improvement recommendations
- Alternative approaches
- Overall evaluation summary
-
Results returned to Scenario Runner for multi-persona aggregation
The system creates realistic doctor-patient dynamics through information asymmetry:
| Information | Doctor Has | Patient Has | Judge Has |
|---|---|---|---|
| Patient Personality (MBTI) | ❌ | ✅ | ✅ |
| Patient Symptoms | ❌ | ✅ | ✅ |
| Patient Concerns/Fears | ❌ | ✅ | ✅ |
| Medical Diagnosis | ✅ | ✅ | ✅ |
| Treatment Details | ✅ | ✅ | ✅ |
| Dialogue History | ✅ | ✅ | ✅ |
| Evaluation Scores | ❌ | ❌ | ✅ |
This mirrors real medical consultations where doctors must discover patient information through conversation.
- Judge - Central orchestrator managing the entire evaluation lifecycle
- Persona Manager - Manages 64 patient personas (16 MBTI × 2 cases × 2 genders)
- Patient Constructor - Generates complete patient backgrounds from templates using LLM
- Patient Agent - Simulates patients with MBTI-driven personality-consistent behaviors
- Per-Round Scoring - LLM-as-judge evaluation using 30 criteria across 3 categories
- Stop Detector - LLM-based classification to detect dialogue termination conditions
- Report Generator - Creates comprehensive performance analysis with qualitative insights
- Doctor Agent - The AI agent being evaluated (example implementation provided in
purple_agents/doctor_agent.py)
Edit scenarios/medical_dialogue/scenario.toml to customize evaluation:
[config]
# Evaluate specific personas
persona_ids = ["INTJ_M_PNEUMO"] # Single persona with gender
persona_ids = ["INTJ_PNEUMO"] # Single persona, random gender
persona_ids = ["INTJ_M_PNEUMO", "ESFP_F_LUNG"] # Multiple specific personas
persona_ids = ["all"] # All 64 personas with gender
persona_ids = ["random"] # Random persona each run
# Maximum dialogue rounds
max_rounds = 10
# Retry configuration for API calls
[config.retry]
patient_max_retries = 3
judge_max_retries = 5For detailed configuration options, see scenarios/medical_dialogue/README.md.
Contributions are welcome! Areas of interest:
- Additional medical conditions and cases
- New patient personality models beyond MBTI
- Enhanced scoring metrics
- Multi-language support
- Performance optimizations
See LICENSE file for details.
Built on the AgentBeats platform for standardized agent evaluations using the A2A protocol.
If you use this leaderboard or the OSCE-Project framework in your research, please cite:
@software{osce_agentbeats_leaderboard,
title = {OSCE-AgentBeats Medical Dialogue Evaluation Leaderboard},
author = {MadGAA-Lab},
year = {2026},
url = {https://github.com/MadGAA-Lab/OSCE-AgentBeats-Leaderboard},
note = {Leaderboard for evaluating doctor agents' ability to conduct empathetic and persuasive medical consultations}
}
@software{osce_project,
title = {OSCE-Project: Open Standard for Clinical Evaluation},
author = {MadGAA-Lab},
year = {2026},
url = {https://github.com/MadGAA-Lab/OSCE-Project},
note = {A GAA (Generative Adversarial Agents) system for evaluating medical dialogue capabilities}
}