OSCE-Project

Objective Structured Clinical Examination

An AI evaluation framework for assessing medical dialogue agents through realistic doctor-patient consultations using the GAA (Generative Adversarial Agents) system.

Key Features

🏥 Medical Dialogue Evaluation - Evaluates doctor agents' communication and persuasion abilities
🧠 64 Patient Personas - 16 MBTI personality types × 2 medical conditions × 2 genders
📊 Multi-Dimensional Scoring - Real-time evaluation of empathy, persuasion, and patient safety
🔬 Information Asymmetry - Doctor receives only clinical data; patient personality and symptoms remain hidden
✅ Reproducible - Built on AgentBeats platform using A2A protocol

Quickstart

Clone the repo

git clone https://github.com/MadGAA-Lab/OSCE-Project.git
cd OSCE-Project

Install dependencies

uv sync

Set environment variables

cp sample.env .env

Add your API credentials to the .env file (supports OpenAI, Anthropic, Google Gemini, etc.)

Run evaluation

uv run agentbeats-run scenarios/medical_dialogue/scenario.toml

Note: Use --show-logs to see agent outputs during the assessment, and --serve-only to start agents without running the assessment.

After running, you should see dialogue rounds and evaluation scores:

Project Structure

src/agentbeats/              # Core A2A infrastructure
  ├─ green_executor.py       # Base green agent executor
  ├─ models.py               # Pydantic models for agent IO
  ├─ client.py               # A2A messaging helpers
  └─ run_scenario.py         # Scenario runner

scenarios/medical_dialogue/  # Medical dialogue evaluation
  ├─ green_agents/
  │  ├─ judge.py             # Orchestrates doctor-patient dialogue
  │  ├─ patient_agent.py     # Simulates patient with personality
  │  ├─ patient_constructor.py # Generates patient personas (MBTI)
  │  ├─ per_round_scoring.py # Evaluates empathy, persuasion, safety
  │  └─ report_generator.py  # Creates performance reports
  ├─ purple_agents/
  │  └─ doctor_agent.py      # Doctor agent being evaluated
  ├─ prompts/                # MBTI traits & medical cases
  └─ scenario.toml           # Evaluation configuration

Medical Dialogue Evaluation

Patient Personas

16 MBTI Types: INTJ, INTP, ENTJ, ENTP, INFJ, INFP, ENFJ, ENFP, ISTJ, ISFJ, ESTJ, ESFJ, ISTP, ISFP, ESTP, ESFP
2 Medical Cases: Pneumothorax, Lung Cancer
2 Genders: Male, Female (optional)

Round-Based Evaluation Process

Doctor sends response to patient
Patient generates personality-driven response
Judge evaluates the round:
- Empathy Score (0-10)
- Persuasion Score (0-10)
- Safety Score (0-10)
Stop Conditions: Patient left / accepted treatment / max rounds reached

Information Asymmetry Design

Doctor receives:

Age, gender (if specified)
Diagnosis and recommended treatment
Treatment risks, benefits, and prognosis

Doctor does NOT receive:

Patient symptoms (must discover through dialogue)
Patient personality traits (MBTI)
Patient concerns and fears
Patient behavioral patterns

This mirrors real medical practice where doctors must discover patient information through conversation.

System Architecture

The following diagram shows how the agent evaluation system works:

graph TB
    subgraph "AgentBeats Platform"
        Runner[Scenario Runner]
    end
    
    subgraph "Green Agents - Evaluation System"
        Judge[Judge Agent<br/>Central Orchestrator]
        
        subgraph "Patient Simulation"
            PersonaMgr[Persona Manager<br/>64 Personas]
            PatientConst[Patient Constructor<br/>Generate Personas]
            PatientAgent[Patient Agent<br/>MBTI-driven Behavior]
        end
        
        subgraph "Evaluation Components"
            PerRoundScore[Per-Round Scoring<br/>LLM-as-Judge]
            StopDetector[Stop Detector<br/>Termination Logic]
            ReportGen[Report Generator<br/>Final Analysis]
        end
        
        Criteria[(Criteria CSV<br/>30 Evaluation Criteria)]
    end
    
    subgraph "Purple Agent - Under Evaluation"
        Doctor[Doctor Agent<br/>Being Tested]
    end
    
    %% Initialization Flow
    Runner -->|1. Start Evaluation| Judge
    Judge -->|2. Get Persona| PersonaMgr
    PersonaMgr -->|3. Load Templates| PatientConst
    PatientConst -->|4. Generate Background| PatientAgent
    PatientConst -->|5. Clinical Info| Judge
    
    %% Round-based Dialogue Loop
    Judge -->|6. Clinical Context| Doctor
    Doctor -->|7. Doctor Response| Judge
    Judge -->|8. Doctor Message| PatientAgent
    PatientAgent -->|9. Patient Response| Judge
    
    %% Evaluation Flow
    Judge -->|10. Evaluate Round| PerRoundScore
    Criteria -->|Evaluation Criteria| PerRoundScore
    PerRoundScore -->|11. Scores<br/>Empathy/Persuasion/Safety| Judge
    
    Judge -->|12. Check Stop| StopDetector
    StopDetector -->|13. Continue/Stop| Judge
    
    %% Final Report
    Judge -->|14. Generate Report| ReportGen
    ReportGen -->|15. Final Analysis| Runner
    
    %% Styling
    classDef green fill:#90EE90,stroke:#228B22,stroke-width:2px
    classDef purple fill:#DDA0DD,stroke:#8B008B,stroke-width:2px
    classDef data fill:#87CEEB,stroke:#4682B4,stroke-width:2px
    classDef eval fill:#FFD700,stroke:#FF8C00,stroke-width:2px
    
    class Judge,PersonaMgr,PatientConst,PatientAgent green
    class Doctor purple
    class Criteria data
    class PerRoundScore,StopDetector,ReportGen eval

Evaluation Flow

The system follows a sophisticated multi-round evaluation process:

Phase 1: Initialization

Scenario Runner starts evaluation with persona configuration
Judge Agent receives evaluation request with persona IDs and max rounds
Persona Manager selects personas (e.g., INTJ_M_PNEUMO)
Patient Constructor generates:
- Full patient background (age, symptoms, personality traits, concerns)
- Clinical info subset (diagnosis, treatment details) → sent to Doctor
- Character description (MBTI-driven behavior) → for Patient Agent
- Roleplay examples → for context priming

Phase 2: Round-Based Dialogue Loop

For each round (max 10 rounds):

Judge sends clinical context to Doctor Agent:
- Patient demographics (age, gender)
- Diagnosis and recommended treatment
- Risks, benefits, prognosis
- Previous dialogue history
- ⚠️ NOT included: Patient personality, symptoms, concerns
Doctor Agent generates response attempting to:
- Show empathy and build trust
- Address patient concerns
- Persuade patient to accept treatment
- Ensure safety and informed consent
Patient Agent generates personality-driven response:
- Uses MBTI personality traits (hidden from Doctor)
- Responds naturally with concerns and emotions
- May resist, question, or gradually accept treatment
Per-Round Scoring Engine evaluates the round:
- Uses 30 criteria from judge_criteria.csv
- Categories: Empathy (1-10), Persuasion (11-20), Safety (21-30)
- LLM judges each criterion as met/not_met/not_relevant
- Calculates scores: Empathy, Persuasion, Safety (0-10 each)
Stop Detector checks termination conditions:
- Patient explicitly left/refused treatment
- Patient accepted treatment
- Max rounds reached
- Uses LLM to detect patient commitment/refusal signals
Loop continues or stops based on stop condition

Phase 3: Final Report Generation

Report Generator creates comprehensive analysis:
- Aggregate scores across all rounds (weighted 30/40/30)
- Qualitative analysis: strengths, weaknesses, key moments
- Improvement recommendations
- Alternative approaches
- Overall evaluation summary
Results returned to Scenario Runner for multi-persona aggregation

Information Asymmetry Design

The system creates realistic doctor-patient dynamics through information asymmetry:

Information	Doctor Has	Patient Has	Judge Has
Patient Personality (MBTI)	❌	✅	✅
Patient Symptoms	❌	✅	✅
Patient Concerns/Fears	❌	✅	✅
Medical Diagnosis	✅	✅	✅
Treatment Details	✅	✅	✅
Dialogue History	✅	✅	✅
Evaluation Scores	❌	❌	✅

This mirrors real medical consultations where doctors must discover patient information through conversation.

System Components

Green Agents (Evaluation System)

Judge - Central orchestrator managing the entire evaluation lifecycle
Persona Manager - Manages 64 patient personas (16 MBTI × 2 cases × 2 genders)
Patient Constructor - Generates complete patient backgrounds from templates using LLM
Patient Agent - Simulates patients with MBTI-driven personality-consistent behaviors
Per-Round Scoring - LLM-as-judge evaluation using 30 criteria across 3 categories
Stop Detector - LLM-based classification to detect dialogue termination conditions
Report Generator - Creates comprehensive performance analysis with qualitative insights

Purple Agents (Evaluated)

Doctor Agent - The AI agent being evaluated (example implementation provided in purple_agents/doctor_agent.py)

Configuration

Edit scenarios/medical_dialogue/scenario.toml to customize evaluation:

[config]
# Evaluate specific personas
persona_ids = ["INTJ_M_PNEUMO"]  # Single persona with gender
persona_ids = ["INTJ_PNEUMO"]    # Single persona, random gender
persona_ids = ["INTJ_M_PNEUMO", "ESFP_F_LUNG"]  # Multiple specific personas
persona_ids = ["all"]            # All 64 personas with gender
persona_ids = ["random"]         # Random persona each run

# Maximum dialogue rounds
max_rounds = 10

# Retry configuration for API calls
[config.retry]
patient_max_retries = 3
judge_max_retries = 5

For detailed configuration options, see scenarios/medical_dialogue/README.md.

Contributing

Contributions are welcome! Areas of interest:

Additional medical conditions and cases
New patient personality models beyond MBTI
Enhanced scoring metrics
Multi-language support
Performance optimizations

License

See LICENSE file for details.

Acknowledgments

Built on the AgentBeats platform for standardized agent evaluations using the A2A protocol.

Citation

If you use this leaderboard or the OSCE-Project framework in your research, please cite:

@software{osce_agentbeats_leaderboard,
  title = {OSCE-AgentBeats Medical Dialogue Evaluation Leaderboard},
  author = {MadGAA-Lab},
  year = {2026},
  url = {https://github.com/MadGAA-Lab/OSCE-AgentBeats-Leaderboard},
  note = {Leaderboard for evaluating doctor agents' ability to conduct empathetic and persuasive medical consultations}
}

@software{osce_project,
  title = {OSCE-Project: Open Standard for Clinical Evaluation},
  author = {MadGAA-Lab},
  year = {2026},
  url = {https://github.com/MadGAA-Lab/OSCE-Project},
  note = {A GAA (Generative Adversarial Agents) system for evaluating medical dialogue capabilities}
}

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
assets		assets
deployment		deployment
docs		docs
scenarios		scenarios
site		site
src/agentbeats		src/agentbeats
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile.client_cli		Dockerfile.client_cli
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
sample.env		sample.env
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OSCE-Project

Objective Structured Clinical Examination

Key Features

Quickstart

Project Structure

Medical Dialogue Evaluation

Patient Personas

Round-Based Evaluation Process

Information Asymmetry Design

System Architecture

Evaluation Flow

Phase 1: Initialization

Phase 2: Round-Based Dialogue Loop

Phase 3: Final Report Generation

Information Asymmetry Design

System Components

Green Agents (Evaluation System)

Purple Agents (Evaluated)

Configuration

Contributing

License

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Languages

License

MadGAA-Lab/OSCE-Project-Stage2

Folders and files

Latest commit

History

Repository files navigation

OSCE-Project

Objective Structured Clinical Examination

Key Features

Quickstart

Project Structure

Medical Dialogue Evaluation

Patient Personas

Round-Based Evaluation Process

Information Asymmetry Design

System Architecture

Evaluation Flow

Phase 1: Initialization

Phase 2: Round-Based Dialogue Loop

Phase 3: Final Report Generation

Information Asymmetry Design

System Components

Green Agents (Evaluation System)

Purple Agents (Evaluated)

Configuration

Contributing

License

Acknowledgments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages