TalkingHeadBench

title

TalkingHeadBench

colorFrom

indigo

colorTo

purple

sdk

docker

app_port

8000

pinned

false

license

mit

short_description

Talking-head LoRA diagnostic reasoning benchmark

TalkingHeadBench

An open-source diagnostic reasoning benchmark for evaluating AI agents on talking-head video LoRA pipelines.

Overview

TalkingHeadBench challenges AI agents to act as senior engineers who audit and optimize talking-head video LoRA pipelines, identifying failure modes in reference images, training datasets, and final model weights before a single frame is ever rendered.

The benchmark focuses on diagnostic reasoning, not generative performance. All signals are pre-extracted (face occupancy ratios, yaw/pitch degrees, landmark stability scores, canonical SVD weight components), making episodes run in seconds without GPU inference.

Architecture

The benchmark is organized into 3 audit sub-environments spanning 9 deterministic nodes, with mode-based execution:

Audit Tasks
├── Sub-env 1: Reference Image and Prompt Audit
│   ├── Node 1: Image Diagnostician
│   ├── Node 2: Parameter Anomaly Detector
│   └── Node 3: Grader
│
├── Sub-env 2: Dataset Clip Health Audit
│   ├── Node 4: Clip Signal Extractor
│   ├── Node 5: Disposition Classifier
│   └── Node 6: Grader
│
└── Sub-env 3: Trained LoRA Weight Behavioral Audit
    ├── Node 7: Weight Signal Extractor
    ├── Node 8: Phoneme Risk Assessor
    └── Node 9: Behavioral Audit Grader

Coupling Model

Sub-env 1 is standalone (no downstream dependency).
Sub-env 2 can optionally feed suspected anomalous phonemes into Sub-env 3.

Episode Modes

Mode	Flow	Reward
`image`	Node 1 → Node 2 → done	`subenv1_score`
`clips`	Node 5 → done	`subenv2_score`
`weights`	Node 8 → done	`subenv3_score`
`clips_and_weights`	Node 5 → Node 8 → done	blend of `subenv2_score` and `subenv3_score`

See REWARD_LOGIC.md for grader-dimension scoring breakdowns.

Project Structure

TalkingHeadBench/
├── src/
│   ├── pipeline.py                  # Episode orchestrator (run_episode_from_bundle)
│   ├── evaluate.py                  # CLI evaluation harness (dry-run + scoring)
│   ├── envs/
│   │   ├── subenv1/
│   │   │   ├── node1_image_diagnostician.py
│   │   │   ├── node2_param_anomaly.py
│   │   │   └── node3_grader.py
│   │   ├── subenv2/
│   │   │   ├── node4_clip_extractor.py
│   │   │   ├── node5_disposition.py
│   │   │   └── node6_grader.py
│   │   └── subenv3/
│   │       ├── node7_weight_extractor.py
│   │       ├── node8_phoneme_risk.py
│   │       └── node9_grader.py
│   ├── schemas/
│   │   ├── subenv1.py               # Pydantic models: ImageDiagnosticsObservation, etc.
│   │   ├── subenv2.py               # Pydantic models: ClipSignalObservation, etc.
│   │   ├── subenv3.py               # Pydantic models: WeightSignalObservation, etc.
│   │   └── ground_truth.py          # GroundTruth schema for all sub-envs
│   └── utils/
│       ├── canonical.py             # Canonical SVD + weight decomposition utilities
│       └── grader_utils.py          # Shared scoring helpers (F1, NDCG, recall)
│
├── server/
│   ├── app.py                       # FastAPI app (OpenEnv-compliant /reset, /step)
│   ├── talking_head_environment.py  # Gymnasium-style environment wrapper
│   ├── Dockerfile                   # Container definition
│   └── requirements.txt             # Server-side dependencies
│
├── tests/
│   ├── unit/                        # Unit tests for individual nodes and schemas
│   │   ├── test_node4_extractor.py
│   │   ├── test_node7_extractor.py
│   │   ├── test_canonical.py
│   │   ├── test_graders.py
│   │   ├── test_schemas.py
│   │   ├── test_subenv1.py
│   │   ├── test_subenv2.py
│   │   └── test_subenv3.py
│   └── smoke/                       # Integration & boundary tests
│       ├── test_pipeline_e2e.py
│       ├── test_pipeline_bundle.py
│       ├── test_schema_roundtrip.py
│       ├── test_grader_arithmetic.py
│       ├── test_node1_boundaries.py
│       ├── test_node2_boundaries.py
│       ├── test_node5_boundaries.py
│       ├── test_node7_deep.py
│       ├── test_node8_boundaries.py
│       ├── test_evaluate_cli.py
│       └── test_validate_annotations_cli.py
│
├── scripts/
│   ├── extract_subenv1_signals.py   # Signal extraction for Sub-env 1
│   ├── extract_subenv2_signals.py   # Signal extraction for Sub-env 2
│   ├── extract_subenv3_signals.py   # Signal extraction for Sub-env 3
│   ├── generate_annotation_worksheet.py
│   ├── validate_annotations.py
│   ├── convert_captions.py
│   └── export_test_set.py
│
├── docs/
│   ├── PROJECT_OVERVIEW.md
│   ├── OPENENV_INTEGRATION_GUIDE.md
│   ├── CODEBASE_REVIEW.md
│   └── annotation_worksheet_subenv{1,2,3}.md
│
├── client.py                        # OpenEnv client helper
├── openenv.yaml                     # OpenEnv manifest (runtime: fastapi, port: 8000)
├── pyproject.toml                   # Package config (openenv-talking-head-bench v1.0.0)
├── requirements.txt                 # Top-level dependencies
├── REWARD_LOGIC.md                  # Detailed scoring documentation
└── LICENSE                          # MIT

Quick Start

Prerequisites

Python 3.10+
pip or uv

Installation

git clone https://github.com/22elix3r/TalkingHeadBench.git
cd TalkingHeadBench

# Standard pip
pip install -r requirements.txt

# Or install as a package (recommended for OpenEnv usage)
pip install -e ".[dev]"

Run an Episode (Python API)

from src.pipeline import run_episode_from_bundle, EpisodeResult

bundle = {
    "reference_image_obs": {
        # ImageDiagnosticsObservation fields
        "face_occupancy_ratio": 0.42,
        "yaw_degrees": 28.5,
        "pitch_degrees": -4.1,
        "landmark_stability_score": 0.81,
        # ...
    },
    "param_config": {
        "cfg": 5.5,
        "denoise_alt": 0.5,
        "eta": 0.08
    },
    "clip_signal_obs_list": [
        # list of ClipSignalObservation dicts
    ],
    "weight_obs": {
        # WeightSignalObservation fields
    },
    "ground_truths": {
        # ground truth annotations for all sub-envs
    },
}

result: EpisodeResult = run_episode_from_bundle(bundle)
print(f"Final score: {result.final_score:.3f}")
print(f"  Sub-env 1: {result.subenv1_score:.3f}")
print(f"  Sub-env 2: {result.subenv2_score:.3f}")
print(f"  Sub-env 3: {result.subenv3_score:.3f}")

CLI Evaluation Harness

# Dry-run (schema validation only)
python -m src.evaluate --dry-run --test-set tests/test_set/

# Full scoring run
python -m src.evaluate --test-set tests/test_set/ --verbose

OpenEnv Server

TalkingHeadBench is packaged as an OpenEnv-compliant environment with a Gymnasium-style reset / step API served over FastAPI.

Run Locally

pip install openenv-core[core]>=0.2.2
uvicorn server.app:app --host 0.0.0.0 --port 8000

Run with Docker

docker build -t talking-head-bench -f server/Dockerfile .
docker run -p 8000:8000 talking-head-bench

Hugging Face Space Config (Public Deployment)

When this server is deployed publicly, keep custom provider URLs disabled unless you explicitly need them.

Set these Space variables:

THB_ALLOW_CUSTOM_BASE_URLS=0 (recommended default for public endpoints)
THB_ALLOWED_BASE_URL_PREFIXES=https://api-inference.huggingface.co

If you need to allow custom provider endpoints, set THB_ALLOW_CUSTOM_BASE_URLS=1 and keep THB_ALLOWED_BASE_URL_PREFIXES to a strict comma-separated allowlist. Private, loopback, and link-local hosts are blocked.

Client Usage

from client import TalkingHeadBenchEnv

with TalkingHeadBenchEnv(base_url="http://localhost:8000").sync() as env:
    # Image audit episode (standalone)
    obs = env.reset(mode="image")
    obs = env.step(action_1)    # ImageDiagnosticsAction -> ParamAnomalyObservation
    obs = env.step(action_2)    # ParamAnomalyAction -> done
    print(obs.reward)           # Sub-env 1 score

    # Clip audit episode (standalone)
    obs = env.reset(mode="clips")
    obs = env.step(action_clip) # ClipDispositionAction -> done

    # Weight audit episode (standalone)
    obs = env.reset(mode="weights")
    obs = env.step(action_w)    # PhonemeRiskAction -> done

Episode Flow

Mode	Reset Output	Step 1	Step 2
`image`	`ImageDiagnosticsObservation`	`ParamAnomalyObservation`	done (`subenv1_score`)
`clips`	`ClipDispositionObservation`	done (`subenv2_score`)	N/A
`weights`	`PhonemeRiskObservation`	done (`subenv3_score`)	N/A
`clips_and_weights`	`ClipDispositionObservation`	`PhonemeRiskObservation`	done (subenv2/subenv3 blend)

Test Suite

# Run all tests
pytest

# Unit tests only
pytest tests/unit/

# Smoke / integration tests
pytest tests/smoke/

# With coverage
pytest --cov=src --cov-report=term-missing

Test Module	Coverage Area
`test_schemas.py`	Pydantic model validation (all sub-envs)
`test_schema_roundtrip.py`	Schema serialization / deserialization
`test_grader_arithmetic.py`	Reward formula correctness
`test_node1_boundaries.py`	Node 1 edge cases
`test_node2_boundaries.py`	Node 2 edge cases
`test_node4_extractor.py`	Clip signal extraction
`test_node5_boundaries.py`	Disposition classifier boundaries
`test_node7_extractor.py`	Weight signal extraction
`test_node7_deep.py`	Deep Node 7 heuristic tests
`test_node8_boundaries.py`	Phoneme risk assessor boundaries
`test_pipeline_e2e.py`	Full episode end-to-end
`test_pipeline_bundle.py`	Bundle format validation
`test_evaluate_cli.py`	CLI harness integration

Scoring Reference

Sub-env 1: Reference Image and Prompt Audit

Dimension	Weight	Method
Regime Classification	0.35	Exact match (1.0), borderline (0.7), wrong (0.0)
Risk Factor Recall	0.35	Set intersection recall
Prompt Modification Validity	0.30	Precision against curated valid set

Sub-env 2: Dataset Clip Health Audit

Dimension	Weight	Method
Disposition Match	0.40	Exact + confidence calibration
Fix Instruction Quality	0.20	Precision ≥ 0.8 → full, ≥ 0.5 → half
Dataset Impact Reasoning	0.20	Keyword element matching
Override Misuse Penalty	−0.10	Unjustified override → penalty

Sub-env 3: LoRA Weight Behavioral Audit

Dimension	Weight	Method
Phoneme Risk Ranking	0.25	NDCG against reference ranking
Behavior Trigger Prediction	0.20	Set F1 on (phoneme, behavior) pairs
Cluster Identification	0.20	Overlap with reference clusters
Safety Calibration	0.15	Ordinal distance
Mitigation Quality	0.20	(target, action) pair matching

Reference Model

This benchmark is designed to evaluate agents working with:

elix3r/LTX-2.3-22b-AV-LoRA-talking-head

Design Principles

Property	Description
No live generation	All signals are pre-extracted; no GPU inference required during evaluation
Deterministic	All graders are rule-based, with no LLM judge, fully reproducible
Partial credit	Borderline answers receive scaled scores, not binary pass/fail
Mode-based tasks	Image, clip, and weight audits run independently; clip→weight coupling is optional
Fast episodes	Full evaluation completes in seconds

Research Foundation

TalkingHeadBench's diagnostic nodes are grounded in peer-reviewed research on LoRA failure modes, attention interference, and weight-space analysis:

Research	Application in TalkingHeadBench
TARA (Token-Aware LoRA Attention)	Node 1/2: Attention bleed from low face occupancy; token filtering informs risk factor detection
W2T (Weights to Tokens)	Node 7: QR->SVD canonical decomposition resolves factorization ambiguity before any weight statistics are computed
EditYourself	Node 1: Reference token coverage degrades at lateral angles - informs yaw-based regime classification
MoFE (Mixture of Facial Experts)	Node 2: Angle-dependent identity drift thresholds; directional fix vocabulary
VASA / EMO / Hallo	Node 4/5: Talking-head temporal stability expectations; lip-sync quality baselines
ALTER	Node 8: Phoneme->behavior association patterns; expression trigger taxonomy

The benchmark now treats image, clip, and weight audits as independent tasks, with optional clip-to-weight context transfer when running a combined clips-and-weights episode.

Documentation

Document	Description
`docs/PROJECT_OVERVIEW.md`	Full architecture and design reference
`docs/OPENENV_INTEGRATION_GUIDE.md`	OpenEnv compliance and deployment guide
`docs/CODEBASE_REVIEW.md`	File-by-file codebase audit
`REWARD_LOGIC.md`	Detailed scoring and reward formula

Citation

@software{TalkingHeadBench2026,
  author  = {elix3r},
  title   = {TalkingHeadBench: A Diagnostic Reasoning Benchmark for Talking-Head LoRA Pipelines},
  year    = {2026},
  url     = {https://github.com/22elix3r/TalkingHeadBench},
  version = {1.0.0}
}

License

Licensed under the MIT License. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TalkingHeadBench

Overview

Architecture

Coupling Model

Episode Modes

Project Structure

Quick Start

Prerequisites

Installation

Run an Episode (Python API)

CLI Evaluation Harness

OpenEnv Server

Run Locally

Run with Docker

Hugging Face Space Config (Public Deployment)

Client Usage

Episode Flow

Test Suite

Scoring Reference

Sub-env 1: Reference Image and Prompt Audit

Sub-env 2: Dataset Clip Health Audit

Sub-env 3: LoRA Weight Behavioral Audit

Reference Model

Design Principles

Research Foundation

Documentation

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
docs		docs
examples		examples
scripts		scripts
server		server
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
REWARD_LOGIC.md		REWARD_LOGIC.md
__init__.py		__init__.py
client.py		client.py
conftest.py		conftest.py
openenv.yaml		openenv.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

TalkingHeadBench

Overview

Architecture

Coupling Model

Episode Modes

Project Structure

Quick Start

Prerequisites

Installation

Run an Episode (Python API)

CLI Evaluation Harness

OpenEnv Server

Run Locally

Run with Docker

Hugging Face Space Config (Public Deployment)

Client Usage

Episode Flow

Test Suite

Scoring Reference

Sub-env 1: Reference Image and Prompt Audit

Sub-env 2: Dataset Clip Health Audit

Sub-env 3: LoRA Weight Behavioral Audit

Reference Model

Design Principles

Research Foundation

Documentation

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages