GitHub - MAC-AutoML/SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

A benchmark for evaluating who, when, and how in omni-modal dialogue interaction.

Highlights · Benchmark Overview · Quick Start · Main Results · Citation

SocialOmni is a benchmark for audio-visual social interactivity in omni-modal large language models (OLMs). Instead of reducing evaluation to static answer correctness, SocialOmni measures whether a model can behave appropriately in real dialogue by jointly evaluating three tightly coupled dimensions:

Who is speaking: speaker separation and identification
When to enter: interruption timing control
How to respond: natural interruption generation

The repository contains the benchmark pipeline, model clients and servers, runtime configurations, and reproducible evaluation entrypoints for both perception and interaction-generation settings.

😮 Highlights

1. Beyond Static QA: A Benchmark for Social Interaction

Existing benchmarks are trapped in "answer-centric" metrics. SocialOmni shifts the focus to socially appropriate behavior in multi-party dialogues, where a "correct" answer is still a failure if the timing is unnatural.

2. The "Who-When-How" Protocol

We operationalize conversational interactivity into a unified joint profile:

Who: Active speaker identification.

When: Socially appropriate interruption timing.

How: Contextually coherent response generation.

3. Joint Diagnostics: Decoding the Interaction Gap

SocialOmni provides a high-fidelity map of failure by deconstructing the friction between three critical axes:

Perceptual Resilience: Robustness across audio-visual (in)consistency.
Timing Precision: Millisecond-level accuracy within "social windows."
Generative Quality: AI-judged naturalness and coherence of interruptions.

The Core Insight: By decoupling these dimensions, we pinpoint exactly where strong perception fails to translate into fluid social interaction.

🧩 Tasks

Task I: Perception (`who`)

Given a video clip and a timestamp t, the model answers:

At timestamp t, who is speaking?

The model chooses from {A, B, C, D}.

Task II: Interaction Generation (`when` + `how`)

Given a video prefix V[0:t] and a candidate speaker X, the model performs two sub-questions:

Q1 (when): should X interrupt immediately after t?
Q2 (how): if yes, what is the natural interruption content?

📏 Evaluation Protocol

Perception metrics

Top-1 Accuracy
Consistent / inconsistent split accuracy
Gap:

Δ = Acc_consistent - Acc_inconsistent

Generation metrics

Q1: Accuracy / Precision / Recall / F1 under tolerance windows such as δ = 0.2s
Q2: LLM-judge score on {0, 25, 50, 75, 100}

The paper protocol uses three judges for Q2:

GPT-4o
Gemini 3 Pro
Qwen3-Omni

🐳 Main Results

SocialOmni reveals cross-axis decoupling

Perception strength does not guarantee interaction quality. Some models identify speakers well but perform poorly on natural interruption generation, while others generate plausible responses despite weak speaker grounding.

Model	Who (%)	When Acc. (%)	How (/100)
GPT-4o	36.75	46.89	69.64
Gemini 2.5 Pro	44.69	55.67	72.32
Gemini 2.5 Flash	47.03	61.50	85.08
Gemini 3 Flash Preview	53.23	61.06	79.08
Gemini 3 Pro Preview	64.99	67.31	81.77
Qwen3-Omni	69.25	63.64	45.57

Key observation:

Who leader: Qwen3-Omni
When leader: Gemini 3 Pro Preview
How leader: Gemini 2.5 Flash

This rank inversion is why SocialOmni evaluates the full interaction profile instead of a single aggregate score.

⚙️ Requirements and Installation

We recommend the following environment:

Python >=3.10,<3.11
CUDA-compatible PyTorch runtime for local omni models
uv for dependency and environment management

Install with:

git clone https://github.com/Alexisxty/SocialOmni.git
cd SocialOmni
uv sync

🚀 Quick Start

1. Configure runtime and paths

Edit config/config.yaml and set:

API keys / API endpoints
local model path or server_url
dataset path
output and result directories

Common environment variables:

OPENAI_API_KEY
OPENAI_API_BASE
GEMINI_API_KEY
GEMINI_API_BASE

2. Start a local model server

Example:

uv run models/model_server/qwen3_omni/qwen3_omni_server.py

Other model server entrypoints are located under:

models/model_server/*/*_server.py

3. Run Task I benchmark

uv run run_benchmark.py --model qwen3_omni

4. Run Task II benchmark

uv run run_benchmark_level2.py --model qwen3_omni --resume

🧱 Repository Structure

SocialOmni/
├── config/                  # runtime, model, and evaluation configs
├── data/                    # local datasets (not tracked)
├── docs/                    # docs and visual assets
├── models/                  # model servers, clients, and shared benchmark logic
├── scripts/                 # utility scripts
├── run_benchmark.py         # Task I entrypoint
├── run_benchmark_level2.py  # Task II entrypoint
├── pyproject.toml           # dependency definition
└── README.md

🔑 Supported Model Keys

Use the following keys with --model:

gpt4o
gemini_2_5_flash
gemini_2_5_pro
gemini_3_flash_preview
gemini_3_pro_preview
qwen3_omni
qwen3_omni_thinking
qwen2_5_omni
miniomni_2
omnivinci
vita_1_5
baichuan_omni_1_5
ming

🧪 Reproducibility Notes

Keep dataset and result directories local and out of version control.
Use fixed prompt templates and stable runtime configs for cross-model comparison.
Report split-wise metrics and confidence intervals when claiming improvements.
For generation evaluation, keep the judge set fixed across runs.

✏️ Citation

If you find SocialOmni useful in your research, please cite:

@article{xie2026socialomni,
  title={SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models},
  author={Tianyu Xie and Jinfa Huang and Yuexiao Ma and Rongfang Luo and Yan Yang and Wang Chen and Yuhui Zeng and Ruize Fang and Yixuan Zou and Xiawu Zheng and Jiebo Luo and Rongrong Ji},
  journal={arXiv preprint arXiv:2603.16859},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

A benchmark for evaluating who, when, and how in omni-modal dialogue interaction.

😮 Highlights

1. Beyond Static QA: A Benchmark for Social Interaction

2. The "Who-When-How" Protocol

3. Joint Diagnostics: Decoding the Interaction Gap

🧩 Tasks

Task I: Perception (`who`)

Task II: Interaction Generation (`when` + `how`)

📏 Evaluation Protocol

Perception metrics

Generation metrics

🐳 Main Results

SocialOmni reveals cross-axis decoupling

⚙️ Requirements and Installation

🚀 Quick Start

1. Configure runtime and paths

2. Start a local model server

3. Run Task I benchmark

4. Run Task II benchmark

🧱 Repository Structure

🔑 Supported Model Keys

🧪 Reproducibility Notes

✏️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
config		config
data		data
docs		docs
models		models
scripts		scripts
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
run_benchmark.py		run_benchmark.py
run_benchmark_level2.py		run_benchmark_level2.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

A benchmark for evaluating who, when, and how in omni-modal dialogue interaction.

😮 Highlights

1. Beyond Static QA: A Benchmark for Social Interaction

2. The "Who-When-How" Protocol

3. Joint Diagnostics: Decoding the Interaction Gap

🧩 Tasks

Task I: Perception (who)

Task II: Interaction Generation (when + how)

📏 Evaluation Protocol

Perception metrics

Generation metrics

🐳 Main Results

SocialOmni reveals cross-axis decoupling

⚙️ Requirements and Installation

🚀 Quick Start

1. Configure runtime and paths

2. Start a local model server

3. Run Task I benchmark

4. Run Task II benchmark

🧱 Repository Structure

🔑 Supported Model Keys

🧪 Reproducibility Notes

✏️ Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Task I: Perception (`who`)

Task II: Interaction Generation (`when` + `how`)

Packages