Highlights · Benchmark Overview · Quick Start · Main Results · Citation
SocialOmni is a benchmark for audio-visual social interactivity in omni-modal large language models (OLMs). Instead of reducing evaluation to static answer correctness, SocialOmni measures whether a model can behave appropriately in real dialogue by jointly evaluating three tightly coupled dimensions:
- Who is speaking: speaker separation and identification
- When to enter: interruption timing control
- How to respond: natural interruption generation
The repository contains the benchmark pipeline, model clients and servers, runtime configurations, and reproducible evaluation entrypoints for both perception and interaction-generation settings.
Existing benchmarks are trapped in "answer-centric" metrics. SocialOmni shifts the focus to socially appropriate behavior in multi-party dialogues, where a "correct" answer is still a failure if the timing is unnatural.
We operationalize conversational interactivity into a unified joint profile:
Who: Active speaker identification.
When: Socially appropriate interruption timing.
How: Contextually coherent response generation.
SocialOmni provides a high-fidelity map of failure by deconstructing the friction between three critical axes:
-
Perceptual Resilience: Robustness across audio-visual (in)consistency.
-
Timing Precision: Millisecond-level accuracy within "social windows."
-
Generative Quality: AI-judged naturalness and coherence of interruptions.
The Core Insight: By decoupling these dimensions, we pinpoint exactly where strong perception fails to translate into fluid social interaction.
Given a video clip and a timestamp t, the model answers:
At timestamp
t, who is speaking?
The model chooses from {A, B, C, D}.
Given a video prefix V[0:t] and a candidate speaker X, the model performs two sub-questions:
- Q1 (
when): shouldXinterrupt immediately aftert? - Q2 (
how): if yes, what is the natural interruption content?
- Top-1 Accuracy
- Consistent / inconsistent split accuracy
- Gap:
Δ = Acc_consistent - Acc_inconsistent
- Q1: Accuracy / Precision / Recall / F1 under tolerance windows such as
δ = 0.2s - Q2: LLM-judge score on
{0, 25, 50, 75, 100}
The paper protocol uses three judges for Q2:
- GPT-4o
- Gemini 3 Pro
- Qwen3-Omni
Perception strength does not guarantee interaction quality. Some models identify speakers well but perform poorly on natural interruption generation, while others generate plausible responses despite weak speaker grounding.
| Model | Who (%) | When Acc. (%) | How (/100) |
|---|---|---|---|
| GPT-4o | 36.75 | 46.89 | 69.64 |
| Gemini 2.5 Pro | 44.69 | 55.67 | 72.32 |
| Gemini 2.5 Flash | 47.03 | 61.50 | 85.08 |
| Gemini 3 Flash Preview | 53.23 | 61.06 | 79.08 |
| Gemini 3 Pro Preview | 64.99 | 67.31 | 81.77 |
| Qwen3-Omni | 69.25 | 63.64 | 45.57 |
Key observation:
- Who leader: Qwen3-Omni
- When leader: Gemini 3 Pro Preview
- How leader: Gemini 2.5 Flash
This rank inversion is why SocialOmni evaluates the full interaction profile instead of a single aggregate score.
We recommend the following environment:
- Python
>=3.10,<3.11 - CUDA-compatible PyTorch runtime for local omni models
uvfor dependency and environment management
Install with:
git clone https://github.com/Alexisxty/SocialOmni.git
cd SocialOmni
uv syncEdit config/config.yaml and set:
- API keys / API endpoints
- local model path or
server_url - dataset path
- output and result directories
Common environment variables:
OPENAI_API_KEYOPENAI_API_BASEGEMINI_API_KEYGEMINI_API_BASE
Example:
uv run models/model_server/qwen3_omni/qwen3_omni_server.pyOther model server entrypoints are located under:
models/model_server/*/*_server.py
uv run run_benchmark.py --model qwen3_omniuv run run_benchmark_level2.py --model qwen3_omni --resumeSocialOmni/
├── config/ # runtime, model, and evaluation configs
├── data/ # local datasets (not tracked)
├── docs/ # docs and visual assets
├── models/ # model servers, clients, and shared benchmark logic
├── scripts/ # utility scripts
├── run_benchmark.py # Task I entrypoint
├── run_benchmark_level2.py # Task II entrypoint
├── pyproject.toml # dependency definition
└── README.md
Use the following keys with --model:
gpt4o
gemini_2_5_flash
gemini_2_5_pro
gemini_3_flash_preview
gemini_3_pro_preview
qwen3_omni
qwen3_omni_thinking
qwen2_5_omni
miniomni_2
omnivinci
vita_1_5
baichuan_omni_1_5
ming
- Keep dataset and result directories local and out of version control.
- Use fixed prompt templates and stable runtime configs for cross-model comparison.
- Report split-wise metrics and confidence intervals when claiming improvements.
- For generation evaluation, keep the judge set fixed across runs.
If you find SocialOmni useful in your research, please cite:
@article{xie2026socialomni,
title={SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models},
author={Tianyu Xie and Jinfa Huang and Yuexiao Ma and Rongfang Luo and Yan Yang and Wang Chen and Yuhui Zeng and Ruize Fang and Yixuan Zou and Xiawu Zheng and Jiebo Luo and Rongrong Ji},
journal={arXiv preprint arXiv:2603.16859},
year={2026}
}