YaYan-AI（雅言）

Fully-offline multi-dialect speech intelligence system. Converts speech across 22 Chinese dialects and 40+ world languages into structured Traditional-Chinese transcripts — with speaker diarization, character-level timestamps, and an LLM-assisted correction loop. Designed for air-gapped, privacy-critical environments: no cloud, no API calls, no data leaving the machine.

Why YaYan-AI

Off-the-shelf ASR works for Mandarin and English. It falls apart on low-resource Chinese dialects — Hokkien (Taiwanese), Hakka, Wu, Cantonese variants — exactly the speech that matters in real-world field recordings, call archives, and oral-history preservation. And in privacy-critical settings, shipping audio to a cloud API is not an option at all.

YaYan-AI solves both: state-of-the-art dialect recognition, running entirely on local GPUs.

Pipeline Architecture

flowchart LR
    A["🎙️ Audio input<br/>meetings · calls · field recordings"] --> D["pyannote 3.1<br/>speaker diarization (≤5 speakers)"]
    D --> L["Per-segment<br/>language / dialect ID"]
    L -->|"22 Chinese dialects"| C["Dolphin-CN-Dialect<br/>dialect ASR"]
    L -->|"40+ languages"| W["Whisper large-v3<br/>multilingual ASR"]
    C --> Q["Qwen3-14B<br/>correction · normalization<br/>→ Traditional Chinese"]
    W --> Q
    Q --> T["📄 Structured transcript<br/>speaker labels + char-level timestamps"]
    T --> F["✏️ Editable feedback loop<br/>human fixes → LLM re-refinement"]
    F -.-> Q

Design decisions that matter:

Per-segment language ID — a single conversation can mix Mandarin, Hokkien and English; routing each segment independently prevents one dominant language from swallowing the others.
Two-stage ASR routing — a dialect-specialized model (Dolphin) for Chinese variants, Whisper large-v3 for everything else; each model does only what it is best at.
LLM correction stage — Qwen3-14B normalizes ASR output into readable Traditional Chinese and fixes dialect-specific transcription artifacts.
Human-in-the-loop — user edits feed back into the correction stage, so accuracy improves on your audio domain over time.

Features

Capability	Detail
Chinese dialects	22 — Mandarin, Cantonese, Hokkien/Taiwanese, Hakka, Wu (Shanghai/Suzhou/Wenzhou), Hunanese, …
World languages	40+ via Whisper large-v3 (JA / KO / EU / SEA / ME)
Speaker diarization	Up to 5 speakers (A–E), pyannote 3.1
Timestamps	Character-level
Output	Structured Traditional-Chinese transcript, per-speaker
Correction	Qwen3-14B two-stage + editable feedback loop
Deployment	100 % offline — air-gapped & privacy-critical environments
Interface	Gradio web UI · batch mode (`auto_batch.py`)

Requirements

Component	Spec
OS	Ubuntu 22.04+
GPU	2× NVIDIA — supported profiles: RTX 4000 Ada 20 GB (`app_rtx4000.py`) · RTX 6000 24 GB (`app_rtx6000.py`)
Driver / CUDA	535+ / CUDA 12.1
Python	3.10
Disk	~50 GB (models ≈ 19 GB + workspace)
RAM	32 GB+

Core stack: PyTorch 2.3.1 · transformers 4.51.3 · pyannote.audio 3.1 · Gradio 4.44

Quickstart

git clone https://github.com/wu840407/YaYan-AI.git
cd YaYan-AI
pip install -r requirements.txt

# Download models (online phase; the system runs fully offline afterwards)
bash scripts/download_models.sh
python scripts/verify_models.py

# Launch — pick the profile matching your GPUs
python app_rtx4000.py     # dual RTX 4000 Ada → Gradio UI on http://localhost:7860
python app_rtx6000.py     # dual RTX 6000

# Batch mode
bash scripts/start_batch.sh

For fully air-gapped installation, see scripts/install_offline.sh and scripts/verify_offline.py.

Performance

Category	Accuracy	Note
Standard Mandarin	~99%	Character-level
Chinese dialects	~75%	Bounded by current model size on dual 20–24 GB GPUs

The dialect gap is primarily a model-capacity constraint. Accuracy improvements are on the roadmap below (larger models on H200 hardware + dialect-specific term-bank and corpora).

Roadmap

Larger models on H200 ×2 hardware to lift dialect accuracy (2026 H2)
RAG term-bank for domain / dialect vocabulary
Fine-tuning on public academic dialect corpora
Kafka-based high-throughput batch pipeline
Streaming (real-time) mode

License & Author

MIT — see LICENSE. Model weights retain their respective upstream licenses (see NOTICE.md).

Maintained by ChengRung Wu (@wu840407). Questions and issues welcome.

中文說明（摘要）

雅言 YaYan-AI 是全離線的多方言語音辨識系統：22 種漢語方言＋40+ 種語言 → 繁體中文逐字稿，含語者分離（至多 5 人）、字元級時間戳、Qwen3-14B 校正與可編輯回饋迴路。專為離線、隱私敏感環境設計——不連雲端、資料不出機器。

架構：pyannote 3.1 分離語者 → 逐段語言判定 → Dolphin（漢語方言）/ Whisper large-v3（外語）→ Qwen3-14B 校正輸出繁中逐字稿。雙 GPU 設定檔：RTX 4000 Ada（app_rtx4000.py）／RTX 6000（app_rtx6000.py）。

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
configs		configs
deploy		deploy
docker		docker
scripts		scripts
yayan		yayan
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE.md		NOTICE.md
README.md		README.md
app.py		app.py
app_rtx4000.py		app_rtx4000.py
app_rtx6000.py		app_rtx6000.py
auto_batch.py		auto_batch.py
auto_batch_rtx4000.py		auto_batch_rtx4000.py
auto_batch_rtx6000.py		auto_batch_rtx6000.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YaYan-AI（雅言）

Why YaYan-AI

Pipeline Architecture

Features

Requirements

Quickstart

Performance

Roadmap

License & Author

中文說明（摘要）

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

YaYan-AI（雅言）

Why YaYan-AI

Pipeline Architecture

Features

Requirements

Quickstart

Performance

Roadmap

License & Author

中文說明（摘要）

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages