zipformer is a speech encoder that achieves both high performance and efficiency. It is specifically optimized for speech recognition tasks and is the only model that outperforms Google's Conformer under fair comparison.
- Efficient model architecture: UNet-style multi-scale encoder with module innovations (BiasNorm, Swoosh, Balancer, Whitener).
- New optimizer: ScaledAdam.
- State-of-the-art performance with 50% fewer FLOPs than Conformer.
- Supports CTC, Transducer, and AED modeling.
- CR-CTC: Consistency regularization for stronger CTC models.
zipformer ASR models are available in xlarge, large, medium, and small variants, with both streaming and non-streaming versions. The table below provides download links. For more details, please refer to the documentation.
| Model | Parameters | ModelScope | Huggingface | Languages | Architectures |
|---|---|---|---|---|---|
| zipformer-xlarge | 300M | link | link | Chinese, English | CTC |
| zipformer-large | 150M | link | link | Chinese, English | CTC, Transducer |
| zipformer-large-streaming | 150M | link | link | Chinese, English | CTC, Transducer |
| zipformer-medium | 65M | link | link | Chinese, English | CTC, Transducer |
| zipformer-medium-streaming | 65M | link | link | Chinese, English | CTC, Transducer |
| zipformer-small | 25M | link | link | Chinese, English | CTC, Transducer |
| zipformer-small-streaming | 25M | link | link | Chinese, English | CTC, Transducer |
2026/06/22: Created standalone zipformer repository from icefall, and released xlarge, large, medium, and small Chinese/English models.
pip install zipformerTip
The examples below use the non-streaming medium model. For more models, please refer to the documentation.
# Use jit scripted model
# Transducer
zipformer inference --hf-model pkufool/zipformer-medium --model-type jit --ctc 0 en.wav zh.wav
# CTC
zipformer inference --hf-model pkufool/zipformer-medium --model-type jit --ctc 1 en.wav zh.wav
# Use onnx model
# Transducer
zipformer inference --hf-model pkufool/zipformer-medium --model-type onnx --ctc 0 en.wav zh.wav
# CTC
zipformer inference --hf-model pkufool/zipformer-medium --model-type onnx --ctc 1 en.wav zh.wavfrom zipformer import inference
# jit scripted model
result = inference([en.wav, zh.wav], hf_model='pkufool/zipformer-medium', model_type='jit', ctc=False)
result = inference([en.wav, zh.wav], hf_model='pkufool/zipformer-medium', model_type='jit', ctc=True)
# onnx model
result = inference([en.wav, zh.wav], hf_model='pkufool/zipformer-medium', model_type='onnx', ctc=False)
result = inference([en.wav, zh.wav], hf_model='pkufool/zipformer-medium', model_type='onnx', ctc=True)
# fp16 model
result = inference([en.wav, zh.wav], hf_model='pkufool/zipformer-medium', model_type='onnx', ctc=False, dtype='fp16')
result = inference([en.wav, zh.wav], hf_model='pkufool/zipformer-medium', model_type='onnx', ctc=True, dtype='fp16')For more information about model training, evaluation, and deployment, please refer to the documentation.
For task-related issues, please open an issue on GitHub Issues.
You can also scan the QR code below to join our developer WeChat group or follow our WeChat official account.
| Developer Group Admin | WeChat Official Account |
|---|---|
![]() |
![]() |
@inproceedings{yao2024zipformer,
title={Zipformer: A faster and better encoder for automatic speech recognition},
author={Yao, Zengwei and Guo, Liyong and Yang, Xiaoyu and Kang, Wei and Kuang, Fangjun and Yang, Yifan and Jin, Zengrui and Lin, Long and Povey, Daniel},
booktitle={International Conference on Learning Representations},
volume={2024},
pages={44440--44455},
year={2024}
}
@inproceedings{yao2025cr,
title={Cr-ctc: Consistency regularization on ctc for improved speech recognition},
author={Yao, Zengwei and Kang, Wei and Yang, Xiaoyu and Kuang, Fangjun and Guo, Liyong and Zhu, Han and Jin, Zengrui and Li, Zhaoqing and Lin, Long and Povey, Daniel},
booktitle={International Conference on Learning Representations},
volume={2025},
pages={26850--26868},
year={2025}
}
