Wenhui Tan1, Minghao Li2, Xiaoqian Ma2, Siqi Fan3, Xiusheng Huang4, Liujie Zhang2, Ruihua Song1, Weihang Chen2
1 Gaoling School of Artificial Intelligence, Renmin University of China 2 AI Platform, Xiaohongshu Inc. 3 University of Electronic Science and Technology of China 4 Institute of Automation, Chinese Academy of Sciences
Long chain-of-thought reasoning has made autoregressive decoding the dominant inference cost of modern large language models. Existing methods target either the input side (latent compression) or the output side (speculative decoding and multi-token prediction, MTP), but the two lines of work have been pursued independently. Moreover, output-side methods must incur an expensive verifier pass to validate the unreliable draft tokens predicted by MTP. To address these issues, we propose Pair-In, Pair-Out (PIPO), which unifies both sides by viewing a latent compressor and an MTP head as mirror-image operations: the compressor folds two input tokens into one latent representation, while the MTP head unfolds one hidden state into one additional output token. To remove the verifier cost without sacrificing reliability, PIPO trains a lightweight confidence head that decides whether draft tokens should be accepted. We observe that On-Policy Distillation (OPD) naturally matches the rejection-sampling criterion of speculative decoding, so the confidence head can be trained alongside OPD with negligible extra cost. Experiments on AIME 2025, GPQA-Diamond, LiveCodeBench v6, and LongBench v2 with Qwen3.5-4B and 9B backbones show that PIPO improves pass@4 over regular decoding by up to +7.15 points, while delivering up to 2.64× first-token-latency and 2.07× per-token-latency speedups.
conda create -n pipo python=3.12 -y
conda activate pipo
bash scripts/install.shThe trained checkpoints are available on Hugging Face.
Download the outputs/ directory into /path/to/PIPO/. You should then see files like:
outputs/Qwen3.5-4B/sft_mlp_sft_all_65535_0.25_2epochs/checkpoint-1500/
Then download the base models and benchmarks from Hugging Face:
bash scripts/download.sh
bash scripts/merge_lora.sh outputs/Qwen3.5-4B/sft_mlp_sft_all_65535_0.25_2epochs/checkpoint-1500This produces a sibling directory with merged weights:
outputs/Qwen3.5-4B/sft_mlp_sft_all_65535_0.25_2epochs/checkpoint-1500-merged.
python sglang_eval.py --model_path=outputs/Qwen3.5-4B/sft_mlp_sft_all_65535_0.25_2epochs/checkpoint-1500-mergedResults are saved under <model_path>/eval/.
Regular decoding:
python sglang_eval.py --model_path=Qwen/Qwen3.5-4BQwen3.5's native MTP head with EAGLE-2 speculative decoding:
python sglang_eval.py --model_path=Qwen/Qwen3.5-4B --enable_eagleSFT and OPD training data are available on Hugging Face.
Download the data/ directory into /path/to/PIPO/. You should then see files like data/sft_all.jsonl and data/rl_0.5.jsonl.
We recommend caching the SFT dataset to avoid re-processing it on every run.
bash pipo/dataset/export_cached_dataset.sh data/sft_all.jsonlThis produces files like data/sft_all.jsonl.cache/train.
Running scripts/swift_sft.sh with no arguments reproduces our default SFT setting.
bash scripts/swift_sft.shRunning scripts/swift_opd.sh with the merged SFT checkpoint reproduces our default OPD setting.
bash scripts/swift_opd.sh outputs/Qwen3.5-4B/sft_mlp_sft_all_65535_0.25_2epochs/checkpoint-1500-mergedUse the 9B teacher to roll out trajectories:
python sglang_eval.py \
--model_path=Qwen/Qwen3.5-9B \
--max_generated_tokens=128000 \
--datasets=dapo_math,codeforcesConvert the rollout results into SFT data:
python pipo/dataset/build_sft_data_on_results_jsonl.py \
outputs/Qwen3.5-9B/regular/4_1.0_0.95_20_0_1.0_1.5_128000/codeforces-results.jsonl \
outputs/Qwen3.5-9B/regular/4_1.0_0.95_20_0_1.0_1.5_128000/dapo_math-results.jsonlOr into RL/OPD data:
python pipo/dataset/build_rl_data_on_results_jsonl.py \
outputs/Qwen3.5-9B/regular/4_1.0_0.95_20_0_1.0_1.5_128000/codeforces-results.jsonl \
outputs/Qwen3.5-9B/regular/4_1.0_0.95_20_0_1.0_1.5_128000/dapo_math-results.jsonlContributions to address any of the following are very welcome.
- Radix cache is disabled when PIPO is enabled. We observe cache-key mismatches between the compressed pair-latent representation and SGLang's prefix hash; the radix cache is therefore force-disabled in the PIPO inference path.
- The prefill stage does not emit a draft token (token2) because of CUDA-graph constraints. A PAD token is automatically inserted after the backbone token (token1).
enable_memory_saver/SLEEP_LEVELare unsupported on the SGLang + ms-swift OPD path. OPD therefore needs ~130 GB / GPU, while SFT only needs ~75 GB / GPU.- Only Qwen3.5 backbones are supported at the moment.
Please ask your agents to read .agents/PIPO.md before working, to let them comprehensively understand PIPO & this repo.
If you use this code or dataset, please cite our paper:
@article{tan2026pipo,
title = {Pair-In, Pair-Out: Latent Multi-Token Prediction for Efficient LLMs},
author = {Tan, Wenhui and Li, Minghao and Ma, Xiaoqian and Fan, Siqi and Huang, Xiusheng and Zhang, Liujie and Song, Ruihua and Chen, Weihang},
journal = {arXiv preprint arXiv:2605.27255},
year = {2026}
}We thank the authors of SGLang and ms-swift, whose work we have modified and extended for our research.


