Skip to content

Latest commit

 

History

History
75 lines (59 loc) · 3.43 KB

File metadata and controls

75 lines (59 loc) · 3.43 KB

Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

Yolo Yunlong Tang1, Daiki Shimada2, Hang Hua3, Chao Huang1, Jing Bi1, Rogerio Feris3, Chenliang Xu1

1University of Rochester, 2Sony Group Corporation, 3MIT-IBM Watson AI Lab

arXiv Paper Project Page Huggingface Dataset Huggingface Model

🌟 News

  • [2025-11-23] Introducing Video-R4, a reinforced video agent with visual rumination for text-rich video reasoning. The arXiv paper has been released. Code, model, and dataset are coming soon.

🚀 Video-R4 Training Framework

📊 Data Curation Pipeline

📈 Performance

📦 Installation

conda create -n video-r4 python=3.10
conda activate video-r4
git clone https://github.com/yunlong10/Video-R4.git
cd Video-R4
pip install -r requirements.txt

📖 Citation

If you find this work useful, please consider citing:

@article{tang2025video-r4,
  title={Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination},
  author={Tang, Yunlong and Shimada, Daiku and Hua, Hang and Huang, Chao and Bi, Jing and Feris, Rogerio and Xu, Chenliang},
  journal={arXiv preprint arXiv:2511.17490},
  year={2025}
}

🤝 Acknowledgments

This work was supported by Sony Group Corporation. We would like to thank Sayaka Nakamura and Jerry Jun Yokono for their insightful discussion.

We also thank the authors of the following projects for their contributions: