LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards

This repo provides the official code for our ICLR 2026 paper: LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards.

💡 Why LongRLVR?

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) by optimizing them against factual outcomes. However, this paradigm falters in long-context scenarios, as its reliance on internal parametric knowledge is ill-suited for tasks requiring contextual grounding—the ability to find and reason over externally provided information.

We formally prove that the outcome-only reward leads to significant vanishing gradients for the context grounding process. LongRLVR addresses this by augmenting the sparse answer reward with a dense, verifiable context reward that directly incentivizes correct evidence selection, providing a robust learning gradient that solves the underlying optimization challenge.

Key takeaways

Diagnosis: Outcome-only RLVR leads to a vanishing learning signal for contextual grounding in long sequences.
Method: Add a verifiable context reward (based on grounding chunk identifiers) alongside answer correctness.
Result: Consistently and significantly improve long-context performance across Qwen and LLaMA models.

Results

LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks.

🚀 Released Artifacts

Models

We release LongRLVR-trained models based on LLaMA and Qwen:

Model	Hugging Face Repo
LLaMA-3.1-8B-LongRLVR	https://huggingface.co/Guanzheng/LLaMA-3.1-8B-LongRLVR
Qwen2.5-7B-LongRLVR	https://huggingface.co/Guanzheng/Qwen2.5-7B-LongRLVR
Qwen2.5-14B-LongRLVR	https://huggingface.co/Guanzheng/Qwen2.5-14B-LongRLVR

Dataset

Our training dataset containing 46K high-quality synthetic QA pairs is available on Hugging Face:

Dataset	Hugging Face Repo
LongRLVR-Data	https://huggingface.co/datasets/Guanzheng/LongRLVR-Data

📁 Repository Structure

data_gen/: The synthetic data generation pipeline to construct the explicit grounding dataset (chunking → clustering → QA generation → judging → final selection). See the Data Generation README for details.
recipe/dapo/: Training entrypoint and LongRLVR reward manager.
- longrl_reward_manager.py: Implements the asynchronous reward computation (F-score context reward and synergistic answer reward) for LongRLVR.
- main_dapo.py: Main training loop utilizing GRPO.
verl/: RL training framework utilities (rollout, PPO/GRPO, distributed training, etc.), customized for long-context generation.

🛠️ Quickstart

Using the Models (Hugging Face)

The LongRLVR models are explicitly trained to identify useful context chunks before answering. You can use standard Hugging Face transformers to interact with them. For the best results, you must chunk your document using the same strategy as our data pipeline (see split_into_chunks in data_gen/clustering.py) and prompt the model to output the <think>, <useful chunks>, and <answer> tags. Note that <think> has been already added into the chat template of the released models.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Guanzheng/Qwen2.5-7B-LongRLVR"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Example document that has been split into chunks
document_chunks = [
    "<Chunk_0> Marie Curie was born in Warsaw, Poland...</Chunk_0>",
    "<Chunk_1> The Curies' early research was inspired by Henri Becquerel's 1896 discovery...</Chunk_1>",
    # ... more chunks ...
    "<Chunk_5> In December 1898, they announced the discovery of a second element, 'radium'...</Chunk_0\5>"
]
context = "\n".join(document_chunks)

prompt = f"""{context}

Question: Where was Marie Curie born and what was the second radioactive element she co-discovered?

Output:
"""

messages = [
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
    **inputs,
    max_new_tokens=1024,
    temperature=0.6,
    top_p=0.9
)

# The model will output something like:
# <think> Let me find where she was born and the second element... </think>
# <useful chunks> <CHUNK 0>, <CHUNK 5> </useful chunks>
# <answer> Marie Curie was born in Warsaw, Poland, and the second radioactive element she co-discovered was radium. </answer>

response = tokenizer.batch_decode(generated_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
print(response)

Data Generation Pipeline

We provide a comprehensive pipeline to synthesize the verifiable context-grounded dataset. For a step-by-step tutorial on generating your own data, please see the data_gen/README.md.

Training

We provide example scripts for training. Training is built on verl and uses Ray for distributed execution and vLLM / sglang for fast generation.

Setup your environment: Ensure you have the necessary dependencies installed (e.g., PyTorch, Ray, vLLM/sglang, and other dependencies required by verl). We use official docker image verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.2-te2.2. Feel free to use latest verl to access new features.
Run training: You can use run_longrl.sh as a template. Set the required environment variables:

export TRAIN_FILES="['/path/to/train.parquet']"
export VAL_FILES="['/path/to/val.parquet']"
export MODEL_PATH="/path/to/base-model"
export CKPT_SAVE_PATH="./ckpts/longrlvr_run"

bash run_longrl.sh

Evaluation

We release the evaluation results on LongBench v2, LongReason, and RULER-QA under eval_results folder.

📜 Citation

If you find LongRLVR useful for your research, please cite our paper:

@inproceedings{
  chen2026longrlvr,
  title={Long{RLVR}: Long-Context Reinforcement Learning Requires Verifiable Context Rewards},
  author={Guanzheng Chen and Michael Qizhe Shieh and Lidong Bing},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/forum?id=omVhYvyTPJ}
}

🤝 Acknowledgements

This project utilizes the verl framework for scalable RL training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards

💡 Why LongRLVR?

Key takeaways

Results

🚀 Released Artifacts

Models

Dataset

📁 Repository Structure

🛠️ Quickstart

Using the Models (Hugging Face)

Data Generation Pipeline

Training

Evaluation

📜 Citation

🤝 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data_gen		data_gen
eval_results		eval_results
recipe		recipe
verl		verl
.gitignore		.gitignore
CITATION.cff		CITATION.cff
README.md		README.md
run_longrl.sh		run_longrl.sh

Folders and files

Latest commit

History

Repository files navigation

LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards

💡 Why LongRLVR?

Key takeaways

Results

🚀 Released Artifacts

Models

Dataset

📁 Repository Structure

🛠️ Quickstart

Using the Models (Hugging Face)

Data Generation Pipeline

Training

Evaluation

📜 Citation

🤝 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages