h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning

🎉 News • 🔗 Links • 📖 Overview • 📊 Results

✨ Getting Started • 🗂️ Datasets • 🏋️ Training • 📃 Evaluation

📝 Example • 🎈 Citation • 🌻 Acknowledgement • 📧 Contact

Our method, h1, scales long-horizon reasoning by composing existing short-horizon problems into longer, dependency-based sequences. Using a stage-wise curriculum on this synthetic data, we train models with outcome-only rewards to gradually handle more complex reasoning tasks. This approach achieves strong out-of-distribution gains and significantly improves long-horizon reasoning performance without requiring additional supervision.

🎉 News

[2025/10/09] We present the h1 [Paper | Code | Dataset].

🔗 Links

📄 [Paper]
💻 [Code]
🗂️ [Dataset]

📖 Overview

Large language models excel at short-horizon reasoning tasks, but performance drops as reasoning horizon lengths increase. Existing approaches to combat this rely on inference-time scaffolding or costly step-level supervision, neither of which is scalable. In this work, we introduce a scalable method to bootstrap long-horizon reasoning capabilities using only existing, abundant short-horizon data. Our approach synthetically composes simple problems into complex, multi-step dependency chains of arbitrary length. We then train models on this data using outcome-only rewards under a curriculum that automatically increases in complexity, allowing RL training to be scaled much further without saturating. Empirically, our method generalizes remarkably well: curriculum training on composed 6th-grade level math problems (GSM8K) boosts accuracy on longer, competition-level benchmarks (GSM-Symbolic, MATH-500, AIME) by up to $2.06\times$. Importantly, our long-horizon improvements are significantly higher than baselines even at high $\textit{pass@k}$, showing that models can learn entirely new reasoning paths under RL. Theoretically, we show that curriculum-based RL with outcome rewards achieves an exponential improvement in sample complexity over full-horizon training, comparable to the gains from dense supervision, while providing strong training signal without additional annotations. $\textit{h1}$ therefore introduces an efficient path towards scaling RL for longer horizon problems using only existing data.

📊 Results

🏆 Main Results

Curriculum-based RL training significantly improves in-domain performance compared to the Instruct model and all other equal compute baselines.

Accuracy on GSM8K Problems of Horizon L-n
Model / setting	L-1	L-2	L-3	L-4	L-5	L-6	L-7	L-8
Instruct model	82.79	35.06	20.07	6.70	3.57	0.00	0.79	0.00
Equal compute training baselines
Only-L1	86.80	37.14	21.43	6.70	3.87	0.25	0.00	0.00
Uniform-Mix	82.80	12.66	2.04	0.54	0.00	0.00	0.00	0.00
Only-Long	82.71	43.36	20.41	3.22	1.49	0.25	0.25	0.00
Curriculum training (trained up to Len-n)
RLVR	83.24	39.42	18.37	2.95	2.08	0.25	0.79	0.00
Len-2	85.92	56.22	28.57	12.06	6.25	1.26	0.79	0.49
Len-3	84.91	56.22	37.76	15.55	8.63	3.27	3.17	0.25
Len-4	85.48	57.05	40.14	18.23	9.23	3.53	3.17	1.72
Len-5 (H1)	85.97 (+3.8%)	58.51 (+66.9%)	36.39 (+81.3%)	18.77 (+180.1%)	9.82 (+175.1%)	3.53 (++)	3.17 (+301.3%)	2.22 (++)

🔭 Generalization to harder benchmarks

Performance on harder math benchmarks improves significantly with GSM8K RL curriculum training stages. Bootstrapping simple existing data can be used for scaling RL.

Generalization to Significantly Harder Math Problems
Model/setting	MATH-500	Symbolic P1	Symbolic P2	MMLU-Pro	AIME 2025	AIME 2024
Instruct model	64.20	67.06	43.08	58.47	1.77	5.10
Standard RLVR on GSM8K
GSM8K RLVR	66.20	71.40	47.60	60.62	2.71	6.88
Curriculum RL on Composed GSM8K Problems
Len-2 GSM8K	67.00	72.86	50.80	59.73	1.25	7.19
Len-3 GSM8K	66.80	70.70	49.48	61.21	1.67	5.73
Len-4 GSM8K	68.40	72.22	51.92	60.91	2.60	10.00
Len-5 GSM8K	69.20 (+7.8%)	73.28 (+9.3%)	52.00 (+20.7%)	61.21 (+4.7%)	3.02 (+70.6%)	10.52 (+106.3%)

✨ Getting Started

Setup the environment:

conda create -n h1 python=3.12.11
conda activate h1
pip install torch==2.7.1
pip install -r requirements.txt

🗂️ Datasets

📚 GSM-LongHorizon dataset

You can download the full GSM-LongHorizon dataset from Hugging Face: alesiaivanova/GSM-LongHorizon

Using the Hugging Face CLI:

huggingface-cli download alesiaivanova/GSM-LongHorizon --repo-type dataset --local-dir ./GSM-LongHorizon

To convert downloaded files to JSONL format:

python -c "import pandas as pd; pd.read_parquet('GSM-LongHorizon/data/train-00000-of-00001.parquet').to_json('GSM-LongHorizon/train.jsonl', orient='records', lines=True)"
python -c "import pandas as pd; pd.read_parquet('GSM-LongHorizon/data/test-00000-of-00001.parquet').to_json('GSM-LongHorizon/test.jsonl', orient='records', lines=True)"

To filter only problems for a selected horizon length (e.g., 2):

python -c "import json; data = [json.loads(line) for line in open('GSM-LongHorizon/train.jsonl')]; open('GSM-LongHorizon/train_len_2.jsonl', 'w').writelines(json.dumps(item) + '\n' for item in data if item['horizon_length'] == 2)"

🔨 Constructing your own long-horizon dataset

We currently support constructing long-horizon reasoning tasks from atomic problems with numerical answers.

There are two supported methods for dataset construction.

1. Combining problems using generated code solutions

For each atomic problem, a Python code solution is generated using the OpenAI API and the o3 model. When composing long-horizon problems, the exact numerical output of a subproblem is used to replace one of the parameters in the subsequent subproblem. The final answer for the chained problem is obtained by executing the generated Python code end-to-end, passing the output of each function as the input argument to the next.

Steps:

Generate Python code solutions for each atomic problem using the OpenAI API and o3 model (requires a valid OpenAI API token):
```
python solution_code_generation.py <input_file> <output_file> [--num_samples <int>]
```
Arguments:
- input_file: Path to the input JSONL file containing atomic problems (must include "question" and "answer" fields, "answer" should be in the same format as in GSM8K).
- output_file: Path where the updated dataset including generated code solutions will be saved.
- num_samples: (Optional) Number of samples to process from the input file (default: 1000).
Generate updated problem statements with some numbers replaced by variables:
```
python replace_question_numbers_by_variables.py <input_file> <output_file>
```
Arguments:
- input_file: Path to the input JSONL file containing atomic problems (must include "question" and "answer" fields, "answer" should be in the same format as in GSM8K).
- output_file: Path where the updated dataset including generated code solutions will be saved.
- num_samples: (Optional) Number of samples to process from the input file (default: 1000).
Combine atomic problems into multi-step reasoning chains:
```
python combine_questions_into_long_horizon_reasoning.py \
--dataset <path_to_atomic_dataset> \
--output_dataset <path_to_output> \
--num_subproblems <int> \
[--num_repetitions <int>] \
[--seed <int>] \
[--only_int_answers] \
[--only_small_int_answers]
```
Arguments:
- dataset: Path to the dataset containing atomic problems (can be a Hugging Face dataset or a local directory).
- output_dataset: Output path for the constructed long-horizon dataset.
- num_subproblems: Number of subproblems per long-horizon instance.
- num_repetitions:(Optional) Number of attempts to use each atomic problem as a subproblem. Note that it is not guaranteed that each problem will be used exactly this number of times, due to type and bound consistency constraints during problem chaining.
- seed: Random seed.
- only_int_answers: (Optional) If set, only problems whose final answer is an integer are retained (by default, floating-point answers are allowed).
- only_small_int_answers: (Optional) If set, only problems with integer final answers in the range [-1000, 1000] are included.

2. Combining problems using only numerical answers

In this mode, the dataset of atomic problems needs to include only the following fields: "problem" and "answer".

To construct the long-horizon dataset, run:

python combine_questions_without_code_generation.py \
    --input_file <path_to_input_jsonl> \
    --output_file <path_to_output_jsonl> \
    --num_subproblems <int> \
    [--num_repetitions <int>] \
    [--seed <int>]

Arguments:

input_file: Path to the input JSONL file containing Hendrycks Math problems.
output_file: Path where the combined long-horizon problems will be saved.
num_subproblems: Number of atomic subproblems to combine into each long-horizon instance.
num_repetitions: (Optional) Number of times to repeat the construction process (default: 1).
seed: (Optional) Random seed for reproducibility (default: 42).

Combining multiple datasets

You can combine multiple datasets (concatenate or shuffle):

python combine_datasets.py \
    --dataset1 <path_to_first_dataset> \
    [--dataset2 <path_to_second_dataset>] \
    --output <output_path> \
    [--samples1 <int>] \
    [--samples2 <int>] \
    [--start_idx_1 <int>] \
    [--start_idx_2 <int>] \
    [--strategy <stack|shuffle>] \
    [--num_repetitions <int>] \
    [--seed <int>]

Arguments:

dataset1: Path to the first dataset (JSONL file).
dataset2: (Optional) Path to the second dataset (JSONL file).
output: Output file path for the combined dataset.
samples1: (Optional) Number of samples to take from the first dataset (default: all).
samples2: (Optional) Number of samples to take from the second dataset (default: all).
start_idx_1: (Optional) Start index for the first dataset (default: 0).
start_idx_2: (Optional) Start index for the second dataset (default: 0).
strategy: Combination strategy - "stack" (concatenate datasets) or "shuffle" (shuffle combined datasets) (default: stack).
num_repetitions: (Optional) Number of times to repeat the combined dataset (default: 1).
seed: (Optional) Random seed for reproducibility (default: 42).

🏋️ Training

⚠️WARNING⚠️: The Python executor in this repository is very raw and intended for research purposes only. It is not secure for production environments. We plan to update our executor to more secure implementations in the future. Your use of our code is at your own discretion and risk.

For training, we employ DrGRPO (unbiased Group Relative Policy Optimization).

python grpo.py \
    --model <model_name_or_path> \
    --dataset <path_to_dataset> \
    --output_dir <path_to_output_dir> \
    [--run_name <name>] \
    [--learning_rate <float>] \
    [--max_steps <int>] \
    [--per_device_train_batch_size <int>] \
    [--num_generations <int>] \
    [--seed <int>] \
    [other optional arguments...]

Arguments:

model: Hugging Face model repository ID or local model path.
dataset: Path to the training dataset.
output_dir: Directory where training outputs and checkpoints are saved.
start_idx: (Optional) Start index for slicing the dataset.
end_idx: (Optional) End index for slicing the dataset.
run_name: (Optional) Identifier for the training run.
learning_rate: (Optional) Learning rate for optimization (default: 5e-6).
gradient_accumulation_steps: (Optional) Number of gradient accumulation steps (default: 16).
max_steps: (Optional) Maximum number of training steps.
shuffle_dataset: (Optional) If set, shuffle the dataset before training (default: False).
per_device_train_batch_size: (Optional) Batch size per device (default: 1).
num_generations: (Optional) Number of generations per training step (default: 16).
warmup_ratio: (Optional) Fraction of steps for learning rate warm-up (default: 0.1).
max_prompt_length: (Optional) Maximum length of the input prompt (default: 512).
max_completion_length: (Optional) Maximum length of the generated completion (default: 1536). For GSM-LongHorizon we recommend using 768 for horizon 1, 1024 for horizon 2, 1280 for horizon 3, 1536 for horizons 4 and 5.
seed: (Optional) Random seed for reproducibility (default: 42).
save_steps: (Optional) Save model checkpoint every N steps.
save_total_limit: (Optional) Maximum number of checkpoints to keep.
logging_steps: (Optional) Frequency (in steps) of training log updates (default: 1).
float_reward_func: (Optional) If set, provides a format reward for generating answers in floating-point format; otherwise, rewards integer format answers (default: False).

📃 Evaluation

✏️ GSM Evaluation

To evaluate models on GSM-style datasets, run:

python gsm_eval.py \
  --models <model_name_or_path> \
  --datasets <path_to_dataset> \
  [--out_file <path_to_output>] \
  [--tp <int>] \
  [--num_samples <int>] \
  [--seed <int>] \
  [--start_idx <int>] \
  [--temperatures <floats...>] \
  [--num_generations <int>] \
  [--max_new_tokens <int>] \
  [--boxed_system_prompt] \
  [--top_p <float>] \
  [--instruct]

Arguments:

models: One or more Hugging Face model repositories, local model paths, or "random" for random baselines.
datasets: One or more dataset paths for evaluation.
out_file: (Optional) Path to the output file where evaluation results will be saved (default: None).
tp: (Optional) Number of GPUs used for vLLM tensor parallelism (default: 1).
num_samples: (Optional) Number of samples to evaluate (default: None).
seed: (Optional) Random seed for reproducibility (default: 42).
start_idx: (Optional) Starting index for evaluation (default: 0).
temperatures: (Optional) List of sampling temperatures (default: [0.0]).
num_generations: (Optional) Number of generations per prompt (default: 1).
max_new_tokens: (Optional) Maximum number of new tokens to generate (default: 2048).
boxed_system_prompt: (Optional) Use boxed formatting for the system prompt (default: False).
top_p: (Optional) Top-p nucleus sampling parameter (default: 1.0).

🧮 Math Evaluation

For more complex mathematical datasets (e.g., Math500), run:

python math_eval.py \
  --models <model_name_or_path> \
  --datasets <dataset_name_or_path> \
  [--out_file <path_to_output>] \
  [--tp <int>] \
  [--num_samples <int>] \
  [--seed <int>] \
  [--start_idx <int>] \
  [--temperatures <floats...>] \
  [--num_generations <int>] \
  [--max_new_tokens <int>] \
  [--top_p <float>] \
  [--top_k <int>] \
  [--boxed_system_prompt] \
  [--llama_system_prompt] \
  [--instruct]

Arguments:

models: One or more Hugging Face model repositories or local model paths.
datasets: Dataset(s) to evaluate on — supports "math500" or a custom dataset path.
out_file: (Optional) Path to the output file for saving evaluation results (default: None).
tp: (Optional) Number of GPUs used for vLLM tensor parallelism (default: 1).
num_samples: (Optional) Number of samples to evaluate (default: None).
seed: (Optional) Random seed for reproducibility (default: 42).
start_idx: (Optional) Starting index for evaluation (default: 0).
temperatures: (Optional) List of sampling temperatures (default: [0.0]).
num_generations: (Optional) Number of generations per prompt (default: 1).
max_new_tokens: (Optional) Maximum number of new tokens generated by vLLM (default: 8192).
top_p: (Optional) Top-p nucleus sampling parameter (default: 1.0).
top_k: (Optional) Top-k sampling parameter (default: 0).
boxed_system_prompt: (Optional) Use boxed formatting for the system prompt (default: False).
llama_system_prompt: (Optional) Use the LLaMA-style system prompt format (default: False).

📝 Example

To train on Qwen2.5 3B Instruct with GSM-LongHorizon data and achieve performance similar to our main results, follow these guidelines.

Download the GSM-LongHorizon dataset and split it by horizon length.

huggingface-cli download alesiaivanova/GSM-LongHorizon --repo-type dataset --local-dir ./GSM-LongHorizon

python -c "import pandas as pd; pd.read_parquet('GSM-LongHorizon/data/train-00000-of-00001.parquet').to_json('GSM-LongHorizon/train.jsonl', orient='records', lines=True)"

python -c "import pandas as pd; pd.read_parquet('GSM-LongHorizon/data/test-00000-of-00001.parquet').to_json('GSM-LongHorizon/test.jsonl', orient='records', lines=True)"

python -c "import json; data = [json.loads(line) for line in open('GSM-LongHorizon/train.jsonl')]; open('GSM-LongHorizon/train_len_1.jsonl', 'w').writelines(json.dumps(item) + '\n' for item in data if item['horizon_length'] == 1)"

python -c "import json; data = [json.loads(line) for line in open('GSM-LongHorizon/test.jsonl')]; open('GSM-LongHorizon/test_len_1.jsonl', 'w').writelines(json.dumps(item) + '\n' for item in data if item['horizon_length'] == 1)"

Train the model on horizon length 1 examples.

python grpo.py --model Qwen/Qwen2.5-3B-Instruct --dataset GSM-LongHorizon/train_len_1.jsonl --output_dir checkpoints/qwen-3b-grpo-len-1 --max_steps 300 --max_completion_length 768 --save_steps 50 --warmup_ratio 0.1 --seed 5

(Optional: Try 2–3 different random seeds because training can be noisy; the first ~200 steps are usually enough for saturation.)

Evaluate checkpoints on horizon lengths 1, 2, 3 and choose the overall best checkpoint.

python gsm_eval.py --models checkpoints/qwen-3b-grpo-len-1/checkpoint-200 --datasets GSM-LongHorizon/test_len_1.jsonl

Use the best horizon-1 checkpoint as the initialization for training on horizon length 2.

python grpo.py --model checkpoints/qwen-3b-grpo-len-1-best --dataset GSM-LongHorizon/train_len_2.jsonl --output_dir checkpoints/qwen-3b-grpo-len-2 --max_steps 300 --max_completion_length 1024 --save_steps 50 --warmup_ratio 0.1 --float_reward_func

Repeat for further horizon lengths. For GSM-LongHorizon, we recommend setting the maximum completion length to 768 tokens for horizon 1, 1024 tokens for horizon 2, 1280 tokens for horizon 3, and 1536 tokens for horizons 4 and 5.

🎈 Citation

If you find h1 helpful, please cite us.

@misc{motwani2025h1bootstrappingllmsreason,
      title={h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning}, 
      author={Sumeet Ramesh Motwani and Alesia Ivanova and Ziyang Cai and Philip Torr and Riashat Islam and Shital Shah and Christian Schroeder de Witt and Charles London},
      year={2025},
      eprint={2510.07312},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.07312}, 
}

🌻 Acknowledgement

Our DrGRPO training framework builds upon components from willccbb/verifiers. For inference, we used vLLM. We thank the authors of these projects for their excellent open-source contributions.

📧 Contact

Feel free to contact Alesia Ivanova via email: alesia.ivanova@st-hildas.ox.ac.uk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning

🎉 News

🔗 Links

📖 Overview

📊 Results

🏆 Main Results

🔭 Generalization to harder benchmarks

✨ Getting Started

🗂️ Datasets

📚 GSM-LongHorizon dataset

🔨 Constructing your own long-horizon dataset

1. Combining problems using generated code solutions

2. Combining problems using only numerical answers

Combining multiple datasets

🏋️ Training

📃 Evaluation

✏️ GSM Evaluation

🧮 Math Evaluation

📝 Example

🎈 Citation

🌻 Acknowledgement

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
combine_datasets.py		combine_datasets.py
combine_questions_into_long_horizon_resoning.py		combine_questions_into_long_horizon_resoning.py
combine_questions_without_code_generation.py		combine_questions_without_code_generation.py
grpo.py		grpo.py
gsm_eval.py		gsm_eval.py
h1_algorithm.png		h1_algorithm.png
math_eval.py		math_eval.py
math_utils.py		math_utils.py
replace_question_numbers_by_variables.py		replace_question_numbers_by_variables.py
requirements.txt		requirements.txt
solution_code_generation.py		solution_code_generation.py

Folders and files

Latest commit

History

Repository files navigation

h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning

🎉 News

🔗 Links

📖 Overview

📊 Results

🏆 Main Results

🔭 Generalization to harder benchmarks

✨ Getting Started

🗂️ Datasets

📚 GSM-LongHorizon dataset

🔨 Constructing your own long-horizon dataset

1. Combining problems using generated code solutions

2. Combining problems using only numerical answers

Combining multiple datasets

🏋️ Training

📃 Evaluation

✏️ GSM Evaluation

🧮 Math Evaluation

📝 Example

🎈 Citation

🌻 Acknowledgement

📧 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages