Skip to content

LongHorizonReasoning/h1

Repository files navigation

h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning

Paper Github Dataset

h1

Our method, h1, scales long-horizon reasoning by composing existing short-horizon problems into longer, dependency-based sequences. Using a stage-wise curriculum on this synthetic data, we train models with outcome-only rewards to gradually handle more complex reasoning tasks. This approach achieves strong out-of-distribution gains and significantly improves long-horizon reasoning performance without requiring additional supervision.

๐ŸŽ‰ News


๐Ÿ”— Links


๐Ÿ“– Overview


Large language models excel at short-horizon reasoning tasks, but performance drops as reasoning horizon lengths increase. Existing approaches to combat this rely on inference-time scaffolding or costly step-level supervision, neither of which is scalable. In this work, we introduce a scalable method to bootstrap long-horizon reasoning capabilities using only existing, abundant short-horizon data. Our approach synthetically composes simple problems into complex, multi-step dependency chains of arbitrary length. We then train models on this data using outcome-only rewards under a curriculum that automatically increases in complexity, allowing RL training to be scaled much further without saturating. Empirically, our method generalizes remarkably well: curriculum training on composed 6th-grade level math problems (GSM8K) boosts accuracy on longer, competition-level benchmarks (GSM-Symbolic, MATH-500, AIME) by up to $2.06\times$. Importantly, our long-horizon improvements are significantly higher than baselines even at high $\textit{pass@k}$, showing that models can learn entirely new reasoning paths under RL. Theoretically, we show that curriculum-based RL with outcome rewards achieves an exponential improvement in sample complexity over full-horizon training, comparable to the gains from dense supervision, while providing strong training signal without additional annotations. $\textit{h1}$ therefore introduces an efficient path towards scaling RL for longer horizon problems using only existing data.

๐Ÿ“Š Results


๐Ÿ† Main Results

Curriculum-based RL training significantly improves in-domain performance compared to the Instruct model and all other equal compute baselines.

Accuracy on GSM8K Problems of Horizon L-n
Model / setting L-1 L-2 L-3 L-4 L-5 L-6 L-7 L-8
Instruct model 82.79 35.06 20.07 6.70 3.57 0.00 0.79 0.00
Equal compute training baselines
Only-L1 86.80 37.14 21.43 6.70 3.87 0.25 0.00 0.00
Uniform-Mix 82.80 12.66 2.04 0.54 0.00 0.00 0.00 0.00
Only-Long 82.71 43.36 20.41 3.22 1.49 0.25 0.25 0.00
Curriculum training (trained up to Len-n)
RLVR 83.24 39.42 18.37 2.95 2.08 0.25 0.79 0.00
Len-2 85.92 56.22 28.57 12.06 6.25 1.26 0.79 0.49
Len-3 84.91 56.22 37.76 15.55 8.63 3.27 3.17 0.25
Len-4 85.48 57.05 40.14 18.23 9.23 3.53 3.17 1.72
Len-5 (H1) 85.97
(+3.8%)
58.51
(+66.9%)
36.39
(+81.3%)
18.77
(+180.1%)
9.82
(+175.1%)
3.53
(++)
3.17
(+301.3%)
2.22
(++)

๐Ÿ”ญ Generalization to harder benchmarks

Performance on harder math benchmarks improves significantly with GSM8K RL curriculum training stages. Bootstrapping simple existing data can be used for scaling RL.

Generalization to Significantly Harder Math Problems
Model/setting MATH-500 Symbolic P1 Symbolic P2 MMLU-Pro AIME 2025 AIME 2024
Instruct model 64.20 67.06 43.08 58.47 1.77 5.10
Standard RLVR on GSM8K
GSM8K RLVR 66.20 71.40 47.60 60.62 2.71 6.88
Curriculum RL on Composed GSM8K Problems
Len-2 GSM8K 67.00 72.86 50.80 59.73 1.25 7.19
Len-3 GSM8K 66.80 70.70 49.48 61.21 1.67 5.73
Len-4 GSM8K 68.40 72.22 51.92 60.91 2.60 10.00
Len-5 GSM8K 69.20
(+7.8%)
73.28
(+9.3%)
52.00
(+20.7%)
61.21
(+4.7%)
3.02
(+70.6%)
10.52
(+106.3%)

โœจ Getting Started


Setup the environment:

conda create -n h1 python=3.12.11
conda activate h1
pip install torch==2.7.1
pip install -r requirements.txt

๐Ÿ—‚๏ธ Datasets


๐Ÿ“š GSM-LongHorizon dataset

You can download the full GSM-LongHorizon dataset from Hugging Face: alesiaivanova/GSM-LongHorizon

Using the Hugging Face CLI:

huggingface-cli download alesiaivanova/GSM-LongHorizon --repo-type dataset --local-dir ./GSM-LongHorizon

To convert downloaded files to JSONL format:

python -c "import pandas as pd; pd.read_parquet('GSM-LongHorizon/data/train-00000-of-00001.parquet').to_json('GSM-LongHorizon/train.jsonl', orient='records', lines=True)"
python -c "import pandas as pd; pd.read_parquet('GSM-LongHorizon/data/test-00000-of-00001.parquet').to_json('GSM-LongHorizon/test.jsonl', orient='records', lines=True)"

To filter only problems for a selected horizon length (e.g., 2):

python -c "import json; data = [json.loads(line) for line in open('GSM-LongHorizon/train.jsonl')]; open('GSM-LongHorizon/train_len_2.jsonl', 'w').writelines(json.dumps(item) + '\n' for item in data if item['horizon_length'] == 2)"

๐Ÿ”จ Constructing your own long-horizon dataset

We currently support constructing long-horizon reasoning tasks from atomic problems with numerical answers.

There are two supported methods for dataset construction.

1. Combining problems using generated code solutions

For each atomic problem, a Python code solution is generated using the OpenAI API and the o3 model. When composing long-horizon problems, the exact numerical output of a subproblem is used to replace one of the parameters in the subsequent subproblem. The final answer for the chained problem is obtained by executing the generated Python code end-to-end, passing the output of each function as the input argument to the next.

Steps:

  1. Generate Python code solutions for each atomic problem using the OpenAI API and o3 model (requires a valid OpenAI API token):

    python solution_code_generation.py <input_file> <output_file> [--num_samples <int>]

    Arguments:

    • input_file: Path to the input JSONL file containing atomic problems (must include "question" and "answer" fields, "answer" should be in the same format as in GSM8K).
    • output_file: Path where the updated dataset including generated code solutions will be saved.
    • num_samples: (Optional) Number of samples to process from the input file (default: 1000).
  2. Generate updated problem statements with some numbers replaced by variables:

    python replace_question_numbers_by_variables.py <input_file> <output_file>

    Arguments:

    • input_file: Path to the input JSONL file containing atomic problems (must include "question" and "answer" fields, "answer" should be in the same format as in GSM8K).
    • output_file: Path where the updated dataset including generated code solutions will be saved.
    • num_samples: (Optional) Number of samples to process from the input file (default: 1000).
  3. Combine atomic problems into multi-step reasoning chains:

    python combine_questions_into_long_horizon_reasoning.py \
    --dataset <path_to_atomic_dataset> \
    --output_dataset <path_to_output> \
    --num_subproblems <int> \
    [--num_repetitions <int>] \
    [--seed <int>] \
    [--only_int_answers] \
    [--only_small_int_answers]

    Arguments:

    • dataset: Path to the dataset containing atomic problems (can be a Hugging Face dataset or a local directory).
    • output_dataset: Output path for the constructed long-horizon dataset.
    • num_subproblems: Number of subproblems per long-horizon instance.
    • num_repetitions:(Optional) Number of attempts to use each atomic problem as a subproblem. Note that it is not guaranteed that each problem will be used exactly this number of times, due to type and bound consistency constraints during problem chaining.
    • seed: Random seed.
    • only_int_answers: (Optional) If set, only problems whose final answer is an integer are retained (by default, floating-point answers are allowed).
    • only_small_int_answers: (Optional) If set, only problems with integer final answers in the range [-1000, 1000] are included.

2. Combining problems using only numerical answers

In this mode, the dataset of atomic problems needs to include only the following fields: "problem" and "answer".

To construct the long-horizon dataset, run:

python combine_questions_without_code_generation.py \
    --input_file <path_to_input_jsonl> \
    --output_file <path_to_output_jsonl> \
    --num_subproblems <int> \
    [--num_repetitions <int>] \
    [--seed <int>]

Arguments:

  • input_file: Path to the input JSONL file containing Hendrycks Math problems.
  • output_file: Path where the combined long-horizon problems will be saved.
  • num_subproblems: Number of atomic subproblems to combine into each long-horizon instance.
  • num_repetitions: (Optional) Number of times to repeat the construction process (default: 1).
  • seed: (Optional) Random seed for reproducibility (default: 42).

Combining multiple datasets

You can combine multiple datasets (concatenate or shuffle):

python combine_datasets.py \
    --dataset1 <path_to_first_dataset> \
    [--dataset2 <path_to_second_dataset>] \
    --output <output_path> \
    [--samples1 <int>] \
    [--samples2 <int>] \
    [--start_idx_1 <int>] \
    [--start_idx_2 <int>] \
    [--strategy <stack|shuffle>] \
    [--num_repetitions <int>] \
    [--seed <int>]

Arguments:

  • dataset1: Path to the first dataset (JSONL file).
  • dataset2: (Optional) Path to the second dataset (JSONL file).
  • output: Output file path for the combined dataset.
  • samples1: (Optional) Number of samples to take from the first dataset (default: all).
  • samples2: (Optional) Number of samples to take from the second dataset (default: all).
  • start_idx_1: (Optional) Start index for the first dataset (default: 0).
  • start_idx_2: (Optional) Start index for the second dataset (default: 0).
  • strategy: Combination strategy - "stack" (concatenate datasets) or "shuffle" (shuffle combined datasets) (default: stack).
  • num_repetitions: (Optional) Number of times to repeat the combined dataset (default: 1).
  • seed: (Optional) Random seed for reproducibility (default: 42).

๐Ÿ‹๏ธ Training


โš ๏ธWARNINGโš ๏ธ: The Python executor in this repository is very raw and intended for research purposes only. It is not secure for production environments. We plan to update our executor to more secure implementations in the future. Your use of our code is at your own discretion and risk.

For training, we employ DrGRPO (unbiased Group Relative Policy Optimization).

python grpo.py \
    --model <model_name_or_path> \
    --dataset <path_to_dataset> \
    --output_dir <path_to_output_dir> \
    [--run_name <name>] \
    [--learning_rate <float>] \
    [--max_steps <int>] \
    [--per_device_train_batch_size <int>] \
    [--num_generations <int>] \
    [--seed <int>] \
    [other optional arguments...]

Arguments:

  • model: Hugging Face model repository ID or local model path.
  • dataset: Path to the training dataset.
  • output_dir: Directory where training outputs and checkpoints are saved.
  • start_idx: (Optional) Start index for slicing the dataset.
  • end_idx: (Optional) End index for slicing the dataset.
  • run_name: (Optional) Identifier for the training run.
  • learning_rate: (Optional) Learning rate for optimization (default: 5e-6).
  • gradient_accumulation_steps: (Optional) Number of gradient accumulation steps (default: 16).
  • max_steps: (Optional) Maximum number of training steps.
  • shuffle_dataset: (Optional) If set, shuffle the dataset before training (default: False).
  • per_device_train_batch_size: (Optional) Batch size per device (default: 1).
  • num_generations: (Optional) Number of generations per training step (default: 16).
  • warmup_ratio: (Optional) Fraction of steps for learning rate warm-up (default: 0.1).
  • max_prompt_length: (Optional) Maximum length of the input prompt (default: 512).
  • max_completion_length: (Optional) Maximum length of the generated completion (default: 1536). For GSM-LongHorizon we recommend using 768 for horizon 1, 1024 for horizon 2, 1280 for horizon 3, 1536 for horizons 4 and 5.
  • seed: (Optional) Random seed for reproducibility (default: 42).
  • save_steps: (Optional) Save model checkpoint every N steps.
  • save_total_limit: (Optional) Maximum number of checkpoints to keep.
  • logging_steps: (Optional) Frequency (in steps) of training log updates (default: 1).
  • float_reward_func: (Optional) If set, provides a format reward for generating answers in floating-point format; otherwise, rewards integer format answers (default: False).

๐Ÿ“ƒ Evaluation


โœ๏ธ GSM Evaluation

To evaluate models on GSM-style datasets, run:

python gsm_eval.py \
  --models <model_name_or_path> \
  --datasets <path_to_dataset> \
  [--out_file <path_to_output>] \
  [--tp <int>] \
  [--num_samples <int>] \
  [--seed <int>] \
  [--start_idx <int>] \
  [--temperatures <floats...>] \
  [--num_generations <int>] \
  [--max_new_tokens <int>] \
  [--boxed_system_prompt] \
  [--top_p <float>] \
  [--instruct]

Arguments:

  • models: One or more Hugging Face model repositories, local model paths, or "random" for random baselines.
  • datasets: One or more dataset paths for evaluation.
  • out_file: (Optional) Path to the output file where evaluation results will be saved (default: None).
  • tp: (Optional) Number of GPUs used for vLLM tensor parallelism (default: 1).
  • num_samples: (Optional) Number of samples to evaluate (default: None).
  • seed: (Optional) Random seed for reproducibility (default: 42).
  • start_idx: (Optional) Starting index for evaluation (default: 0).
  • temperatures: (Optional) List of sampling temperatures (default: [0.0]).
  • num_generations: (Optional) Number of generations per prompt (default: 1).
  • max_new_tokens: (Optional) Maximum number of new tokens to generate (default: 2048).
  • boxed_system_prompt: (Optional) Use boxed formatting for the system prompt (default: False).
  • top_p: (Optional) Top-p nucleus sampling parameter (default: 1.0).

๐Ÿงฎ Math Evaluation

For more complex mathematical datasets (e.g., Math500), run:

python math_eval.py \
  --models <model_name_or_path> \
  --datasets <dataset_name_or_path> \
  [--out_file <path_to_output>] \
  [--tp <int>] \
  [--num_samples <int>] \
  [--seed <int>] \
  [--start_idx <int>] \
  [--temperatures <floats...>] \
  [--num_generations <int>] \
  [--max_new_tokens <int>] \
  [--top_p <float>] \
  [--top_k <int>] \
  [--boxed_system_prompt] \
  [--llama_system_prompt] \
  [--instruct]

Arguments:

  • models: One or more Hugging Face model repositories or local model paths.
  • datasets: Dataset(s) to evaluate on โ€” supports "math500" or a custom dataset path.
  • out_file: (Optional) Path to the output file for saving evaluation results (default: None).
  • tp: (Optional) Number of GPUs used for vLLM tensor parallelism (default: 1).
  • num_samples: (Optional) Number of samples to evaluate (default: None).
  • seed: (Optional) Random seed for reproducibility (default: 42).
  • start_idx: (Optional) Starting index for evaluation (default: 0).
  • temperatures: (Optional) List of sampling temperatures (default: [0.0]).
  • num_generations: (Optional) Number of generations per prompt (default: 1).
  • max_new_tokens: (Optional) Maximum number of new tokens generated by vLLM (default: 8192).
  • top_p: (Optional) Top-p nucleus sampling parameter (default: 1.0).
  • top_k: (Optional) Top-k sampling parameter (default: 0).
  • boxed_system_prompt: (Optional) Use boxed formatting for the system prompt (default: False).
  • llama_system_prompt: (Optional) Use the LLaMA-style system prompt format (default: False).

๐Ÿ“ Example


To train on Qwen2.5 3B Instruct with GSM-LongHorizon data and achieve performance similar to our main results, follow these guidelines.

  1. Download the GSM-LongHorizon dataset and split it by horizon length.
huggingface-cli download alesiaivanova/GSM-LongHorizon --repo-type dataset --local-dir ./GSM-LongHorizon

python -c "import pandas as pd; pd.read_parquet('GSM-LongHorizon/data/train-00000-of-00001.parquet').to_json('GSM-LongHorizon/train.jsonl', orient='records', lines=True)"

python -c "import pandas as pd; pd.read_parquet('GSM-LongHorizon/data/test-00000-of-00001.parquet').to_json('GSM-LongHorizon/test.jsonl', orient='records', lines=True)"

python -c "import json; data = [json.loads(line) for line in open('GSM-LongHorizon/train.jsonl')]; open('GSM-LongHorizon/train_len_1.jsonl', 'w').writelines(json.dumps(item) + '\n' for item in data if item['horizon_length'] == 1)"

python -c "import json; data = [json.loads(line) for line in open('GSM-LongHorizon/test.jsonl')]; open('GSM-LongHorizon/test_len_1.jsonl', 'w').writelines(json.dumps(item) + '\n' for item in data if item['horizon_length'] == 1)"
  1. Train the model on horizon length 1 examples.
python grpo.py --model Qwen/Qwen2.5-3B-Instruct --dataset GSM-LongHorizon/train_len_1.jsonl --output_dir checkpoints/qwen-3b-grpo-len-1 --max_steps 300 --max_completion_length 768 --save_steps 50 --warmup_ratio 0.1 --seed 5

(Optional: Try 2โ€“3 different random seeds because training can be noisy; the first ~200 steps are usually enough for saturation.)

  1. Evaluate checkpoints on horizon lengths 1, 2, 3 and choose the overall best checkpoint.
python gsm_eval.py --models checkpoints/qwen-3b-grpo-len-1/checkpoint-200 --datasets GSM-LongHorizon/test_len_1.jsonl
  1. Use the best horizon-1 checkpoint as the initialization for training on horizon length 2.
python grpo.py --model checkpoints/qwen-3b-grpo-len-1-best --dataset GSM-LongHorizon/train_len_2.jsonl --output_dir checkpoints/qwen-3b-grpo-len-2 --max_steps 300 --max_completion_length 1024 --save_steps 50 --warmup_ratio 0.1 --float_reward_func
  1. Repeat for further horizon lengths. For GSM-LongHorizon, we recommend setting the maximum completion length to 768 tokens for horizon 1, 1024 tokens for horizon 2, 1280 tokens for horizon 3, and 1536 tokens for horizons 4 and 5.

๐ŸŽˆ Citation


If you find h1 helpful, please cite us.

@misc{motwani2025h1bootstrappingllmsreason,
      title={h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning}, 
      author={Sumeet Ramesh Motwani and Alesia Ivanova and Ziyang Cai and Philip Torr and Riashat Islam and Shital Shah and Christian Schroeder de Witt and Charles London},
      year={2025},
      eprint={2510.07312},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.07312}, 
}

๐ŸŒป Acknowledgement


Our DrGRPO training framework builds upon components from willccbb/verifiers. For inference, we used vLLM. We thank the authors of these projects for their excellent open-source contributions.

๐Ÿ“ง Contact


Feel free to contact Alesia Ivanova via email: alesia.ivanova@st-hildas.ox.ac.uk

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages