We introduce OPT-Engine, an extensible benchmark framework for optimization problems with controllable complexity and configurable templates. OPT-Engine spans ten canonical operations research problem classes,systematically scaling in complexity, thus provides a structured testbed for automated problem formulation and solving under different OR complexity. OPT-Engine facilitates rigorous, reproducible studies on how problem complexity impacts model performance, offering a more granular look at LLM formulation and solving capabilities.
- 2026.01.29 - OPT-Engine paper published on arXiv: OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling.
For a given problem class, the OPT-Engine pipeline operates as follows:
- Numeric Instance Generation & Validation: It first samples a random numerical instance within a specified target complexity range, defined by parameters such as variable and constraint counts. A Gurobi-based exact solver then verifies the instance's feasibility and computes the optimal solution, which is stored as the ground truth.
- Canonical Problem Creation: Once validated, the framework maps the instance's numeric parameters into a structured, editable template to produce a canonical natural-language problem statement.
- Problem augumentation: To decouple linguistic variation from mathematical structure, the pipeline employs an LLM-based rephrasing step. This step alters the textual scenario and surface wording while strictly preserving the objective function, constraints, and all numerical values.
- Integrity Verification: A subsequent LLM-as-a-judge and rule-based validation check uses regex extraction to confirm that the rephrased text maintains the original numeric parameters and their logical relationships. If validation fails, the rephrasing step is retried until a valid output is produced.
We also provide C^3-Bench (Controllable & Configurable Complexity Benchmark), a dataset generated by our framework to facilitate reproducible research and enable further studies. The dataset are avaiable data: C^3-Bench Dataset.
C³-Bench is structured into two complementary subsets for evaluating model performance on classical Operations Research (OR) problems under controlled complexity scaling:
- canonical: Contains standard instances of 10 canonical OR problem types, with complexity systematically increased across parameters like variable and constraint counts.
- perturbation: Contains derived instances where controlled linguistic and parametric perturbations are applied to a subset of the canonical benchmark (covering the Inventory, TSP, and Knapsack problems) to specifically test model robustness and generalization.
Within the perturbation set, we introduce controlled variations across three dimensions that are known performance bottlenecks:
-
Linguistic Complexity: The underlying mathematical model is held constant, while the natural language description is rephrased into templates with systematically higher syntactic and lexical complexity, creating harder-to-parse problem statements.
-
Objective Perturbation: A constant term or simple coefficient change is introduced to the objective function, testing the model's ability to adapt to modified goals while the constraints remain unchanged.
-
Constraint Augmentation: One additional simple linear constraint is introduced to the original formulation, testing how models handle incremental increases in problem structure without altering the core variables or objective.
Our study, enabled by the C³-Bench dataset, addresses two critical questions about LLMs in optimization: 1.) Does LLM performance remain robust when generalizing to out-of-distribution optimization tasks that exceed the complexity levels of current benchmarks? 2.) At which stage of the solution pipeline—from problem interpretation to solution generation—do current LLMs encounter the most significant bottlenecks?
Our evaluation yields two primary findings:
1.) Tool Integration is Essential for Scaling: We demonstrate that Tool-Integrated Reasoning (TIR) yields consistent performance trends across all problem classes. In stark contrast, Pure-Text Reasoning (PTR) exhibits a clear and progressive accuracy drop as problem complexity increases.
2.) The Semantic Sensitivity Bottleneck: We identify a critical "semantic sensitivity" failure mode: even frontier LLMs struggle to maintain formulation fidelity when the linguistic expression of constraints deviates from canonical problem descriptions, highlighting a major bottleneck in reliable automated problem-solving.
Here is an example of TIR vs PTR:
If you find OPT-Engine useful or relevant to your research, please consider citing our paper:
@article{chen2026opt,
title={OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling},
author={Chen, Yitian and Cheng, Cheng and Sun, Yinan and Ling, Zi and Ge, Dongdong},
journal={arXiv preprint arXiv:2601.19924},
year={2026}
}For any questions or issues regarding the pipeline or datasets, please raise an issue on our GitHub repository or contact one of the authors via emails:
Yitian Chen, chenyitian@shanshu.ai
Cheng Cheng, clairecheng0709@gmail.com


