Feature Request
Bundle the vakra small M3 training data with the repository and make it the default data source when --m3-data is passed without a path argument.
Motivation / Problem
Today, ./benchmarks/m3/eval.sh --m3-data requires an explicit path to a zip or directory:
if [[ -z "${2:-}" || "$2" == --* ]]; then
echo "Error: --m3-data requires a path (zip file or directory)" >&2
exit 2
fi
There is no out-of-the-box default, so every user must know the location of the M3 data before they can run --m3-data mode. The comment in m3_registry_m3_data.yaml already anticipates a "default zip" invocation (./benchmarks/m3/eval.sh --m3-data # default zip), but that path does not exist yet.
Use Case
- Developers and CI jobs want to run
./benchmarks/m3/eval.sh --m3-data against a well-known, representative dataset without specifying an external path every time.
- The vakra small dataset is compact enough to ship alongside the repo and covers the capability domains already listed in
m3_registry_m3_data.yaml.
- Having a canonical default lowers the barrier to entry for new contributors running M3 evals locally.
Proposed Solution
-
Add the vakra small data — commit (or reference via a just download-m3-data task) the vakra small zip/directory under benchmarks/m3/data/vakra_small/ (or as a tracked zip artifact).
-
Wire it as the default — in eval.sh, change the --m3-data argument parsing so that omitting a path falls back to the bundled vakra small data:
--m3-data)
M3_DATA=true
if [[ -z "${2:-}" || "$2" == --* ]]; then
# No path supplied — use bundled vakra small data
M3_DATA_PATH="$SCRIPT_DIR/data/vakra_small"
else
M3_DATA_PATH="$2"
shift
fi
shift
;;
-
Update help text — document that --m3-data (with no path) runs against the bundled vakra small dataset.
-
Update m3_registry_m3_data.yaml — align the domains lists with whatever capabilities and domains vakra small actually contains.
Alternatives Considered
- Keeping the required-path behavior and adding a separate
--m3-data-default flag. Rejected: adds flag surface with no benefit; the implicit default is cleaner.
- Downloading the data at eval time via
setup_m3.sh. This works but requires network access and makes the no-arg invocation slower and less reproducible.
Priority
Medium
Additional Context
- Related files:
benchmarks/m3/eval.sh, benchmarks/m3/config/m3_registry_m3_data.yaml, benchmarks/m3/m3_data_loader.py
- The vakra scoring pipeline (
benchmarks/m3/m3_vakra_score.py) already consumes the data shape that M3DataLoader produces, so no scoring-side changes are expected.
Feature Request
Bundle the vakra small M3 training data with the repository and make it the default data source when
--m3-datais passed without a path argument.Motivation / Problem
Today,
./benchmarks/m3/eval.sh --m3-datarequires an explicit path to a zip or directory:There is no out-of-the-box default, so every user must know the location of the M3 data before they can run
--m3-datamode. The comment inm3_registry_m3_data.yamlalready anticipates a "default zip" invocation (./benchmarks/m3/eval.sh --m3-data # default zip), but that path does not exist yet.Use Case
./benchmarks/m3/eval.sh --m3-dataagainst a well-known, representative dataset without specifying an external path every time.m3_registry_m3_data.yaml.Proposed Solution
Add the vakra small data — commit (or reference via a
just download-m3-datatask) the vakra small zip/directory underbenchmarks/m3/data/vakra_small/(or as a tracked zip artifact).Wire it as the default — in
eval.sh, change the--m3-dataargument parsing so that omitting a path falls back to the bundled vakra small data:--m3-data) M3_DATA=true if [[ -z "${2:-}" || "$2" == --* ]]; then # No path supplied — use bundled vakra small data M3_DATA_PATH="$SCRIPT_DIR/data/vakra_small" else M3_DATA_PATH="$2" shift fi shift ;;Update help text — document that
--m3-data(with no path) runs against the bundled vakra small dataset.Update
m3_registry_m3_data.yaml— align thedomainslists with whatever capabilities and domains vakra small actually contains.Alternatives Considered
--m3-data-defaultflag. Rejected: adds flag surface with no benefit; the implicit default is cleaner.setup_m3.sh. This works but requires network access and makes the no-arg invocation slower and less reproducible.Priority
Medium
Additional Context
benchmarks/m3/eval.sh,benchmarks/m3/config/m3_registry_m3_data.yaml,benchmarks/m3/m3_data_loader.pybenchmarks/m3/m3_vakra_score.py) already consumes the data shape thatM3DataLoaderproduces, so no scoring-side changes are expected.