Feature Request
Add a --dotenv flag to eval.sh and compare.sh (both top-level and per-benchmark) that, after applying a named model profile, re-reads .env and force-exports every variable found there — so .env always takes precedence over the profile for model configuration.
Motivation / Problem
apply_model_profile unconditionally force-exports AGENT_SETTING_CONFIG, MODEL_NAME, OPENAI_BASE_URL, OPENAI_API_VERSION, etc., overwriting any values loaded from .env by load_env.sh. There is currently no way to say "use my .env values for model config" — and no way to use a named profile as a default while letting .env override specific variables (e.g. to point a standard profile at a different endpoint or API key).
Use Case
A developer working against a local vLLM instance or a non-standard LiteLLM/WatsonX endpoint wants to:
- Run
./scripts/eval.sh --benchmark bpo --dotenv — use whatever MODEL_NAME, OPENAI_BASE_URL, WATSONX_PROJECT_ID, etc. are in .env, without specifying a profile at all (defaults to gpt-oss as the base).
- Run
./scripts/eval.sh --benchmark bpo --model-profile gpt4o --dotenv — use the gpt4o profile for everything it sets, but let .env override specific variables (e.g. OPENAI_BASE_URL to point to a different gateway).
- Have this work consistently in both
eval.sh and compare.sh flows, including per-benchmark compare loops that call apply_model_profile directly.
This is especially useful when adding new inference services (Groq, LiteLLM, WatsonX) or testing against non-standard endpoints without modifying model_profiles.sh.
Proposed Solution
New flag: --dotenv (recognized by parse_common_args in benchmarks/helpers/common.sh and by each per-benchmark compare.sh)
Precedence order (lowest → highest):
load_env.sh (.env, no-override)
→ apply_model_profile (force-exports profile vars)
→ apply_dotenv_model_overrides (.env, force-exports ALL vars) ← new, only when --dotenv
→ CLI overrides (--model-name, --openai-base-url)
New functions in benchmarks/helpers/common.sh:
apply_dotenv_model_overrides([env_file]) — re-reads .env with override semantics, force-exporting every variable found. Accepts an optional path argument for testability; defaults to <project_root>/.env derived from BASH_SOURCE[0].
apply_model_config(profile, [env_file]) — wraps apply_model_profile + apply_dotenv_model_overrides. Defaults to gpt-oss when USE_DOTENV=true and no profile is given.
Updated finalize_model_config delegates to apply_model_config.
Per-benchmark compare.sh scripts (bpo, m3, appworld, oak_health_insurance) replace their bare apply_model_profile "$model" calls with apply_model_config "$model".
Examples:
# Use .env entirely (gpt-oss as default base)
./scripts/eval.sh --benchmark bpo --dotenv
# gpt4o profile as base, .env overrides on top
./scripts/eval.sh --benchmark bpo --model-profile gpt4o --dotenv
# Existing behaviour unchanged (no --dotenv)
./scripts/eval.sh --benchmark bpo --model-profile gpt-oss
Alternatives Considered
- Hard-coded list of model-config vars to override — more surgical but requires maintenance every time a new service (Groq, WatsonX, etc.) is added. Rejected in favour of re-reading all
.env vars so future service vars work automatically.
dotenv as a pseudo-profile name — doesn't support the merge case (--model-profile gpt4o --dotenv) and is confusing alongside real profile names.
- Modifying
apply_model_profile to respect pre-set vars — invasive change to internals; less transparent.
Priority
High - Important for my workflow
Additional Context
Test plan:
- Bash unit tests for
apply_dotenv_model_overrides: overrides existing vars, no-op when .env missing, strips quotes/inline comments, handles export-prefixed lines.
- Bash unit tests for
apply_model_config: USE_DOTENV=false behaves like profile only; USE_DOTENV=true with profile → .env wins; .env omits var → profile value kept; no profile + USE_DOTENV=true → defaults to gpt-oss.
- Smoke tests:
--dotenv alone, --model-profile gpt4o --dotenv, and existing behaviour without --dotenv.
Files affected: benchmarks/helpers/common.sh, scripts/eval.sh, scripts/compare.sh, benchmarks/bpo/compare.sh, benchmarks/m3/compare.sh, benchmarks/appworld/compare.sh, benchmarks/oak_health_insurance/compare.sh, README.md.
Feature Request
Add a
--dotenvflag toeval.shandcompare.sh(both top-level and per-benchmark) that, after applying a named model profile, re-reads.envand force-exports every variable found there — so.envalways takes precedence over the profile for model configuration.Motivation / Problem
apply_model_profileunconditionally force-exportsAGENT_SETTING_CONFIG,MODEL_NAME,OPENAI_BASE_URL,OPENAI_API_VERSION, etc., overwriting any values loaded from.envbyload_env.sh. There is currently no way to say "use my.envvalues for model config" — and no way to use a named profile as a default while letting.envoverride specific variables (e.g. to point a standard profile at a different endpoint or API key).Use Case
A developer working against a local vLLM instance or a non-standard LiteLLM/WatsonX endpoint wants to:
./scripts/eval.sh --benchmark bpo --dotenv— use whateverMODEL_NAME,OPENAI_BASE_URL,WATSONX_PROJECT_ID, etc. are in.env, without specifying a profile at all (defaults togpt-ossas the base)../scripts/eval.sh --benchmark bpo --model-profile gpt4o --dotenv— use thegpt4oprofile for everything it sets, but let.envoverride specific variables (e.g.OPENAI_BASE_URLto point to a different gateway).eval.shandcompare.shflows, including per-benchmark compare loops that callapply_model_profiledirectly.This is especially useful when adding new inference services (Groq, LiteLLM, WatsonX) or testing against non-standard endpoints without modifying
model_profiles.sh.Proposed Solution
New flag:
--dotenv(recognized byparse_common_argsinbenchmarks/helpers/common.shand by each per-benchmarkcompare.sh)Precedence order (lowest → highest):
New functions in
benchmarks/helpers/common.sh:apply_dotenv_model_overrides([env_file])— re-reads.envwith override semantics, force-exporting every variable found. Accepts an optional path argument for testability; defaults to<project_root>/.envderived fromBASH_SOURCE[0].apply_model_config(profile, [env_file])— wrapsapply_model_profile+apply_dotenv_model_overrides. Defaults togpt-osswhenUSE_DOTENV=trueand no profile is given.Updated
finalize_model_configdelegates toapply_model_config.Per-benchmark
compare.shscripts (bpo, m3, appworld, oak_health_insurance) replace their bareapply_model_profile "$model"calls withapply_model_config "$model".Examples:
Alternatives Considered
.envvars so future service vars work automatically.dotenvas a pseudo-profile name — doesn't support the merge case (--model-profile gpt4o --dotenv) and is confusing alongside real profile names.apply_model_profileto respect pre-set vars — invasive change to internals; less transparent.Priority
High - Important for my workflow
Additional Context
Test plan:
apply_dotenv_model_overrides: overrides existing vars, no-op when.envmissing, strips quotes/inline comments, handlesexport-prefixed lines.apply_model_config:USE_DOTENV=falsebehaves like profile only;USE_DOTENV=truewith profile →.envwins;.envomits var → profile value kept; no profile +USE_DOTENV=true→ defaults togpt-oss.--dotenvalone,--model-profile gpt4o --dotenv, and existing behaviour without--dotenv.Files affected:
benchmarks/helpers/common.sh,scripts/eval.sh,scripts/compare.sh,benchmarks/bpo/compare.sh,benchmarks/m3/compare.sh,benchmarks/appworld/compare.sh,benchmarks/oak_health_insurance/compare.sh,README.md.