Skip to content

[Feature]: Add --dotenv flag to layer .env overrides on top of model profiles #58

@haroldship

Description

@haroldship

Feature Request

Add a --dotenv flag to eval.sh and compare.sh (both top-level and per-benchmark) that, after applying a named model profile, re-reads .env and force-exports every variable found there — so .env always takes precedence over the profile for model configuration.

Motivation / Problem

apply_model_profile unconditionally force-exports AGENT_SETTING_CONFIG, MODEL_NAME, OPENAI_BASE_URL, OPENAI_API_VERSION, etc., overwriting any values loaded from .env by load_env.sh. There is currently no way to say "use my .env values for model config" — and no way to use a named profile as a default while letting .env override specific variables (e.g. to point a standard profile at a different endpoint or API key).

Use Case

A developer working against a local vLLM instance or a non-standard LiteLLM/WatsonX endpoint wants to:

  1. Run ./scripts/eval.sh --benchmark bpo --dotenv — use whatever MODEL_NAME, OPENAI_BASE_URL, WATSONX_PROJECT_ID, etc. are in .env, without specifying a profile at all (defaults to gpt-oss as the base).
  2. Run ./scripts/eval.sh --benchmark bpo --model-profile gpt4o --dotenv — use the gpt4o profile for everything it sets, but let .env override specific variables (e.g. OPENAI_BASE_URL to point to a different gateway).
  3. Have this work consistently in both eval.sh and compare.sh flows, including per-benchmark compare loops that call apply_model_profile directly.

This is especially useful when adding new inference services (Groq, LiteLLM, WatsonX) or testing against non-standard endpoints without modifying model_profiles.sh.

Proposed Solution

New flag: --dotenv (recognized by parse_common_args in benchmarks/helpers/common.sh and by each per-benchmark compare.sh)

Precedence order (lowest → highest):

load_env.sh (.env, no-override)
→ apply_model_profile (force-exports profile vars)
→ apply_dotenv_model_overrides (.env, force-exports ALL vars)   ← new, only when --dotenv
→ CLI overrides (--model-name, --openai-base-url)

New functions in benchmarks/helpers/common.sh:

  • apply_dotenv_model_overrides([env_file]) — re-reads .env with override semantics, force-exporting every variable found. Accepts an optional path argument for testability; defaults to <project_root>/.env derived from BASH_SOURCE[0].
  • apply_model_config(profile, [env_file]) — wraps apply_model_profile + apply_dotenv_model_overrides. Defaults to gpt-oss when USE_DOTENV=true and no profile is given.

Updated finalize_model_config delegates to apply_model_config.

Per-benchmark compare.sh scripts (bpo, m3, appworld, oak_health_insurance) replace their bare apply_model_profile "$model" calls with apply_model_config "$model".

Examples:

# Use .env entirely (gpt-oss as default base)
./scripts/eval.sh --benchmark bpo --dotenv

# gpt4o profile as base, .env overrides on top
./scripts/eval.sh --benchmark bpo --model-profile gpt4o --dotenv

# Existing behaviour unchanged (no --dotenv)
./scripts/eval.sh --benchmark bpo --model-profile gpt-oss

Alternatives Considered

  • Hard-coded list of model-config vars to override — more surgical but requires maintenance every time a new service (Groq, WatsonX, etc.) is added. Rejected in favour of re-reading all .env vars so future service vars work automatically.
  • dotenv as a pseudo-profile name — doesn't support the merge case (--model-profile gpt4o --dotenv) and is confusing alongside real profile names.
  • Modifying apply_model_profile to respect pre-set vars — invasive change to internals; less transparent.

Priority

High - Important for my workflow

Additional Context

Test plan:

  • Bash unit tests for apply_dotenv_model_overrides: overrides existing vars, no-op when .env missing, strips quotes/inline comments, handles export-prefixed lines.
  • Bash unit tests for apply_model_config: USE_DOTENV=false behaves like profile only; USE_DOTENV=true with profile → .env wins; .env omits var → profile value kept; no profile + USE_DOTENV=true → defaults to gpt-oss.
  • Smoke tests: --dotenv alone, --model-profile gpt4o --dotenv, and existing behaviour without --dotenv.

Files affected: benchmarks/helpers/common.sh, scripts/eval.sh, scripts/compare.sh, benchmarks/bpo/compare.sh, benchmarks/m3/compare.sh, benchmarks/appworld/compare.sh, benchmarks/oak_health_insurance/compare.sh, README.md.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

Status
In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions