feat: add --dotenv flag to override model profile with .env values by haroldship · Pull Request #59 · cuga-project/cuga-eval

haroldship · 2026-06-09T14:01:30Z

Summary

Adds --dotenv flag to all eval.sh and compare.sh scripts (top-level dispatchers and all four per-benchmark scripts).
When --dotenv is set, after applying the named model profile, the entire .env is re-read and force-exported — so .env always wins over the profile for model configuration.
If no --model-profile is given alongside --dotenv, defaults to gpt-oss as the base profile.
CLI overrides (--model-name, --openai-base-url) still take highest precedence.

Precedence order (lowest → highest):

load_env.sh (.env, no-override)
→ apply_model_profile (force-exports profile vars)
→ apply_dotenv_model_overrides (.env, force-exports ALL vars)   ← new, only when --dotenv
→ CLI overrides (--model-name, --openai-base-url)

Example usage:

# .env entirely controls model config (gpt-oss as default base)
./scripts/eval.sh --benchmark bpo --dotenv

# gpt4o as base, .env overrides specific vars on top
./scripts/eval.sh --benchmark bpo --model-profile gpt4o --dotenv

# Existing behaviour unchanged
./scripts/eval.sh --benchmark bpo --model-profile gpt-oss

Closes #58

Test plan

8/8 bash unit tests pass (benchmarks/helpers/tests/test_model_config.sh)
- apply_dotenv_model_overrides: overrides vars, no-op when .env missing, strips quotes/comments, handles export-prefixed lines
- apply_model_config: USE_DOTENV=false behaves like profile-only; USE_DOTENV=true with profile → .env wins; .env omits var → profile value kept; no profile + USE_DOTENV=true → defaults to gpt-oss
Dry-run smoke test: --dotenv alone shows gpt-oss profile applied then .env force-exports on top
Dry-run smoke test: --model-profile gpt4o --dotenv shows gpt4o profile applied first, then .env overrides MODEL_NAME
Dry-run smoke test: --model-profile gpt-oss (no --dotenv) shows no .env overrides section — existing behaviour unchanged

Summary by CodeRabbit

New Features
- Added --dotenv command-line flag to eval and comparison scripts for using .env file values to override model profile configuration.
Documentation
- Updated README with --dotenv flag documentation and usage examples.
Tests
- Added test suite for model configuration logic, including .env override behavior.

…des and apply_model_config

…config

coderabbitai · 2026-06-09T14:01:44Z

Warning

Review limit reached

@haroldship, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 13 minutes and 23 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1ebfb4c1-c93d-4e8c-a83c-c286bbf5c3d7

📥 Commits

Reviewing files that changed from the base of the PR and between 6c0c8e9 and 6f4be7f.

📒 Files selected for processing (1)

benchmarks/helpers/tests/test_model_config.sh

📝 Walkthrough

Walkthrough

This PR implements a --dotenv flag that allows .env file variables to override model profile settings in benchmark evaluation and comparison flows. The feature adds a new precedence layer: .env can override profile defaults while still allowing CLI arguments to take final precedence. When no explicit profile is provided with --dotenv, the system defaults to the gpt-oss base profile.

Changes

--dotenv Feature Implementation

Layer / File(s)	Summary
Core .env override infrastructure and tests `benchmarks/helpers/common.sh`, `benchmarks/helpers/tests/test_model_config.sh`	Introduces `USE_DOTENV` global variable, `apply_dotenv_model_overrides()` function that parses `.env` files (supporting `export KEY=VALUE` syntax, quote stripping, inline comment removal), and updates `apply_model_config()` to layer `.env` overrides after profile application. Tests validate parsing logic and precedence order (profile → .env → CLI).
Top-level script integration `scripts/eval.sh`, `scripts/compare.sh`	Both scripts add `--dotenv` to help text and export `USE_DOTENV` variable for downstream use.
Per-benchmark compare.sh integration `benchmarks/appworld/compare.sh`, `benchmarks/bpo/compare.sh`, `benchmarks/m3/compare.sh`, `benchmarks/oak_health_insurance/compare.sh`	Each benchmark's compare script adds `USE_DOTENV` variable initialization, `--dotenv` flag parsing, and replaces `apply_model_profile "$model"` with `apply_model_config "$model"` to activate the new override logic.
Documentation `README.md`	Documents `--dotenv` flag behavior and provides usage examples showing interaction with `--model-profile` and `.env` value precedence.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding a --dotenv flag to override model profiles with .env values, which is the primary feature across all modified scripts.
Linked Issues check	✅ Passed	The PR fully implements all coding requirements from issue `#58`: new functions (apply_dotenv_model_overrides, apply_model_config), --dotenv flag parsing in all scripts, proper precedence handling, per-benchmark script updates, and comprehensive unit tests.
Out of Scope Changes check	✅ Passed	All changes are directly related to implementing the --dotenv feature: helper functions, argument parsing, per-benchmark updates, tests, and documentation. No unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/dotenv-model-override

Warning

Review ran into problems

🔥 Problems

Stopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a @coderabbit review after the pipeline has finished.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

benchmarks/helpers/tests/test_model_config.sh (1)
30-30: 💤 Low value

Consider single-quoting trap commands to defer variable expansion.

Lines 30, 50, 60, 83, 94, and 105 use trap "rm -f $TMP" EXIT, which expands $TMP at trap-set time. While functionally correct here (since $TMP doesn't change), shellcheck SC2064 recommends trap 'rm -f "$TMP"' EXIT to make the deferred expansion explicit and avoid potential pitfalls if the variable were modified later.
♻️ Recommended fix for shell hygiene
-    TMP=$(mktemp); trap "rm -f $TMP" EXIT
+    TMP=$(mktemp); trap 'rm -f "$TMP"' EXIT
Apply this pattern to all six occurrences.
Also applies to: 50-50, 60-60, 83-83, 94-94, 105-105
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/helpers/tests/test_model_config.sh` at line 30, The trap commands
use double quotes causing $TMP to be expanded immediately; update each
occurrence of TMP created via mktemp where you call trap "rm -f $TMP" EXIT to
use single-quoted deferred expansion (e.g., change to trap 'rm -f "$TMP"' EXIT)
so the removal happens with the variable evaluated at trap execution time—apply
this change for all six occurrences that reference TMP.
Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@benchmarks/helpers/tests/test_model_config.sh`:
- Line 30: The trap commands use double quotes causing $TMP to be expanded
immediately; update each occurrence of TMP created via mktemp where you call
trap "rm -f $TMP" EXIT to use single-quoted deferred expansion (e.g., change to
trap 'rm -f "$TMP"' EXIT) so the removal happens with the variable evaluated at
trap execution time—apply this change for all six occurrences that reference
TMP.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bce075a7-17f0-49d1-b06c-8b6ead0a4486

📥 Commits

Reviewing files that changed from the base of the PR and between 142c29a and 6c0c8e9.

📒 Files selected for processing (9)

README.md
benchmarks/appworld/compare.sh
benchmarks/bpo/compare.sh
benchmarks/helpers/common.sh
benchmarks/helpers/tests/test_model_config.sh
benchmarks/m3/compare.sh
benchmarks/oak_health_insurance/compare.sh
scripts/compare.sh
scripts/eval.sh

haroldship · 2026-06-09T14:48:20Z

Fixed in 6f4be7f — all six trap commands now use single-quoted deferred expansion (trap 'rm -f "$TMP"' EXIT). 8/8 tests still pass.

haroldship added 11 commits June 9, 2026 14:33

test(model-config): add bash unit tests for apply_dotenv_model_overri…

bb61735

…des and apply_model_config

test(model-config): suppress stderr in all test function calls

812655f

feat(model-config): add USE_DOTENV global and --dotenv arg parsing

1047fa9

feat(model-config): add apply_dotenv_model_overrides

4a7848d

feat(model-config): add apply_model_config and update finalize_model_…

c1bbb28

…config

feat(model-config): expose --dotenv in top-level eval/compare scripts

cb5dbf5

feat(bpo): honour --dotenv in per-config compare loop

fd33430

feat(m3): honour --dotenv in per-config compare loop

d7714df

feat(appworld): honour --dotenv in per-config compare loop

debcf62

feat(oak): honour --dotenv in per-model compare loop

a28b714

docs: document --dotenv flag in README

6c0c8e9

haroldship requested a review from Sergey-Zeltyn June 9, 2026 14:11

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

fix(test): defer $TMP expansion in trap commands (SC2064)

6f4be7f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add --dotenv flag to override model profile with .env values#59

feat: add --dotenv flag to override model profile with .env values#59
haroldship wants to merge 12 commits into
mainfrom
feat/dotenv-model-override

haroldship commented Jun 9, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Uh oh!

haroldship commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

haroldship commented Jun 9, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

haroldship commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

haroldship commented Jun 9, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading