Unit tests for the CoEval Experiment Execution Runtime (EER).
python -m pytest Tests/runner/Add -v for verbose output or -x to stop on the first failure.
Current test count: 404 tests across 8 test modules.
Tests for config loading and validation.
- Validation rules V-01 through V-17 (required fields, type checks, mutual exclusivity, etc.)
- V-15:
probe_modemust be one ofdisable,full,resume - V-16:
probe_on_failmust be one ofabort,warn - V-17:
label_attributesentries must be a subset oftarget_attributeskeys (when static) - Role-parameter merging: how model-level settings override global defaults
- Phase mode defaults and how they interact with explicit overrides
- Parsing of new experiment-level fields:
probe_mode,probe_on_fail,estimate_cost,estimate_samples - Parsing of new task-level field:
label_attributes
Tests for the ExperimentStorage layer.
- Round-trip serialization: write a result, read it back, verify integrity
- Metadata lifecycle: creation, update, and finalization of experiment metadata
- Resume copy: verifying that a resumed experiment correctly inherits prior state
- Continue in-place:
initialize(continue_in_place=True)reopens without clearing data - Continue does not overwrite
meta.json(phases_completed preserved)
Tests for prompt construction and resolution.
- Prompt resolution order (config-level vs. task-level vs. role-level)
- All 6 supported template IDs
- Variable substitution:
{task},{response},{rubric}, and other placeholders
Tests for shared utility functions.
- JSON extraction from freeform model output
extract_prompt_response: parsing raw completions into structured records- Merge helpers: deep-merge logic for nested config dicts
QuotaTracker: rate-limit token accounting and backoff logic
Tests for Phase 4 (response collection) and Phase 5 (evaluation).
- Phase 4 and 5 in New, Keep, Extend, and Model modes
- Batch path disabled in all unit tests (
cfg.use_batch.return_value = False) - Phase 5 Extend mode: already-evaluated responses are always skipped (regardless of new rubric factors)
- Per-phase skip logic and JSONL record counting
Tests for Code/runner/label_eval.py — label accuracy for classification and IE tasks.
extract_label: JSON exact key, alias key (label/prediction/class/answer), markdown fence, short free-text (≤60 chars), long text (returnsNone), empty text, null value, integer coercionextract_multilabel: JSON multi-key, missing keys, short free-text fallback, empty attr listLabelEvaluator.evaluate: perfect/partial/zero accuracy, case-insensitive default match, custommatch_fn, extraction failure (skipped), per-label P/R/F1, missing datapoint (skipped), missing attribute in ground truth (skipped)LabelEvaluator.evaluate_multilabel: Hamming accuracy (perfect, partial, empty)LabelEvaluator: emptylabel_attributesraisesValueError- Information-extraction scenario (
entity_typeattribute)
Tests for Code/runner/interfaces/probe.py and Code/runner/interfaces/cost_estimator.py.
Probe tests:
_models_needed: resume mode filters models by roles needed for remaining phasesrun_probewithmode='disable'returns empty results without probingrun_probewritesprobe_results.jsoncontaining mode, results, and probed model liston_fail='abort'callslogger.error;on_fail='warn'callslogger.warning
Cost estimator tests:
get_prices: known models return correct prices; unknown models use defaultscount_tokens_approx: length/4 heuristic, minimum of 1- Heuristic latency/TPS fallbacks for HuggingFace and disabled sampling
estimate_experiment_cost: required keys present in result, all phases accounted for, total cost > 0,cost_estimate.jsonwritten to experiment folder, batch mode reduces estimated cost vs. non-batch
Validation tests (V-15 through V-17):
- V-15: invalid
probe_modevalues are rejected; valid values pass - V-16: invalid
probe_on_failvalues are rejected; valid values pass - V-17:
label_attributesreferencing unknown keys fail; valid subsets pass; auto/completetarget_attributesskip validation
- docs/spec_claude.md — formal spec (COEVAL-SPEC-001)
- Code/runner/label_eval.py — label accuracy module
- Code/runner/interfaces/cost_estimator.py — cost/time estimation
- Code/runner/interfaces/probe.py — model availability probe