feat: multi-turn behavioral drift infrastructure (v3.1.0)#3
Merged
Conversation
add tier, subcategory, difficulty, per_turn_expectations to OPTIONAL_KEYS.
validate tier values (smoke/quick/full/deep), difficulty values (easy/medium/hard),
subcategory (non-empty string), and per_turn_expectations structure ({turn: int, expected: str}).
backward compatible: old entries without new fields still validate.
TIER_ORDER = [smoke, quick, full, deep]. filter_by_tier uses index comparison for cumulative inclusion. entries without tier field default to 'full'. invalid tier entries are silently excluded.
--tier smoke|quick|full|deep with --smoke and --quick shortcuts. defaults to quick via Settings model. tier flows through config dict to run_evaluation_suite which filters using tier.filter_by_tier.
rs_001 through rs_030 moved out of default dataset. accessible via --dataset archive/adversarial/refusal_sensitivity.json for local model testing or authorized red-team exercises. main dataset now 190 entries.
compute_turn_metrics runs after each turn response. response_length_ratio detects terse/verbose drift across turns. metrics attached to turn_responses and aggregated in result_data.per_turn_metrics. no LLM calls needed.
timeout grows with turn number: base * (1 + turn * 0.5). warns when conversation exceeds ~6000 estimated tokens (may overflow 8k context models). prevents indefinite hangs on deep tier 20-turn sequences.
3 sycophancy entries tagged quick tier, 187 tagged full. all entries get subcategory='general' and difficulty='medium' as defaults. these get refined as new multi-turn sequences are added in subsequent commits.
documents all fields including new tier, subcategory, difficulty, and per_turn_expectations. validates prompt as either string (single-turn) or message array (multi-turn). eval_criteria is a flexible object.
- move filter_by_tier import to top-level (consistency) - preserve traceback chain on TimeoutError (from e) - use 'in' check for metrics aggregation (prevents future empty-dict drop) - change turn_number default from 0 to 1 (matches schema.json minimum) - validate tier config with Literal type (catches bad YAML at load time) - cap timeout at base_timeout * 5 (prevents 26-min hangs on deep sequences) - exit non-zero when tier filter produces 0 entries - log entries with invalid tier values - add trailing newline to archive JSON
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
summary
infrastructure for converting promptpressure from single-turn eval to multi-turn drift detection. no new eval sequences yet, just the plumbing.
tier system
--tier smoke|quick|full|deepflag with--smokeand--quickshortcuts--tier quickincludes smoke + quick entriesLiteraltype validation on config field (catches bad YAML at load time)per-turn metrics
response_length_ratiocomputed after each turn (no LLM calls)multi-turn hardening
dataset changes
archive/adversarial/test coverage
50 tests passing. 30 new tests added across 4 test files.
pre-landing review
19 issues found by structured + adversarial review. all resolved:
plan completion
all 8 implementation tasks complete. no scope creep.
test plan
🤖 Generated with Claude Code