Skip to content

feat: multi-turn behavioral drift infrastructure (v3.1.0)#3

Merged
StressTestor merged 11 commits into
mainfrom
feat/multi-turn-drift-dataset
Mar 29, 2026
Merged

feat: multi-turn behavioral drift infrastructure (v3.1.0)#3
StressTestor merged 11 commits into
mainfrom
feat/multi-turn-drift-dataset

Conversation

@StressTestor

Copy link
Copy Markdown
Owner

summary

infrastructure for converting promptpressure from single-turn eval to multi-turn drift detection. no new eval sequences yet, just the plumbing.

tier system

  • 4-tier run system: smoke (CI, <60s), quick (dev, <10min), full (~1hr), deep (everything)
  • --tier smoke|quick|full|deep flag with --smoke and --quick shortcuts
  • cumulative filtering: --tier quick includes smoke + quick entries
  • exits non-zero when tier produces 0 matches (prevents CI false-passes)
  • Literal type validation on config field (catches bad YAML at load time)

per-turn metrics

  • response_length_ratio computed after each turn (no LLM calls)
  • metrics attached to turn_responses and aggregated in result_data
  • foundation for drift detection across conversation turns

multi-turn hardening

  • per-turn timeout scaling capped at 5x base (prevents 26-min hangs)
  • context window token estimation with warning at ~6k tokens
  • traceback preservation on timeout errors

dataset changes

  • 30 refusal sensitivity entries archived to archive/adversarial/
  • all 190 remaining entries tagged with tier/subcategory/difficulty
  • schema.json documenting the full entry format

test coverage

50 tests passing. 30 new tests added across 4 test files.

new modules (tier, metrics): 17/17 paths (100%)
schema validation: 7/7 paths (100%)
integration (cli wiring): 0/5 (requires live adapter)
overall: 22/27 paths (81%)

pre-landing review

19 issues found by structured + adversarial review. all resolved:

  • 5 auto-fixed (import location, traceback chain, truthiness check, default value, trailing newline)
  • 4 user-approved fixes (zero-entry exit, Literal config type, invalid tier logging, timeout cap)
  • 10 informational/deferred (pre-existing .pyc tracking, context warning noise, CSV gap, etc)

plan completion

12/12 DONE, 0 PARTIAL, 0 NOT DONE

all 8 implementation tasks complete. no scope creep.

test plan

  • all pytest tests pass (50 tests, 0 failures)
  • tier filtering: cumulative semantics, backward compat, invalid handling
  • per-turn metrics: length ratio computation, edge cases
  • schema validation: new fields accepted, invalid values rejected, legacy entries pass

🤖 Generated with Claude Code

Joeseph Grey and others added 11 commits March 29, 2026 01:29
add tier, subcategory, difficulty, per_turn_expectations to OPTIONAL_KEYS.
validate tier values (smoke/quick/full/deep), difficulty values (easy/medium/hard),
subcategory (non-empty string), and per_turn_expectations structure ({turn: int, expected: str}).
backward compatible: old entries without new fields still validate.
TIER_ORDER = [smoke, quick, full, deep]. filter_by_tier uses index
comparison for cumulative inclusion. entries without tier field
default to 'full'. invalid tier entries are silently excluded.
--tier smoke|quick|full|deep with --smoke and --quick shortcuts.
defaults to quick via Settings model. tier flows through config dict
to run_evaluation_suite which filters using tier.filter_by_tier.
rs_001 through rs_030 moved out of default dataset. accessible via
--dataset archive/adversarial/refusal_sensitivity.json for local model
testing or authorized red-team exercises. main dataset now 190 entries.
compute_turn_metrics runs after each turn response. response_length_ratio
detects terse/verbose drift across turns. metrics attached to turn_responses
and aggregated in result_data.per_turn_metrics. no LLM calls needed.
timeout grows with turn number: base * (1 + turn * 0.5). warns when
conversation exceeds ~6000 estimated tokens (may overflow 8k context
models). prevents indefinite hangs on deep tier 20-turn sequences.
3 sycophancy entries tagged quick tier, 187 tagged full. all entries
get subcategory='general' and difficulty='medium' as defaults. these
get refined as new multi-turn sequences are added in subsequent commits.
documents all fields including new tier, subcategory, difficulty, and
per_turn_expectations. validates prompt as either string (single-turn)
or message array (multi-turn). eval_criteria is a flexible object.
- move filter_by_tier import to top-level (consistency)
- preserve traceback chain on TimeoutError (from e)
- use 'in' check for metrics aggregation (prevents future empty-dict drop)
- change turn_number default from 0 to 1 (matches schema.json minimum)
- validate tier config with Literal type (catches bad YAML at load time)
- cap timeout at base_timeout * 5 (prevents 26-min hangs on deep sequences)
- exit non-zero when tier filter produces 0 entries
- log entries with invalid tier values
- add trailing newline to archive JSON
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@StressTestor StressTestor merged commit 1a4f6f5 into main Mar 29, 2026
4 checks passed
@StressTestor StressTestor deleted the feat/multi-turn-drift-dataset branch March 29, 2026 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant