feat: multi-turn behavioral drift infrastructure (v3.1.0) by StressTestor · Pull Request #3 · StressTestor/PromptPressure

StressTestor · 2026-03-29T17:41:24Z

summary

infrastructure for converting promptpressure from single-turn eval to multi-turn drift detection. no new eval sequences yet, just the plumbing.

tier system

4-tier run system: smoke (CI, <60s), quick (dev, <10min), full (~1hr), deep (everything)
--tier smoke|quick|full|deep flag with --smoke and --quick shortcuts
cumulative filtering: --tier quick includes smoke + quick entries
exits non-zero when tier produces 0 matches (prevents CI false-passes)
Literal type validation on config field (catches bad YAML at load time)

per-turn metrics

response_length_ratio computed after each turn (no LLM calls)
metrics attached to turn_responses and aggregated in result_data
foundation for drift detection across conversation turns

multi-turn hardening

per-turn timeout scaling capped at 5x base (prevents 26-min hangs)
context window token estimation with warning at ~6k tokens
traceback preservation on timeout errors

dataset changes

30 refusal sensitivity entries archived to archive/adversarial/
all 190 remaining entries tagged with tier/subcategory/difficulty
schema.json documenting the full entry format

test coverage

50 tests passing. 30 new tests added across 4 test files.

new modules (tier, metrics): 17/17 paths (100%)
schema validation: 7/7 paths (100%)
integration (cli wiring): 0/5 (requires live adapter)
overall: 22/27 paths (81%)

pre-landing review

19 issues found by structured + adversarial review. all resolved:

5 auto-fixed (import location, traceback chain, truthiness check, default value, trailing newline)
4 user-approved fixes (zero-entry exit, Literal config type, invalid tier logging, timeout cap)
10 informational/deferred (pre-existing .pyc tracking, context warning noise, CSV gap, etc)

plan completion

12/12 DONE, 0 PARTIAL, 0 NOT DONE

all 8 implementation tasks complete. no scope creep.

test plan

all pytest tests pass (50 tests, 0 failures)
tier filtering: cumulative semantics, backward compat, invalid handling
per-turn metrics: length ratio computation, edge cases
schema validation: new fields accepted, invalid values rejected, legacy entries pass

🤖 Generated with Claude Code

add tier, subcategory, difficulty, per_turn_expectations to OPTIONAL_KEYS. validate tier values (smoke/quick/full/deep), difficulty values (easy/medium/hard), subcategory (non-empty string), and per_turn_expectations structure ({turn: int, expected: str}). backward compatible: old entries without new fields still validate.

TIER_ORDER = [smoke, quick, full, deep]. filter_by_tier uses index comparison for cumulative inclusion. entries without tier field default to 'full'. invalid tier entries are silently excluded.

--tier smoke|quick|full|deep with --smoke and --quick shortcuts. defaults to quick via Settings model. tier flows through config dict to run_evaluation_suite which filters using tier.filter_by_tier.

rs_001 through rs_030 moved out of default dataset. accessible via --dataset archive/adversarial/refusal_sensitivity.json for local model testing or authorized red-team exercises. main dataset now 190 entries.

compute_turn_metrics runs after each turn response. response_length_ratio detects terse/verbose drift across turns. metrics attached to turn_responses and aggregated in result_data.per_turn_metrics. no LLM calls needed.

timeout grows with turn number: base * (1 + turn * 0.5). warns when conversation exceeds ~6000 estimated tokens (may overflow 8k context models). prevents indefinite hangs on deep tier 20-turn sequences.

3 sycophancy entries tagged quick tier, 187 tagged full. all entries get subcategory='general' and difficulty='medium' as defaults. these get refined as new multi-turn sequences are added in subsequent commits.

documents all fields including new tier, subcategory, difficulty, and per_turn_expectations. validates prompt as either string (single-turn) or message array (multi-turn). eval_criteria is a flexible object.

- move filter_by_tier import to top-level (consistency) - preserve traceback chain on TimeoutError (from e) - use 'in' check for metrics aggregation (prevents future empty-dict drop) - change turn_number default from 0 to 1 (matches schema.json minimum) - validate tier config with Literal type (catches bad YAML at load time) - cap timeout at base_timeout * 5 (prevents 26-min hangs on deep sequences) - exit non-zero when tier filter produces 0 entries - log entries with invalid tier values - add trailing newline to archive JSON

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Joeseph Grey and others added 11 commits March 29, 2026 01:29

add tier filtering module with cumulative semantics

3fb66b9

TIER_ORDER = [smoke, quick, full, deep]. filter_by_tier uses index comparison for cumulative inclusion. entries without tier field default to 'full'. invalid tier entries are silently excluded.

add --tier CLI flag and tier filtering to eval runner

55613d0

--tier smoke|quick|full|deep with --smoke and --quick shortcuts. defaults to quick via Settings model. tier flows through config dict to run_evaluation_suite which filters using tier.filter_by_tier.

archive 30 refusal sensitivity entries to archive/adversarial/

07c5a7a

rs_001 through rs_030 moved out of default dataset. accessible via --dataset archive/adversarial/refusal_sensitivity.json for local model testing or authorized red-team exercises. main dataset now 190 entries.

add per-turn response_length_ratio metric to multi-turn runner

d2d6578

compute_turn_metrics runs after each turn response. response_length_ratio detects terse/verbose drift across turns. metrics attached to turn_responses and aggregated in result_data.per_turn_metrics. no LLM calls needed.

add per-turn timeout scaling and context window warning

8ed73cc

timeout grows with turn number: base * (1 + turn * 0.5). warns when conversation exceeds ~6000 estimated tokens (may overflow 8k context models). prevents indefinite hangs on deep tier 20-turn sequences.

add tier, subcategory, difficulty fields to all 190 dataset entries

2bb17a8

3 sycophancy entries tagged quick tier, 187 tagged full. all entries get subcategory='general' and difficulty='medium' as defaults. these get refined as new multi-turn sequences are added in subsequent commits.

add JSON Schema for eval dataset entry format

b354ac8

documents all fields including new tier, subcategory, difficulty, and per_turn_expectations. validates prompt as either string (single-turn) or message array (multi-turn). eval_criteria is a flexible object.

chore: bump version and changelog (v3.1.0)

92df96e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: remove tracked .pyc files (already in .gitignore)

901feb0

StressTestor merged commit 1a4f6f5 into main Mar 29, 2026
4 checks passed

StressTestor deleted the feat/multi-turn-drift-dataset branch March 29, 2026 17:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: multi-turn behavioral drift infrastructure (v3.1.0)#3

feat: multi-turn behavioral drift infrastructure (v3.1.0)#3
StressTestor merged 11 commits into
mainfrom
feat/multi-turn-drift-dataset

StressTestor commented Mar 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

StressTestor commented Mar 29, 2026

summary

test coverage

pre-landing review

plan completion

test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant