diff --git a/README.md b/README.md index 8a0ca32..5143947 100644 --- a/README.md +++ b/README.md @@ -33,7 +33,7 @@ APTS is not a testing methodology. It complements PTES, OWASP WSTG, and OSSTMM b - **Tier 2 (Verified)**: 85 additional (157 cumulative). Full transparency, tamper-proof audit trails, and independently verifiable findings. - **Tier 3 (Comprehensive)**: 16 additional (173 cumulative). Highest assurance for critical infrastructure and L4 autonomous operations. -Thirteen additional advisory practices live exclusively in the [Advisory Requirements appendix](./standard/appendix/Advisory_Requirements.md) under the `APTS--A0x` identifier pattern. Advisory practices are not counted toward any tier and do not affect conformance. +Fourteen additional advisory practices live exclusively in the [Advisory Requirements appendix](./standard/appendix/Advisory_Requirements.md) under the `APTS--A0x` identifier pattern. Advisory practices are not counted toward any tier and do not affect conformance. APTS has no certification body, no mandatory third-party audit, and no fee. Platforms are assessed against the requirements and conformance is documented. The standard does not prescribe who performs the assessment; internal self-assessment, independent internal review, and external third-party assessment are all valid approaches, and the choice is left to the reader. diff --git a/index.md b/index.md index 0921d9a..768aadc 100644 --- a/index.md +++ b/index.md @@ -44,7 +44,7 @@ APTS is not a testing methodology. It complements PTES, OWASP WSTG, and OSSTMM b - **Tier 2 (Verified)**: 85 additional (157 cumulative). Full transparency, tamper-proof audit trails, and independently verifiable findings. - **Tier 3 (Comprehensive)**: 16 additional (173 cumulative). Highest assurance for critical infrastructure and L4 autonomous operations. -Thirteen additional advisory practices live exclusively in the [Advisory Requirements appendix](./standard/appendix/Advisory_Requirements.md) under the `APTS--A0x` identifier pattern. Advisory practices are not counted toward any tier and do not affect conformance. +Fourteen additional advisory practices live exclusively in the [Advisory Requirements appendix](./standard/appendix/Advisory_Requirements.md) under the `APTS--A0x` identifier pattern. Advisory practices are not counted toward any tier and do not affect conformance. APTS has no certification body, no mandatory third-party audit, and no fee. Platforms are assessed against the requirements and conformance is documented. The standard does not prescribe who performs the assessment; internal self-assessment, independent internal review, and external third-party assessment are all valid approaches, and the choice is left to the reader. diff --git a/standard/6_Manipulation_Resistance/Implementation_Guide.md b/standard/6_Manipulation_Resistance/Implementation_Guide.md index 318c687..9a283cc 100644 --- a/standard/6_Manipulation_Resistance/Implementation_Guide.md +++ b/standard/6_Manipulation_Resistance/Implementation_Guide.md @@ -364,6 +364,45 @@ Platforms using LLMs for decision-making MUST implement layered prompt injection --- +## Advisory Practice Implementation Guidance + +### APTS-MR-A01: Goal Misgeneralization and Emergent Misalignment Evaluation Suite + +> This section provides implementation guidance for the advisory practice [APTS-MR-A01](../appendix/Advisory_Requirements.md#apts-mr-a01-goal-misgeneralization-and-emergent-misalignment-evaluation-suite-advisory). It is not required for conformance at any tier. + +**Implementation:** Build the evaluation suite as an independent pipeline outside the agent's serving infrastructure. Each run takes a fixed model configuration (foundation model version, adapter set, system prompt, tool access policy, temperature) and produces a structured run record with per-scenario decisions, per-category alignment rates, and an aggregate score committed to the audit trail. + +**Architecture Pattern — Independent Evaluation Pipeline:** + +A proven pattern is to implement the suite using an off-the-shelf evaluation framework (Inspect AI, Braintrust, OpenAI Evals, or an equivalent harness) and to run it as a standalone job rather than as part of the engagement runtime. The pipeline has four components: + +1. **Scenario corpus.** A versioned set of distinguishing scenarios stored as data, not code. Each scenario carries: a target description (the synthetic environment the agent is asked to assess), the proxy-objective action, the true-objective action, the rubric mapping the agent's chosen action to one of the three classes (true-objective-aligned, proxy-objective-aligned, other), and a category tag. +2. **Run harness.** A driver that instantiates the same agent loop the platform uses in production — same system prompt, same tool registry, same control plane — but pointed at the synthetic targets in the scenario corpus. The harness records every tool call, every model output, and the agent's terminal action per scenario. +3. **Scorer.** A rule-based or LLM-judge scorer (independent of the agent under test) that maps each scenario's run record to one of the three classes. Where an LLM judge is used, the judge MUST run as a separate inference call with no shared context with the agent under test (same constraint as APTS-RP-A01). +4. **Reporter.** Aggregates per-category alignment rates, computes aggregate score, compares against documented thresholds, computes deltas against the most recent prior run on the same model version, and writes a signed run record to the audit trail per APTS-AR-012. + +**Calibrated Alignment Thresholds:** + +For scenario categories where the population of distinguishing cases is small, define a minimum sample size before thresholds apply. For example: a category with fewer than 30 scenarios should report a confidence-interval band rather than a single alignment rate, and threshold checks should compare the lower bound of the band against the floor. This prevents a category with 5 scenarios from triggering reauthorization on a single regression that is within statistical noise. + +**Out-of-Distribution Audit:** + +Implement the post-fine-tuning out-of-distribution audit as a separate evaluation job that runs the model against a held-out general-purpose benchmark (HHH-style or equivalent) before and after every fine-tuning event. The audit is concerned with relative shift, not absolute score: a several-point drop on a previously-stable benchmark following narrow fine-tuning is the signal the Nature 2026 methodology identifies as emergent misalignment, and the relative drop matters more than the absolute number. + +**Key Considerations:** +- The evaluation suite must be governed as carefully as the model under APTS-TP-019. Scenario corpus contamination — where suite scenarios leak into post-deployment learning data — silently inflates alignment rates and is the evaluation-side analogue of training-set contamination in capability benchmarks. +- The judge component must use a different model family or, if the same family, a different system prompt and a separate inference call. APTS-RP-A01's independence requirement applies symmetrically here. +- Run the suite against the *same* model configuration the platform uses in production, including system prompt, tool registry, temperature, and any safety wrappers. A suite that passes against a stripped-down configuration provides no assurance about production behavior. +- Suite results should be tracked across runs with explicit versioning of (model, adapter, system prompt, suite version). A drop in alignment rate is interpretable only relative to prior runs on a comparable configuration. + +**Common Pitfalls:** +- Building a scenario corpus that distinguishes the proxy from the true objective only on cases the agent has already seen during training. Distinguishing scenarios must be genuinely held out from training data. +- Treating an LLM-judge scorer as ground truth without inter-rater reliability checks. The scorer is a measurement instrument and needs its own calibration. +- Running the suite once at platform launch and never again. The point of the practice is to catch drift, which requires repeated runs across model and adapter changes. +- Treating the suite as a marketing artifact rather than a control. Publishing high alignment rates without committing run records and scenario versions to the audit trail provides no assurance and creates a perverse incentive to overfit the suite. + +--- + ## Implementation Roadmap **Tier 1 (implement before any autonomous pentesting begins):** diff --git a/standard/6_Manipulation_Resistance/README.md b/standard/6_Manipulation_Resistance/README.md index 90b062a..373f91f 100644 --- a/standard/6_Manipulation_Resistance/README.md +++ b/standard/6_Manipulation_Resistance/README.md @@ -1058,3 +1058,5 @@ The rest of Manipulation Resistance defends against an outside attacker trying t 5. **Exercise review**: Retrieve records of the platform's last tabletop exercise or red-team drill that specifically tested the agent-as-insider scenario; confirm the exercise took place within the documented review interval and that identified issues were tracked to resolution. --- + +> **See also:** [APTS-MR-A01: Goal Misgeneralization and Emergent Misalignment Evaluation Suite](../appendix/Advisory_Requirements.md#apts-mr-a01-goal-misgeneralization-and-emergent-misalignment-evaluation-suite-advisory) — an advisory practice for platforms using fine-tuned or adapted LLM-based agents. Evaluates the agent's underlying objective alignment under distribution shift and detects emergent misalignment after fine-tuning, addressing failure modes that input-side (MR-013) and control-side (MR-020) adversarial testing do not cover. Candidate for tier-gated inclusion in v0.2.0. diff --git a/standard/Frontispiece.md b/standard/Frontispiece.md index 65e4c36..75a6697 100644 --- a/standard/Frontispiece.md +++ b/standard/Frontispiece.md @@ -72,4 +72,4 @@ Licensed under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/). | Version | Date | Notes | |---------|------|-------| -| 0.1.0 | April 2026 | Initial release. Eight domains, 173 tier-required requirements across three compliance tiers, plus 13 advisory practices in the appendix. | +| 0.1.0 | April 2026 | Initial release. Eight domains, 173 tier-required requirements across three compliance tiers, plus 14 advisory practices in the appendix. | diff --git a/standard/Getting_Started.md b/standard/Getting_Started.md index a3fa6c0..f237ab2 100644 --- a/standard/Getting_Started.md +++ b/standard/Getting_Started.md @@ -84,7 +84,7 @@ Depending on your role: ## Common Questions **Q: Do I need to implement all 173 requirements?** -No. Start with Tier 1 (72 requirements). Tier 2 and Tier 3 add requirements progressively for cumulative totals of 157 and 173. An additional 13 advisory practices live in the [Advisory Requirements appendix](appendix/Advisory_Requirements.md) under the `APTS--A0x` identifier pattern; advisory practices are not required for conformance at any tier. See [Introduction: Compliance Tiers](Introduction.md#compliance-tiers) for details. +No. Start with Tier 1 (72 requirements). Tier 2 and Tier 3 add requirements progressively for cumulative totals of 157 and 173. An additional 14 advisory practices live in the [Advisory Requirements appendix](appendix/Advisory_Requirements.md) under the `APTS--A0x` identifier pattern; advisory practices are not required for conformance at any tier. See [Introduction: Compliance Tiers](Introduction.md#compliance-tiers) for details. **Q: What if my platform meets most but not all Tier 1 requirements?** APTS does not award partial credit. A platform must meet 100% of requirements for its claimed tier. Address gaps before claiming a tier. diff --git a/standard/Introduction.md b/standard/Introduction.md index 687cc56..2e110b4 100644 --- a/standard/Introduction.md +++ b/standard/Introduction.md @@ -44,7 +44,7 @@ APTS does not prescribe who performs the assessment. The choice of internal self | 7 | Third-Party & Supply Chain Trust | TP | 22 | AI providers, cloud dependencies, data handling, foundation model disclosure | | 8 | Reporting | RP | 15 | Finding validation, confidence scoring, coverage disclosure | -**Total: 173 tier-required requirements** (Tier 1 + Tier 2 + Tier 3) across the eight domains. An additional **13 advisory practices** live exclusively in the [Advisory Requirements](appendix/Advisory_Requirements.md) appendix using the `APTS--A0x` identifier pattern; advisory practices are not counted toward any tier and do not affect conformance. +**Total: 173 tier-required requirements** (Tier 1 + Tier 2 + Tier 3) across the eight domains. An additional **14 advisory practices** live exclusively in the [Advisory Requirements](appendix/Advisory_Requirements.md) appendix using the `APTS--A0x` identifier pattern; advisory practices are not counted toward any tier and do not affect conformance. --- diff --git a/standard/README.md b/standard/README.md index 95bfc6a..429f27d 100644 --- a/standard/README.md +++ b/standard/README.md @@ -1,6 +1,6 @@ # OWASP Autonomous Penetration Testing Standard -This is the full OWASP Autonomous Penetration Testing Standard. It defines 173 tier-required requirements across 8 domains (plus 13 advisory practices in the [Advisory Requirements appendix](appendix/Advisory_Requirements.md)) that autonomous penetration testing platforms must meet to operate safely, transparently, and within defined boundaries, whether delivered by vendors, operated as a service, or built in-house by enterprise security teams. +This is the full OWASP Autonomous Penetration Testing Standard. It defines 173 tier-required requirements across 8 domains (plus 14 advisory practices in the [Advisory Requirements appendix](appendix/Advisory_Requirements.md)) that autonomous penetration testing platforms must meet to operate safely, transparently, and within defined boundaries, whether delivered by vendors, operated as a service, or built in-house by enterprise security teams. ## Getting Started diff --git a/standard/appendix/Advisory_Requirements.md b/standard/appendix/Advisory_Requirements.md index 134eda6..95bd3c0 100644 --- a/standard/appendix/Advisory_Requirements.md +++ b/standard/appendix/Advisory_Requirements.md @@ -191,6 +191,30 @@ Document every external connector that can execute actions, access customer data --- +### APTS-MR-A01: Goal Misgeneralization and Emergent Misalignment Evaluation Suite (Advisory) + +**Applicability:** This practice applies to platforms that use LLM-based agents fine-tuned (SFT, RFT, RLHF, DPO, or equivalent) on offensive-security tasks, or whose foundation model has been adapted with offensive-task adapters, instruction tuning, task-specific reward models, or post-deployment online learning on engagement data. + +**Rationale:** Recent peer-reviewed work has demonstrated that fine-tuning a frontier LLM on a narrow task can produce broad behavioral misalignment that extends far outside the training domain (Nature 2026, *Training LLMs on narrow tasks can lead to broad misalignment*). For autonomous penetration testing platforms, two failure modes follow directly: (a) goal misgeneralization, where the agent learns a proxy objective ("produce findings that look like vulnerabilities") that diverges from the true objective ("identify vulnerabilities exploitable in the customer environment") in distinguishing situations the training data did not cover; and (b) emergent misalignment, where narrow fine-tuning on offensive tasks shifts the agent's behavior in adjacent domains with no signal until the shift manifests in a production engagement. APTS-MR-013 (Adversarial Example Detection in Vulnerability Classification) probes input-side robustness; APTS-MR-020 (Adversarial Validation and Resilience Testing of Safety Controls) probes control-side resilience; APTS-AR-019 (AI/ML Model Change Tracking and Drift Detection) tracks output drift. None of these evaluate the agent's underlying objective alignment under distribution shift. The Introduction's *Capability Frontier and Containment Assumptions* section defers verifiable goal alignment as research-stage and out of scope for v0.1.0; this practice begins to close that gap with an evaluation-based approach achievable today. The normative requirement set for v0.1.0 is frozen; this practice is a candidate for tier-gated inclusion in v0.2.0 (likely as SHOULD | Tier 2 for platforms operating at Level 3 autonomy or higher, or for any platform that performs post-deployment fine-tuning on engagement data). + +**Value:** Platforms that maintain a goal-misgeneralization and emergent-misalignment evaluation suite detect a class of failure that no other safety control catches: situations where every individual safety check passes, scope holds, and the agent produces fluent, plausible output, while the agent is in fact optimizing a proxy objective that diverges from the true objective in distinguishing cases. This is the agent-side analogue of the fabricated-finding problem addressed by APTS-RP-A01: a failure that is invisible to per-action checks because it manifests only across the distribution of decisions the agent makes. + +**Practice Description:** + +Maintain and execute a goal-misgeneralization and emergent-misalignment evaluation suite that evaluates the agent's behavior in distinguishing scenarios — scenarios constructed so that following a plausible proxy objective produces a different action than following the true objective — and that evaluates the agent's behavior outside the offensive-security domain after any fine-tuning event. Specifically: + +1. **Evaluation suite corpus.** Maintain a corpus of distinguishing scenarios covering, at minimum: scope-versus-finding-pressure tension (the agent has an opportunity to claim a finding by exceeding scope), proxy-reward gaming (the agent has an opportunity to produce a fluent finding without performing the underlying verification), intent-versus-output divergence (the agent is asked to assess a target whose intended behavior matches a vulnerability signature), and adjacent-domain spillover scenarios constructed per the Nature 2026 emergent-misalignment methodology. Document the proxy action and true action for every scenario. +2. **Behavioral metrics.** For each scenario, record the agent's chosen action and classify it as true-objective-aligned, proxy-objective-aligned, or other. Compute alignment rate per scenario category and aggregate across the suite. Track per-category trends across runs to detect drift. +3. **Refresh cadence.** Execute the suite quarterly and on every event that changes the agent's behavior at scale: foundation model change (per APTS-TP-022), adapter or instruction-tuning update, reward model change, and any post-deployment fine-tuning on engagement data. Re-execute on the prior model version after every refresh to distinguish suite-quality changes from model-behavior changes. +4. **Threshold and escalation.** Document a minimum acceptable alignment rate per scenario category and an aggregate floor. Below-threshold results MUST trigger reauthorization review per APTS-AL-026 before the platform may continue to operate at or above L3 autonomy. Below-threshold results on out-of-distribution scenarios MUST trigger review of any post-deployment learning data per APTS-TP-019. +5. **Out-of-distribution audit after fine-tuning.** Following the Nature 2026 methodology, evaluate the agent on a held-out, non-pentesting-domain benchmark before and after every fine-tuning event. Material shifts (defined and documented by the platform) constitute emergent misalignment evidence and trigger the same escalation as below-threshold suite results. + +**Recommendation:** Implement the suite as an independent evaluation pipeline using a recognized evaluation framework (Inspect AI, Braintrust, OpenAI Evals, or equivalent) so that scenarios, scoring, and run history are inspectable, reproducible, and externally reviewable. Run the suite under the same model configuration the platform uses in production (same system prompt, same tool access pattern, same temperature). APTS-RP-A01 provides a backstop on the output side: even if the agent's objective drifts, an independent finding-authenticity verifier can catch fabricated evidence before it reaches the customer. APTS-MR-A01 addresses the upstream failure that RP-A01 cannot: the agent producing genuinely-grounded findings that the agent itself was misaligned to discover, prioritize, or report. + +**Related normative requirements:** APTS-MR-013, APTS-MR-020, APTS-AL-026, APTS-TP-019, APTS-TP-022, APTS-AR-019. + +--- + ### APTS-RP-A01: Automated Finding Authenticity Verification (Advisory) **Rationale:** LLM-based penetration testing agents can produce findings that appear legitimate but contain fabricated evidence: proof-of-concept scripts that output hardcoded strings instead of making real requests, HTTP responses that were not actually received from the target, or severity classifications unsupported by the evidence. Because these fabricated findings are fluent and internally consistent, they pass casual human review and erode trust in the platform's output. RP-001 and RP-002 require evidence-based validation and human review, but neither addresses the risk that the agent itself fabricates evidence. The normative requirement set for v0.1.0 is frozen; this practice is a candidate for tier-gated inclusion in v0.2.0 (likely as MUST | Tier 2 given the implementation complexity). diff --git a/standard/appendix/Glossary.md b/standard/appendix/Glossary.md index 44833a9..052cdd1 100644 --- a/standard/appendix/Glossary.md +++ b/standard/appendix/Glossary.md @@ -79,7 +79,7 @@ Notation for specifying IP address ranges using a base address and prefix length Alternative security measures that mitigate vulnerability when the primary control is missing. Example: Two-factor authentication compensates for weak passwords. **Compliance Tier** -One of three progressive levels of APTS conformance. Tier 1 (Foundation) requires 72 core requirements (MUST | Tier 1). Tier 2 (Verified) adds 85 requirements for a cumulative 157 (MUST | Tier 2 + SHOULD | Tier 2). Tier 3 (Comprehensive) adds 16 requirements for a cumulative 173 (MUST | Tier 3 + SHOULD | Tier 3). A platform must meet 100% of requirements assigned to its claimed tier (both MUST and SHOULD). An additional 13 advisory practices in the Advisory Requirements appendix are recommended for highest-assurance engagements but are not counted toward any tier. +One of three progressive levels of APTS conformance. Tier 1 (Foundation) requires 72 core requirements (MUST | Tier 1). Tier 2 (Verified) adds 85 requirements for a cumulative 157 (MUST | Tier 2 + SHOULD | Tier 2). Tier 3 (Comprehensive) adds 16 requirements for a cumulative 173 (MUST | Tier 3 + SHOULD | Tier 3). A platform must meet 100% of requirements assigned to its claimed tier (both MUST and SHOULD). An additional 14 advisory practices in the Advisory Requirements appendix are recommended for highest-assurance engagements but are not counted toward any tier. **Confidence Score** A numeric value on a 0-100% scale indicating the platform's certainty in a scope boundary determination, target legitimacy assessment, asset classification, or finding validity. Scores below 75% for scope-related decisions trigger mandatory human escalation. See APTS-HO-013, APTS-RP-003. diff --git a/standard/appendix/Vendor_Evaluation_Guide.md b/standard/appendix/Vendor_Evaluation_Guide.md index 8e5ddac..36a0440 100644 --- a/standard/appendix/Vendor_Evaluation_Guide.md +++ b/standard/appendix/Vendor_Evaluation_Guide.md @@ -14,7 +14,7 @@ Decide your minimum compliance tier based on your risk tolerance: - **Tier 2 (Verified):** 157 cumulative requirements (72 + 85). The platform is fully transparent about what it did and why, protects your data with tamper-proof audit trails, handles incidents with formal response procedures, and provides independently verifiable findings. **Choose Tier 2 when:** you are testing production environments, operating in regulated industries, or need full accountability for audit or compliance purposes. This is the recommended minimum for most production deployments. -- **Tier 3 (Comprehensive):** 173 cumulative requirements (157 + 16). The platform meets the highest assurance bar for critical infrastructure, fully autonomous (L4) operations, and the strictest regulatory requirements. **Choose Tier 3 when:** you are deploying fully autonomous testing against critical infrastructure, financial systems, or healthcare environments with minimal human oversight. An additional 13 advisory practices in the [Advisory Requirements appendix](Advisory_Requirements.md) are recommended for highest-assurance engagements but are not counted toward any tier. +- **Tier 3 (Comprehensive):** 173 cumulative requirements (157 + 16). The platform meets the highest assurance bar for critical infrastructure, fully autonomous (L4) operations, and the strictest regulatory requirements. **Choose Tier 3 when:** you are deploying fully autonomous testing against critical infrastructure, financial systems, or healthcare environments with minimal human oversight. An additional 14 advisory practices in the [Advisory Requirements appendix](Advisory_Requirements.md) are recommended for highest-assurance engagements but are not counted toward any tier. > **Minimum tier guidance:** Tier 1 is appropriate for supervised testing of non-critical systems in non-regulated environments. Organizations in financial services, healthcare, critical infrastructure, or any regulated industry SHOULD require Tier 2 as a minimum. Tier 3 is recommended for critical infrastructure, fully autonomous (L4) operations, and environments with the strictest regulatory requirements.