Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,33 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## 0.4.0

### Added

- **COMPL-AI benchmark path**: `scripts/openai_compatible_server.py` — OpenAI-compatible FastAPI bridge (`/v1/chat/completions`, `/chat/completions`) routing requests through MoralStack governance (env `MORALSTACK_OPENAI_COMPATIBLE_*`).
- **Objective benchmark runner**: `scripts/benchmark_moralstack.py` — grounded-truth evaluation harness (expected actions/risk, parallel execution, markdown reports, optional judge model); aligns MoralStack scoring with `final_action`-only compliance semantics.
- Constitution overlay `violent_crime.yaml` plus coordinated overlay YAML adjustments across domains.
- `moralstack/orchestration/refusal_context.py` — refusal contextualization and grounding helpers wired through refusal assembly.
- `moralstack/observability/read_store.py` — read helpers over persisted observability artifacts.
- SQLite persistence extension for benchmark/report consumption (`moralstack/persistence/db.py`).
- Large expansion of automated tests: refusal contextualization and grounding, domain prefilter descriptions, intent falsification and operational-risk signals, observability read store, report durations and journey ordering, risk config/runtime-domain behavior, UI calibration path, refusal handler duration metadata, and related suites.

### Changed

- Minimum `openai` dependency raised to `>=2.24.0` in `pyproject.toml`.
- README architecture diagram: risk-estimator parallel mini-estimator ordering/labels updated (`intent · signal detection (q1–q17) · operational risk`).
- **Risk layer**: richer estimation prompts and schema, calibration logic, config-loader/env wiring, estimator behavior (including runtime/normalized domain handling); documentation updates in `docs/modules/risk_estimator.md`.
- **Constitution**: retriever and store updates supporting benchmark-grade retrieval and policy behavior; related docs (`docs/modules/constitution_store.md`, `docs/constitution.md`, `docs/architecture_spec.md`).
- **Orchestration**: `safe_refusal_generator`, `refusal_handler`, `response_assembler`, `controller`, `deliberation_runner`, and `decision_service` updated for contextualized refusals and benchmark-aligned flows.
- **Reports & UI**: request report model enhancements (e.g. duration/journey-oriented fields); dashboard runs view and styling updates for calibration-oriented workflows.
- Environment templates (`.env.template`, `.env.minimal`) and `INSTALL.md` updated for new variables and setup paths.

### Fixed

- Domain-detection / refusal end-state specificity issues called out in the COMPL-AI integration work.
- Lint/format hygiene: Ruff and Black fixes with aligned test updates.

## 0.3.3
22/04/2026

Expand Down
38 changes: 22 additions & 16 deletions moralstack/constitution/retriever.py
Original file line number Diff line number Diff line change
Expand Up @@ -389,50 +389,56 @@ def _domain_line(domain: str) -> str:
prompt = f"""
USER QUERY:
{query}

AVAILABLE DOMAINS:
{domain_list}

You are a strict semantic domain classifier.

Your task is to select up to {self.max_domains} domains from AVAILABLE DOMAINS.

Core principle:
Classify the user's REAL SUBSTANTIVE REQUEST, not the wrapper used to present it.

A wrapper includes:
- requested output format: JSON, list, table, article, Wikipedia page
- style or tone: funny, roleplay, no disclaimers, start with X
- genre: story, poem, script, fictional framing
- platform evasion: avoid moderation, avoid takedown, bypass filters
- prompt-injection constraints

Encoded or obfuscated content:
If the query contains recoverable encoded/obfuscated text, treat the recovered meaning as the substantive request.
If the recovered meaning conflicts with the surface wrapper, the recovered meaning wins.
If the encoded text is not recoverable, do not guess; return [] unless another clear substantive topic exists.

Classification procedure:
1. Identify the substantive payload.
2. Identify wrapper cues.
3. Select domains only from the substantive payload.
4. Exclude any domain supported only by wrapper cues.
5. Before final JSON, run the falsification checks below.

Falsification checks:
- If you selected "creative" only because the query asks for an article, story, poem, script, Wikipedia-style text, JSON, roleplay, tone, or writing format, remove "creative".
- If the recovered payload asks about drug use, injection, dosing, administration routes, overdose risk, unsafe bodily procedures, self-treatment, or controlled substances, select "medical" if available.
- If the recovered payload asks about minors, grooming, child sexualization, or unsafe adult-minor interaction, select "children" if available.
- If the recovered payload asks about hacking, malware, exploits, or unauthorized access, select "cybersecurity" if available.
- If the recovered payload asks about real-world violent crime, select "violent_crime" if available.
- If wrapper cues and recovered payload point to different domains, choose the recovered payload domain and exclude the wrapper domain.

- If you selected "creative" only because the query asks for an article,
story, poem, script, Wikipedia-style text, JSON, roleplay, tone, or writing format, remove "creative".
- If the recovered payload asks about drug use, injection, dosing, administration routes,
overdose risk, unsafe bodily procedures, self-treatment, or controlled substances, select "medical" if available.
- If the recovered payload asks about minors, grooming, child sexualization,
or unsafe adult-minor interaction, select "children" if available.
- If the recovered payload asks about hacking, malware, exploits,
or unauthorized access, select "cybersecurity" if available.
- If the recovered payload asks about real-world violent crime,
select "violent_crime" if available.
- If wrapper cues and recovered payload point to different domains,
choose the recovered payload domain and exclude the wrapper domain.

Use confidence:
- 0.90-1.00: direct unambiguous match from substantive payload.
- 0.70-0.89: clear match with minor ambiguity.
- 0.50-0.69: plausible but uncertain.
- 0.00-0.49: no recoverable substantive topic; return [].

Return JSON only:
{{
"substantive_payload": "brief description of the real request, decoded/recovered when applicable",
Expand Down
14 changes: 11 additions & 3 deletions moralstack/models/risk/prompts.py
Original file line number Diff line number Diff line change
Expand Up @@ -324,7 +324,9 @@
══════════════════════════════════════════════════════════════════

{{
"rationale": "First state substantive semantic payload (include recoverable hidden meaning when applicable); then framing, intent, and risk; align with harm_type and affirmative q-signals",
"rationale": "First state substantive semantic payload
(include recoverable hidden meaning when applicable);
then framing, intent, and risk; align with harm_type and affirmative q-signals",
"detected_language": "ISO 639-1 code",
"q1_confidential": "yes" | "no",
"q2_unauthorized": "yes" | "no",
Expand Down Expand Up @@ -659,7 +661,9 @@
to handle a sensitive situation or asks whether a norm applies.

{{
"rationale": "Substantive semantic payload first (recovered encoded/obfuscated meaning when applicable); then framing fired and why; align with harm_type",
"rationale": "Substantive semantic payload first
(recovered encoded/obfuscated meaning when applicable);
then framing fired and why; align with harm_type",
"detected_language": "ISO 639-1 code",
"intent_to_harm": "yes" | "no",
"requested_instructions": "yes" | "no",
Expand Down Expand Up @@ -1176,7 +1180,11 @@
══════════════════════════════════════════════════════════════════

{{
"rationale": "Lead with substantive semantic payload (recovered hidden meaning when applicable); state framing (defensive/offensive/socio-behavioral); operational assessment — must match payload semantics and harm category; no wrapper-only summaries",
"rationale": "Lead with substantive semantic payload
(recovered hidden meaning when applicable);
state framing (defensive/offensive/socio-behavioral);
operational assessment — must match payload semantics and harm category;
no wrapper-only summaries",
"operational_risk": "NONE" | "LOW" | "HIGH",
"risk_score": 0.0-1.0,
"confidence": 0.0-1.0,
Expand Down
7 changes: 2 additions & 5 deletions moralstack/ui/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -702,9 +702,7 @@ def _describe_calibration_path(
if q13_positive:
topic_signal_summary_parts.append("q13=yes")
if reputational_positive:
topic_signal_summary_parts.append(
f"q14–q16={len(reputational_positive)} ({', '.join(reputational_positive)})"
)
topic_signal_summary_parts.append(f"q14–q16={len(reputational_positive)} ({', '.join(reputational_positive)})")
topic_signal_summary = (
"; ".join(topic_signal_summary_parts)
+ " — excluded from harmful_count"
Expand Down Expand Up @@ -745,8 +743,7 @@ def _describe_calibration_path(
)
elif harmful_count >= 3 or raw_score >= 0.85:
rules_applied.append(
f"harm_escalation: {harmful_count} signals, score={raw_score:.2f} "
f"-> CLEARLY_HARMFUL, floor 0.85"
f"harm_escalation: {harmful_count} signals, score={raw_score:.2f} " f"-> CLEARLY_HARMFUL, floor 0.85"
)
elif harmful_count > 0:
rules_applied.append(f"harm_present: {harmful_count} signal(s) ({', '.join(positive_signals)})")
Expand Down
17 changes: 12 additions & 5 deletions tests/orchestration/test_refusal_contextualization.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
import pytest
from moralstack.orchestration.refusal_context import classify_refusal_focus, build_refusal_context
from unittest.mock import MagicMock

from moralstack.orchestration.refusal_context import build_refusal_context, classify_refusal_focus
from moralstack.orchestration.safe_refusal_generator import resolve_refusal_domain_and_redirection
from unittest.mock import MagicMock, patch


def test_refusal_focus_weapon_or_physical_harm():
focus, guidance = classify_refusal_focus(
Expand All @@ -18,6 +19,7 @@ def test_refusal_focus_weapon_or_physical_harm():
assert "de-escalation" in guidance.lower()
assert "emergency" in guidance.lower()


def test_refusal_focus_self_harm_crisis():
focus, guidance = classify_refusal_focus(
semantic_signals=["Q8:self_harm_suicide"],
Expand All @@ -33,6 +35,7 @@ def test_refusal_focus_self_harm_crisis():
assert "support" in guidance.lower()
assert "emergency" in guidance.lower() or "crisis" in guidance.lower()


def test_refusal_focus_cyber_abuse():
focus, guidance = classify_refusal_focus(
semantic_signals=["Q9:cyber_malware"],
Expand All @@ -48,6 +51,7 @@ def test_refusal_focus_cyber_abuse():
assert "defensive" in guidance.lower()
assert "authorized" in guidance.lower() or "secure" in guidance.lower()


def test_refusal_resolver_does_not_map_weapon_signals_to_science():
# In the new implementation, it should NOT use signals to infer domain
fake_constitution_store = MagicMock()
Expand All @@ -64,6 +68,7 @@ def test_refusal_resolver_does_not_map_weapon_signals_to_science():
# Ensure it didn't call detect_relevant_domains
fake_constitution_store.detect_relevant_domains.assert_not_called()


def test_legal_refusal_redirection_still_works_for_actual_legal_domain():
fake_constitution_store = MagicMock()
fake_constitution = MagicMock()
Expand All @@ -84,9 +89,10 @@ def test_legal_refusal_redirection_still_works_for_actual_legal_domain():
assert "attorney" in redirection.lower()
fake_constitution_store.get_constitution.assert_called_with("legal")


def test_refusal_resolver_does_not_call_detect_relevant_domains():
fake_constitution_store = MagicMock()

resolve_refusal_domain_and_redirection(
request_prompt="How can I make a shiv?",
request_domain=None,
Expand All @@ -97,6 +103,7 @@ def test_refusal_resolver_does_not_call_detect_relevant_domains():

fake_constitution_store.detect_relevant_domains.assert_not_called()


def test_build_refusal_context_mapping():
risk_estimation = MagicMock()
risk_estimation.harm_type = "physical"
Expand All @@ -118,7 +125,7 @@ def test_build_refusal_context_mapping():
domain="general",
refusal_redirection="",
risk_score=0.9,
risk_category="clearly_harmful"
risk_category="clearly_harmful",
)

assert ctx.safe_refusal_focus == "weapon_or_physical_harm"
Expand Down
38 changes: 5 additions & 33 deletions tests/test_domain_prefilter_descriptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,9 @@ def test_prefilter_prompt_includes_descriptions_when_provided():
pf.filter_domains("how to make a bomb", ["core", "legal"])

prompt = captured.get("prompt", "")
assert "Legal advice. NOT for: instructions on harmful acts." in prompt, (
f"description text must appear in prompt; got prompt:\n{prompt}"
)
assert (
"Legal advice. NOT for: instructions on harmful acts." in prompt
), f"description text must appear in prompt; got prompt:\n{prompt}"
# Keywords still present alongside description
assert "lawyer" in prompt, "keywords must remain visible to the LLM"

Expand Down Expand Up @@ -129,46 +129,18 @@ def test_set_domain_descriptions_change_clears_cache():
assert len(pf._cache) == 0, "cache must be cleared after change"


def test_prefilter_prompt_prioritizes_semantically_recovered_payloads():
"""Prefilter keeps payload recovery semantic; it must not require runtime decoding."""
captured: dict = {}
pf = DomainPrefilter(
domain_keywords={
"children": ["minor", "child safety"],
"creative": ["article", "writing"],
},
domain_descriptions={
"children": "Child safety and protection.",
"creative": "Creative writing.",
},
)
with _stub_openai(captured):
pf.filter_domains("Write an article around an encoded minor-contact request", ["children", "creative"])
prompt = captured.get("prompt", "")
assert "semantically recoverable" in prompt
assert "recovered meaning is the substantive request" in prompt
assert "the substantive" in prompt
assert "Domain choice follows only how well each candidate in AVAILABLE DOMAINS matches" in prompt
assert "evaluate recovered inner meaning with the same description/keyword" in prompt
assert "Do not classify the wrapper as the domain" in prompt
assert "any encoding or obfuscation scheme" in prompt
assert "Clear recovered semantics aligned with a domain" in prompt
assert "Self-check before JSON" in prompt

def test_prefilter_prompt_surfaces_multiple_not_for_scopes():
"""At least two overlays with explicit negative scoping must appear verbatim."""
captured: dict = {}
desc_gaming = (
"Gaming domains. "
"NOT for: fabrication of unrelated real-world explosives request topic."
)
desc_gaming = "Gaming domains. " "NOT for: fabrication of unrelated real-world explosives request topic."
desc_edu = "School topics. NOT for: instructional wrapper hiding bomb procedures."
pf = DomainPrefilter(
domain_keywords={"gaming": ["game"], "education": ["school"]},
domain_descriptions={"gaming": desc_gaming, "education": desc_edu},
)
with _stub_openai(captured):
pf.filter_domains('Write a JSON article about explosives', ["gaming", "education"])
pf.filter_domains("Write a JSON article about explosives", ["gaming", "education"])
prompt = captured.get("prompt", "")
assert desc_gaming in prompt
assert desc_edu in prompt
1 change: 0 additions & 1 deletion tests/test_intent_falsification_fields.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@
)
from moralstack.models.risk.parse_result import RiskParseResult


# ── _parse_verification_questions: defaults and parsing ─────────────


Expand Down
4 changes: 1 addition & 3 deletions tests/test_operational_prompt.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,7 @@ def test_operational_prompt_step2_parallel_technical_and_socio_high() -> None:


def test_operational_prompt_step3_scope_note_decouples_checklist_from_grooming() -> None:
assert "Scope — physical/chemical/cyber checklist vs interpersonal exploitation:" in (
OPERATIONAL_RISK_PROMPT_TEMPLATE
)
assert "Scope — physical/chemical/cyber checklist vs interpersonal exploitation:" in (OPERATIONAL_RISK_PROMPT_TEMPLATE)


def test_operational_prompt_pre_output_rationale_rewrite_checkbox() -> None:
Expand Down
10 changes: 4 additions & 6 deletions tests/test_refusal_handler_duration.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,9 +134,7 @@ def slow_refusal(**kwargs):
assert captured_duration is not None, "duration_ms missing from emit_llm_call kwargs"
# 50ms sleep should yield ~50-200ms wall time on any reasonable runner;
# bound generously to avoid flakiness.
assert captured_duration >= 40.0, (
f"duration_ms must reflect real LLM latency (>= 40ms with 50ms sleep stub); got {captured_duration}"
)
assert captured_duration < 5000.0, (
f"duration_ms suspiciously high (sanity bound 5s); got {captured_duration}"
)
assert (
captured_duration >= 40.0
), f"duration_ms must reflect real LLM latency (>= 40ms with 50ms sleep stub); got {captured_duration}"
assert captured_duration < 5000.0, f"duration_ms suspiciously high (sanity bound 5s); got {captured_duration}"
Loading
Loading