Module:
moralstack/runtime/modules/critic_module.py
The Constitutional Critic validates responses against the system's ethical constitution.
For testers and stakeholders: The Critic returns a structured decision (PROCEED | REVISE | REFUSE). A
REFUSE decision or hard violations lead the Orchestrator to REFUSE. Tests can verify that responses violating
hard principles produce has_critical_violations=True and decision=REFUSE.
The Constitutional Critic:
- Identifies violations of ethical principles
- Quantifies severity of each violation
- Distinguishes between hard constraints (non-negotiable) and soft norms (flexible)
- Generates guidance for response revision
- Flags information leakage when the request warrants refusal (see below)
On a DCCL MATCH, this module is not invoked. A MODULE_DEFERRED_TO_COMPLIANCE
orchestration event documents the skip. The module's evaluation is unnecessary because
the deployer explicitly authorized the response via the developer contract.
When the request warrants refusal (e.g. terrorism, weapons, critical infrastructure targeting), the critic also flags * information leakage*: responses that provide operationally useful information—such as listing components, targets, materials, or structural elements that could be exploited—even if framed as "general principles" or "educational content."
Example: A response that refuses to give step-by-step attack instructions but enumerates key structural components (e.g.
main cables, towers, anchorages of a bridge) implicitly maps to high-value targets and is flagged as a violation. Such
leakage is mapped to CORE.DUALUSE.1 (or CORE.NM.1 when applicable).
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Response │ ──► │ Principle │ ──► │ LLM │
│ + Prompt │ │ Retrieval │ │ Evaluation │
└──────────────┘ └──────────────┘ └──────────────┘
│
▼
┌──────────────────────────────────┐
│ CriticReport │
│ - violations[] │
│ - severity_score │
│ - has_critical_violations │
│ - revision_guidance │
└──────────────────────────────────┘
- Principle Retrieval: Identifies applicable principles via hybrid matching
- LLM Evaluation: Analyzes the response against each principle
- Quantification: Assigns severity score for each violation
- Guidance Generation: Produces instructions for revision
@dataclass
class CriticReport:
violations: list[Violation] # List of detected violations
severity_score: float # Aggregate score [0, 1]
has_critical_violations: bool # True if hard constraint violated (alias: violated_hard)
decision: str # "PROCEED" | "REVISE" | "REFUSE" (guides Orchestrator)
revision_guidance: str # Revision instructions
raw_response: str # LLM response (debug)
parse_attempts: int # Parsing attempts@dataclass
class Violation:
principle_id: str # e.g. "CORE.NM.1"
principle_title: str # e.g. "Non-Maleficence"
severity: float # [0, 1]
constraint_type: str # "hard" or "soft"
rationale: str # Explanation
evidence: str # Citation from responseNon-negotiable violations that result in immediate refusal:
| Principle | Description |
|---|---|
CORE.NM.1 |
Physical harm |
CORE.CSAM.1 |
Child protection |
CORE.NM.2 |
Illegal activity |
CORE.MALWARE.1 |
Malware/security |
CORE.PRIV.1 |
Privacy |
CORE.DUALUSE.1 |
Dual-use information leakage |
Flexible violations that result in caveat or revision:
| Principle | Description |
|---|---|
SOFT.HONEST.1 |
Accuracy |
SOFT.HELPFUL.1 |
Usefulness |
SOFT.VULNERABLE.1 |
Vulnerable protection |
SOFT.BALANCED.1 |
Perspective balance |
severity_score = Σ(severity_i × weight_i) / Σ(weight_i)
# Weights:
# - Hard constraints: weight = 2.0
# - Soft norms: weight = 1.0Violations with severity < 0.15 are filtered to avoid false positives.
CriticConfig (in moralstack/runtime/modules/critic_module.py) controls LLM and evaluation settings. When no explicit
config is passed (e.g. LLMConstitutionalCritic(policy, store) or create_critic(policy)), config is loaded from
environment variables (see Environment Variables).
All critic tuning can be overridden via .env. Variables are read at critic creation (CLI and benchmark); empty or
missing values use the defaults below. See .env.template for the full list. In application runs (CLI and benchmark),
.env is the single source of configuration for both critic config and model — no CLI or code path overrides these
variables.
- Default: (none — uses the same model as the rest of the stack, e.g.
OPENAI_MODELorgpt-4o) - Type: string (OpenAI model id)
- Meaning: OpenAI model used only for the constitutional critic. When set and non-empty, the CLI and benchmark
create a dedicated
OpenAIPolicywith this model for the critic; the rest of the stack keeps usingOPENAI_MODEL. In run and benchmark this is the single source for the critic model — no CLI override. - Example:
MORALSTACK_CRITIC_MODEL=gpt-4o-miniuses a smaller model for constitutional critique to reduce cost/latency.
- Default:
2 - Type: int (>= 1)
- Meaning: Number of parse attempts for the critic JSON response before raising an error.
Structured critic output uses OpenAI's json_object response format (response_format={"type": "json_object"} on GenerationConfig), which guarantees valid JSON and greatly reduces retries caused by malformed JSON.
- Default:
384 - Type: int (>= 1)
- Meaning: Maximum tokens for the critic LLM response.
- Default:
0.1 - Type: float, clamped to [0.0, 2.0]
- Meaning: Temperature for critic LLM generation. Lower values produce more deterministic evaluations.
- Default:
0.9 - Type: float, clamped to [0.0, 1.0]
- Meaning: Nucleus sampling (top-p) for critic LLM generation. Controls diversity of token sampling.
- Default:
20 - Type: int (>= 1)
- Meaning: Maximum number of constitution principles included in the critic prompt.
- Default:
false - Type: bool (1/true/yes or 0/false/no)
- Meaning: Whether to include violation examples from principles in the critic prompt.
from moralstack.runtime.modules.critic_module import LLMConstitutionalCritic
from moralstack.constitution.store import ConstitutionStore
# Config from env when not passed; model from MORALSTACK_CRITIC_MODEL (or main policy)
critic = LLMConstitutionalCritic(
policy=policy,
store=constitution_store,
)report = critic.critique(
prompt="User request",
response="Response to evaluate",
domain="medical", # Optional: activate overlay
)
if report.has_critical_violations:
# Refuse the request
pass
elif report.violations:
# Revise using the guidance
guidance = report.revision_guidanceFor the fast path, quick check of hard constraints only:
result = critic.quick_check(response, hard_only=True)
if not result.passed:
# Hard constraint violation found
print(result.critical_violation)The generated guidance is structured to guide revisions. When the decision is REVISE or REFUSE, the critic must
include concrete safe alternatives using the format suggest: [specific topics, resources, or reframing].
Examples:
suggest: Discuss ASLR mitigations, defensive coding, secure SDLC; avoid exploit code.
suggest: Historical/regulatory overview, ethical frameworks, specialist consultation; avoid operational details.
Include disclaimer on the informational nature of the response. suggest: Recommend consultation with a medical professional.
The suggest: prefix is parsed by the Response Assembler when formatting fallback refusals. The guidance is passed to
policy.rewrite() for revision or to policy.refuse() when assembling the final refusal.
CriticReport(
violations=[
Violation(
principle_id="MED.DISCLAIMER.1",
principle_title="Medical Disclaimer",
severity=0.7,
constraint_type="soft",
rationale="Absence of medical disclaimer",
evidence="The response provides advice without specifying..."
)
],
severity_score=0.7,
has_critical_violations=False,
revision_guidance="Include appropriate disclaimer and recommend professional consultation"
)# No violations found
report = CriticReport.empty()# Fallback on critical error
report = CriticReport.from_error("Parsing failed")
# Assumes worst case: severity_score=1.0, has_critical=TrueThe Critic determines Orchestrator decisions:
if report.has_critical_violations:
decision = DecisionType.REFUSE
elif report.violations:
decision = DecisionType.REVISE
# Aggregate guidance for revision
else:
decision = DecisionType.CONTINUE- Constitution Store - Ethical principle management
- Orchestrator - Flow coordination
- Policy LLM - Guided revision