Description
We need to integrate the official COMPL-AI benchmark suite into MoralStack to provide an objective, standardized, and independent evaluation of our framework's compliance with the EU AI Act.
Currently, MoralStack uses an internal benchmark (scripts/benchmark_moralstack.py) based on a curated set of synthetic prompts. While useful for regression testing, an internal benchmark lacks the external validation required for academic and enterprise adoption. COMPL-AI solves this by using massive, recognized academic datasets (e.g., StrongREJECT, RealToxicityPrompts) and rigorous LLM-as-a-Judge scoring mechanisms based on the inspect_ai framework.
The Challenge
COMPL-AI is designed to evaluate generative LLMs (like GPT-4 or Claude) via standard API endpoints. MoralStack, however, is a governance framework (a guardrail) that wraps an LLM client.
To evaluate MoralStack using COMPL-AI as-is (without modifying the benchmark's source code), we need to expose MoralStack as an OpenAI-compatible API endpoint.
Proposed Solution
We should introduce a lightweight, OpenAI-compatible ASGI server (e.g., using FastAPI) that acts as a bridge between COMPL-AI and MoralStack.
The architecture will be:
COMPL-AI (client) -> POST /chat/completions -> MoralStack Bridge Server -> MoralStack govern() -> Base LLM
Key Technical Requirements for the Bridge Server:
- Endpoint Compatibility: Must expose
POST /v1/chat/completions and POST /chat/completions (since inspect_ai sometimes omits the /v1/ prefix).
- Transparent Routing: The server must instantiate the
govern() client and pass the incoming messages to it.
- Handling Refusals: When MoralStack decides to block a request (
final_action == "REFUSE"), it autonomously generates a refusal text in response.content. The server must strictly return this exact response.content to COMPL-AI, without raising HTTP 500 errors or inventing synthetic refusal messages. This allows COMPL-AI's judge to semantically evaluate MoralStack's actual refusal text.
- Message Normalization: Some COMPL-AI tasks (like
instruction_goal_hijacking) send ChatMessageSystem objects with content=None. The bridge server must filter out or normalize these null contents to empty strings before passing them to MoralStack, preventing 400 Bad Request errors from the underlying OpenAI API.
- Metadata Injection (Optional but recommended): Append MoralStack's governance metadata (action, risk category, path) to the API response payload (e.g., in a custom
moralstack_metadata field) for easier debugging during evaluation.
Implementation Plan
-
Create the Bridge Server:
- Add a new script:
scripts/complai_bridge_server.py (or similar).
- Implement the FastAPI application with the requirements listed above.
- Ensure it loads the environment variables correctly via
moralstack.utils.env_loader.
-
Create the Evaluation Runner:
- Add a shell script or Python wrapper (e.g.,
scripts/run_complai_eval.sh) that automates the evaluation process:
- Starts the bridge server in the background.
- Installs/updates
compl-ai in an isolated virtual environment.
- Sets the required environment variables (
MORALSTACK_API_KEY, MORALSTACK_BASE_URL=http://localhost:8000).
- Executes the
complai eval openai-api/moralstack/governed command for the core tasks (strong_reject, human_deception, realtoxicityprompts, instruction_goal_hijacking).
- Gracefully shuts down the bridge server upon completion.
-
Documentation:
- Add a new document in
docs/modules/complai_evaluation.md explaining how to run the external benchmark and how the bridge architecture works.
- Update the main
README.md to highlight COMPL-AI compliance testing as a core feature.
Acceptance Criteria
Additional Context
A proof-of-concept for the bridge server has already been successfully tested, achieving 100% on strong_reject and human_deception, 99.9% on realtoxicityprompts, and 80% on instruction_goal_hijacking. The PoC code can be provided in the PR.
Description
We need to integrate the official COMPL-AI benchmark suite into MoralStack to provide an objective, standardized, and independent evaluation of our framework's compliance with the EU AI Act.
Currently, MoralStack uses an internal benchmark (
scripts/benchmark_moralstack.py) based on a curated set of synthetic prompts. While useful for regression testing, an internal benchmark lacks the external validation required for academic and enterprise adoption. COMPL-AI solves this by using massive, recognized academic datasets (e.g., StrongREJECT, RealToxicityPrompts) and rigorous LLM-as-a-Judge scoring mechanisms based on theinspect_aiframework.The Challenge
COMPL-AI is designed to evaluate generative LLMs (like GPT-4 or Claude) via standard API endpoints. MoralStack, however, is a governance framework (a guardrail) that wraps an LLM client.
To evaluate MoralStack using COMPL-AI as-is (without modifying the benchmark's source code), we need to expose MoralStack as an OpenAI-compatible API endpoint.
Proposed Solution
We should introduce a lightweight, OpenAI-compatible ASGI server (e.g., using FastAPI) that acts as a bridge between COMPL-AI and MoralStack.
The architecture will be:
COMPL-AI (client) -> POST /chat/completions -> MoralStack Bridge Server -> MoralStack govern() -> Base LLMKey Technical Requirements for the Bridge Server:
POST /v1/chat/completionsandPOST /chat/completions(sinceinspect_aisometimes omits the/v1/prefix).govern()client and pass the incoming messages to it.final_action == "REFUSE"), it autonomously generates a refusal text inresponse.content. The server must strictly return this exactresponse.contentto COMPL-AI, without raising HTTP 500 errors or inventing synthetic refusal messages. This allows COMPL-AI's judge to semantically evaluate MoralStack's actual refusal text.instruction_goal_hijacking) sendChatMessageSystemobjects withcontent=None. The bridge server must filter out or normalize these null contents to empty strings before passing them to MoralStack, preventing 400 Bad Request errors from the underlying OpenAI API.moralstack_metadatafield) for easier debugging during evaluation.Implementation Plan
Create the Bridge Server:
scripts/complai_bridge_server.py(or similar).moralstack.utils.env_loader.Create the Evaluation Runner:
scripts/run_complai_eval.sh) that automates the evaluation process:compl-aiin an isolated virtual environment.MORALSTACK_API_KEY,MORALSTACK_BASE_URL=http://localhost:8000).complai eval openai-api/moralstack/governedcommand for the core tasks (strong_reject,human_deception,realtoxicityprompts,instruction_goal_hijacking).Documentation:
docs/modules/complai_evaluation.mdexplaining how to run the external benchmark and how the bridge architecture works.README.mdto highlight COMPL-AI compliance testing as a core feature.Acceptance Criteria
REFUSE,SAFE_COMPLETE, andNORMAL_COMPLETEactions by returning the exactresponse.contentgenerated by MoralStack.content=Nonemessages frominspect_ai.Additional Context
A proof-of-concept for the bridge server has already been successfully tested, achieving 100% on
strong_rejectandhuman_deception, 99.9% onrealtoxicityprompts, and 80% oninstruction_goal_hijacking. The PoC code can be provided in the PR.