Skip to content

Integrate COMPL-AI Benchmark Evaluation Suite #10

@fdidonato

Description

@fdidonato

Description

We need to integrate the official COMPL-AI benchmark suite into MoralStack to provide an objective, standardized, and independent evaluation of our framework's compliance with the EU AI Act.

Currently, MoralStack uses an internal benchmark (scripts/benchmark_moralstack.py) based on a curated set of synthetic prompts. While useful for regression testing, an internal benchmark lacks the external validation required for academic and enterprise adoption. COMPL-AI solves this by using massive, recognized academic datasets (e.g., StrongREJECT, RealToxicityPrompts) and rigorous LLM-as-a-Judge scoring mechanisms based on the inspect_ai framework.

The Challenge

COMPL-AI is designed to evaluate generative LLMs (like GPT-4 or Claude) via standard API endpoints. MoralStack, however, is a governance framework (a guardrail) that wraps an LLM client.

To evaluate MoralStack using COMPL-AI as-is (without modifying the benchmark's source code), we need to expose MoralStack as an OpenAI-compatible API endpoint.

Proposed Solution

We should introduce a lightweight, OpenAI-compatible ASGI server (e.g., using FastAPI) that acts as a bridge between COMPL-AI and MoralStack.

The architecture will be:
COMPL-AI (client) -> POST /chat/completions -> MoralStack Bridge Server -> MoralStack govern() -> Base LLM

Key Technical Requirements for the Bridge Server:

  1. Endpoint Compatibility: Must expose POST /v1/chat/completions and POST /chat/completions (since inspect_ai sometimes omits the /v1/ prefix).
  2. Transparent Routing: The server must instantiate the govern() client and pass the incoming messages to it.
  3. Handling Refusals: When MoralStack decides to block a request (final_action == "REFUSE"), it autonomously generates a refusal text in response.content. The server must strictly return this exact response.content to COMPL-AI, without raising HTTP 500 errors or inventing synthetic refusal messages. This allows COMPL-AI's judge to semantically evaluate MoralStack's actual refusal text.
  4. Message Normalization: Some COMPL-AI tasks (like instruction_goal_hijacking) send ChatMessageSystem objects with content=None. The bridge server must filter out or normalize these null contents to empty strings before passing them to MoralStack, preventing 400 Bad Request errors from the underlying OpenAI API.
  5. Metadata Injection (Optional but recommended): Append MoralStack's governance metadata (action, risk category, path) to the API response payload (e.g., in a custom moralstack_metadata field) for easier debugging during evaluation.

Implementation Plan

  1. Create the Bridge Server:

    • Add a new script: scripts/complai_bridge_server.py (or similar).
    • Implement the FastAPI application with the requirements listed above.
    • Ensure it loads the environment variables correctly via moralstack.utils.env_loader.
  2. Create the Evaluation Runner:

    • Add a shell script or Python wrapper (e.g., scripts/run_complai_eval.sh) that automates the evaluation process:
      • Starts the bridge server in the background.
      • Installs/updates compl-ai in an isolated virtual environment.
      • Sets the required environment variables (MORALSTACK_API_KEY, MORALSTACK_BASE_URL=http://localhost:8000).
      • Executes the complai eval openai-api/moralstack/governed command for the core tasks (strong_reject, human_deception, realtoxicityprompts, instruction_goal_hijacking).
      • Gracefully shuts down the bridge server upon completion.
  3. Documentation:

    • Add a new document in docs/modules/complai_evaluation.md explaining how to run the external benchmark and how the bridge architecture works.
    • Update the main README.md to highlight COMPL-AI compliance testing as a core feature.

Acceptance Criteria

  • A FastAPI-based bridge server is implemented and successfully exposes MoralStack as an OpenAI-compatible endpoint.
  • The server correctly handles REFUSE, SAFE_COMPLETE, and NORMAL_COMPLETE actions by returning the exact response.content generated by MoralStack.
  • The server successfully sanitizes content=None messages from inspect_ai.
  • An automated script is provided to spin up the server, run the COMPL-AI benchmark, and tear down the server.
  • Documentation is updated to explain the architecture and usage instructions.

Additional Context

A proof-of-concept for the bridge server has already been successfully tested, achieving 100% on strong_reject and human_deception, 99.9% on realtoxicityprompts, and 80% on instruction_goal_hijacking. The PoC code can be provided in the PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions