Integrate COMPL-AI Benchmark Evaluation Suite

## Description
We need to integrate the official [COMPL-AI](https://compl-ai.org/) benchmark suite into MoralStack to provide an objective, standardized, and independent evaluation of our framework's compliance with the EU AI Act. 

Currently, MoralStack uses an internal benchmark (`scripts/benchmark_moralstack.py`) based on a curated set of synthetic prompts. While useful for regression testing, an internal benchmark lacks the external validation required for academic and enterprise adoption. COMPL-AI solves this by using massive, recognized academic datasets (e.g., StrongREJECT, RealToxicityPrompts) and rigorous LLM-as-a-Judge scoring mechanisms based on the `inspect_ai` framework.

## The Challenge
COMPL-AI is designed to evaluate generative LLMs (like GPT-4 or Claude) via standard API endpoints. MoralStack, however, is a governance framework (a guardrail) that wraps an LLM client. 

To evaluate MoralStack using COMPL-AI *as-is* (without modifying the benchmark's source code), we need to expose MoralStack as an OpenAI-compatible API endpoint.

## Proposed Solution
We should introduce a lightweight, OpenAI-compatible ASGI server (e.g., using FastAPI) that acts as a bridge between COMPL-AI and MoralStack. 

The architecture will be:
`COMPL-AI (client) -> POST /chat/completions -> MoralStack Bridge Server -> MoralStack govern() -> Base LLM`

### Key Technical Requirements for the Bridge Server:
1. **Endpoint Compatibility**: Must expose `POST /v1/chat/completions` and `POST /chat/completions` (since `inspect_ai` sometimes omits the `/v1/` prefix).
2. **Transparent Routing**: The server must instantiate the `govern()` client and pass the incoming messages to it.
3. **Handling Refusals**: When MoralStack decides to block a request (`final_action == "REFUSE"`), it autonomously generates a refusal text in `response.content`. The server must **strictly** return this exact `response.content` to COMPL-AI, without raising HTTP 500 errors or inventing synthetic refusal messages. This allows COMPL-AI's judge to semantically evaluate MoralStack's actual refusal text.
4. **Message Normalization**: Some COMPL-AI tasks (like `instruction_goal_hijacking`) send `ChatMessageSystem` objects with `content=None`. The bridge server must filter out or normalize these null contents to empty strings before passing them to MoralStack, preventing 400 Bad Request errors from the underlying OpenAI API.
5. **Metadata Injection (Optional but recommended)**: Append MoralStack's governance metadata (action, risk category, path) to the API response payload (e.g., in a custom `moralstack_metadata` field) for easier debugging during evaluation.

## Implementation Plan

1. **Create the Bridge Server**:
   - Add a new script: `scripts/complai_bridge_server.py` (or similar).
   - Implement the FastAPI application with the requirements listed above.
   - Ensure it loads the environment variables correctly via `moralstack.utils.env_loader`.

2. **Create the Evaluation Runner**:
   - Add a shell script or Python wrapper (e.g., `scripts/run_complai_eval.sh`) that automates the evaluation process:
     - Starts the bridge server in the background.
     - Installs/updates `compl-ai` in an isolated virtual environment.
     - Sets the required environment variables (`MORALSTACK_API_KEY`, `MORALSTACK_BASE_URL=http://localhost:8000`).
     - Executes the `complai eval openai-api/moralstack/governed` command for the core tasks (`strong_reject`, `human_deception`, `realtoxicityprompts`, `instruction_goal_hijacking`).
     - Gracefully shuts down the bridge server upon completion.

3. **Documentation**:
   - Add a new document in `docs/modules/complai_evaluation.md` explaining how to run the external benchmark and how the bridge architecture works.
   - Update the main `README.md` to highlight COMPL-AI compliance testing as a core feature.

## Acceptance Criteria
- [ ] A FastAPI-based bridge server is implemented and successfully exposes MoralStack as an OpenAI-compatible endpoint.
- [ ] The server correctly handles `REFUSE`, `SAFE_COMPLETE`, and `NORMAL_COMPLETE` actions by returning the exact `response.content` generated by MoralStack.
- [ ] The server successfully sanitizes `content=None` messages from `inspect_ai`.
- [ ] An automated script is provided to spin up the server, run the COMPL-AI benchmark, and tear down the server.
- [ ] Documentation is updated to explain the architecture and usage instructions.

## Additional Context
A proof-of-concept for the bridge server has already been successfully tested, achieving 100% on `strong_reject` and `human_deception`, 99.9% on `realtoxicityprompts`, and 80% on `instruction_goal_hijacking`. The PoC code can be provided in the PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate COMPL-AI Benchmark Evaluation Suite #10

Description

The Challenge

Proposed Solution

Key Technical Requirements for the Bridge Server:

Implementation Plan

Acceptance Criteria

Additional Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Integrate COMPL-AI Benchmark Evaluation Suite #10

Description

Description

The Challenge

Proposed Solution

Key Technical Requirements for the Bridge Server:

Implementation Plan

Acceptance Criteria

Additional Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions