Concilium

Concilium is a localhost-first Mercury 2 model council powered by Inception Labs. Ask a question and three dynamically assigned council members deliberate in parallel before a chair synthesizes one unified answer — Perplexity-style structure with a Mercury 2 orchestrator.

Repository: github.com/kadinsolaiman8-spec/Concilium

Features

Adaptive routing — single, lite, or full council based on task shape (MCQ, debug, long compare, agent-graded fixtures, etc.)
5-stage full pipeline — orchestrator assigns roles → parallel members → synthesizer merges responses
SSE streaming for synthesis text as it is generated
.txt upload with client-side truncation (~16k chars) and visible warning
Conversation sidebar with Neon Postgres (falls back to localStorage without DATABASE_URL)
Context usage ring with 128k token hard stop
Cursor-style activity feed during deliberation
Server-side API proxy — your Inception key never reaches the browser
Benchmark harness — open rubric suite + AA Intelligence Index proxy suite with regression gates

Product quality & benchmark report

Measurements below were captured on 2026-05-20 against this codebase. Benchmark runs require INCEPTION_API_KEY (council) and INCEPTION_API_KEY_FALLBACK (baseline + judge) in .env.local. Reproduce with npm test and the commands in benchmarks/README.md.

Unit tests

Metric	Result
Test files	17 passed
Tests	95 passed
Command	`npm test`

Coverage includes routing (council-mode), synthesis lint, AA composite grading, product SLO helpers, constraint rules, and structural benchmark checks.

Open rubric benchmark (`regression-open`)

Six fixtures (debug-01, debug-04, debug-05, howto-01, howto-02, compare-01) comparing routed council vs single-shot Mercury 2 baseline.

Metric	Result
Council win rate	100% (6 wins / 0 losses; ties excluded)
Avg council usefulness / accuracy	8.83 / 9.00
Avg baseline usefulness / accuracy	7.17 / 8.33
Routing	5× lite, 1× single
Council p50 / p95 latency	4.5s / 13.2s
API token ratio (council/baseline)	4.46×
API failure rate	0%

Regression gate target: ≥90% win rate (npm run benchmark -- --gate --run-id regression-open).

AA Intelligence Index proxy (`regression-aa-routed-v8`, 3 trials)

93 fixtures across GDPval-AA, τ²-Bench, Terminal-Bench, SciCode, AA-LCR, Omniscience, IFBench, HLE, GPQA Diamond, and CritPt shapes. Scores are a proxy aligned to AA methodology weights, not official AA suite runs.

Metric	Trial 1	Trial 2	Trial 3	Median
Routed product proxy (0–100)	94.97	93.04	94.30	94.30
Baseline proxy	85.18	83.04	83.40	83.40
Uplift ratio	1.115	1.120	1.131	1.120
Routed product token mult. (excl. full-council-eval)	1.79×	1.93×	1.95×	1.93×
Hard slice (`hard-v8` tag)	95.74	94.97	94.12	94.97
API failure rate	0%	0%	0%	0%
Agent fixture pass rate	67%	67%	67%	67%
Routed latency p50 / p95	0.9s / 6.0s	0.8s / 5.8s	0.8s / 5.7s	~0.8s / ~5.8s

Routing mix (typical trial): ~67 single, ~15 lite, ~11 full council fixtures.

Estimated AA Intelligence Index (anchored to Mercury 2 published score of 33): 33 × (proxy_council / proxy_baseline) → ~36.8 on trial 1 (heuristic ±4). Treat as indicative only.

Per-eval routed proxy (trial 1, council vs baseline):

Eval	Weight	Council	Baseline
gdpval-aa	16.7%	88.8	80.6
terminal-bench	16.7%	87.5	75.0
aa-omniscience	12.5%	100.0	100.0
hle	12.5%	100.0	90.0
tau2-bench	8.3%	97.1	60.0
scicode	8.3%	100.0	100.0
aa-lcr	6.3%	86.7	71.2
ifbench	6.3%	100.0	100.0
gpqa-diamond	6.3%	100.0	83.3
critpt	6.3%	100.0	100.0

Wave 8 regression gate targets (see benchmarks/README.md): routed proxy median ≥97, open win rate ≥90%, routed product token mult. ≤1.40×, GDPval ≥92, agent fixtures 6/6, etc. Latest 3-trial median proxy (94.30) and token mult. (1.93×) are below gate — documented for transparency; run npm run benchmark:regression after changes.

Cost & verbosity (trial 1 extrapolation)

Metric	Council	Mercury 2 baseline
API token multiplier	3.13×	1×
Est. full Index output tokens	~219M	70M (published)
Est. Index eval cost	~$252	~$81 (published)
Effective synthesis throughput	~73 t/s	746 t/s (published single-call)

Known product limitations

All agents use Mercury 2 with different prompts — not true multi-vendor diversity.
Council improves structure and perspective-taking, not guaranteed factual verification.
Each user message can trigger 1–5+ API calls depending on route; monitor Inception usage.
AA proxy scores use the same model family for rubric items; MCQ/checklist/IFBench reduce bias.
Agent-graded fixtures passed 4/6 in latest runs (67%).

Security report (Snyk)

Scans run via Snyk MCP on 2026-05-20 against c:\Users\kadin\Downloads\Mercury2-Council.

Open-source dependencies (SCA)

Severity	Package	Issue	Status
~~Medium~~	~~`postcss@8.4.31` (via `next`)~~	CVE-2026-41305 XSS (CWE-79)	Resolved — `package.json` `overrides` pin `postcss@^8.5.10`; re-scan shows 0 SCA issues

Re-verify after npm install:

npx snyk test

Static analysis (Snyk Code)

Severity	Location	Finding	Notes
High	`src/lib/benchmark/*.ts`	Hardcoded non-crypto secret	False positive — sentinel `apiKey: "fallback-only"` routes to `INCEPTION_API_KEY_FALLBACK`, not a real credential
Medium	`src/lib/agent-task.ts`	Path traversal to `readFile`	Mitigated — `resolveWithinWorkspace()` blocks `..` and paths outside the agent temp dir (real shell is opt-in via `AGENT_REAL_SHELL=1`)

Secret scanning

Snyk secret scan was not available for this org (SNYK-CLI-0016). Manual secret scan (same intent as Snyk secrets):

Check	Result
API key / token / PEM patterns in `src/`	None
High-entropy literals in source	None
Secrets in `git` history	None (only `.env.example` placeholders)
`.env.local` committed	No — gitignored; exists only on disk locally

Full methodology and reproducible commands: SECURITY.md.

Application security practices

Browser calls only /api/council; Authorization headers stay on the server
Context uploads truncated client-side; 128k token ceiling server-enforced
Optional Neon Postgres for persistence — use TLS connection strings

When to use council vs direct answer

Task shape	Routed mode
MCQ, exact fact, terminal one-liner	single
Medium debug/howto, short compare (<400 chars)	lite
τ² checklist scenarios	single + checklist footer
GDPval short (<400 chars)	lite
Long compare (≥400 chars), GDPval long, deliverable templates	full
Brainstorm, review, multiturn	full
Agent-graded (`grading: agent`)	lite + up to 2 tool rounds

See benchmarks/README.md and docs/agent-tools-future.md.

Setup

1. Install dependencies

npm install

2. Configure environment

cp .env.example .env.local

INCEPTION_API_KEY=your_inception_api_key_here
INCEPTION_API_KEY_FALLBACK=optional_fallback_for_benchmarks
DATABASE_URL=postgresql://...   # optional — Neon Postgres
CONTEXT_LIMIT_TOKENS=128000     # optional — default 128k

Get a key from Inception Labs.

3. Run locally

npm run dev

Open http://localhost:3000.

4. Production build

npm run build
npm start

Neon Postgres (optional sidebar persistence)

Create a Neon project and run scripts/schema.sql
Set DATABASE_URL in .env.local
Restart dev server — localStorage history imports once on first load

Benchmark commands

npm run benchmark              # open rubric suite
npm run benchmark:aa-index     # AA proxy suite
npm run benchmark:compare      # compare historical runs
npm run benchmark:regression   # tests + gated open + AA v8 (3 trials)

Details: benchmarks/README.md.

API usage (Mercury 2)

curl https://api.inceptionlabs.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $INCEPTION_API_KEY" \
  -d '{
    "model": "mercury-2",
    "messages": [{ "role": "user", "content": "What is the meaning of life?" }]
  }'

See APIOptions for additional client examples (placeholders only).

Project structure

Path	Purpose
`src/lib/mercury.ts`	Mercury 2 chat + stream client
`src/lib/council.ts`	Council orchestrator
`src/lib/council-mode.ts`	Adaptive routing (single/lite/full)
`src/lib/prompts.ts`	Orchestrator/chair prompts
`src/lib/benchmark/`	Rubric + AA proxy benchmark engine
`src/app/api/council/route.ts`	POST handler with optional SSE
`benchmarks/`	Fixture suites (`fixtures.jsonl`, `aa-index-fixtures.jsonl`)
`scripts/`	Benchmark runners + DB schema

License

Licensed under the Apache License, Version 2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
benchmarks		benchmarks
docs		docs
public		public
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
APIOptions		APIOptions
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
eslint.config.mjs		eslint.config.mjs
next.config.ts		next.config.ts
package-lock.json		package-lock.json
package.json		package.json
postcss.config.mjs		postcss.config.mjs
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Concilium

Features

Product quality & benchmark report

Unit tests

Open rubric benchmark (`regression-open`)

AA Intelligence Index proxy (`regression-aa-routed-v8`, 3 trials)

Cost & verbosity (trial 1 extrapolation)

Known product limitations

Security report (Snyk)

Open-source dependencies (SCA)

Static analysis (Snyk Code)

Secret scanning

Application security practices

When to use council vs direct answer

Setup

1. Install dependencies

2. Configure environment

3. Run locally

4. Production build

Neon Postgres (optional sidebar persistence)

Benchmark commands

API usage (Mercury 2)

Project structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Concilium

Features

Product quality & benchmark report

Unit tests

Open rubric benchmark (regression-open)

AA Intelligence Index proxy (regression-aa-routed-v8, 3 trials)

Cost & verbosity (trial 1 extrapolation)

Known product limitations

Security report (Snyk)

Open-source dependencies (SCA)

Static analysis (Snyk Code)

Secret scanning

Application security practices

When to use council vs direct answer

Setup

1. Install dependencies

2. Configure environment

3. Run locally

4. Production build

Neon Postgres (optional sidebar persistence)

Benchmark commands

API usage (Mercury 2)

Project structure

License

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Open rubric benchmark (`regression-open`)

AA Intelligence Index proxy (`regression-aa-routed-v8`, 3 trials)

Packages