Combine LLMs to beat the best single LLM.
The Mixture-of-Agents architecture achieved 65.1% on AlpacaEval 2.0 using only open-source models—surpassing GPT-4o's 57.5%. This library gives you the building blocks to construct these pipelines.
uv add mixture-llmfrom mixture_llm import Propose, Aggregate, run
pipeline = [
Propose(["gpt-5-nano-2025-08-07", "claude-sonnet-4-5", "llama-3.3-70b"]),
Aggregate("gpt-5-nano-2025-08-07"),
]
result, history = await run(pipeline, "What is quantum computing?", my_client)The benchmark-winning configuration from Wang et al. (2024): 3 layers, 6 diverse proposers, Qwen aggregator.
PROPOSERS = [
"wizardlm-2-8x22b",
"qwen1.5-110b-chat",
"qwen1.5-72b-chat",
"llama-3-70b-instruct",
"mixtral-8x22b-instruct",
"dbrx-instruct",
]
together_moa = [
Propose(PROPOSERS, temp=0.7, max_tokens=512),
Synthesize(PROPOSERS, temp=0.7, max_tokens=512),
Synthesize(PROPOSERS, temp=0.7, max_tokens=512),
Aggregate("qwen1.5-110b-chat"),
]Cost-optimized 2-layer variant—still beats GPT-4o.
moa_lite = [
Propose(PROPOSERS, temp=0.7, max_tokens=512),
Synthesize(PROPOSERS, temp=0.7, max_tokens=512),
Aggregate("qwen1.5-72b-chat"),
]Li et al. (2025) showed that sampling one top model multiple times can outperform diverse model mixtures.
# Same model, multiple samples via temperature
self_moa = [
Propose(["gpt-5-nano-2025-08-07"] * 6, temp=0.7),
Aggregate("gpt-5-nano-2025-08-07"),
]Prevents positional bias and improves diversity.
robust_moa = [
Propose(["gpt-5-nano-2025-08-07", "claude-sonnet-4-5", "llama-70b", "gemini-2.5-flash"]),
Shuffle(),
Dropout(0.2),
Aggregate("gpt-5-nano-2025-08-07"),
]LLM steps — call models:
Propose(agents)— generate initial responses in parallelSynthesize(agents)— each agent synthesizes all previous outputsAggregate(agent)— single model combines everything into final outputRefine(agents)— improve each response individuallyRank(agent, n)— select top n responses by qualityVote(agent)— pick consensus answer
Transform steps — manipulate responses:
Shuffle()— randomize order (prevents position bias)Dropout(rate)— randomly drop responses (improves robustness)Sample(n)— random subsetTake(n)— first n responsesFilter(fn)— keep responses matching predicateMap(fn)— transform each response
Every LLM step accepts temp and max_tokens:
Propose(["gpt-5-nano-2025-08-07", "claude-sonnet-4-5"], temp=0.9, max_tokens=4096)Override the synthesis prompt:
Aggregate("gpt-5-nano-2025-08-07", prompt="Pick the single best response and return it verbatim.")Your client is an async function with this signature:
async def client(model, messages, temp, max_tokens) -> tuple[str, int, int]:
# Returns (response_text, input_tokens, output_tokens)from openai import AsyncOpenAI
openai_client = AsyncOpenAI()
anthropic_client = AsyncOpenAI(
base_url="https://api.anthropic.com/v1/",
api_key=os.environ["ANTHROPIC_API_KEY"],
)
async def multi_provider_client(model, messages, temp, max_tokens):
client = anthropic_client if model.startswith("claude") else openai_client
# GPT-5: max_completion_tokens, no temperature, minimal reasoning
is_gpt5 = model.startswith("gpt-5")
params = {"model": model, "messages": messages}
params.update({"max_completion_tokens": max_tokens, "reasoning_effort": "minimal"} if is_gpt5 else {"max_tokens": max_tokens, "temperature": temp})
resp = await client.chat.completions.create(**params)
return resp.choices[0].message.content, resp.usage.prompt_tokens, resp.usage.completion_tokens
# Mix providers in one pipeline
pipeline = [
Propose(["gpt-5-nano-2025-08-07", "claude-sonnet-4-5", "gpt-5-nano-2025-08-07"]),
Aggregate("claude-sonnet-4-5"),
]from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.environ["OPENROUTER_API_KEY"],
)
async def openrouter_client(model, messages, temp, max_tokens):
resp = await client.chat.completions.create(
model=model, messages=messages, temperature=temp, max_tokens=max_tokens
)
return resp.choices[0].message.content, resp.usage.prompt_tokens, resp.usage.completion_tokens
# Together MoA models via OpenRouter
PROPOSERS = [
"qwen/qwen-2.5-72b-instruct",
"meta-llama/llama-3.3-70b-instruct",
"mistralai/mixtral-8x22b-instruct",
]
together_moa_openrouter = [
Propose(PROPOSERS, temp=0.7, max_tokens=512),
Synthesize(PROPOSERS, temp=0.7, max_tokens=512),
Aggregate("qwen/qwen-2.5-72b-instruct"),
]Groq offers free access to several models. Great for experimentation.
from litellm import acompletion
async def groq_client(model, messages, temp, max_tokens):
resp = await acompletion(
model=f"groq/{model}", messages=messages, temperature=temp, max_tokens=max_tokens
)
return resp.choices[0].message.content, resp.usage.prompt_tokens, resp.usage.completion_tokens
# Free Groq models (check console.groq.com/docs/rate-limits for current list)
GROQ_FREE = [
"llama-3.3-70b-versatile",
"llama-3.1-8b-instant",
"qwen/qwen3-32b",
"meta-llama/llama-4-scout-17b-16e-instruct",
]
free_moa = [
Propose(GROQ_FREE, temp=0.7, max_tokens=512),
Aggregate("llama-3.3-70b-versatile"),
]
# Self-MoA with Groq (single model, multiple samples)
free_self_moa = [
Propose(["llama-3.3-70b-versatile"] * 4, temp=0.7),
Aggregate("llama-3.3-70b-versatile"),
]The examples/ directory contains tested, runnable scripts for different providers. See examples/EXAMPLES.md for detailed documentation.
| Example | Provider | What You'll Learn |
|---|---|---|
openai_basic.py |
OpenAI | Basic MoA pattern (Propose → Aggregate), client setup, token tracking |
openai_self_moa.py |
OpenAI | Self-MoA technique—one model sampled 6 times beats diverse mixtures |
multi_provider.py |
OpenAI + Anthropic | Provider routing, Shuffle step to prevent position bias |
openrouter_moa.py |
OpenRouter | 3-layer MoA (Propose → Synthesize → Aggregate), paper configuration |
groq_free.py |
Groq | Free experimentation, LiteLLM integration, Dropout for robustness |
with_history.py |
Groq | Pipeline debugging, Rank step, execution history inspection |
# Install and run
uv add mixture-llm[examples]
export OPENAI_API_KEY=sk-...
python examples/openai_basic.py
# Or try free with Groq
export GROQ_API_KEY=gsk_...
python examples/groq_free.py- Aggregator quality matters 2x more than proposer quality — invest in your final model
- 3 layers is the sweet spot — diminishing returns beyond this
- Diversity vs quality tradeoff — Self-MoA shows a single great model can beat diverse mediocre ones
- 6 proposers optimal — gains diminish after this point
- Wang et al. "Mixture-of-Agents Enhances Large Language Model Capabilities" (2024) — arXiv:2406.04692
- Li et al. "Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?" (2025) — arXiv:2502.00674
MIT