Skip to content

leonardosul/mixture-llm

Repository files navigation

mixture-llm

Test Release PyPI License: MIT Docs

Combine LLMs to beat the best single LLM.

The Mixture-of-Agents architecture achieved 65.1% on AlpacaEval 2.0 using only open-source models—surpassing GPT-4o's 57.5%. This library gives you the building blocks to construct these pipelines.

Install

uv add mixture-llm

Quick start

from mixture_llm import Propose, Aggregate, run

pipeline = [
    Propose(["gpt-5-nano-2025-08-07", "claude-sonnet-4-5", "llama-3.3-70b"]),
    Aggregate("gpt-5-nano-2025-08-07"),
]

result, history = await run(pipeline, "What is quantum computing?", my_client)

Paper-accurate pipelines

Together MoA (65.1% AlpacaEval)

The benchmark-winning configuration from Wang et al. (2024): 3 layers, 6 diverse proposers, Qwen aggregator.

PROPOSERS = [
    "wizardlm-2-8x22b",
    "qwen1.5-110b-chat",
    "qwen1.5-72b-chat",
    "llama-3-70b-instruct",
    "mixtral-8x22b-instruct",
    "dbrx-instruct",
]

together_moa = [
    Propose(PROPOSERS, temp=0.7, max_tokens=512),
    Synthesize(PROPOSERS, temp=0.7, max_tokens=512),
    Synthesize(PROPOSERS, temp=0.7, max_tokens=512),
    Aggregate("qwen1.5-110b-chat"),
]

MoA-Lite (59.3% AlpacaEval)

Cost-optimized 2-layer variant—still beats GPT-4o.

moa_lite = [
    Propose(PROPOSERS, temp=0.7, max_tokens=512),
    Synthesize(PROPOSERS, temp=0.7, max_tokens=512),
    Aggregate("qwen1.5-72b-chat"),
]

Self-MoA (+6.6% over standard MoA)

Li et al. (2025) showed that sampling one top model multiple times can outperform diverse model mixtures.

# Same model, multiple samples via temperature
self_moa = [
    Propose(["gpt-5-nano-2025-08-07"] * 6, temp=0.7),
    Aggregate("gpt-5-nano-2025-08-07"),
]

With robustness (shuffle + dropout)

Prevents positional bias and improves diversity.

robust_moa = [
    Propose(["gpt-5-nano-2025-08-07", "claude-sonnet-4-5", "llama-70b", "gemini-2.5-flash"]),
    Shuffle(),
    Dropout(0.2),
    Aggregate("gpt-5-nano-2025-08-07"),
]

Steps

LLM steps — call models:

  • Propose(agents) — generate initial responses in parallel
  • Synthesize(agents) — each agent synthesizes all previous outputs
  • Aggregate(agent) — single model combines everything into final output
  • Refine(agents) — improve each response individually
  • Rank(agent, n) — select top n responses by quality
  • Vote(agent) — pick consensus answer

Transform steps — manipulate responses:

  • Shuffle() — randomize order (prevents position bias)
  • Dropout(rate) — randomly drop responses (improves robustness)
  • Sample(n) — random subset
  • Take(n) — first n responses
  • Filter(fn) — keep responses matching predicate
  • Map(fn) — transform each response

Configuration

Every LLM step accepts temp and max_tokens:

Propose(["gpt-5-nano-2025-08-07", "claude-sonnet-4-5"], temp=0.9, max_tokens=4096)

Override the synthesis prompt:

Aggregate("gpt-5-nano-2025-08-07", prompt="Pick the single best response and return it verbatim.")

Client examples

Your client is an async function with this signature:

async def client(model, messages, temp, max_tokens) -> tuple[str, int, int]:
    # Returns (response_text, input_tokens, output_tokens)

OpenAI SDK (OpenAI + Anthropic models)

from openai import AsyncOpenAI

openai_client = AsyncOpenAI()
anthropic_client = AsyncOpenAI(
    base_url="https://api.anthropic.com/v1/",
    api_key=os.environ["ANTHROPIC_API_KEY"],
)

async def multi_provider_client(model, messages, temp, max_tokens):
    client = anthropic_client if model.startswith("claude") else openai_client
    # GPT-5: max_completion_tokens, no temperature, minimal reasoning
    is_gpt5 = model.startswith("gpt-5")
    params = {"model": model, "messages": messages}
    params.update({"max_completion_tokens": max_tokens, "reasoning_effort": "minimal"} if is_gpt5 else {"max_tokens": max_tokens, "temperature": temp})
    resp = await client.chat.completions.create(**params)
    return resp.choices[0].message.content, resp.usage.prompt_tokens, resp.usage.completion_tokens

# Mix providers in one pipeline
pipeline = [
    Propose(["gpt-5-nano-2025-08-07", "claude-sonnet-4-5", "gpt-5-nano-2025-08-07"]),
    Aggregate("claude-sonnet-4-5"),
]

OpenRouter (access all models via one API)

from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"],
)

async def openrouter_client(model, messages, temp, max_tokens):
    resp = await client.chat.completions.create(
        model=model, messages=messages, temperature=temp, max_tokens=max_tokens
    )
    return resp.choices[0].message.content, resp.usage.prompt_tokens, resp.usage.completion_tokens

# Together MoA models via OpenRouter
PROPOSERS = [
    "qwen/qwen-2.5-72b-instruct",
    "meta-llama/llama-3.3-70b-instruct",
    "mistralai/mixtral-8x22b-instruct",
]

together_moa_openrouter = [
    Propose(PROPOSERS, temp=0.7, max_tokens=512),
    Synthesize(PROPOSERS, temp=0.7, max_tokens=512),
    Aggregate("qwen/qwen-2.5-72b-instruct"),
]

Groq via LiteLLM (free tier)

Groq offers free access to several models. Great for experimentation.

from litellm import acompletion

async def groq_client(model, messages, temp, max_tokens):
    resp = await acompletion(
        model=f"groq/{model}", messages=messages, temperature=temp, max_tokens=max_tokens
    )
    return resp.choices[0].message.content, resp.usage.prompt_tokens, resp.usage.completion_tokens

# Free Groq models (check console.groq.com/docs/rate-limits for current list)
GROQ_FREE = [
    "llama-3.3-70b-versatile",
    "llama-3.1-8b-instant",
    "qwen/qwen3-32b",
    "meta-llama/llama-4-scout-17b-16e-instruct",
]

free_moa = [
    Propose(GROQ_FREE, temp=0.7, max_tokens=512),
    Aggregate("llama-3.3-70b-versatile"),
]

# Self-MoA with Groq (single model, multiple samples)
free_self_moa = [
    Propose(["llama-3.3-70b-versatile"] * 4, temp=0.7),
    Aggregate("llama-3.3-70b-versatile"),
]

Examples

The examples/ directory contains tested, runnable scripts for different providers. See examples/EXAMPLES.md for detailed documentation.

Example Provider What You'll Learn
openai_basic.py OpenAI Basic MoA pattern (Propose → Aggregate), client setup, token tracking
openai_self_moa.py OpenAI Self-MoA technique—one model sampled 6 times beats diverse mixtures
multi_provider.py OpenAI + Anthropic Provider routing, Shuffle step to prevent position bias
openrouter_moa.py OpenRouter 3-layer MoA (Propose → Synthesize → Aggregate), paper configuration
groq_free.py Groq Free experimentation, LiteLLM integration, Dropout for robustness
with_history.py Groq Pipeline debugging, Rank step, execution history inspection
# Install and run
uv add mixture-llm[examples]
export OPENAI_API_KEY=sk-...
python examples/openai_basic.py

# Or try free with Groq
export GROQ_API_KEY=gsk_...
python examples/groq_free.py

Key findings from the research

  • Aggregator quality matters 2x more than proposer quality — invest in your final model
  • 3 layers is the sweet spot — diminishing returns beyond this
  • Diversity vs quality tradeoff — Self-MoA shows a single great model can beat diverse mediocre ones
  • 6 proposers optimal — gains diminish after this point

References

  • Wang et al. "Mixture-of-Agents Enhances Large Language Model Capabilities" (2024) — arXiv:2406.04692
  • Li et al. "Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?" (2025) — arXiv:2502.00674

License

MIT

About

Combine multiple LLMs for better outputs.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages