Skip to content

RAG + RLM #22

@apenab

Description

@apenab

How RLM complements RAG (the 3 most commonly used patterns today)

Recursive Language Models (RLM), proposed by Alex Zhang et al. (MIT CSAIL, arXiv:2512.24601, Dec 2025), do not replace RAG: they elevate it by treating the context as a programmable environment (REPL) instead of stuffing everything into the prompt. This prevents "context rot" and enables handling contexts of millions of tokens in a more precise and structured way.

The three main ways to combine RLM + RAG that are already used in practice (see the official repo alexzhang13/rlm, Prime Intellect, forks and community experiments):

  1. RAG as initial filter + RLM as deep reasoner

    • RAG (vector search, BM25, GraphRAG, etc.) quickly and cheaply retrieves relevant chunks from a large corpus.
    • Those chunks are loaded as variables into the RLM REPL (e.g., context_chunks = retrieve(...)).
    • The RLM explores them recursively: it splits, compares, checks contradictions, performs multi-hop reasoning, extracts structures... without ever stuffing the entire prompt at once.
      Key advantage: Avoids degradation from "attention overload" when RAG returns 20–50 chunks.
      → Common use: multi-document QA or analysis of large codebases.
  2. RLM as smart retrieval director (Agentic / Adaptive RAG)

    • The RLM dynamically decides:
      • Which queries to make to the vector DB / retriever.
      • When it needs more retrieval (refinement loops).
      • How to rerank, filter or combine results.
      • Adaptive multi-hop strategies.
    • The paper and the official repo’s CodeAct+BM25 already give the agent a SEARCH(query) tool inside the recursive loop.
      Key advantage: Much smarter retrieval and less noise than a static RAG (the model "thinks" before searching).
      → Common use: long-research agents or tasks requiring iterative exploration.
  3. Hybrid full: native RLM with integrated retrieval tools

    • The RLM REPL includes search tools directly (e.g., vector_retrieve(k, query), bm25_search(...), or wrappers to Pinecone/Chroma/Qdrant).
    • The model can call them at any step in the recursive loop, process results, store sub-results in variables and call sub-RLMs.
    • It’s basically Agentic RAG elevated: the agent not only retrieves, but reasons recursively about what it retrieved.
      Key advantage: Maximum flexibility and accuracy for extreme contexts (10M+ tokens).
      → Common use: long-horizon agents, analysis of massive document collections or workflows with persistent state.

Bonus: How this fits with Smart Router + Parallelism

  • A Smart Router (a heuristic or small model) decides:
    • Simple query → classic RAG (fast/cheap).
    • Complex/long → RLM + hybrid RAG.
  • Parallelism (already used in Prime Intellect and forks) launches multiple retrievals or sub-LLMs concurrently → low latency.
    → Reported results: +30–50% accuracy on complex tasks vs. RAG-only, cost 60–80% lower vs. pure RLM.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions