Experimental orchestration framework exploring how LLM agents can process datasets significantly larger than the model context window using iterative retrieval and semantic reduction.
Modern LLMs are powerful reasoning systems, but they are still limited by:
- finite context window
- expensive token usage
- limited reliability when processing large datasets
- inability to safely paginate external APIs autonomously
In real-world environments, external systems often return:
- thousands of records
- large JSON structures
- verbose logs
- ticket histories
- monitoring data
- enterprise API payloads
These datasets frequently exceed the context capacity of the model.
The goal of this project is to explore an architecture that allows an LLM agent to:
- work with large-scale external data
- iteratively load and process data
- semantically reduce tool outputs
- preserve only relevant information
- continue reasoning within limited context budgets
The project separates the system into multiple specialized layers:
| Component | Responsibility | Implementation |
|---|---|---|
| LLM1 (Main Agent / Orchestrator) | reasoning, planning, tool selection | agent.py |
| OpenAPI Adapter | OpenAPI schema parsing, REST API execution | adapters/openapi_adapter.py |
| MCP Adapter | MCP tool discovery, JSON-RPC transport execution | adapters/mcp_adapter.py |
| LLM2 (Reducer) | semantic reduction and aggregation | reducer.py |
Although the architecture uses the terms LLM1 and LLM2, this distinction is primarily logical rather than physical.
In the default configuration, both orchestration and reduction are typically performed by the same underlying model. The separation mainly represents different processing roles and prompting strategies within the pipeline.
Instead of allowing the main model to directly process extremely large datasets, the architecture:
- loads data iteratively
- processes them in chunks
- semantically compresses results
- injects only reduced outputs back into the agent context
Core concepts:
- separation of orchestration and reduction
- adaptive processing strategies
- chunk-based orchestration
- semantic reduction pipeline
- adapter-based OpenAPI and MCP integration
- short-term vs long-term memory separation
The framework supports multiple tool provider backends through adapter abstractions.
Currently implemented:
- OpenAPI adapter
- MCP adapter
Both adapters normalize external tool definitions into a unified internal orchestration format used by the agent layer.
Adapter responsibilities include:
- tool discovery and parsing
- LLM tool schema generation
- tool execution
- transport abstraction
- internal executor metadata generation
The framework intentionally separates:
- LLM-facing tool schemas
- internal executor orchestration metadata
Pagination-related parameters such as offset and limit are intentionally removed from tool schemas exposed to the LLM.
This prevents the model from attempting autonomous pagination during reasoning, because pagination orchestration is handled exclusively by the executor layer.
Relying only on SYSTEM_PROMPT instructions for pagination control was found to be insufficiently reliable in experimental runs.
Pagination orchestration is intentionally handled outside the LLM reasoning layer.
The executor controls:
- metadata retrieval
- chunk sizing
- iterative paging
- retrieval orchestration
This improves reliability compared to prompt-driven pagination.
Large tool outputs may be semantically reduced before reinjection into the orchestration context.
Reduction strategies include:
- filtering
- summarization
- aggregation
- compression of verbose structures
Reduction may be skipped for smaller payloads where orchestration overhead would exceed reduction benefits.
This enables multi-step reasoning over datasets that would otherwise exceed model limits.
The project distinguishes between:
Working memory used during orchestration:
- user messages
- tool calls
- tool outputs
- intermediate reasoning
Persistent conversation history:
- user queries
- final assistant answers
This prevents uncontrolled context growth.
/
├── adapters/
│ ├── openapi_adapter.py # OpenAPI tool adapter abstraction
│ └── mcp_adapter.py # MCP JSON-RPC adapter abstraction
│
├── agent.py # Main orchestration agent
├── misc.py # Shared helper functions
├── mock_api.py # Mock OpenAPI server
├── reducer.py # Semantic reducer (LLM2)
│
├── mockdata/ # Large mock datasets
│ ├── customer4_anonymized.json # temporarily removed from repository
│ └── ...
│
├── logs/
│ └── debug_*.log # Runtime logs
│
├── docs/
│ ├── architecture_v3.png # Architecture diagram in PNG format
│ └── architecture_v3.svg # Architecture diagram in SVG format
│
├── .env # Runtime configuration
├── requirements.txt
└── README.md
The original large mock dataset was temporarily removed from the repository because, despite anonymization efforts, it could still potentially contain sensitive information.
The user sends a query to the orchestrator.
Examples:
user_query> Zjisti vše o uživateli Baláž.
user_query> Zjisti vše o uživateli Králová.
LLM1 decides whether external data are required.
The model receives:
- dynamically generated tools
- tool descriptions
- parameter schemas
If the model responds without selecting a tool, the response is treated as the final assistant answer and the orchestration loop terminates.
If the endpoint supports pagination (typically via offset and limit parameters), it is also expected to support lightweight metadata-only requests.
Example:
GET /tickets?meta_only=true
The metadata response is used for orchestration planning before full retrieval begins.
The metadata typically includes:
- estimated token size
- total item count
- path to paginated payload data
Expected metadata structure:
{
"customer_id": customer_id,
"data_path": "tickets", # path to paginated data within the response, supports dot notation, e.g. "items.tickets"
"tokens_estimation": tokens_estimation,
"total_items": total
}Based on estimated payload size, pagination support, and effective context utilization, the executor dynamically selects one of several processing strategies:
◇ Processing Strategy
├─ Direct Pass
│ Small payloads are injected directly into context
│
├─ Single Reduction
│ Moderate payloads are reduced in a single reducer pass
│
└─ Paginated Reduction
Large payloads are chunked, reduced independently,
and merged into a final aggregated result
Typical strategy conditions:
-
Direct Pass
- small payloads
- reduction overhead would exceed benefits
- endpoints without pagination support
-
Single Reduction
- moderate payload sizes
- reduction is beneficial
- pagination is unnecessary
-
Paginated Reduction
- large payloads exceeding effective context budgets
- endpoints supporting pagination
For paginated reduction, the executor calculates:
- average tokens per item
- optimal chunk/page size
- estimated number of pages
- effective context utilization
The executor then retrieves data iteratively in pages:
GET /tickets?offset=0&limit=29
GET /tickets?offset=29&limit=29
GET /tickets?offset=58&limit=29
...
Each chunk is processed by the reducer model.
Reducer responsibilities:
- semantic filtering
- summarization
- aggregation
- removal of irrelevant payload data
- reduction of context saturation
- mitigation of attention dilution effects
The reduction pipeline helps preserve relevant information density while minimizing unnecessary token usage.
This is important because modern LLMs operate with a finite context window, and using the entire available context is not always optimal.
Large prompts may suffer from:
- attention dilution
- lost-in-the-middle effects
- degraded reasoning quality
- higher latency
- increased token costs
For this reason, the framework distinguishes between:
| Parameter | Description |
|---|---|
LLM_MAX_CONTEXT |
Maximum context size supported by the model |
LLM_CONTEXT_UTILIZATION |
Intentionally allowed fraction of usable context |
Example:
LLM_MAX_CONTEXT=128000
LLM_CONTEXT_UTILIZATION=0.25In this configuration, the orchestration layer targets approximately 32k effective context usage, despite the model supporting 128k tokens.
Reduced chunk outputs are merged and injected back into the orchestrator context.
The main agent then continues reasoning using compressed information.
The following values are approximate examples from experimental runs.
| Dataset | Original Tokens | Reduced Tokens | Reduction |
|---|---|---|---|
| Large ticket dataset | 355,000 | 18,000 | 94.9% |
| Customer communication history | 120,000 | 9,500 | 92.1% |
| Monitoring logs | 210,000 | 14,000 | 93.3% |
The actual reduction ratio depends on:
- dataset structure
- user query specificity
- reducer prompt quality
- aggregation strategy
The framework provides structured colorized console logging designed for runtime tracing, orchestration debugging, and reducer inspection.
The logging system visually distinguishes individual orchestration stages and data flows, making complex LLM interactions significantly easier to analyze in real time.
| Color | Meaning |
|---|---|
| Blue text | REQUEST sent to the main LLM |
| Green text | RESPONSE received from the main LLM |
| Yellow text | Tool execution results |
| Blue/green text on blue background | Reducer LLM requests and responses |
| Purple text | Response structure and item types |
| Red text on white background | Token statistics, reductions, context usage |
The image above demonstrates the approximate runtime appearance of the logging system.
The project includes runtime logging for:
- selected tools
- API calls
- pagination
- chunk processing
- token reduction statistics
- reducer activity
- error handling
Example log output:
[agent] LLM TOOL SELECTED: get_tickets, [ARGS: {'customer_id': 4}]
[agent] API end-point supports metadata & pagination
[agent] API CALL: GET http://127.0.0.1:9001/tickets?customer_id=4&meta_only=True
[agent] [META] total_items=261 total_tokens_est=355116
[agent] [STRATEGY] PAGING ACTIVATED
[agent] [CHUNKING] one chunk budget=32768 tokens
[agent] [CHUNKING] avg_tokens_per_item=1117.10
[agent] [CHUNKING] items count in one chunk: limit=29
[agent] [CHUNKING] total pages: 9
[agent] API CALL: GET http://127.0.0.1:9001/tickets?customer_id=4&offset=0&limit=29
[agent] [PAGE 1/9] offset=0 limit=items=29
[agent] [PAGE 1/9] context reduction -> transformation function (summarization/agregation/selection/...)
...
[reducer] [REDUCER] preview={"meta": {"strategy": "aggregation", "notes": "Aggregated key information about the customer ...
[reducer] TOKEN CHANGE: 38402 -> 971 (Δ-37431 / -97.5%)
...
[agent] [PAGE 2/9] offset=29 limit=items=29
...
This project is currently experimental.
Known limitations:
- reducer outputs are not fully deterministic
- token estimation is heuristic
- no recursive reduction strategy yet
- no retry orchestration layer
- structured output reliability depends on model behavior
- context overflow handling is still evolving
- no explicit planner/executor separation yet
- retrieval dependencies are tracked only implicitly through conversation context
- the agent currently operates mostly as a reactive execution loop (
LLM → tool → result → next iteration) - no explicit working memory / entity state management yet
- retrieved facts and model inferences are not formally separated
- reducer-based compression may lose important identifiers, relationships, or retrieval context
- long-running workflows may gradually lose execution intent or retrieval completeness
- no provenance or retrieval verification layer yet
- no deterministic computation / Python execution layer for advanced analytical workflows
- complex multi-question prompts are not yet reliably decomposed into independent retrieval/reduction workflows1
- the comments are primarily Czech because the application was built as a study project
Planned improvements:
- orchestrator/reducer prompt strategy improvements
- advanced MCP session lifecycle handling
- distributed processing
- recursive reduction pipelines
- adaptive reduction strategies
- structured output enforcement
- streaming chunk processing
The paging/chunking architecture also enables future parallel and distributed processing.
Because segments are processed independently, reducer tasks may be delegated to multiple models or hardware nodes concurrently.
This may significantly reduce end-to-end processing latency and improve horizontal scalability.
- Python 3.10+
- OpenAI-compatible Responses API
- FastAPI
- Uvicorn
uvicorn mock_api:app --port 9001 --reloadpython agent.pyExample .env:
# OpenAPI adapter
BASE_API_URL="http://127.0.0.1:9001"
BASE_API_TOKEN=dummy
# MCP adapter
MCP_URL=https://example.com/_mcp
MCP_TOKEN=dummy
LLM_API_BASE_URL=http://example.com:8000/v1
LLM_API_KEY=dummy
LLM_NAME=gpt-4.1-mini
LLM_MAX_CONTEXT=131072
LLM_CONTEXT_UTILIZATION=0.25
LLM_TEMPERATURE=0.0
LLM_TOP_P=1.0
LLM_TIMEOUT=60
LOG_DIR="logs"This project explores whether LLM agents can reliably operate over datasets that significantly exceed their native context window by combining:
- iterative retrieval
- semantic compression
- orchestration loops
- adaptive chunking
- external tool integration
The project focuses primarily on:
- orchestration reliability
- context preservation
- scalable tool usage
- semantic reduction strategies
rather than traditional chatbot interaction.
Current status:
- ✅ architecture prototype implemented
- ✅ OpenAPI adapter abstraction implemented
- ✅ MCP adapter prototype functional
- ✅ adaptive pagination functional
- ✅ reducer pipeline functional
⚠️ semantic chunk reduction experimental
Experimental / educational project.
Footnotes
-
Currently, if a single query contains multiple independent questions (e.g., about two different users), the orchestrator processes them sequentially within the same prompt. Due to chunking, retrieval, and reduction, the model typically produces a correct answer only for the first question, while subsequent questions may be ignored, hallucinated, or incomplete. A dedicated planning/decomposition stage could split such queries into independent subqueries, assign them to the retrieval/reduction pipeline, and aggregate results more reliably. ↩