Skip to content

cizekmilan/llm-large-data-agent

Repository files navigation

LLM Agent for Large-Scale Data Processing on Models with Limited Context

Experimental orchestration framework exploring how LLM agents can process datasets significantly larger than the model context window using iterative retrieval and semantic reduction.

🧠 Motivation

Modern LLMs are powerful reasoning systems, but they are still limited by:

  • finite context window
  • expensive token usage
  • limited reliability when processing large datasets
  • inability to safely paginate external APIs autonomously

In real-world environments, external systems often return:

  • thousands of records
  • large JSON structures
  • verbose logs
  • ticket histories
  • monitoring data
  • enterprise API payloads

These datasets frequently exceed the context capacity of the model.

The goal of this project is to explore an architecture that allows an LLM agent to:

  • work with large-scale external data
  • iteratively load and process data
  • semantically reduce tool outputs
  • preserve only relevant information
  • continue reasoning within limited context budgets

Main Idea

The project separates the system into multiple specialized layers:

Component Responsibility Implementation
LLM1 (Main Agent / Orchestrator) reasoning, planning, tool selection agent.py
OpenAPI Adapter OpenAPI schema parsing, REST API execution adapters/openapi_adapter.py
MCP Adapter MCP tool discovery, JSON-RPC transport execution adapters/mcp_adapter.py
LLM2 (Reducer) semantic reduction and aggregation reducer.py

Although the architecture uses the terms LLM1 and LLM2, this distinction is primarily logical rather than physical.

In the default configuration, both orchestration and reduction are typically performed by the same underlying model. The separation mainly represents different processing roles and prompting strategies within the pipeline.

Instead of allowing the main model to directly process extremely large datasets, the architecture:

  1. loads data iteratively
  2. processes them in chunks
  3. semantically compresses results
  4. injects only reduced outputs back into the agent context

Architecture Overview

Architecture Diagram

Core concepts:

  • separation of orchestration and reduction
  • adaptive processing strategies
  • chunk-based orchestration
  • semantic reduction pipeline
  • adapter-based OpenAPI and MCP integration
  • short-term vs long-term memory separation

📋 Key Features

Tool Provider Adapters

The framework supports multiple tool provider backends through adapter abstractions.

Currently implemented:

  • OpenAPI adapter
  • MCP adapter

Both adapters normalize external tool definitions into a unified internal orchestration format used by the agent layer.

Adapter responsibilities include:

  • tool discovery and parsing
  • LLM tool schema generation
  • tool execution
  • transport abstraction
  • internal executor metadata generation

The framework intentionally separates:

  • LLM-facing tool schemas
  • internal executor orchestration metadata

Pagination-related parameters such as offset and limit are intentionally removed from tool schemas exposed to the LLM.

This prevents the model from attempting autonomous pagination during reasoning, because pagination orchestration is handled exclusively by the executor layer.

Relying only on SYSTEM_PROMPT instructions for pagination control was found to be insufficiently reliable in experimental runs.

⚙️ Internal Pagination Handling

Pagination orchestration is intentionally handled outside the LLM reasoning layer.

The executor controls:

  • metadata retrieval
  • chunk sizing
  • iterative paging
  • retrieval orchestration

This improves reliability compared to prompt-driven pagination.

🧩 Semantic Data Reduction

Large tool outputs may be semantically reduced before reinjection into the orchestration context.

Reduction strategies include:

  • filtering
  • summarization
  • aggregation
  • compression of verbose structures

Reduction may be skipped for smaller payloads where orchestration overhead would exceed reduction benefits.

This enables multi-step reasoning over datasets that would otherwise exceed model limits.

📦 Context Management

The project distinguishes between:

Short-Term Memory

Working memory used during orchestration:

  • user messages
  • tool calls
  • tool outputs
  • intermediate reasoning

Long-Term Memory

Persistent conversation history:

  • user queries
  • final assistant answers

This prevents uncontrolled context growth.

🗂️ Project Structure

/
├── adapters/
│   ├── openapi_adapter.py           # OpenAPI tool adapter abstraction
│   └── mcp_adapter.py               # MCP JSON-RPC adapter abstraction
│
├── agent.py                         # Main orchestration agent
├── misc.py                          # Shared helper functions
├── mock_api.py                      # Mock OpenAPI server
├── reducer.py                       # Semantic reducer (LLM2)
│
├── mockdata/                        # Large mock datasets
│   ├── customer4_anonymized.json    # temporarily removed from repository
│   └── ...
│
├── logs/
│   └── debug_*.log                  # Runtime logs
│
├── docs/
│   ├── architecture_v3.png          # Architecture diagram in PNG format
│   └── architecture_v3.svg          # Architecture diagram in SVG format
│
├── .env                             # Runtime configuration
├── requirements.txt
└── README.md

The original large mock dataset was temporarily removed from the repository because, despite anonymization efforts, it could still potentially contain sensitive information.

🔄 Workflow

1. User Query

The user sends a query to the orchestrator.

Examples:

user_query> Zjisti vše o uživateli Baláž.
user_query> Zjisti vše o uživateli Králová.

2. Tool Selection

LLM1 decides whether external data are required.

The model receives:

  • dynamically generated tools
  • tool descriptions
  • parameter schemas

If the model responds without selecting a tool, the response is treated as the final assistant answer and the orchestration loop terminates.

3. Metadata Retrieval

If the endpoint supports pagination (typically via offset and limit parameters), it is also expected to support lightweight metadata-only requests.

Example:

GET /tickets?meta_only=true

The metadata response is used for orchestration planning before full retrieval begins.

The metadata typically includes:

  • estimated token size
  • total item count
  • path to paginated payload data

Expected metadata structure:

{
  "customer_id": customer_id,
  "data_path": "tickets",  # path to paginated data within the response, supports dot notation, e.g. "items.tickets"
  "tokens_estimation": tokens_estimation,
  "total_items": total
}

4. Adaptive Processing Strategy

Based on estimated payload size, pagination support, and effective context utilization, the executor dynamically selects one of several processing strategies:

◇ Processing Strategy
   ├─ Direct Pass
   │    Small payloads are injected directly into context
   │
   ├─ Single Reduction
   │    Moderate payloads are reduced in a single reducer pass
   │
   └─ Paginated Reduction
        Large payloads are chunked, reduced independently,
        and merged into a final aggregated result

Typical strategy conditions:

  • Direct Pass

    • small payloads
    • reduction overhead would exceed benefits
    • endpoints without pagination support
  • Single Reduction

    • moderate payload sizes
    • reduction is beneficial
    • pagination is unnecessary
  • Paginated Reduction

    • large payloads exceeding effective context budgets
    • endpoints supporting pagination

For paginated reduction, the executor calculates:

  • average tokens per item
  • optimal chunk/page size
  • estimated number of pages
  • effective context utilization

The executor then retrieves data iteratively in pages:

GET /tickets?offset=0&limit=29
GET /tickets?offset=29&limit=29
GET /tickets?offset=58&limit=29
...

5. Semantic Reduction

Each chunk is processed by the reducer model.

Reducer responsibilities:

  • semantic filtering
  • summarization
  • aggregation
  • removal of irrelevant payload data
  • reduction of context saturation
  • mitigation of attention dilution effects

The reduction pipeline helps preserve relevant information density while minimizing unnecessary token usage.

This is important because modern LLMs operate with a finite context window, and using the entire available context is not always optimal.

Large prompts may suffer from:

  • attention dilution
  • lost-in-the-middle effects
  • degraded reasoning quality
  • higher latency
  • increased token costs

For this reason, the framework distinguishes between:

Parameter Description
LLM_MAX_CONTEXT Maximum context size supported by the model
LLM_CONTEXT_UTILIZATION Intentionally allowed fraction of usable context

Example:

LLM_MAX_CONTEXT=128000
LLM_CONTEXT_UTILIZATION=0.25

In this configuration, the orchestration layer targets approximately 32k effective context usage, despite the model supporting 128k tokens.

6. Final Aggregation

Reduced chunk outputs are merged and injected back into the orchestrator context.

The main agent then continues reasoning using compressed information.

📉 Example Reduction Statistics

The following values are approximate examples from experimental runs.

Dataset Original Tokens Reduced Tokens Reduction
Large ticket dataset 355,000 18,000 94.9%
Customer communication history 120,000 9,500 92.1%
Monitoring logs 210,000 14,000 93.3%

The actual reduction ratio depends on:

  • dataset structure
  • user query specificity
  • reducer prompt quality
  • aggregation strategy

🎨 Colored Runtime Console Logging

The framework provides structured colorized console logging designed for runtime tracing, orchestration debugging, and reducer inspection.

The logging system visually distinguishes individual orchestration stages and data flows, making complex LLM interactions significantly easier to analyze in real time.

Console Color Categories

Color Meaning
Blue text REQUEST sent to the main LLM
Green text RESPONSE received from the main LLM
Yellow text Tool execution results
Blue/green text on blue background Reducer LLM requests and responses
Purple text Response structure and item types
Red text on white background Token statistics, reductions, context usage

Example Console Visualization

Runtime Console

The image above demonstrates the approximate runtime appearance of the logging system.

🔍 Logging

The project includes runtime logging for:

  • selected tools
  • API calls
  • pagination
  • chunk processing
  • token reduction statistics
  • reducer activity
  • error handling

Example log output:

[agent] LLM TOOL SELECTED: get_tickets, [ARGS: {'customer_id': 4}]
[agent] API end-point supports metadata & pagination
[agent] API CALL: GET http://127.0.0.1:9001/tickets?customer_id=4&meta_only=True
[agent] [META] total_items=261 total_tokens_est=355116
[agent] [STRATEGY] PAGING ACTIVATED
[agent] [CHUNKING] one chunk budget=32768 tokens
[agent] [CHUNKING] avg_tokens_per_item=1117.10
[agent] [CHUNKING] items count in one chunk: limit=29
[agent] [CHUNKING] total pages: 9
[agent] API CALL: GET http://127.0.0.1:9001/tickets?customer_id=4&offset=0&limit=29
[agent] [PAGE 1/9] offset=0 limit=items=29
[agent] [PAGE 1/9] context reduction -> transformation function (summarization/agregation/selection/...)
...
[reducer] [REDUCER] preview={"meta": {"strategy": "aggregation", "notes": "Aggregated key information about the customer ...
[reducer] TOKEN CHANGE: 38402 -> 971 (Δ-37431 / -97.5%)
...
[agent] [PAGE 2/9] offset=29 limit=items=29
...

⚠️ Current Limitations

This project is currently experimental.

Known limitations:

  • reducer outputs are not fully deterministic
  • token estimation is heuristic
  • no recursive reduction strategy yet
  • no retry orchestration layer
  • structured output reliability depends on model behavior
  • context overflow handling is still evolving
  • no explicit planner/executor separation yet
  • retrieval dependencies are tracked only implicitly through conversation context
  • the agent currently operates mostly as a reactive execution loop (LLM → tool → result → next iteration)
  • no explicit working memory / entity state management yet
  • retrieved facts and model inferences are not formally separated
  • reducer-based compression may lose important identifiers, relationships, or retrieval context
  • long-running workflows may gradually lose execution intent or retrieval completeness
  • no provenance or retrieval verification layer yet
  • no deterministic computation / Python execution layer for advanced analytical workflows
  • complex multi-question prompts are not yet reliably decomposed into independent retrieval/reduction workflows1
  • the comments are primarily Czech because the application was built as a study project

🚀 Future Work

Planned improvements:

  • orchestrator/reducer prompt strategy improvements
  • advanced MCP session lifecycle handling
  • distributed processing
  • recursive reduction pipelines
  • adaptive reduction strategies
  • structured output enforcement
  • streaming chunk processing

Distributed Processing Potential

The paging/chunking architecture also enables future parallel and distributed processing.

Because segments are processed independently, reducer tasks may be delegated to multiple models or hardware nodes concurrently.

This may significantly reduce end-to-end processing latency and improve horizontal scalability.

🔧 Requirements

  • Python 3.10+
  • OpenAI-compatible Responses API
  • FastAPI
  • Uvicorn

Running the Mock API

uvicorn mock_api:app --port 9001 --reload

Running the Agent

python agent.py

Environment Variables

Example .env:

# OpenAPI adapter
BASE_API_URL="http://127.0.0.1:9001"
BASE_API_TOKEN=dummy
# MCP adapter
MCP_URL=https://example.com/_mcp
MCP_TOKEN=dummy

LLM_API_BASE_URL=http://example.com:8000/v1
LLM_API_KEY=dummy
LLM_NAME=gpt-4.1-mini
LLM_MAX_CONTEXT=131072
LLM_CONTEXT_UTILIZATION=0.25
LLM_TEMPERATURE=0.0
LLM_TOP_P=1.0
LLM_TIMEOUT=60

LOG_DIR="logs"

🧪 Research Goal

This project explores whether LLM agents can reliably operate over datasets that significantly exceed their native context window by combining:

  • iterative retrieval
  • semantic compression
  • orchestration loops
  • adaptive chunking
  • external tool integration

The project focuses primarily on:

  • orchestration reliability
  • context preservation
  • scalable tool usage
  • semantic reduction strategies

rather than traditional chatbot interaction.

Status

Current status:

  • ✅ architecture prototype implemented
  • ✅ OpenAPI adapter abstraction implemented
  • ✅ MCP adapter prototype functional
  • ✅ adaptive pagination functional
  • ✅ reducer pipeline functional
  • ⚠️ semantic chunk reduction experimental

License

Experimental / educational project.

Footnotes

  1. Currently, if a single query contains multiple independent questions (e.g., about two different users), the orchestrator processes them sequentially within the same prompt. Due to chunking, retrieval, and reduction, the model typically produces a correct answer only for the first question, while subsequent questions may be ignored, hallucinated, or incomplete. A dedicated planning/decomposition stage could split such queries into independent subqueries, assign them to the retrieval/reduction pipeline, and aggregate results more reliably.