LLM Agent for Large-Scale Data Processing on Models with Limited Context

Experimental orchestration framework exploring how LLM agents can process datasets significantly larger than the model context window using iterative retrieval and semantic reduction.

🧠 Motivation

Modern LLMs are powerful reasoning systems, but they are still limited by:

finite context window
expensive token usage
limited reliability when processing large datasets
inability to safely paginate external APIs autonomously

In real-world environments, external systems often return:

thousands of records
large JSON structures
verbose logs
ticket histories
monitoring data
enterprise API payloads

These datasets frequently exceed the context capacity of the model.

The goal of this project is to explore an architecture that allows an LLM agent to:

work with large-scale external data
iteratively load and process data
semantically reduce tool outputs
preserve only relevant information
continue reasoning within limited context budgets

Main Idea

The project separates the system into multiple specialized layers:

Component	Responsibility	Implementation
LLM1 (Main Agent / Orchestrator)	reasoning, planning, tool selection	`agent.py`
OpenAPI Adapter	OpenAPI schema parsing, REST API execution	`adapters/openapi_adapter.py`
MCP Adapter	MCP tool discovery, JSON-RPC transport execution	`adapters/mcp_adapter.py`
LLM2 (Reducer)	semantic reduction and aggregation	`reducer.py`

Although the architecture uses the terms LLM1 and LLM2, this distinction is primarily logical rather than physical.

In the default configuration, both orchestration and reduction are typically performed by the same underlying model. The separation mainly represents different processing roles and prompting strategies within the pipeline.

Instead of allowing the main model to directly process extremely large datasets, the architecture:

loads data iteratively
processes them in chunks
semantically compresses results
injects only reduced outputs back into the agent context

Architecture Overview

Core concepts:

separation of orchestration and reduction
adaptive processing strategies
chunk-based orchestration
semantic reduction pipeline
adapter-based OpenAPI and MCP integration
short-term vs long-term memory separation

📋 Key Features

Tool Provider Adapters

The framework supports multiple tool provider backends through adapter abstractions.

Currently implemented:

OpenAPI adapter
MCP adapter

Both adapters normalize external tool definitions into a unified internal orchestration format used by the agent layer.

Adapter responsibilities include:

tool discovery and parsing
LLM tool schema generation
tool execution
transport abstraction
internal executor metadata generation

The framework intentionally separates:

LLM-facing tool schemas
internal executor orchestration metadata

Pagination-related parameters such as offset and limit are intentionally removed from tool schemas exposed to the LLM.

This prevents the model from attempting autonomous pagination during reasoning, because pagination orchestration is handled exclusively by the executor layer.

Relying only on SYSTEM_PROMPT instructions for pagination control was found to be insufficiently reliable in experimental runs.

⚙️ Internal Pagination Handling

Pagination orchestration is intentionally handled outside the LLM reasoning layer.

The executor controls:

metadata retrieval
chunk sizing
iterative paging
retrieval orchestration

This improves reliability compared to prompt-driven pagination.

🧩 Semantic Data Reduction

Large tool outputs may be semantically reduced before reinjection into the orchestration context.

Reduction strategies include:

filtering
summarization
aggregation
compression of verbose structures

Reduction may be skipped for smaller payloads where orchestration overhead would exceed reduction benefits.

This enables multi-step reasoning over datasets that would otherwise exceed model limits.

📦 Context Management

The project distinguishes between:

Short-Term Memory

Working memory used during orchestration:

user messages
tool calls
tool outputs
intermediate reasoning

Long-Term Memory

Persistent conversation history:

user queries
final assistant answers

This prevents uncontrolled context growth.

🗂️ Project Structure

/
├── adapters/
│   ├── openapi_adapter.py           # OpenAPI tool adapter abstraction
│   └── mcp_adapter.py               # MCP JSON-RPC adapter abstraction
│
├── agent.py                         # Main orchestration agent
├── misc.py                          # Shared helper functions
├── mock_api.py                      # Mock OpenAPI server
├── reducer.py                       # Semantic reducer (LLM2)
│
├── mockdata/                        # Large mock datasets
│   ├── customer4_anonymized.json    # temporarily removed from repository
│   └── ...
│
├── logs/
│   └── debug_*.log                  # Runtime logs
│
├── docs/
│   ├── architecture_v3.png          # Architecture diagram in PNG format
│   └── architecture_v3.svg          # Architecture diagram in SVG format
│
├── .env                             # Runtime configuration
├── requirements.txt
└── README.md

The original large mock dataset was temporarily removed from the repository because, despite anonymization efforts, it could still potentially contain sensitive information.

🔄 Workflow

1. User Query

The user sends a query to the orchestrator.

Examples:

user_query> Zjisti vše o uživateli Baláž.

user_query> Zjisti vše o uživateli Králová.

2. Tool Selection

LLM1 decides whether external data are required.

The model receives:

dynamically generated tools
tool descriptions
parameter schemas

If the model responds without selecting a tool, the response is treated as the final assistant answer and the orchestration loop terminates.

3. Metadata Retrieval

If the endpoint supports pagination (typically via offset and limit parameters), it is also expected to support lightweight metadata-only requests.

Example:

GET /tickets?meta_only=true

The metadata response is used for orchestration planning before full retrieval begins.

The metadata typically includes:

estimated token size
total item count
path to paginated payload data

Expected metadata structure:

{
  "customer_id": customer_id,
  "data_path": "tickets",  # path to paginated data within the response, supports dot notation, e.g. "items.tickets"
  "tokens_estimation": tokens_estimation,
  "total_items": total
}

4. Adaptive Processing Strategy

Based on estimated payload size, pagination support, and effective context utilization, the executor dynamically selects one of several processing strategies:

◇ Processing Strategy
   ├─ Direct Pass
   │    Small payloads are injected directly into context
   │
   ├─ Single Reduction
   │    Moderate payloads are reduced in a single reducer pass
   │
   └─ Paginated Reduction
        Large payloads are chunked, reduced independently,
        and merged into a final aggregated result

Typical strategy conditions:

Direct Pass
- small payloads
- reduction overhead would exceed benefits
- endpoints without pagination support
Single Reduction
- moderate payload sizes
- reduction is beneficial
- pagination is unnecessary
Paginated Reduction
- large payloads exceeding effective context budgets
- endpoints supporting pagination

For paginated reduction, the executor calculates:

average tokens per item
optimal chunk/page size
estimated number of pages
effective context utilization

The executor then retrieves data iteratively in pages:

GET /tickets?offset=0&limit=29
GET /tickets?offset=29&limit=29
GET /tickets?offset=58&limit=29
...

5. Semantic Reduction

Each chunk is processed by the reducer model.

Reducer responsibilities:

semantic filtering
summarization
aggregation
removal of irrelevant payload data
reduction of context saturation
mitigation of attention dilution effects

The reduction pipeline helps preserve relevant information density while minimizing unnecessary token usage.

This is important because modern LLMs operate with a finite context window, and using the entire available context is not always optimal.

Large prompts may suffer from:

attention dilution
lost-in-the-middle effects
degraded reasoning quality
higher latency
increased token costs

For this reason, the framework distinguishes between:

Parameter	Description
`LLM_MAX_CONTEXT`	Maximum context size supported by the model
`LLM_CONTEXT_UTILIZATION`	Intentionally allowed fraction of usable context

Example:

LLM_MAX_CONTEXT=128000
LLM_CONTEXT_UTILIZATION=0.25

In this configuration, the orchestration layer targets approximately 32k effective context usage, despite the model supporting 128k tokens.

6. Final Aggregation

Reduced chunk outputs are merged and injected back into the orchestrator context.

The main agent then continues reasoning using compressed information.

📉 Example Reduction Statistics

The following values are approximate examples from experimental runs.

Dataset	Original Tokens	Reduced Tokens	Reduction
Large ticket dataset	355,000	18,000	94.9%
Customer communication history	120,000	9,500	92.1%
Monitoring logs	210,000	14,000	93.3%

The actual reduction ratio depends on:

dataset structure
user query specificity
reducer prompt quality
aggregation strategy

🎨 Colored Runtime Console Logging

The framework provides structured colorized console logging designed for runtime tracing, orchestration debugging, and reducer inspection.

The logging system visually distinguishes individual orchestration stages and data flows, making complex LLM interactions significantly easier to analyze in real time.

Console Color Categories

Color	Meaning
Blue text	REQUEST sent to the main LLM
Green text	RESPONSE received from the main LLM
Yellow text	Tool execution results
Blue/green text on blue background	Reducer LLM requests and responses
Purple text	Response structure and item types
Red text on white background	Token statistics, reductions, context usage

Example Console Visualization

The image above demonstrates the approximate runtime appearance of the logging system.

🔍 Logging

The project includes runtime logging for:

selected tools
API calls
pagination
chunk processing
token reduction statistics
reducer activity
error handling

Example log output:

[agent] LLM TOOL SELECTED: get_tickets, [ARGS: {'customer_id': 4}]
[agent] API end-point supports metadata & pagination
[agent] API CALL: GET http://127.0.0.1:9001/tickets?customer_id=4&meta_only=True
[agent] [META] total_items=261 total_tokens_est=355116
[agent] [STRATEGY] PAGING ACTIVATED
[agent] [CHUNKING] one chunk budget=32768 tokens
[agent] [CHUNKING] avg_tokens_per_item=1117.10
[agent] [CHUNKING] items count in one chunk: limit=29
[agent] [CHUNKING] total pages: 9
[agent] API CALL: GET http://127.0.0.1:9001/tickets?customer_id=4&offset=0&limit=29
[agent] [PAGE 1/9] offset=0 limit=items=29
[agent] [PAGE 1/9] context reduction -> transformation function (summarization/agregation/selection/...)
...
[reducer] [REDUCER] preview={"meta": {"strategy": "aggregation", "notes": "Aggregated key information about the customer ...
[reducer] TOKEN CHANGE: 38402 -> 971 (Δ-37431 / -97.5%)
...
[agent] [PAGE 2/9] offset=29 limit=items=29
...

⚠️ Current Limitations

This project is currently experimental.

Known limitations:

reducer outputs are not fully deterministic
token estimation is heuristic
no recursive reduction strategy yet
no retry orchestration layer
structured output reliability depends on model behavior
context overflow handling is still evolving
no explicit planner/executor separation yet
retrieval dependencies are tracked only implicitly through conversation context
the agent currently operates mostly as a reactive execution loop (LLM → tool → result → next iteration)
no explicit working memory / entity state management yet
retrieved facts and model inferences are not formally separated
reducer-based compression may lose important identifiers, relationships, or retrieval context
long-running workflows may gradually lose execution intent or retrieval completeness
no provenance or retrieval verification layer yet
no deterministic computation / Python execution layer for advanced analytical workflows
complex multi-question prompts are not yet reliably decomposed into independent retrieval/reduction workflows¹
the comments are primarily Czech because the application was built as a study project

🚀 Future Work

Planned improvements:

orchestrator/reducer prompt strategy improvements
advanced MCP session lifecycle handling
distributed processing
recursive reduction pipelines
adaptive reduction strategies
structured output enforcement
streaming chunk processing

Distributed Processing Potential

The paging/chunking architecture also enables future parallel and distributed processing.

Because segments are processed independently, reducer tasks may be delegated to multiple models or hardware nodes concurrently.

This may significantly reduce end-to-end processing latency and improve horizontal scalability.

🔧 Requirements

Python 3.10+
OpenAI-compatible Responses API
FastAPI
Uvicorn

Running the Mock API

uvicorn mock_api:app --port 9001 --reload

Running the Agent

python agent.py

Environment Variables

Example .env:

# OpenAPI adapter
BASE_API_URL="http://127.0.0.1:9001"
BASE_API_TOKEN=dummy
# MCP adapter
MCP_URL=https://example.com/_mcp
MCP_TOKEN=dummy

LLM_API_BASE_URL=http://example.com:8000/v1
LLM_API_KEY=dummy
LLM_NAME=gpt-4.1-mini
LLM_MAX_CONTEXT=131072
LLM_CONTEXT_UTILIZATION=0.25
LLM_TEMPERATURE=0.0
LLM_TOP_P=1.0
LLM_TIMEOUT=60

LOG_DIR="logs"

🧪 Research Goal

This project explores whether LLM agents can reliably operate over datasets that significantly exceed their native context window by combining:

iterative retrieval
semantic compression
orchestration loops
adaptive chunking
external tool integration

The project focuses primarily on:

orchestration reliability
context preservation
scalable tool usage
semantic reduction strategies

rather than traditional chatbot interaction.

Status

Current status:

✅ architecture prototype implemented
✅ OpenAPI adapter abstraction implemented
✅ MCP adapter prototype functional
✅ adaptive pagination functional
✅ reducer pipeline functional
⚠️ semantic chunk reduction experimental

License

Experimental / educational project.

Currently, if a single query contains multiple independent questions (e.g., about two different users), the orchestrator processes them sequentially within the same prompt. Due to chunking, retrieval, and reduction, the model typically produces a correct answer only for the first question, while subsequent questions may be ignored, hallucinated, or incomplete. A dedicated planning/decomposition stage could split such queries into independent subqueries, assign them to the retrieval/reduction pipeline, and aggregate results more reliably. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github		.github
adapters		adapters
docs		docs
mockdata		mockdata
.gitignore		.gitignore
README.md		README.md
agent.py		agent.py
misc.py		misc.py
mock_api.py		mock_api.py
reducer.py		reducer.py
requirements.txt		requirements.txt
run_agent.cmd		run_agent.cmd
run_agent.sh		run_agent.sh
run_mock_api.cmd		run_mock_api.cmd
run_mock_api.sh		run_mock_api.sh

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

LLM Agent for Large-Scale Data Processing on Models with Limited Context

🧠 Motivation

Main Idea

Architecture Overview

📋 Key Features

Tool Provider Adapters

⚙️ Internal Pagination Handling

🧩 Semantic Data Reduction

📦 Context Management

Short-Term Memory

Long-Term Memory

🗂️ Project Structure

🔄 Workflow

1. User Query

2. Tool Selection

3. Metadata Retrieval

4. Adaptive Processing Strategy

5. Semantic Reduction

6. Final Aggregation

📉 Example Reduction Statistics

🎨 Colored Runtime Console Logging

Console Color Categories

Example Console Visualization

🔍 Logging

⚠️ Current Limitations

🚀 Future Work

Distributed Processing Potential

🔧 Requirements

Running the Mock API

Running the Agent

Environment Variables

🧪 Research Goal

Status

License

Footnotes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Contributors

Uh oh!

Languages