A self-correcting, LLM-powered Exploratory Data Analysis agent that writes, executes, and debugs its own Python code — so you don't have to.
Agentic EDA Engine is an AI-driven data analysis assistant that lets you explore datasets using plain English. Instead of writing pandas or matplotlib code yourself, you simply upload a file, ask a question, and the agent takes care of the rest — autonomously generating, running, and fixing code until it gets the right answer.
Under the hood, the system is built as a stateful agentic workflow using LangGraph, with a local LLM (Qwen2.5-Coder:7b via Ollama) for code generation. A Streamlit front-end provides a clean, chat-based interface for interacting with your data.
The agent follows a Generate → Execute → Reflect loop:
- Generate: The LLM receives your natural language query and the dataset's schema (columns, data types, sample rows) and generates Python analysis code.
- Execute: The generated code is run in a sandboxed environment. If it succeeds, the output (text results or charts) is returned to you.
- Reflect: If the code throws an error, the agent analyzes the error message and rewrites the code to fix it. This loop continues for up to 3 iterations before gracefully stopping to prevent infinite loops.
- Clarify: If your query is ambiguous, the agent pauses and asks you for clarification instead of guessing.
This self-correcting loop means the agent can recover from common mistakes — wrong column names, incorrect data types, missing imports — without any intervention from you.
- 🗣️ Natural language querying — ask questions like "What is the average revenue by region?" or "Plot monthly sales trends"
- 🔁 Self-correcting execution — automatically rewrites and retries code on failure (up to 3 times)
- 📊 Chart generation — produces and displays matplotlib/seaborn plots directly in the UI
- 🗂️ Schema-aware generation — uses extracted column names, data types, and sample rows to write accurate, context-aware code
- 📁 CSV & Excel support — works with both
.csvand.xlsxfile formats - 🔒 Fully local — runs entirely on your machine via Ollama; no data is sent to external APIs
- 🧾 Transparent outputs — view the final executed code and the number of correction iterations in an expandable panel
Agentic-EDA-Engine/
│
├── main.py # Core agentic workflow (LangGraph state graph)
├── streamlit_app.py # Streamlit UI — file upload, chat interface, result display
├── prompts.py # Prompt templates for code generation and error reflection
├── tools.py # Safe Python code execution utility
├── requirements.txt # Python dependencies
├── Sample Datasets/ # Example datasets to try out
├── output/ # Temporary folder for generated chart images
└── sample_generated.ipynb # Example notebook showing generated outputs
| Component | Technology |
|---|---|
| Agentic Workflow | LangGraph |
| LLM | Qwen2.5-Coder:7b via Ollama (local) |
| LLM Interface | LangChain Ollama |
| UI | Streamlit |
| Data Handling | Pandas |
| Visualization | Matplotlib / Seaborn |
- Python 3.9+
- Ollama installed and running locally
- Qwen2.5-Coder model pulled:
ollama pull qwen2.5-coder:7b
git clone https://github.com/prakhar-189/Agentic-EDA-Engine.git
cd Agentic-EDA-Engine
pip install -r requirements.txtstreamlit run streamlit_app.pyThen open http://localhost:8501 in your browser.
- Launch the Streamlit app.
- Upload a
.csvor.xlsxdataset using the file uploader. - Review the automatically extracted schema in the expandable panel.
- Type your analysis question in the chat input (e.g., "Show me the top 5 products by total sales").
- The agent will generate, execute, and if necessary, self-correct Python code to answer your question.
- View the result, any generated charts, the final code, and the number of correction loops it took.
"What is the distribution of customer ages?"→ Generates a histogram"Which city had the highest total revenue last year?"→ Returns a ranked summary"Plot the correlation between price and quantity sold"→ Generates a scatter plot"Are there any missing values in the dataset?"→ Returns a missing-value report
- The agent uses a local LLM, so performance depends on your hardware (GPU recommended for Qwen2.5-Coder:7b).
- Complex, multi-step analyses may occasionally require rephrasing the query for best results.
- The maximum self-correction attempts are capped at 3 iterations to prevent runaway loops.
This project is licensed under the MIT License © 2026 Prakhar Srivastava
Built with LangGraph, LangChain, Ollama, and Streamlit.
Prakhar Srivastava
Data Analyst, Data Scientist & AI Engineer | Dashboards, SQL, Machine Learning, Deep Learning, Generative AI, Prompt Engineering & Agentic AI