Skip to content

kibda/Autonomous-Data-Science-Agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ€– Autonomous Data Science Agent

A production-grade multi-agent system that performs end-to-end data science workflows autonomously. Built with LangGraph and Llama 3.2 via Ollama.

🎯 What This Does

Give it a dataset + objective, and the system:

  1. Plans the execution strategy
  2. Explores and preprocesses data autonomously
  3. Selects and trains appropriate models
  4. Evaluates performance and compares models
  5. Explains results with feature importance
  6. Critiques its own work and iterates if needed
  7. Generates a comprehensive markdown report

No hardcoded pipelines. True agent behavior.


πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Planner   β”‚ ← Decomposes objective into tasks
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Data Agent  β”‚ ← Explores, cleans, preprocesses
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Model Agent β”‚ ← Trains baseline β†’ advanced models
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Evaluator  β”‚ ← Compares models, selects best
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Explainer   β”‚ ← Feature importance, insights
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Critic    β”‚ ← Reviews pipeline, decides iterate/finish
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
  Iterate? ──No─→ Report Generator
     β”‚
    Yes
     β”‚
     └──────────┐
                β”‚
                β–Ό
          Model Agent (again)

πŸš€ Quick Start

Prerequisites

  • Python 3.10+
  • Ollama installed locally
  • Llama 3.2 model downloaded

Installation

# Clone repository
git clone <your-repo>
cd autonomous_data_science_agent

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install and run Ollama with Llama 3.2
ollama pull llama3.2
ollama serve

Run the Agent

python main.py \
  --dataset data/raw/your_dataset.csv \
  --objective "Predict air quality and explain pollution drivers"

Example Run

python main.py \
  --dataset ../Housing.csv \
  --objective "Predict house prices and identify key value drivers"

Output:

  • Trained models saved in data/outputs/
  • Processed data in processed/
  • Final report in reports/generated/report_TIMESTAMP.md

πŸ“ Project Structure

autonomous_data_science_agent/
β”‚
β”œβ”€β”€ main.py                      # Entry point
β”œβ”€β”€ config.yaml                  # Configuration
β”œβ”€β”€ requirements.txt
β”‚
β”œβ”€β”€ agents/                      # Multi-agent system
β”‚   β”œβ”€β”€ planner_agent.py         # Task decomposition
β”‚   β”œβ”€β”€ data_agent.py            # Data exploration & preprocessing
β”‚   β”œβ”€β”€ modeling_agent.py        # Model training & selection
β”‚   β”œβ”€β”€ evaluation_agent.py      # Model comparison & evaluation
β”‚   β”œβ”€β”€ explanation_agent.py     # Interpretability & insights
β”‚   └── critic_agent.py          # Self-critique ⭐
β”‚
β”œβ”€β”€ graph/
β”‚   β”œβ”€β”€ agent_graph.py           # LangGraph orchestration
β”‚   └── states.py                # Shared state definition
β”‚
β”œβ”€β”€ tools/
β”‚   └── data_tools.py            # Data utilities
β”‚
β”œβ”€β”€ reports/
β”‚   β”œβ”€β”€ report_generator.py      # Report creation
β”‚   └── generated/               # Output reports
β”‚
└── data/
    β”œβ”€β”€ processed/               # Cleaned data
    └── outputs/                 # Models, artifacts

βš™οΈ Configuration

Edit config.yaml to customize:

llm:
  provider: "ollama"
  model: "llama3.2"
  temperature: 0.3

max_iterations: 3
performance_threshold: 0.75
improvement_threshold: 0.05

modeling:
  baseline_models:
    - "linear_regression"
    - "random_forest"
    - "gradient_boosting"
  
  advanced_models:
    - "xgboost"
    - "lightgbm"
    - "neural_network"
  
  cv_folds: 5
  hyperparameter_tuning: true

data:
  max_missing_ratio: 0.3
  outlier_std_threshold: 3.0

🧠 What Makes This "Agentic"

βœ… Dynamic Planning

No fixed pipeline. The planner creates a task graph based on the objective using LLM reasoning.

βœ… Autonomous Decision-Making

  • Data Agent uses LLM to decide preprocessing strategy (imputation, encoding, scaling)
  • Modeling Agent selects algorithms dynamically based on task type and iteration
  • Critic Agent determines when to iterate or finish based on performance thresholds

βœ… Self-Reflection & Iteration

The Critic Agent reviews results and triggers improvements:

if performance < threshold:
    β†’ Iterate with advanced models
elif critic_has_suggestions and iteration < 2:
    β†’ Try suggested improvements
else:
    β†’ Finish and generate report

βœ… Multi-Agent Collaboration

Each agent has a specific role and communicates via shared state (LangGraph TypedDict):

  • State flows through the graph
  • Agents can access previous agent outputs
  • Conditional branching based on critique

βœ… Model Comparison & Selection

Evaluator Agent automatically:

  • Compares all trained models
  • Ranks by appropriate metric (RΒ² for regression, accuracy for classification)
  • Selects best performer for final report

πŸ“Š Example Output

Console Output:

INFO:agents.modeling_agent:πŸ€– Modeling Agent: Training models
INFO:agents.modeling_agent:Training linear_regression...
INFO:agents.modeling_agent:  βœ“ linear_regression - ('rmse', 1324506.96)
INFO:agents.modeling_agent:Training random_forest...
INFO:agents.modeling_agent:  βœ“ random_forest - ('rmse', 1400565.97)
INFO:agents.modeling_agent:Training gradient_boosting...
INFO:agents.modeling_agent:  βœ“ gradient_boosting - ('rmse', 1299385.98)

INFO:agents.evaluation_agent:Model Rankings:
INFO:agents.evaluation_agent:  1. gradient_boosting: RΒ²=0.6660
INFO:agents.evaluation_agent:  2. linear_regression: RΒ²=0.6529
INFO:agents.evaluation_agent:  3. random_forest: RΒ²=0.6119

INFO:agents.critic_agent:Decision: ITERATE - r2 (0.666) below threshold (0.75)

Generated Report (reports/generated/report_TIMESTAMP.md):

# Autonomous Data Science Report

## 🎯 Objective
Predict house prices and explain value drivers

## πŸ“Š Dataset Summary
- Source: `../Housing.csv`
- Rows: 545
- Columns: 13
- Target Variable: price
- Task Type: Regression

## πŸ”§ Preprocessing Pipeline
1. Drop High Missing Cols
2. Impute Numeric Median
3. Encode Categorical Onehot

## πŸ† Best Model
**Selected Model:** Gradient Boosting

### Performance Metrics
- RMSE: 1299385.98
- MAE: 959748.96
- RΒ²: 0.6660

## 🧠 Feature Importance
Top 10 Most Important Features:

1. **area**: 0.4521
2. **bedrooms**: 0.1823
3. **bathrooms**: 0.1456
4. **stories**: 0.0892
5. **mainroad_yes**: 0.0543

## πŸ’‘ Key Insights
1. The gradient_boosting model achieved 0.666 RΒ² score
2. Area is the strongest predictor of house prices
3. Model performance suggests room for improvement
4. Additional feature engineering may improve results
5. Results should be validated on new data

## 🎬 Conclusion
The autonomous agent completed 3 iteration(s) and selected **gradient_boosting** as the best performing model.

πŸ”§ Advanced Usage

Custom Preprocessing Steps

Add new preprocessing options in data_agent.py:

elif step == "remove_outliers":
    # Your custom outlier removal logic
    pass

Custom Models

Add models to modeling_agent.py:

elif s_lower == "xgboost":
    from xgboost import XGBRegressor
    model = XGBRegressor(n_estimators=200, random_state=42)

Adjusting Iteration Behavior

Modify thresholds in config.yaml:

max_iterations: 5  # Allow more iterations
performance_threshold: 0.80  # Higher bar for satisfaction

πŸŽ“ Academic Context

Perfect for:

  • Master's thesis in AI/ML Engineering
  • PFE (Projet de Fin d'Γ‰tudes) requiring production systems
  • Research on autonomous agent systems
  • Portfolio projects for Data Science/ML Engineer roles

Key Differentiators:

  • Multi-agent architecture (not single LLM chain)
  • Self-critique loop with iterative improvement
  • Production-ready code structure with proper state management
  • Comprehensive logging and reporting
  • Uses local LLM (Ollama) - no API costs

Technical Highlights:

  • LangGraph for agent orchestration
  • TypedDict for type-safe state management
  • scikit-learn for ML pipeline
  • Autonomous decision-making via LLM reasoning

πŸ› οΈ Tech Stack

  • Orchestration: LangGraph
  • LLM: Ollama + Llama 3.2
  • ML: scikit-learn, pandas, numpy
  • Data Processing: pandas, numpy
  • Logging: Python logging module

πŸ› Troubleshooting

Issue: KeyError: 'processed_data_path'

  • Solution: Ensure states.py includes all required fields in AgentState TypedDict

Issue: Unicode encoding error in report

  • Solution: Fixed - reports now use UTF-8 encoding

Issue: Ollama connection refused

  • Solution: Run ollama serve in a separate terminal

Issue: LangChain deprecation warnings

  • These are warnings only and don't affect functionality
  • Upgrade to langchain-ollama if preferred

πŸ“ Requirements

pandas>=2.0.0
numpy>=1.24.0
scikit-learn>=1.3.0
langchain>=0.1.0
langchain-community>=0.0.20
langgraph>=0.0.26
pyyaml>=6.0
joblib>=1.3.0

🀝 Contributing

Contributions welcome! Areas for improvement:

  • Additional agents (AutoML, Feature Engineering Agent)
  • Support for more model types (deep learning, time series)
  • Enhanced explainability (SHAP, LIME)
  • Web interface for interaction
  • MLflow integration for experiment tracking

πŸ“„ License

MIT License


πŸ”— Resources


Built with autonomy in mind. No human intervention required. πŸš€

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages