Skip to content

Latest commit

 

History

History
628 lines (481 loc) · 12.6 KB

File metadata and controls

628 lines (481 loc) · 12.6 KB

📚 Complete Tutorial: From Zero to Production Model

This tutorial walks you through using the ML Agent from installation to deployment.


🎯 Tutorial Goals

By the end of this tutorial, you will:

  • ✅ Install and set up the ML Agent
  • ✅ Generate and prepare sample data
  • ✅ Run the full ML pipeline
  • ✅ Understand the generated outputs
  • ✅ Use the trained model for predictions
  • ✅ Customize the agent for your needs

Time Required: 30 minutes


Part 1: Setup (5 minutes)

Step 1.1: Install Python Dependencies

# Install required packages
pip install -r requirements.txt

# Verify installation
python test_agent.py

Expected output:

✅ ALL TESTS PASSED
Your ML Agent is ready to use!

Step 1.2: Understand the File Structure

ml-agent/
├── ml_agent.py          # Main agent code
├── cli.py               # Command-line interface
├── config.py            # Configuration settings
├── example_usage.py     # Example scripts
├── test_agent.py        # Installation tests
├── requirements.txt     # Python dependencies
├── README.md            # Full documentation
├── QUICKSTART.md        # Quick reference
└── TUTORIAL.md          # This file

Part 2: Your First Model (10 minutes)

Step 2.1: Generate Sample Data

python example_usage.py --generate-data

This creates four sample datasets:

  • iris.csv - Classic flower classification
  • classification_example.csv - Binary classification
  • regression_example.csv - Continuous prediction
  • housing.csv - House price prediction

Step 2.2: Run the Agent (Simple Way)

python ml_agent.py sample_data/iris.csv

What happens:

  1. Loads the data ✓
  2. Detects it's a classification problem ✓
  3. Engineers features automatically ✓
  4. Trains 4 different models ✓
  5. Optimizes the best one ✓
  6. Saves everything to outputs/

Time: ~1-2 minutes

Step 2.3: Check the Results

# View the summary
cat outputs/reports/overview.md

# Check the performance
cat outputs/reports/results.md

# See the plots
ls outputs/plots/

Expected files:

outputs/
├── models/
│   └── final_model.joblib
├── plots/
│   ├── feature_importance.png
│   └── metric_comparison.png
├── reports/
│   ├── overview.md
│   ├── data_analysis.md
│   ├── modeling.md
│   └── results.md
└── strategy/
    └── <fingerprint>.json

Part 3: Understanding the Pipeline (10 minutes)

Step 3.1: The 7-Step Process

1. Dataset Fingerprinting

# Creates unique ID based on:
# - Shape: (150, 5)
# - Columns: ['sepal_length', 'sepal_width', ...]
# - Data types: [float, float, int, ...]
# - Missing values: {...}

Fingerprint: a3f5d8c9b2e1f4a7

Why? Reuse successful strategies on similar data.

2. Data Analysis

Auto-detected:
- Target: species (last column)
- Problem: classification (few unique values)
- Features: 4 numerical
- Missing: None
- Classes: Balanced

Why? Understand data before processing.

3. Feature Engineering

Numerical features:
  → Median imputation
  → No scaling (tree models don't need it)

Categorical features:
  → Most frequent imputation
  → Label encoding

Train/Test split: 80/20

Why? Prepare data for models without leakage.

4. Model Training

Trained models:
✓ Logistic Regression  → 0.9333
✓ Random Forest        → 0.9667
✓ Extra Trees          → 0.9667
✓ Gradient Boosting    → 0.9667

Best: Random Forest (0.9667)

Why? Compare multiple approaches to find the best.

5. Hyperparameter Optimization

Using Optuna:
- 30 trials
- 3-fold CV
- Bayesian search

Optimized: 0.9733 (+0.0066)

Why? Squeeze out extra performance automatically.

6. Artifact Generation

Saved:
✓ final_model.joblib
✓ feature_importance.png
✓ metric_comparison.png
✓ 4 markdown reports

Why? Production-ready deliverables.

7. Strategy Memory

Stored strategy:
- Best model: Random Forest
- Best params: {...}
- Feature engineering: {...}
- Score: 0.9733

Why? Reuse on similar datasets later.

Step 3.2: Read the Reports

overview.md - Project summary

# Quick Summary
- Problem: Classification
- Best Model: Random Forest
- Score: 0.9733
- Features: 4

data_analysis.md - Data insights

# Dataset Summary
- Rows: 150
- Missing: None
- Target distribution: Balanced

# Feature Engineering
- Numerical: Median imputation
- Categorical: Label encoding

modeling.md - Model comparison

| Model              | Accuracy |
|--------------------|----------|
| Logistic Reg       | 0.9333   |
| Random Forest      | 0.9667   |
| Extra Trees        | 0.9667   |
| Gradient Boosting  | 0.9667   |

results.md - Final performance & next steps

# Final Score: 0.9733

## How to Use
[Code examples]

## Next Steps
- Feature engineering
- Ensemble methods
- More data

Part 4: Using the Model (5 minutes)

Step 4.1: Load the Model

import joblib
import pandas as pd

# Load model package
model_pkg = joblib.load('outputs/models/final_model.joblib')

# Extract components
model = model_pkg['model']
preprocessor = model_pkg['preprocessor']
feature_names = model_pkg['feature_names']
problem_type = model_pkg['problem_type']

print(f"Model: {type(model).__name__}")
print(f"Features: {feature_names}")
print(f"Problem: {problem_type}")

Step 4.2: Make Predictions

# Create new data (same format as training)
new_data = pd.DataFrame({
    'sepal_length': [5.1, 6.2],
    'sepal_width': [3.5, 2.8],
    'petal_length': [1.4, 4.8],
    'petal_width': [0.2, 1.8]
})

# Predict
predictions = model.predict(new_data)
print(f"Predictions: {predictions}")

# Get probabilities (classification only)
if hasattr(model, 'predict_proba'):
    probabilities = model.predict_proba(new_data)
    print(f"Probabilities: {probabilities}")

Step 4.3: Deploy the Model

Option A: Flask API

from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)
model_pkg = joblib.load('outputs/models/final_model.joblib')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    df = pd.DataFrame(data)
    predictions = model_pkg['model'].predict(df)
    return jsonify({'predictions': predictions.tolist()})

app.run(port=5000)

Option B: FastAPI

from fastapi import FastAPI
import joblib

app = FastAPI()
model_pkg = joblib.load('outputs/models/final_model.joblib')

@app.post("/predict")
def predict(data: dict):
    df = pd.DataFrame(data)
    predictions = model_pkg['model'].predict(df)
    return {"predictions": predictions.tolist()}

Part 5: Advanced Usage (10 minutes)

Step 5.1: Customize via CLI

# Specify target and problem type
python cli.py data.csv --target price --problem regression

# Custom output directory
python cli.py data.csv --output my_results

# More aggressive optimization
python cli.py data.csv --max-iter 5 --target-metric 0.98

# Quick mode (no HPO)
python cli.py data.csv --quick

Step 5.2: Customize via Python

from ml_agent import MLAgent

# Advanced configuration
agent = MLAgent(
    data_path="data.csv",
    output_dir="production_model",
    target_col="revenue",
    problem_type="regression",
    max_iterations=5,
    target_metric_threshold=0.95,
    improvement_threshold=0.005
)

# Run pipeline
agent.run()

# Access results
print(f"Best score: {agent.best_score}")
print(f"Best model: {type(agent.best_model).__name__}")

Step 5.3: Modify Configuration

Edit config.py:

# Change default models
CLASSIFICATION_MODELS = {
    'Random Forest': {
        'enabled': True,
        'params': {
            'n_estimators': 200,  # More trees
            'max_depth': 15,      # Deeper trees
        }
    }
}

# Adjust optimization
OPTUNA_N_TRIALS = 50  # More trials (slower but better)

# Change metrics
PRIMARY_METRIC_CLASSIFICATION = 'f1'  # Use F1 instead of accuracy

Step 5.4: Add Custom Models

Edit ml_agent.py in train_models():

# Add XGBoost (if installed)
try:
    from xgboost import XGBClassifier
    models['XGBoost'] = XGBClassifier(
        n_estimators=100,
        random_state=42
    )
except ImportError:
    pass

# Add LightGBM (if installed)
try:
    from lightgbm import LGBMClassifier
    models['LightGBM'] = LGBMClassifier(
        n_estimators=100,
        random_state=42
    )
except ImportError:
    pass

Part 6: Real-World Example

Scenario: Predict Customer Churn

Data: customer_data.csv

customer_id,age,tenure,monthly_charges,total_charges,churn
1,25,12,50.5,606.0,0
2,45,36,85.2,3067.2,1
...

Step 1: Prepare Data

# Ensure data has header row
# Target should be 'churn' (last column)
# Remove 'customer_id' if present

Step 2: Run Agent

python cli.py customer_data.csv \
    --target churn \
    --problem classification \
    --output customer_churn_model \
    --target-metric 0.90

Step 3: Review Results

cat customer_churn_model/reports/results.md

Step 4: Deploy

# Load model
model_pkg = joblib.load('customer_churn_model/models/final_model.joblib')

# Predict on new customers
new_customers = pd.read_csv('new_customers.csv')
churn_predictions = model_pkg['model'].predict(new_customers)
churn_probs = model_pkg['model'].predict_proba(new_customers)[:, 1]

# Identify high-risk customers
high_risk = new_customers[churn_probs > 0.7]
print(f"High-risk customers: {len(high_risk)}")

Part 7: Troubleshooting

Issue 1: Low Performance

Problem: Model accuracy < 0.7

Solutions:

  1. Check data quality (missing values, outliers)
  2. Try more iterations: --max-iter 5
  3. Add more data if possible
  4. Try feature engineering (see reports for suggestions)

Issue 2: Memory Error

Problem: "MemoryError" during training

Solutions:

  1. Reduce n_estimators in config
  2. Use smaller train/test split
  3. Sample large datasets before running

Issue 3: Slow Performance

Problem: Takes > 5 minutes

Solutions:

  1. Use --quick mode to skip HPO
  2. Reduce OPTUNA_N_TRIALS in config
  3. Disable slow models (e.g., SVM)

Issue 4: Optuna Not Found

Problem: "Optuna not available"

Solutions:

pip install optuna

Or continue without it (HPO will be skipped)


Part 8: Best Practices

1. Data Preparation

  • ✅ Include header row in CSV
  • ✅ Put target column last (or specify with --target)
  • ✅ Remove ID columns
  • ✅ Handle extreme outliers before running

2. Model Selection

  • ✅ Start with default models
  • ✅ Add custom models only if needed
  • ✅ Use quick mode for experimentation
  • ✅ Use full mode for production

3. Iteration

  • ✅ Review generated reports
  • ✅ Follow "Next Steps" suggestions
  • ✅ Use strategy memory for similar datasets
  • ✅ Compare fingerprints to reuse strategies

4. Deployment

  • ✅ Test model on holdout data
  • ✅ Monitor performance in production
  • ✅ Retrain periodically with new data
  • ✅ Version control model artifacts

Part 9: Going Further

Add Advanced Features

Feature Selection:

# Edit config.py
ENABLE_FEATURE_SELECTION = True
FEATURE_SELECTION_METHOD = 'mutual_info'

Class Imbalance:

# Edit config.py
HANDLE_CLASS_IMBALANCE = True
CLASS_IMBALANCE_METHOD = 'smote'

Custom Metrics:

# Edit ml_agent.py
from sklearn.metrics import matthews_corrcoef

# In train_models():
metrics['mcc'] = matthews_corrcoef(y_test, y_pred)

Integrate with MLOps

Track with MLflow:

import mlflow

with mlflow.start_run():
    agent.run()
    mlflow.log_metric("score", agent.best_score)
    mlflow.sklearn.log_model(agent.best_model, "model")

Version with DVC:

dvc add outputs/models/final_model.joblib
git add outputs/models/final_model.joblib.dvc
git commit -m "Add trained model v1.0"

Summary

You've learned to:

  • ✅ Install and test the ML Agent
  • ✅ Run the full pipeline on sample data
  • ✅ Understand each step of the process
  • ✅ Load and use trained models
  • ✅ Customize the agent for your needs
  • ✅ Deploy models to production
  • ✅ Handle common issues

Next Steps:

  1. Try on your own data
  2. Experiment with configurations
  3. Add custom models
  4. Deploy to production
  5. Monitor and iterate

Resources

  • Documentation: README.md
  • Quick Reference: QUICKSTART.md
  • Configuration: config.py
  • Examples: example_usage.py
  • Tests: test_agent.py

Happy modeling! 🤖