This tutorial walks you through using the ML Agent from installation to deployment.
By the end of this tutorial, you will:
- ✅ Install and set up the ML Agent
- ✅ Generate and prepare sample data
- ✅ Run the full ML pipeline
- ✅ Understand the generated outputs
- ✅ Use the trained model for predictions
- ✅ Customize the agent for your needs
Time Required: 30 minutes
# Install required packages
pip install -r requirements.txt
# Verify installation
python test_agent.pyExpected output:
✅ ALL TESTS PASSED
Your ML Agent is ready to use!
ml-agent/
├── ml_agent.py # Main agent code
├── cli.py # Command-line interface
├── config.py # Configuration settings
├── example_usage.py # Example scripts
├── test_agent.py # Installation tests
├── requirements.txt # Python dependencies
├── README.md # Full documentation
├── QUICKSTART.md # Quick reference
└── TUTORIAL.md # This file
python example_usage.py --generate-dataThis creates four sample datasets:
iris.csv- Classic flower classificationclassification_example.csv- Binary classificationregression_example.csv- Continuous predictionhousing.csv- House price prediction
python ml_agent.py sample_data/iris.csvWhat happens:
- Loads the data ✓
- Detects it's a classification problem ✓
- Engineers features automatically ✓
- Trains 4 different models ✓
- Optimizes the best one ✓
- Saves everything to
outputs/✓
Time: ~1-2 minutes
# View the summary
cat outputs/reports/overview.md
# Check the performance
cat outputs/reports/results.md
# See the plots
ls outputs/plots/Expected files:
outputs/
├── models/
│ └── final_model.joblib
├── plots/
│ ├── feature_importance.png
│ └── metric_comparison.png
├── reports/
│ ├── overview.md
│ ├── data_analysis.md
│ ├── modeling.md
│ └── results.md
└── strategy/
└── <fingerprint>.json
# Creates unique ID based on:
# - Shape: (150, 5)
# - Columns: ['sepal_length', 'sepal_width', ...]
# - Data types: [float, float, int, ...]
# - Missing values: {...}
Fingerprint: a3f5d8c9b2e1f4a7Why? Reuse successful strategies on similar data.
Auto-detected:
- Target: species (last column)
- Problem: classification (few unique values)
- Features: 4 numerical
- Missing: None
- Classes: Balanced
Why? Understand data before processing.
Numerical features:
→ Median imputation
→ No scaling (tree models don't need it)
Categorical features:
→ Most frequent imputation
→ Label encoding
Train/Test split: 80/20
Why? Prepare data for models without leakage.
Trained models:
✓ Logistic Regression → 0.9333
✓ Random Forest → 0.9667
✓ Extra Trees → 0.9667
✓ Gradient Boosting → 0.9667
Best: Random Forest (0.9667)
Why? Compare multiple approaches to find the best.
Using Optuna:
- 30 trials
- 3-fold CV
- Bayesian search
Optimized: 0.9733 (+0.0066)
Why? Squeeze out extra performance automatically.
Saved:
✓ final_model.joblib
✓ feature_importance.png
✓ metric_comparison.png
✓ 4 markdown reports
Why? Production-ready deliverables.
Stored strategy:
- Best model: Random Forest
- Best params: {...}
- Feature engineering: {...}
- Score: 0.9733
Why? Reuse on similar datasets later.
overview.md - Project summary
# Quick Summary
- Problem: Classification
- Best Model: Random Forest
- Score: 0.9733
- Features: 4data_analysis.md - Data insights
# Dataset Summary
- Rows: 150
- Missing: None
- Target distribution: Balanced
# Feature Engineering
- Numerical: Median imputation
- Categorical: Label encodingmodeling.md - Model comparison
| Model | Accuracy |
|--------------------|----------|
| Logistic Reg | 0.9333 |
| Random Forest | 0.9667 |
| Extra Trees | 0.9667 |
| Gradient Boosting | 0.9667 |results.md - Final performance & next steps
# Final Score: 0.9733
## How to Use
[Code examples]
## Next Steps
- Feature engineering
- Ensemble methods
- More dataimport joblib
import pandas as pd
# Load model package
model_pkg = joblib.load('outputs/models/final_model.joblib')
# Extract components
model = model_pkg['model']
preprocessor = model_pkg['preprocessor']
feature_names = model_pkg['feature_names']
problem_type = model_pkg['problem_type']
print(f"Model: {type(model).__name__}")
print(f"Features: {feature_names}")
print(f"Problem: {problem_type}")# Create new data (same format as training)
new_data = pd.DataFrame({
'sepal_length': [5.1, 6.2],
'sepal_width': [3.5, 2.8],
'petal_length': [1.4, 4.8],
'petal_width': [0.2, 1.8]
})
# Predict
predictions = model.predict(new_data)
print(f"Predictions: {predictions}")
# Get probabilities (classification only)
if hasattr(model, 'predict_proba'):
probabilities = model.predict_proba(new_data)
print(f"Probabilities: {probabilities}")Option A: Flask API
from flask import Flask, request, jsonify
import joblib
app = Flask(__name__)
model_pkg = joblib.load('outputs/models/final_model.joblib')
@app.route('/predict', methods=['POST'])
def predict():
data = request.json
df = pd.DataFrame(data)
predictions = model_pkg['model'].predict(df)
return jsonify({'predictions': predictions.tolist()})
app.run(port=5000)Option B: FastAPI
from fastapi import FastAPI
import joblib
app = FastAPI()
model_pkg = joblib.load('outputs/models/final_model.joblib')
@app.post("/predict")
def predict(data: dict):
df = pd.DataFrame(data)
predictions = model_pkg['model'].predict(df)
return {"predictions": predictions.tolist()}# Specify target and problem type
python cli.py data.csv --target price --problem regression
# Custom output directory
python cli.py data.csv --output my_results
# More aggressive optimization
python cli.py data.csv --max-iter 5 --target-metric 0.98
# Quick mode (no HPO)
python cli.py data.csv --quickfrom ml_agent import MLAgent
# Advanced configuration
agent = MLAgent(
data_path="data.csv",
output_dir="production_model",
target_col="revenue",
problem_type="regression",
max_iterations=5,
target_metric_threshold=0.95,
improvement_threshold=0.005
)
# Run pipeline
agent.run()
# Access results
print(f"Best score: {agent.best_score}")
print(f"Best model: {type(agent.best_model).__name__}")Edit config.py:
# Change default models
CLASSIFICATION_MODELS = {
'Random Forest': {
'enabled': True,
'params': {
'n_estimators': 200, # More trees
'max_depth': 15, # Deeper trees
}
}
}
# Adjust optimization
OPTUNA_N_TRIALS = 50 # More trials (slower but better)
# Change metrics
PRIMARY_METRIC_CLASSIFICATION = 'f1' # Use F1 instead of accuracyEdit ml_agent.py in train_models():
# Add XGBoost (if installed)
try:
from xgboost import XGBClassifier
models['XGBoost'] = XGBClassifier(
n_estimators=100,
random_state=42
)
except ImportError:
pass
# Add LightGBM (if installed)
try:
from lightgbm import LGBMClassifier
models['LightGBM'] = LGBMClassifier(
n_estimators=100,
random_state=42
)
except ImportError:
passData: customer_data.csv
customer_id,age,tenure,monthly_charges,total_charges,churn
1,25,12,50.5,606.0,0
2,45,36,85.2,3067.2,1
...
Step 1: Prepare Data
# Ensure data has header row
# Target should be 'churn' (last column)
# Remove 'customer_id' if presentStep 2: Run Agent
python cli.py customer_data.csv \
--target churn \
--problem classification \
--output customer_churn_model \
--target-metric 0.90Step 3: Review Results
cat customer_churn_model/reports/results.mdStep 4: Deploy
# Load model
model_pkg = joblib.load('customer_churn_model/models/final_model.joblib')
# Predict on new customers
new_customers = pd.read_csv('new_customers.csv')
churn_predictions = model_pkg['model'].predict(new_customers)
churn_probs = model_pkg['model'].predict_proba(new_customers)[:, 1]
# Identify high-risk customers
high_risk = new_customers[churn_probs > 0.7]
print(f"High-risk customers: {len(high_risk)}")Problem: Model accuracy < 0.7
Solutions:
- Check data quality (missing values, outliers)
- Try more iterations:
--max-iter 5 - Add more data if possible
- Try feature engineering (see reports for suggestions)
Problem: "MemoryError" during training
Solutions:
- Reduce
n_estimatorsin config - Use smaller train/test split
- Sample large datasets before running
Problem: Takes > 5 minutes
Solutions:
- Use
--quickmode to skip HPO - Reduce
OPTUNA_N_TRIALSin config - Disable slow models (e.g., SVM)
Problem: "Optuna not available"
Solutions:
pip install optunaOr continue without it (HPO will be skipped)
- ✅ Include header row in CSV
- ✅ Put target column last (or specify with --target)
- ✅ Remove ID columns
- ✅ Handle extreme outliers before running
- ✅ Start with default models
- ✅ Add custom models only if needed
- ✅ Use quick mode for experimentation
- ✅ Use full mode for production
- ✅ Review generated reports
- ✅ Follow "Next Steps" suggestions
- ✅ Use strategy memory for similar datasets
- ✅ Compare fingerprints to reuse strategies
- ✅ Test model on holdout data
- ✅ Monitor performance in production
- ✅ Retrain periodically with new data
- ✅ Version control model artifacts
Feature Selection:
# Edit config.py
ENABLE_FEATURE_SELECTION = True
FEATURE_SELECTION_METHOD = 'mutual_info'Class Imbalance:
# Edit config.py
HANDLE_CLASS_IMBALANCE = True
CLASS_IMBALANCE_METHOD = 'smote'Custom Metrics:
# Edit ml_agent.py
from sklearn.metrics import matthews_corrcoef
# In train_models():
metrics['mcc'] = matthews_corrcoef(y_test, y_pred)Track with MLflow:
import mlflow
with mlflow.start_run():
agent.run()
mlflow.log_metric("score", agent.best_score)
mlflow.sklearn.log_model(agent.best_model, "model")Version with DVC:
dvc add outputs/models/final_model.joblib
git add outputs/models/final_model.joblib.dvc
git commit -m "Add trained model v1.0"You've learned to:
- ✅ Install and test the ML Agent
- ✅ Run the full pipeline on sample data
- ✅ Understand each step of the process
- ✅ Load and use trained models
- ✅ Customize the agent for your needs
- ✅ Deploy models to production
- ✅ Handle common issues
Next Steps:
- Try on your own data
- Experiment with configurations
- Add custom models
- Deploy to production
- Monitor and iterate
- Documentation:
README.md - Quick Reference:
QUICKSTART.md - Configuration:
config.py - Examples:
example_usage.py - Tests:
test_agent.py
Happy modeling! 🤖