Skip to content

AhmedRaoofuddin/LLM_FineTuning_Agentic

Repository files navigation

LLM Fine-Tuning Agentic System

Author: Ahmed Raoofuddin
GitHub: https://github.com/AhmedRaoofuddin
Role: AI Engineer / Full-Stack ML Engineer

A production-grade agentic system combining LLM fine-tuning with OpenAI function calling, enabling intelligent tool usage for database search, news retrieval, ROI calculation, and document summarization. Built for deployment on Google Cloud Platform with support for both CPU and GPU inference.

Python PyTorch Transformers PEFT FastAPI Docker Google Cloud Run License


Problem Statement

Modern AI applications require more than just language generation—they need to interact with external systems, retrieve information, perform calculations, and summarize content. This project addresses the challenge of building a production-ready agentic system that combines:

  • Efficient LLM Fine-Tuning: Parameter-efficient techniques (LoRA/QLoRA) for adapting large language models to specific tasks
  • Tool Calling Capabilities: Integration with OpenAI function calling for structured tool execution
  • Production Deployment: Cloud-native architecture ready for Google Cloud Platform

System Architecture

graph TB
    A[Client Request] --> B[FastAPI Server]
    B --> C[Agent Service]
    C --> D[OpenAI API]
    D --> E[Function Calling]
    E --> F[Tool Execution]
    F --> G1[DB Search Tool]
    F --> G2[News Fetch Tool]
    F --> G3[ROI Calculator Tool]
    F --> G4[Document Summarizer Tool]
    G1 --> H[SQLite Database]
    G2 --> I[News API / Mock]
    G3 --> J[Financial Calculations]
    G4 --> K[Text Processing]
    G1 --> L[Response Aggregation]
    G2 --> L
    G3 --> L
    G4 --> L
    L --> M[Formatted Response]
    M --> A
Loading

Features

Core Capabilities

  • Tool-Calling Agent: OpenAI function calling integration for structured tool execution
  • Database Search: SQLite-based document search with ranking and filtering
  • News Retrieval: Real-time news fetching with fallback to mock data
  • ROI Calculator: Financial metrics calculation including annualized ROI and payback period
  • Document Summarizer: Extractive summarization with compression metrics

Training & Fine-Tuning

  • QLoRA Support: 4-bit quantized training for memory-efficient fine-tuning
  • LoRA Configuration: Customizable rank, alpha, and dropout parameters
  • YAML Configuration: Manage training hyperparameters via config files
  • CLI Training: Flexible command-line interface with argument overrides

Evaluation & Benchmarking

  • Model Evaluation: Compare base model vs fine-tuned model with accuracy metrics
  • Performance Benchmarking: Measure tokens/sec, latency, and GPU memory usage
  • Visual Analytics: Automated graph generation for performance visualization

Deployment

  • Cloud Run Ready: Docker containerization for Google Cloud Run
  • GPU Support: Optional GPU deployment for faster inference
  • API Server: FastAPI-based REST API for production use
  • Gradio Demo: Interactive web interface for testing

Results & Benchmarks

Performance Metrics

Based on evaluation runs, the fine-tuned model shows significant improvements:

Metric Base Model Fine-Tuned Model Improvement
Accuracy 65.0% 82.0% +17.0%
Avg Latency - 49.3 ms -
Throughput - 13.0 tokens/sec -

Visualization

Latency Distribution

Throughput Performance

Model Comparison

Tool Usage

Quick Start

Installation

# Clone repository
git clone https://github.com/AhmedRaoofuddin/LLM_FineTuning_Agentic.git
cd LLM_FineTuning_Agentic

# Install dependencies
pip install -r requirements.txt

# Set environment variables
export OPENAI_API_KEY="your-openai-api-key"

Run Locally (CPU Mode)

# Run end-to-end agent pipeline
python scripts/run_e2e_agent.py

# Start API server
python -m uvicorn src.agent.api:app --port 8080

# Test API (in another terminal)
python scripts/test_api.py

# Launch Gradio demo
python src/gradio_app.py

Tool Calling Examples

Database Search:

from src.agent.service import AgentService

agent = AgentService()
result = agent.query("Search for documents about machine learning")
# Agent automatically uses tool_db_search and returns formatted results

ROI Calculation:

result = agent.query("Calculate ROI for $10000 investment returning $15000 in 2 years")
# Agent uses tool_calculate_roi and provides detailed financial analysis

News Retrieval:

result = agent.query("Fetch latest news about artificial intelligence")
# Agent uses tool_fetch_latest_news and returns recent articles

Document Summarization:

doc = "Long document text here..."
result = agent.query(f"Summarize this document: {doc}")
# Agent uses tool_summarize_document and returns concise summary

Repository Structure

.
├── src/
│   ├── agent/
│   │   ├── __init__.py
│   │   ├── tools.py          # Tool implementations
│   │   ├── service.py        # Agent service with OpenAI
│   │   └── api.py            # FastAPI server
│   ├── train.py              # Training script
│   ├── inference.py          # Inference script
│   ├── evaluate.py           # Evaluation script
│   ├── benchmark.py          # Benchmarking script
│   ├── gradio_app.py         # Gradio demo UI
│   └── utils.py              # Utility functions
├── scripts/
│   ├── run_e2e_agent.py      # End-to-end agent pipeline
│   ├── test_api.py           # API testing script
│   ├── generate_graphs.py    # Visualization generation
│   ├── deploy_gcp.sh         # GCP deployment script
│   └── gcp_deploy_guide.py   # Deployment guide
├── configs/
│   └── train_qlora.yaml      # Training configuration
├── data/
│   ├── build_dataset.py      # Dataset preprocessing
│   └── app.db                # SQLite database
├── outputs/
│   ├── plots/                # Generated graphs
│   ├── report.json           # Performance metrics
│   └── final_model/          # Trained model
├── Dockerfile                 # Container image
├── cloudbuild.yaml           # Cloud Build config
├── cloud-run-gpu.yaml        # GPU deployment config
└── requirements.txt          # Python dependencies

Google Cloud Platform Deployment

Prerequisites

  1. Google Cloud Account: Sign up at cloud.google.com
  2. Google Cloud SDK: Install from cloud.google.com/sdk
  3. Billing Account: Link billing to your project
  4. APIs Enabled: Cloud Build, Cloud Run, Container Registry

Step 1: Create GCP Project

# Create project
gcloud projects create agent-service-project --name="Agent Service"

# Set as active project
gcloud config set project agent-service-project

# Enable required APIs
gcloud services enable cloudbuild.googleapis.com
gcloud services enable run.googleapis.com
gcloud services enable containerregistry.googleapis.com

Step 2: Configure Environment

# Set project ID
export GOOGLE_CLOUD_PROJECT="agent-service-project"

# Set OpenAI API key
export OPENAI_API_KEY="your-openai-api-key"

Step 3: Deploy to Cloud Run (CPU)

Option A: Using Deployment Script (Recommended)

chmod +x scripts/deploy_gcp.sh
./scripts/deploy_gcp.sh cpu

Option B: Manual Deployment

# Build and push Docker image
gcloud builds submit --config cloudbuild.yaml

# Deploy to Cloud Run
gcloud run deploy agent-service \
    --image gcr.io/$GOOGLE_CLOUD_PROJECT/agent-service:latest \
    --region us-central1 \
    --platform managed \
    --allow-unauthenticated \
    --memory 2Gi \
    --cpu 2 \
    --set-env-vars OPENAI_API_KEY=$OPENAI_API_KEY \
    --port 8080

# Get service URL
gcloud run services describe agent-service \
    --region us-central1 \
    --format 'value(status.url)'

Step 4: Deploy to Cloud Run (GPU)

Note: GPU requires billing and quota approval. Request quota at console.cloud.google.com/iam-admin/quotas

Recommended Regions for GPU:

  • us-central1 (Iowa)
  • europe-west4 (Netherlands)
# Deploy with GPU
gcloud run deploy agent-service-gpu \
    --image gcr.io/$GOOGLE_CLOUD_PROJECT/agent-service:latest \
    --region us-central1 \
    --platform managed \
    --allow-unauthenticated \
    --memory 8Gi \
    --cpu 4 \
    --gpu-type nvidia-t4 \
    --gpu-count 1 \
    --set-env-vars OPENAI_API_KEY=$OPENAI_API_KEY \
    --port 8080

Step 5: Test Deployment

# Get service URL
SERVICE_URL=$(gcloud run services describe agent-service \
    --region us-central1 \
    --format 'value(status.url)')

# Test health endpoint
curl $SERVICE_URL/health

# Test chat endpoint
curl -X POST $SERVICE_URL/chat \
    -H "Content-Type: application/json" \
    -d '{
        "message": "Search for documents about Python",
        "model": "gpt-4o-mini"
    }'

Step 6: Monitor and Manage

# View logs
gcloud run services logs read agent-service --region us-central1

# Update service
gcloud run services update agent-service \
    --region us-central1 \
    --memory 4Gi

# Delete service
gcloud run services delete agent-service --region us-central1

Optional: Vertex AI GPU Training

For GPU-accelerated training on Vertex AI:

# Upload dataset to GCS
gsutil cp data/processed_dataset.jsonl gs://your-bucket/datasets/

# Submit training job
gcloud ai custom-jobs create \
    --region=us-central1 \
    --display-name="llm-finetuning-job" \
    --config=vertex-ai-config.yaml

Development

Running Tests

# Comprehensive test suite
python scripts/test_e2e_comprehensive.py

# Agent pipeline test
python scripts/run_e2e_agent.py

# API server test
python scripts/test_api.py

Generating Visualizations

# Generate all graphs
python scripts/generate_graphs.py

# Graphs saved to outputs/plots/
# Metrics saved to outputs/report.json

Configuration

Training Configuration

Edit configs/train_qlora.yaml:

model_name: "NousResearch/Llama-2-7b-chat-hf"
dataset_path: "data/processed_dataset.jsonl"
output_dir: "./outputs"
use_qlora: true
lora_r: 64
lora_alpha: 16
batch_size: 4
epochs: 1
lr: 2e-4

Environment Variables

Create .env file (not committed):

OPENAI_API_KEY=your-openai-api-key
NEWS_API_KEY=your-news-api-key  # Optional
HUGGINGFACE_TOKEN=your-hf-token  # Optional

Performance Characteristics

  • CPU Inference: ~50ms latency, ~13 tokens/sec
  • GPU Inference: ~20ms latency, ~50 tokens/sec (T4 GPU)
  • Memory Usage: 2GB (CPU), 8GB (GPU)
  • Model Size: ~14GB (7B model), ~4GB (with QLoRA)

Credits & References

License

MIT License - See LICENSE file for details


Built by Ahmed Raoofuddin | AI Engineer | GitHub Profile

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors