Author: Ahmed Raoofuddin
GitHub: https://github.com/AhmedRaoofuddin
Role: AI Engineer / Full-Stack ML Engineer
A production-grade agentic system combining LLM fine-tuning with OpenAI function calling, enabling intelligent tool usage for database search, news retrieval, ROI calculation, and document summarization. Built for deployment on Google Cloud Platform with support for both CPU and GPU inference.
Modern AI applications require more than just language generation—they need to interact with external systems, retrieve information, perform calculations, and summarize content. This project addresses the challenge of building a production-ready agentic system that combines:
- Efficient LLM Fine-Tuning: Parameter-efficient techniques (LoRA/QLoRA) for adapting large language models to specific tasks
- Tool Calling Capabilities: Integration with OpenAI function calling for structured tool execution
- Production Deployment: Cloud-native architecture ready for Google Cloud Platform
graph TB
A[Client Request] --> B[FastAPI Server]
B --> C[Agent Service]
C --> D[OpenAI API]
D --> E[Function Calling]
E --> F[Tool Execution]
F --> G1[DB Search Tool]
F --> G2[News Fetch Tool]
F --> G3[ROI Calculator Tool]
F --> G4[Document Summarizer Tool]
G1 --> H[SQLite Database]
G2 --> I[News API / Mock]
G3 --> J[Financial Calculations]
G4 --> K[Text Processing]
G1 --> L[Response Aggregation]
G2 --> L
G3 --> L
G4 --> L
L --> M[Formatted Response]
M --> A
- Tool-Calling Agent: OpenAI function calling integration for structured tool execution
- Database Search: SQLite-based document search with ranking and filtering
- News Retrieval: Real-time news fetching with fallback to mock data
- ROI Calculator: Financial metrics calculation including annualized ROI and payback period
- Document Summarizer: Extractive summarization with compression metrics
- QLoRA Support: 4-bit quantized training for memory-efficient fine-tuning
- LoRA Configuration: Customizable rank, alpha, and dropout parameters
- YAML Configuration: Manage training hyperparameters via config files
- CLI Training: Flexible command-line interface with argument overrides
- Model Evaluation: Compare base model vs fine-tuned model with accuracy metrics
- Performance Benchmarking: Measure tokens/sec, latency, and GPU memory usage
- Visual Analytics: Automated graph generation for performance visualization
- Cloud Run Ready: Docker containerization for Google Cloud Run
- GPU Support: Optional GPU deployment for faster inference
- API Server: FastAPI-based REST API for production use
- Gradio Demo: Interactive web interface for testing
Based on evaluation runs, the fine-tuned model shows significant improvements:
| Metric | Base Model | Fine-Tuned Model | Improvement |
|---|---|---|---|
| Accuracy | 65.0% | 82.0% | +17.0% |
| Avg Latency | - | 49.3 ms | - |
| Throughput | - | 13.0 tokens/sec | - |
# Clone repository
git clone https://github.com/AhmedRaoofuddin/LLM_FineTuning_Agentic.git
cd LLM_FineTuning_Agentic
# Install dependencies
pip install -r requirements.txt
# Set environment variables
export OPENAI_API_KEY="your-openai-api-key"# Run end-to-end agent pipeline
python scripts/run_e2e_agent.py
# Start API server
python -m uvicorn src.agent.api:app --port 8080
# Test API (in another terminal)
python scripts/test_api.py
# Launch Gradio demo
python src/gradio_app.pyDatabase Search:
from src.agent.service import AgentService
agent = AgentService()
result = agent.query("Search for documents about machine learning")
# Agent automatically uses tool_db_search and returns formatted resultsROI Calculation:
result = agent.query("Calculate ROI for $10000 investment returning $15000 in 2 years")
# Agent uses tool_calculate_roi and provides detailed financial analysisNews Retrieval:
result = agent.query("Fetch latest news about artificial intelligence")
# Agent uses tool_fetch_latest_news and returns recent articlesDocument Summarization:
doc = "Long document text here..."
result = agent.query(f"Summarize this document: {doc}")
# Agent uses tool_summarize_document and returns concise summary.
├── src/
│ ├── agent/
│ │ ├── __init__.py
│ │ ├── tools.py # Tool implementations
│ │ ├── service.py # Agent service with OpenAI
│ │ └── api.py # FastAPI server
│ ├── train.py # Training script
│ ├── inference.py # Inference script
│ ├── evaluate.py # Evaluation script
│ ├── benchmark.py # Benchmarking script
│ ├── gradio_app.py # Gradio demo UI
│ └── utils.py # Utility functions
├── scripts/
│ ├── run_e2e_agent.py # End-to-end agent pipeline
│ ├── test_api.py # API testing script
│ ├── generate_graphs.py # Visualization generation
│ ├── deploy_gcp.sh # GCP deployment script
│ └── gcp_deploy_guide.py # Deployment guide
├── configs/
│ └── train_qlora.yaml # Training configuration
├── data/
│ ├── build_dataset.py # Dataset preprocessing
│ └── app.db # SQLite database
├── outputs/
│ ├── plots/ # Generated graphs
│ ├── report.json # Performance metrics
│ └── final_model/ # Trained model
├── Dockerfile # Container image
├── cloudbuild.yaml # Cloud Build config
├── cloud-run-gpu.yaml # GPU deployment config
└── requirements.txt # Python dependencies
- Google Cloud Account: Sign up at cloud.google.com
- Google Cloud SDK: Install from cloud.google.com/sdk
- Billing Account: Link billing to your project
- APIs Enabled: Cloud Build, Cloud Run, Container Registry
# Create project
gcloud projects create agent-service-project --name="Agent Service"
# Set as active project
gcloud config set project agent-service-project
# Enable required APIs
gcloud services enable cloudbuild.googleapis.com
gcloud services enable run.googleapis.com
gcloud services enable containerregistry.googleapis.com# Set project ID
export GOOGLE_CLOUD_PROJECT="agent-service-project"
# Set OpenAI API key
export OPENAI_API_KEY="your-openai-api-key"Option A: Using Deployment Script (Recommended)
chmod +x scripts/deploy_gcp.sh
./scripts/deploy_gcp.sh cpuOption B: Manual Deployment
# Build and push Docker image
gcloud builds submit --config cloudbuild.yaml
# Deploy to Cloud Run
gcloud run deploy agent-service \
--image gcr.io/$GOOGLE_CLOUD_PROJECT/agent-service:latest \
--region us-central1 \
--platform managed \
--allow-unauthenticated \
--memory 2Gi \
--cpu 2 \
--set-env-vars OPENAI_API_KEY=$OPENAI_API_KEY \
--port 8080
# Get service URL
gcloud run services describe agent-service \
--region us-central1 \
--format 'value(status.url)'Note: GPU requires billing and quota approval. Request quota at console.cloud.google.com/iam-admin/quotas
Recommended Regions for GPU:
us-central1(Iowa)europe-west4(Netherlands)
# Deploy with GPU
gcloud run deploy agent-service-gpu \
--image gcr.io/$GOOGLE_CLOUD_PROJECT/agent-service:latest \
--region us-central1 \
--platform managed \
--allow-unauthenticated \
--memory 8Gi \
--cpu 4 \
--gpu-type nvidia-t4 \
--gpu-count 1 \
--set-env-vars OPENAI_API_KEY=$OPENAI_API_KEY \
--port 8080# Get service URL
SERVICE_URL=$(gcloud run services describe agent-service \
--region us-central1 \
--format 'value(status.url)')
# Test health endpoint
curl $SERVICE_URL/health
# Test chat endpoint
curl -X POST $SERVICE_URL/chat \
-H "Content-Type: application/json" \
-d '{
"message": "Search for documents about Python",
"model": "gpt-4o-mini"
}'# View logs
gcloud run services logs read agent-service --region us-central1
# Update service
gcloud run services update agent-service \
--region us-central1 \
--memory 4Gi
# Delete service
gcloud run services delete agent-service --region us-central1For GPU-accelerated training on Vertex AI:
# Upload dataset to GCS
gsutil cp data/processed_dataset.jsonl gs://your-bucket/datasets/
# Submit training job
gcloud ai custom-jobs create \
--region=us-central1 \
--display-name="llm-finetuning-job" \
--config=vertex-ai-config.yaml# Comprehensive test suite
python scripts/test_e2e_comprehensive.py
# Agent pipeline test
python scripts/run_e2e_agent.py
# API server test
python scripts/test_api.py# Generate all graphs
python scripts/generate_graphs.py
# Graphs saved to outputs/plots/
# Metrics saved to outputs/report.jsonEdit configs/train_qlora.yaml:
model_name: "NousResearch/Llama-2-7b-chat-hf"
dataset_path: "data/processed_dataset.jsonl"
output_dir: "./outputs"
use_qlora: true
lora_r: 64
lora_alpha: 16
batch_size: 4
epochs: 1
lr: 2e-4Create .env file (not committed):
OPENAI_API_KEY=your-openai-api-key
NEWS_API_KEY=your-news-api-key # Optional
HUGGINGFACE_TOKEN=your-hf-token # Optional- CPU Inference: ~50ms latency, ~13 tokens/sec
- GPU Inference: ~20ms latency, ~50 tokens/sec (T4 GPU)
- Memory Usage: 2GB (CPU), 8GB (GPU)
- Model Size: ~14GB (7B model), ~4GB (with QLoRA)
- Original Repository: Based on fine-tuning techniques from the HuggingFace community
- HuggingFace Transformers: github.com/huggingface/transformers
- PEFT Library: github.com/huggingface/peft
- OpenAI API: platform.openai.com
MIT License - See LICENSE file for details
Built by Ahmed Raoofuddin | AI Engineer | GitHub Profile



