In this tutorial, we will explore AI Agents and Small Language Models (SLMs) and their advanced implementation strategies for edge computing environments. We will cover the fundamental concepts of agentic AI, SLM optimization techniques, practical deployment strategies for resource-constrained devices, and Microsoft Agent Framework for building production-ready agent systems.
The landscape of artificial intelligence is experiencing a paradigmatic shift in 2025. While 2023 was the year of chatbots and 2024 saw a boom in copilots, 2025 belongs to AI agents — intelligent systems that think, reason, plan, use tools, and execute tasks with minimal human input, powered increasingly by efficient Small Language Models. Microsoft Agent Framework emerges as a leading solution for building these intelligent systems with offline edge-based capabilities.
By the end of this tutorial, you will be able to:
- 🤖 Understand the fundamental concepts of AI agents and agentic systems
- 🔬 Identify the advantages of Small Language Models over Large Language Models in agentic applications
- 🚀 Learn advanced SLM deployment strategies for edge computing environments
- 📱 Implement practical SLM-powered agents for real-world applications
- 🏗️ Build production-ready agents using Microsoft Agent Framework
- 🌐 Deploy offline edge-based agents with local LLM and SLM integration
- 🔧 Integrate Microsoft Agent Framework with Foundry Local for edge deployment
An artificial intelligence (AI) agent refers to a system or program that is capable of autonomously performing tasks on behalf of a user or another system by designing its workflow and utilizing available tools. Unlike traditional AI that just responds to your questions, an agent can act independently to achieve goals.
Understanding the agent boundaries helps in selecting appropriate agent types for different computing scenarios:
- 🔬 Simple Reflex Agents: Rule-based systems that respond to immediate perceptions (thermostats, basic automation)
- 📱 Model-Based Agents: Systems that maintain internal state and memory (robot vacuums, navigation systems)
- ⚖️ Goal-Based Agents: Systems that plan and execute sequences to achieve objectives (route planners, task schedulers)
- 🧠 Learning Agents: Adaptive systems that improve performance over time (recommendation systems, personalized assistants)
AI agents offer several fundamental advantages that make them ideal for edge computing applications:
Operational Autonomy: Agents provide independent task execution without constant human oversight, making them ideal for real-time applications. They require minimal supervision while maintaining adaptive behavior, enabling deployment on resource-constrained devices with reduced operational overhead.
Deployment Flexibility: These systems enable on-device AI capabilities without internet connectivity requirements, enhance privacy and security through local processing, can be customized for domain-specific applications, and are suitable for various edge computing environments.
Cost Effectiveness: Agent systems offer cost-effective deployment compared to cloud-based solutions, with reduced operational costs and lower bandwidth requirements for edge applications.
A Small Language Model (SLM) is a language model that can fit onto a common consumer electronic device and perform inference with latency sufficiently low to be practical when serving the agentic requests of one user. In practical terms, SLMs are typically models with fewer than 10 billion parameters.
Format Discovery Features: SLMs offer advanced support for various quantization levels, cross-platform compatibility, real-time performance optimization, and edge deployment capabilities. Users can access enhanced privacy through local processing and WebGPU support for browser-based deployment.
Quantization Level Collections: Popular SLM formats include Q4_K_M for balanced compression in mobile applications, Q5_K_S series for quality-focused edge deployment, Q8_0 for near-original precision on powerful edge devices, and experimental formats like Q2_K for ultra-low resource scenarios.
GGUF serves as the primary format for deploying quantized SLMs on CPU and edge devices, specifically optimized for agentic applications:
Agent-Optimized Features: The format provides comprehensive resources for SLM conversion and deployment with enhanced support for tool calling, structured output generation, and multi-turn conversations. Cross-platform compatibility ensures consistent agent behavior across different edge devices.
Performance Optimization: GGUF enables efficient memory usage for agent workflows, supports dynamic model loading for multi-agent systems, and provides optimized inference for real-time agent interactions.
Llama.cpp provides cutting-edge quantization techniques specifically optimized for agentic SLM deployment:
Agent-Specific Quantization: The framework supports Q4_0 (optimal for mobile agent deployment with 75% size reduction), Q5_1 (balanced quality-compression for edge inference agents), and Q8_0 (near-original quality for production agent systems). Advanced formats enable ultra-compressed agents for extreme edge scenarios.
Implementation Benefits: CPU-optimized inference with SIMD acceleration provides memory-efficient agent execution. Cross-platform compatibility across x86, ARM, and Apple Silicon architectures enables universal agent deployment capabilities.
Apple MLX provides native optimization specifically designed for SLM-powered agents on Apple Silicon devices:
Apple Silicon Agent Optimization: The framework utilizes unified memory architecture with Metal Performance Shaders integration, automatic mixed precision for agent inference, and optimized memory bandwidth for multi-agent systems. SLM agents show exceptional performance on M-series chips.
Development Features: Python and Swift API support with agent-specific optimizations, automatic differentiation for agent learning, and seamless integration with Apple development tools provide comprehensive agent development environments.
ONNX Runtime provides a universal inference engine that enables SLM agents to run consistently across diverse hardware platforms and operating systems:
Universal Deployment: ONNX Runtime ensures consistent SLM agent behavior across Windows, Linux, macOS, iOS, and Android platforms. This cross-platform compatibility enables developers to write once and deploy everywhere, significantly reducing development and maintenance overhead for multi-platform applications.
Hardware Acceleration Options: The framework provides optimized execution providers for various hardware configurations including CPU (Intel, AMD, ARM), GPU (NVIDIA CUDA, AMD ROCm), and specialized accelerators (Intel VPU, Qualcomm NPU). SLM agents can automatically leverage the best available hardware without code changes.
Production-Ready Features: ONNX Runtime offers enterprise-grade features essential for production agent deployment including graph optimization for faster inference, memory management for resource-constrained environments, and comprehensive profiling tools for performance analysis. The framework supports both Python and C++ APIs for flexible integration.
Operational Efficiency: SLMs provide 10-30× cost reduction compared to LLMs for agent tasks, enabling real-time agentic responses at scale. They offer faster inference times due to reduced computational complexity, making them ideal for interactive agent applications.
Edge Deployment Capabilities: SLMs enable on-device agent execution without internet dependency, enhanced privacy through local agent processing, and customization for domain-specific agent applications suitable for various edge computing environments.
Agent-Specific Optimization: SLMs excel at tool calling, structured output generation, and routine decision-making workflows that comprise 70-80% of typical agent tasks.
Perfect for SLMs:
- Repetitive agent tasks: Data entry, form filling, routine API calls
- Tool integration: Database queries, file operations, system interactions
- Structured workflows: Following predefined agent processes
- Domain-specific agents: Customer service, scheduling, basic analysis
- Local processing: Privacy-sensitive agent operations
Better for LLMs:
- Complex reasoning: Novel problem-solving, strategic planning
- Open-ended conversations: General chat, creative discussions
- Broad knowledge tasks: Research requiring vast general knowledge
- Novel situations: Handling completely new agent scenarios
The optimal approach combines SLMs and LLMs in heterogeneous agentic systems:
Smart Agent Orchestration:
- SLM as primary: Handle 70-80% of routine agent tasks locally
- LLM when needed: Route complex queries to cloud-based larger models
- Specialized SLMs: Different small models for different agent domains
- Cost optimization: Minimize expensive LLM calls through intelligent routing
Foundry Local (https://github.com/microsoft/foundry-local) serves as Microsoft's flagship solution for deploying Small Language Models in production edge environments. It provides a complete runtime environment specifically designed for SLM-powered agents with enterprise-grade features and seamless integration capabilities.
Core Architecture and Features:
- OpenAI-Compatible API: Full compatibility with OpenAI SDK and Agent Framework integrations
- Automatic Hardware Optimization: Intelligent selection of model variants based on available hardware (CUDA GPU, Qualcomm NPU, CPU)
- Model Management: Automated downloading, caching, and lifecycle management of SLM models
- Service Discovery: Zero-configuration service detection for agent frameworks
- Resource Optimization: Intelligent memory management and power efficiency for edge deployment
Cross-Platform Installation:
# Windows (recommended)
winget install Microsoft.FoundryLocal
# macOS
brew tap microsoft/foundrylocal
brew install foundrylocal
# Linux (manual installation)
wget https://github.com/microsoft/foundry-local/releases/latest/download/foundry-local-linux.tar.gz
tar -xzf foundry-local-linux.tar.gz
sudo mv foundry-local /usr/local/bin/Quick Start for Agent Development:
# Start service with automatic model loading
foundry model run phi-4-mini
# Verify service status and endpoint
foundry service status
# List available models
foundry model ls
# Test API endpoint
curl http://localhost:<port>/v1/modelsFoundry Local SDK Integration:
from foundry_local import FoundryLocalManager
from microsoft_agent_framework import Agent, Config
import openai
# Initialize Foundry Local with automatic service management
manager = FoundryLocalManager("phi-4-mini")
# Configure OpenAI client for local inference
client = openai.OpenAI(
base_url=manager.endpoint,
api_key=manager.api_key # Auto-generated for local usage
)
# Create agent with Foundry Local backend
agent_config = Config(
name="production-agent",
model_provider="foundry-local",
model_id=manager.get_model_info("phi-4-mini").id,
endpoint=manager.endpoint,
api_key=manager.api_key
)
agent = Agent(config=agent_config)Automatic Model Selection and Hardware Optimization:
# Foundry Local automatically selects optimal model variant
models_by_use_case = {
"lightweight_routing": "qwen2.5-0.5b", # 500MB, ultra-fast
"general_conversation": "phi-4-mini", # 2.4GB, balanced
"complex_reasoning": "phi-4", # 7GB, high-capability
"code_assistance": "qwen2.5-coder-0.5b" # 500MB, code-optimized
}
# Foundry Local handles hardware detection and quantization
for use_case, model_alias in models_by_use_case.items():
manager = FoundryLocalManager(model_alias)
print(f"{use_case}: {manager.get_model_info(model_alias).variant_selected}")
# Output examples:
# lightweight_routing: qwen2.5-0.5b-instruct-q4_k_m.gguf (CPU optimized)
# general_conversation: phi-4-mini-instruct-cuda-q5_k_m.gguf (GPU accelerated)Single-Agent Production Setup:
import asyncio
from foundry_local import FoundryLocalManager
from microsoft_agent_framework import Agent, Config, Tool
class ProductionAgentService:
def __init__(self, model_alias="phi-4-mini"):
self.foundry = FoundryLocalManager(model_alias)
self.agent = self._create_agent()
def _create_agent(self):
config = Config(
name="production-customer-service",
model_provider="foundry-local",
model_id=self.foundry.get_model_info().id,
endpoint=self.foundry.endpoint,
api_key=self.foundry.api_key,
max_tokens=512,
temperature=0.1,
timeout=30.0
)
agent = Agent(config=config)
# Add production tools
@agent.tool
def lookup_customer(customer_id: str) -> dict:
"""Look up customer information from local database."""
return self.local_db.get_customer(customer_id)
@agent.tool
def create_ticket(issue: str, priority: str = "medium") -> str:
"""Create a support ticket."""
ticket_id = self.ticketing_system.create(issue, priority)
return f"Created ticket {ticket_id}"
return agent
async def process_request(self, user_input: str) -> str:
"""Process user request with error handling and monitoring."""
try:
response = await self.agent.chat_async(user_input)
self.log_interaction(user_input, response, "success")
return response
except Exception as e:
self.log_interaction(user_input, str(e), "error")
return "I'm experiencing technical difficulties. Please try again."
def health_check(self) -> dict:
"""Check service health for monitoring."""
return {
"foundry_status": self.foundry.health_check(),
"model_loaded": self.foundry.is_model_loaded(),
"endpoint": self.foundry.endpoint,
"memory_usage": self.foundry.get_memory_usage()
}
# Production usage
service = ProductionAgentService("phi-4-mini")
response = await service.process_request("I need help with my order #12345")Multi-Agent Production Orchestration:
from foundry_local import FoundryLocalManager
from microsoft_agent_framework import AgentOrchestrator, Agent, Config
class MultiAgentProductionSystem:
def __init__(self):
self.agents = self._initialize_agents()
self.orchestrator = AgentOrchestrator(list(self.agents.values()))
def _initialize_agents(self):
agents = {}
# Lightweight routing agent
routing_foundry = FoundryLocalManager("qwen2.5-0.5b")
agents["router"] = Agent(Config(
name="request-router",
model_provider="foundry-local",
endpoint=routing_foundry.endpoint,
api_key=routing_foundry.api_key,
role="Route user requests to appropriate specialized agents"
))
# Customer service agent
service_foundry = FoundryLocalManager("phi-4-mini")
agents["customer_service"] = Agent(Config(
name="customer-service",
model_provider="foundry-local",
endpoint=service_foundry.endpoint,
api_key=service_foundry.api_key,
role="Handle customer service inquiries and support requests"
))
# Technical support agent
tech_foundry = FoundryLocalManager("qwen2.5-coder-0.5b")
agents["technical"] = Agent(Config(
name="technical-support",
model_provider="foundry-local",
endpoint=tech_foundry.endpoint,
api_key=tech_foundry.api_key,
role="Provide technical assistance and troubleshooting"
))
return agents
async def process_request(self, user_input: str) -> str:
"""Route and process user requests through appropriate agents."""
# Route request to appropriate agent
routing_result = await self.agents["router"].chat_async(
f"Classify this request and route to customer_service or technical: {user_input}"
)
# Determine target agent based on routing
target_agent = "customer_service" if "customer" in routing_result.lower() else "technical"
# Process with specialized agent
response = await self.agents[target_agent].chat_async(user_input)
return response
# Production deployment
system = MultiAgentProductionSystem()
response = await system.process_request("My application keeps crashing")Health Monitoring and Observability:
from foundry_local import FoundryLocalManager
import asyncio
import logging
class FoundryMonitoringService:
def __init__(self):
self.managers = {}
self.metrics = []
def add_model(self, alias: str) -> FoundryLocalManager:
"""Add a model to monitoring."""
manager = FoundryLocalManager(alias)
self.managers[alias] = manager
return manager
async def collect_metrics(self):
"""Collect performance metrics from all Foundry Local instances."""
metrics = {
"timestamp": time.time(),
"models": {}
}
for alias, manager in self.managers.items():
try:
model_metrics = {
"status": "healthy" if manager.health_check() else "unhealthy",
"memory_usage": manager.get_memory_usage(),
"inference_count": manager.get_inference_count(),
"average_latency": manager.get_average_latency(),
"error_rate": manager.get_error_rate()
}
metrics["models"][alias] = model_metrics
except Exception as e:
logging.error(f"Failed to collect metrics for {alias}: {e}")
metrics["models"][alias] = {"status": "error", "error": str(e)}
self.metrics.append(metrics)
return metrics
def get_health_status(self) -> dict:
"""Get overall system health status."""
healthy_models = 0
total_models = len(self.managers)
for alias, manager in self.managers.items():
if manager.health_check():
healthy_models += 1
return {
"overall_status": "healthy" if healthy_models == total_models else "degraded",
"healthy_models": healthy_models,
"total_models": total_models,
"health_percentage": (healthy_models / total_models) * 100 if total_models > 0 else 0
}
# Production monitoring setup
monitor = FoundryMonitoringService()
monitor.add_model("phi-4-mini")
monitor.add_model("qwen2.5-0.5b")
# Continuous monitoring
async def monitoring_loop():
while True:
metrics = await monitor.collect_metrics()
health = monitor.get_health_status()
if health["health_percentage"] < 100:
logging.warning(f"System health degraded: {health}")
await asyncio.sleep(30) # Collect metrics every 30 secondsResource Management and Auto-scaling:
class FoundryResourceManager:
def __init__(self):
self.model_instances = {}
self.resource_limits = {
"max_memory_gb": 8,
"max_concurrent_models": 3,
"cpu_threshold": 80
}
def auto_scale_models(self, demand_metrics: dict):
"""Automatically scale models based on demand."""
current_memory = self.get_total_memory_usage()
# Scale down if memory usage is high
if current_memory > self.resource_limits["max_memory_gb"] * 0.8:
self.scale_down_idle_models()
# Scale up if demand is high and resources allow
for model_alias, demand in demand_metrics.items():
if demand > 0.8 and len(self.model_instances) < self.resource_limits["max_concurrent_models"]:
self.load_model_instance(model_alias)
def load_model_instance(self, alias: str) -> FoundryLocalManager:
"""Load a new model instance if resources allow."""
if alias not in self.model_instances:
try:
manager = FoundryLocalManager(alias)
self.model_instances[alias] = manager
logging.info(f"Loaded model instance: {alias}")
return manager
except Exception as e:
logging.error(f"Failed to load model {alias}: {e}")
return None
return self.model_instances[alias]
def scale_down_idle_models(self):
"""Remove idle model instances to free resources."""
idle_models = []
for alias, manager in self.model_instances.items():
if manager.get_idle_time() > 300: # 5 minutes idle
idle_models.append(alias)
for alias in idle_models:
self.model_instances[alias].shutdown()
del self.model_instances[alias]
logging.info(f"Scaled down idle model: {alias}")Custom Model Configuration:
# Advanced Foundry Local configuration for production
from foundry_local import FoundryLocalManager, ModelConfig
# Custom configuration for specific use cases
config = ModelConfig(
alias="phi-4-mini",
quantization="Q5_K_M", # Specific quantization level
context_length=4096, # Extended context for complex agents
batch_size=1, # Optimized for single-user agents
threads=4, # CPU thread optimization
gpu_layers=32, # GPU acceleration layers
memory_lock=True, # Lock model in memory for consistent performance
numa=True # NUMA optimization for multi-socket systems
)
manager = FoundryLocalManager(config=config)Production Deployment Checklist:
✅ Service Configuration:
- Configure appropriate model aliases for use cases
- Set resource limits and monitoring thresholds
- Enable health checks and metrics collection
- Configure automatic restart and failover
✅ Security Setup:
- Enable local-only API access (no external exposure)
- Configure appropriate API key management
- Set up audit logging for agent interactions
- Implement rate limiting for production usage
✅ Performance Optimization:
- Test model performance under expected load
- Configure appropriate quantization levels
- Set up model caching and warming strategies
- Monitor memory and CPU usage patterns
✅ Integration Testing:
- Test agent framework integration
- Verify offline operation capabilities
- Test failover and recovery scenarios
- Validate end-to-end agent workflows
Ollama provides a community-driven approach to SLM agent deployment with emphasis on simplicity, extensive model ecosystem, and developer-friendly workflows. While Foundry Local focuses on enterprise-grade features, Ollama excels in rapid prototyping, community model access, and simplified deployment scenarios.
Core Architecture and Features:
- OpenAI-Compatible API: Full REST API compatibility for seamless agent framework integration
- Extensive Model Library: Access to hundreds of community-contributed and official models
- Simple Model Management: One-command model installation and switching
- Cross-Platform Support: Native support across Windows, macOS, and Linux
- Resource Optimization: Automatic quantization and hardware detection
Cross-Platform Installation:
# Windows
winget install Ollama.Ollama
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Docker deployment
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollamaQuick Start for Agent Development:
# Start Ollama service
ollama serve
# Pull and run models for agent development
ollama pull phi3.5:3.8b-mini-instruct-q4_K_M # Microsoft Phi-3.5 Mini
ollama pull qwen2.5:0.5b-instruct-q4_K_M # Qwen2.5 0.5B
ollama pull llama3.2:1b-instruct-q4_K_M # Llama 3.2 1B
# Test model availability
ollama list
# Test API endpoint
curl http://localhost:11434/api/generate -d '{
"model": "phi3.5:3.8b-mini-instruct-q4_K_M",
"prompt": "Hello, how can I help you today?"
}'Ollama with Microsoft Agent Framework:
from microsoft_agent_framework import Agent, Config
import openai
import requests
import json
class OllamaManager:
def __init__(self, model_name: str, base_url: str = "http://localhost:11434"):
self.model_name = model_name
self.base_url = base_url
self.api_url = f"{base_url}/api"
self.openai_url = f"{base_url}/v1"
def ensure_model_available(self) -> bool:
"""Ensure the model is pulled and available."""
try:
response = requests.post(f"{self.api_url}/pull",
json={"name": self.model_name})
return response.status_code == 200
except Exception as e:
print(f"Failed to pull model {self.model_name}: {e}")
return False
def get_openai_client(self) -> openai.OpenAI:
"""Get OpenAI-compatible client for Ollama."""
return openai.OpenAI(
base_url=self.openai_url,
api_key="ollama", # Ollama doesn't require real API key
)
def health_check(self) -> bool:
"""Check if Ollama service is running."""
try:
response = requests.get(f"{self.base_url}/api/tags")
return response.status_code == 200
except:
return False
# Initialize Ollama for agent development
ollama_manager = OllamaManager("phi3.5:3.8b-mini-instruct-q4_K_M")
ollama_manager.ensure_model_available()
# Configure agent with Ollama backend
agent_config = Config(
name="ollama-agent",
model_provider="ollama",
model_id="phi3.5:3.8b-mini-instruct-q4_K_M",
endpoint=ollama_manager.openai_url,
api_key="ollama"
)
agent = Agent(config=agent_config)Multi-Model Agent Setup with Ollama:
class OllamaMultiModelManager:
def __init__(self):
self.models = {
"lightweight": "qwen2.5:0.5b-instruct-q4_K_M", # 350MB
"balanced": "phi3.5:3.8b-mini-instruct-q4_K_M", # 2.3GB
"capable": "llama3.2:3b-instruct-q4_K_M", # 1.9GB
"coding": "codellama:7b-code-q4_K_M" # 4.1GB
}
self.base_url = "http://localhost:11434"
self.clients = {}
self._initialize_models()
def _initialize_models(self):
"""Pull all required models and create clients."""
for category, model_name in self.models.items():
# Pull model if not available
self._pull_model(model_name)
# Create OpenAI client for each model
self.clients[category] = openai.OpenAI(
base_url=f"{self.base_url}/v1",
api_key="ollama"
)
def _pull_model(self, model_name: str):
"""Pull model if not already available."""
try:
response = requests.post(f"{self.base_url}/api/pull",
json={"name": model_name})
if response.status_code == 200:
print(f"Model {model_name} ready")
except Exception as e:
print(f"Failed to pull {model_name}: {e}")
def get_agent_for_task(self, task_type: str) -> Agent:
"""Get appropriate agent based on task complexity."""
model_category = self._classify_task(task_type)
model_name = self.models[model_category]
config = Config(
name=f"ollama-{model_category}-agent",
model_provider="ollama",
model_id=model_name,
endpoint=f"{self.base_url}/v1",
api_key="ollama"
)
return Agent(config=config)
def _classify_task(self, task_type: str) -> str:
"""Classify task to appropriate model category."""
if any(keyword in task_type.lower() for keyword in ["simple", "route", "classify"]):
return "lightweight"
elif any(keyword in task_type.lower() for keyword in ["code", "programming", "debug"]):
return "coding"
elif any(keyword in task_type.lower() for keyword in ["complex", "analysis", "research"]):
return "capable"
else:
return "balanced"
# Usage example
manager = OllamaMultiModelManager()
# Get appropriate agents for different tasks
routing_agent = manager.get_agent_for_task("simple routing")
coding_agent = manager.get_agent_for_task("code debugging")
analysis_agent = manager.get_agent_for_task("complex analysis")Production Service with Ollama:
import asyncio
import logging
from typing import Dict, Optional
from microsoft_agent_framework import Agent, Config
import requests
import openai
class OllamaProductionService:
def __init__(self, models_config: Dict[str, str]):
self.models_config = models_config
self.base_url = "http://localhost:11434"
self.agents = {}
self.metrics = {
"requests_processed": 0,
"errors": 0,
"model_usage": {model: 0 for model in models_config.keys()}
}
self._initialize_production_agents()
def _initialize_production_agents(self):
"""Initialize production agents with health checks."""
for agent_type, model_name in self.models_config.items():
try:
# Ensure model is available
self._ensure_model_ready(model_name)
# Create production agent
config = Config(
name=f"production-{agent_type}",
model_provider="ollama",
model_id=model_name,
endpoint=f"{self.base_url}/v1",
api_key="ollama",
max_tokens=512,
temperature=0.1,
timeout=30.0
)
agent = Agent(config=config)
# Add production tools based on agent type
self._add_production_tools(agent, agent_type)
self.agents[agent_type] = agent
logging.info(f"Initialized {agent_type} agent with model {model_name}")
except Exception as e:
logging.error(f"Failed to initialize {agent_type} agent: {e}")
def _ensure_model_ready(self, model_name: str):
"""Ensure model is pulled and ready for use."""
try:
# Check if model exists
response = requests.get(f"{self.base_url}/api/tags")
models = response.json().get('models', [])
model_exists = any(model['name'] == model_name for model in models)
if not model_exists:
logging.info(f"Pulling model {model_name}...")
pull_response = requests.post(f"{self.base_url}/api/pull",
json={"name": model_name})
if pull_response.status_code != 200:
raise Exception(f"Failed to pull model {model_name}")
except Exception as e:
raise Exception(f"Model setup failed for {model_name}: {e}")
def _add_production_tools(self, agent: Agent, agent_type: str):
"""Add tools based on agent type."""
if agent_type == "customer_service":
@agent.tool
def lookup_customer(customer_id: str) -> dict:
"""Look up customer information."""
# Simulate database lookup
return {"customer_id": customer_id, "status": "active", "tier": "premium"}
@agent.tool
def create_support_ticket(issue: str, priority: str = "medium") -> str:
"""Create a support ticket."""
ticket_id = f"TICK-{hash(issue) % 10000:04d}"
return f"Created ticket {ticket_id} with priority {priority}"
elif agent_type == "technical_support":
@agent.tool
def run_diagnostics(system_info: str) -> dict:
"""Run system diagnostics."""
return {"status": "healthy", "issues": [], "recommendations": []}
@agent.tool
def access_knowledge_base(query: str) -> str:
"""Search technical knowledge base."""
return f"Knowledge base results for: {query}"
async def process_request(self, request: str, agent_type: str = "customer_service") -> dict:
"""Process user request with monitoring and error handling."""
start_time = time.time()
try:
if agent_type not in self.agents:
raise ValueError(f"Agent type {agent_type} not available")
agent = self.agents[agent_type]
response = await agent.chat_async(request)
# Update metrics
self.metrics["requests_processed"] += 1
self.metrics["model_usage"][agent_type] += 1
processing_time = time.time() - start_time
self._log_interaction(request, response, "success", processing_time, agent_type)
return {
"response": response,
"status": "success",
"processing_time": processing_time,
"agent_type": agent_type
}
except Exception as e:
self.metrics["errors"] += 1
processing_time = time.time() - start_time
self._log_interaction(request, str(e), "error", processing_time, agent_type)
return {
"response": "I'm experiencing technical difficulties. Please try again.",
"status": "error",
"error": str(e),
"processing_time": processing_time
}
def _log_interaction(self, request: str, response: str, status: str,
processing_time: float, agent_type: str):
"""Log interaction for monitoring and analysis."""
logging.info(f"Agent: {agent_type}, Status: {status}, Time: {processing_time:.2f}s")
# In production, this would write to a proper logging system
log_entry = {
"timestamp": time.time(),
"agent_type": agent_type,
"request_length": len(request),
"response_length": len(response),
"status": status,
"processing_time": processing_time
}
def get_health_status(self) -> dict:
"""Get service health status."""
try:
# Check Ollama service health
response = requests.get(f"{self.base_url}/api/tags", timeout=5)
ollama_healthy = response.status_code == 200
# Check model availability
available_models = []
if ollama_healthy:
models = response.json().get('models', [])
available_models = [model['name'] for model in models]
return {
"service_status": "healthy" if ollama_healthy else "unhealthy",
"ollama_endpoint": self.base_url,
"available_models": available_models,
"active_agents": list(self.agents.keys()),
"metrics": self.metrics,
"timestamp": time.time()
}
except Exception as e:
return {
"service_status": "error",
"error": str(e),
"timestamp": time.time()
}
# Production deployment example
production_models = {
"customer_service": "phi3.5:3.8b-mini-instruct-q4_K_M",
"technical_support": "llama3.2:3b-instruct-q4_K_M",
"routing": "qwen2.5:0.5b-instruct-q4_K_M"
}
service = OllamaProductionService(production_models)
# Process requests
result = await service.process_request(
"I need help with my account settings",
"customer_service"
)
print(result)Ollama Monitoring and Observability:
import time
import asyncio
import requests
from typing import Dict, List
class OllamaMonitoringService:
def __init__(self, base_url: str = "http://localhost:11434"):
self.base_url = base_url
self.metrics_history = []
self.alert_thresholds = {
"response_time_ms": 2000,
"error_rate_percent": 5,
"memory_usage_percent": 85
}
async def collect_metrics(self) -> dict:
"""Collect comprehensive metrics from Ollama service."""
metrics = {
"timestamp": time.time(),
"service_status": "unknown",
"models": {},
"performance": {},
"resources": {}
}
try:
# Check service health
health_response = requests.get(f"{self.base_url}/api/tags", timeout=5)
metrics["service_status"] = "healthy" if health_response.status_code == 200 else "unhealthy"
if metrics["service_status"] == "healthy":
# Get model information
models_data = health_response.json().get('models', [])
for model in models_data:
model_name = model['name']
metrics["models"][model_name] = {
"size_gb": model.get('size', 0) / (1024**3),
"modified": model.get('modified_at', ''),
"digest": model.get('digest', '')[:12] # Short digest
}
# Test inference performance
start_time = time.time()
test_response = requests.post(f"{self.base_url}/api/generate",
json={
"model": list(metrics["models"].keys())[0] if metrics["models"] else "",
"prompt": "Hello",
"stream": False
}, timeout=10)
if test_response.status_code == 200:
inference_time = (time.time() - start_time) * 1000
metrics["performance"] = {
"inference_time_ms": inference_time,
"tokens_per_second": self._calculate_tokens_per_second(test_response.json()),
"last_successful_inference": time.time()
}
except Exception as e:
metrics["service_status"] = "error"
metrics["error"] = str(e)
self.metrics_history.append(metrics)
# Keep only last 100 metrics entries
if len(self.metrics_history) > 100:
self.metrics_history = self.metrics_history[-100:]
return metrics
def _calculate_tokens_per_second(self, response_data: dict) -> float:
"""Calculate approximate tokens per second from response."""
try:
# Estimate tokens (rough approximation)
response_text = response_data.get('response', '')
estimated_tokens = len(response_text.split())
# Get timing info if available
eval_duration = response_data.get('eval_duration', 0)
if eval_duration > 0:
# Convert nanoseconds to seconds
duration_seconds = eval_duration / 1e9
return estimated_tokens / duration_seconds if duration_seconds > 0 else 0
except:
pass
return 0
def check_alerts(self, current_metrics: dict) -> List[dict]:
"""Check current metrics against alert thresholds."""
alerts = []
# Check response time
if current_metrics.get('performance', {}).get('inference_time_ms', 0) > self.alert_thresholds['response_time_ms']:
alerts.append({
"type": "performance",
"message": f"High response time: {current_metrics['performance']['inference_time_ms']:.0f}ms",
"severity": "warning"
})
# Check service status
if current_metrics.get('service_status') != 'healthy':
alerts.append({
"type": "availability",
"message": f"Service unhealthy: {current_metrics.get('error', 'Unknown error')}",
"severity": "critical"
})
return alerts
def get_performance_summary(self, minutes: int = 60) -> dict:
"""Get performance summary for the last N minutes."""
cutoff_time = time.time() - (minutes * 60)
recent_metrics = [m for m in self.metrics_history if m['timestamp'] > cutoff_time]
if not recent_metrics:
return {"error": "No recent metrics available"}
# Calculate averages
response_times = [m.get('performance', {}).get('inference_time_ms', 0)
for m in recent_metrics if m.get('performance')]
healthy_checks = sum(1 for m in recent_metrics if m.get('service_status') == 'healthy')
uptime_percent = (healthy_checks / len(recent_metrics)) * 100 if recent_metrics else 0
return {
"period_minutes": minutes,
"total_checks": len(recent_metrics),
"uptime_percent": uptime_percent,
"avg_response_time_ms": sum(response_times) / len(response_times) if response_times else 0,
"max_response_time_ms": max(response_times) if response_times else 0,
"min_response_time_ms": min(response_times) if response_times else 0
}
# Production monitoring setup
monitor = OllamaMonitoringService()
async def monitoring_loop():
"""Continuous monitoring loop."""
while True:
try:
metrics = await monitor.collect_metrics()
alerts = monitor.check_alerts(metrics)
if alerts:
for alert in alerts:
logging.warning(f"ALERT: {alert['message']} (Severity: {alert['severity']})")
# Log performance summary every 10 minutes
if int(time.time()) % 600 == 0: # Every 10 minutes
summary = monitor.get_performance_summary(10)
logging.info(f"Performance Summary: {summary}")
except Exception as e:
logging.error(f"Monitoring error: {e}")
await asyncio.sleep(30) # Check every 30 seconds
# Start monitoring
# asyncio.create_task(monitoring_loop())Custom Model Management with Ollama:
class OllamaModelManager:
def __init__(self, base_url: str = "http://localhost:11434"):
self.base_url = base_url
self.model_catalog = {
# Lightweight models for fast responses
"ultra_light": [
"qwen2.5:0.5b-instruct-q4_K_M",
"tinyllama:1.1b-chat-q4_K_M"
],
# Balanced models for general use
"balanced": [
"phi3.5:3.8b-mini-instruct-q4_K_M",
"llama3.2:3b-instruct-q4_K_M"
],
# Specialized models for specific tasks
"code_specialist": [
"codellama:7b-code-q4_K_M",
"codegemma:7b-code-q4_K_M"
],
# High capability models
"high_capability": [
"llama3.1:8b-instruct-q4_K_M",
"qwen2.5:7b-instruct-q4_K_M"
]
}
def setup_production_models(self, categories: List[str]) -> dict:
"""Set up models for production use."""
setup_results = {}
for category in categories:
if category not in self.model_catalog:
setup_results[category] = {"status": "error", "message": "Unknown category"}
continue
models = self.model_catalog[category]
category_results = []
for model in models:
try:
# Pull model
response = requests.post(f"{self.base_url}/api/pull",
json={"name": model})
if response.status_code == 200:
category_results.append({"model": model, "status": "ready"})
else:
category_results.append({"model": model, "status": "failed"})
except Exception as e:
category_results.append({"model": model, "status": "error", "error": str(e)})
setup_results[category] = category_results
return setup_results
def optimize_for_hardware(self) -> dict:
"""Recommend optimal models based on available hardware."""
# This would typically check actual hardware specs
# For demo purposes, we'll simulate hardware detection
recommendations = {
"low_resource": {
"models": ["qwen2.5:0.5b-instruct-q4_K_M"],
"max_concurrent": 1,
"memory_usage": "< 1GB"
},
"medium_resource": {
"models": ["phi3.5:3.8b-mini-instruct-q4_K_M", "llama3.2:3b-instruct-q4_K_M"],
"max_concurrent": 2,
"memory_usage": "2-4GB"
},
"high_resource": {
"models": ["llama3.1:8b-instruct-q4_K_M", "codellama:7b-code-q4_K_M"],
"max_concurrent": 3,
"memory_usage": "6-12GB"
}
}
return recommendations
# Production model setup
model_manager = OllamaModelManager()
setup_results = model_manager.setup_production_models(["balanced", "ultra_light"])
print(f"Model setup results: {setup_results}")Production Deployment Checklist for Ollama:
✅ Service Configuration:
- Install Ollama service with proper system integration
- Configure models for specific agent use cases
- Set up proper startup scripts and service management
- Test model loading and API availability
✅ Model Management:
- Pull required models and verify integrity
- Set up model update and rotation procedures
- Configure model caching and storage optimization
- Test model performance under expected load
✅ Security Setup:
- Configure firewall rules for local-only access
- Set up API access controls and rate limiting
- Implement audit logging for agent interactions
- Configure secure model storage and access
✅ Performance Optimization:
- Benchmark models for expected use cases
- Configure appropriate hardware acceleration
- Set up model warming and caching strategies
- Monitor resource usage and performance metrics
✅ Integration Testing:
- Test Microsoft Agent Framework integration
- Verify offline operation capabilities
- Test failover scenarios and error handling
- Validate end-to-end agent workflows
Comparison with Foundry Local:
| Feature | Foundry Local | Ollama |
|---|---|---|
| Target Use Case | Enterprise production | Development & community |
| Model Ecosystem | Microsoft-curated | Extensive community |
| Hardware Optimization | Automatic (CUDA/NPU/CPU) | Manual configuration |
| Enterprise Features | Built-in monitoring, security | Community tools |
| Deployment Complexity | Simple (winget install) | Simple (curl install) |
| API Compatibility | OpenAI + extensions | OpenAI standard |
| Support | Microsoft official | Community-driven |
| Best For | Production agents | Prototyping, research |
When to Choose Ollama:
- Development and Prototyping: Rapid experimentation with different models
- Community Models: Access to latest community-contributed models
- Educational Use: Learning and teaching AI agent development
- Research Projects: Academic research requiring diverse model access
- Custom Models: Building and testing custom fine-tuned models
VLLM (Very Large Language Model inference) provides a high-throughput, memory-efficient inference engine specifically optimized for production SLM deployments at scale. While Foundry Local focuses on ease-of-use and Ollama emphasizes community models, VLLM excels in high-performance scenarios requiring maximum throughput and efficient resource utilization.
Core Architecture and Features:
- PagedAttention: Revolutionary memory management for efficient attention computation
- Dynamic Batching: Intelligent request batching for optimal throughput
- GPU Optimization: Advanced CUDA kernels and tensor parallelism support
- OpenAI Compatibility: Full API compatibility for seamless integration
- Speculative Decoding: Advanced inference acceleration techniques
- Quantization Support: INT4, INT8, and FP16 quantization for memory efficiency
Installation Options:
# Standard installation
pip install vllm
# With additional dependencies for agent frameworks
pip install vllm[agent] openai
# Docker deployment for production
docker pull vllm/vllm-openai:latest
# From source for latest features
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .Quick Start for Agent Development:
# Start VLLM server with SLM model
python -m vllm.entrypoints.openai.api_server \
--model microsoft/Phi-3.5-mini-instruct \
--trust-remote-code \
--max-model-len 4096 \
--gpu-memory-utilization 0.8
# Alternative: Start with Qwen2.5 for lightweight agents
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-0.5B-Instruct \
--trust-remote-code \
--max-model-len 2048 \
--tensor-parallel-size 1
# Test API endpoint
curl http://localhost:8000/v1/models
# Test chat completion
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "microsoft/Phi-3.5-mini-instruct",
"messages": [{"role": "user", "content": "Hello!"}]
}'VLLM with Microsoft Agent Framework:
from microsoft_agent_framework import Agent, Config
import openai
import subprocess
import time
import requests
from typing import Optional, Dict, Any
class VLLMManager:
def __init__(self, model_name: str,
host: str = "localhost",
port: int = 8000,
gpu_memory_utilization: float = 0.8,
max_model_len: int = 4096):
self.model_name = model_name
self.host = host
self.port = port
self.base_url = f"http://{host}:{port}"
self.gpu_memory_utilization = gpu_memory_utilization
self.max_model_len = max_model_len
self.process = None
self.client = None
def start_server(self) -> bool:
"""Start VLLM server with optimized settings for agents."""
try:
cmd = [
"python", "-m", "vllm.entrypoints.openai.api_server",
"--model", self.model_name,
"--host", self.host,
"--port", str(self.port),
"--gpu-memory-utilization", str(self.gpu_memory_utilization),
"--max-model-len", str(self.max_model_len),
"--trust-remote-code",
"--disable-log-requests", # Reduce logging for agents
"--served-model-name", self.get_served_model_name()
]
self.process = subprocess.Popen(cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
# Wait for server to start
max_retries = 30
for _ in range(max_retries):
if self.health_check():
self.client = openai.OpenAI(base_url=f"{self.base_url}/v1")
return True
time.sleep(2)
return False
except Exception as e:
print(f"Failed to start VLLM server: {e}")
return False
def get_served_model_name(self) -> str:
"""Get a clean model name for serving."""
return self.model_name.replace("/", "--")
def health_check(self) -> bool:
"""Check if VLLM server is healthy."""
try:
response = requests.get(f"{self.base_url}/health", timeout=5)
return response.status_code == 200
except:
return False
def get_openai_client(self) -> openai.OpenAI:
"""Get OpenAI-compatible client for VLLM."""
if not self.client:
self.client = openai.OpenAI(base_url=f"{self.base_url}/v1")
return self.client
def get_model_info(self) -> Dict[str, Any]:
"""Get model information and statistics."""
try:
response = requests.get(f"{self.base_url}/v1/models")
if response.status_code == 200:
return response.json()
except:
pass
return {}
def shutdown(self):
"""Shutdown VLLM server."""
if self.process:
self.process.terminate()
self.process.wait()
# Initialize VLLM for high-performance agents
vllm_manager = VLLMManager("microsoft/Phi-3.5-mini-instruct")
if vllm_manager.start_server():
print("VLLM server started successfully")
# Configure agent with VLLM backend
agent_config = Config(
name="vllm-performance-agent",
model_provider="vllm",
model_id=vllm_manager.get_served_model_name(),
endpoint=f"{vllm_manager.base_url}/v1",
api_key="none" # VLLM doesn't require API key
)
agent = Agent(config=agent_config)
else:
print("Failed to start VLLM server")High-Throughput Multi-Agent Setup:
import asyncio
from concurrent.futures import ThreadPoolExecutor
from microsoft_agent_framework import Agent, Config
import openai
class VLLMHighThroughputManager:
def __init__(self):
self.model_configs = {
"lightweight": {
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"port": 8000,
"max_model_len": 2048,
"gpu_memory_utilization": 0.3
},
"balanced": {
"model": "microsoft/Phi-3.5-mini-instruct",
"port": 8001,
"max_model_len": 4096,
"gpu_memory_utilization": 0.5
},
"capable": {
"model": "meta-llama/Llama-3.2-3B-Instruct",
"port": 8002,
"max_model_len": 8192,
"gpu_memory_utilization": 0.7
}
}
self.managers = {}
self.agents = {}
self.client_pool = {}
async def initialize_all_models(self):
"""Initialize all VLLM models in parallel."""
initialization_tasks = []
for category, config in self.model_configs.items():
task = self._initialize_model(category, config)
initialization_tasks.append(task)
results = await asyncio.gather(*initialization_tasks, return_exceptions=True)
successful_inits = 0
for i, result in enumerate(results):
category = list(self.model_configs.keys())[i]
if isinstance(result, Exception):
print(f"Failed to initialize {category}: {result}")
else:
successful_inits += 1
print(f"Successfully initialized {category} model")
return successful_inits
async def _initialize_model(self, category: str, config: Dict[str, Any]):
"""Initialize a single VLLM model instance."""
manager = VLLMManager(
model_name=config["model"],
port=config["port"],
max_model_len=config["max_model_len"],
gpu_memory_utilization=config["gpu_memory_utilization"]
)
# Start server in thread to avoid blocking
loop = asyncio.get_event_loop()
with ThreadPoolExecutor() as executor:
success = await loop.run_in_executor(executor, manager.start_server)
if success:
self.managers[category] = manager
# Create agent
agent_config = Config(
name=f"vllm-{category}-agent",
model_provider="vllm",
model_id=manager.get_served_model_name(),
endpoint=f"{manager.base_url}/v1",
api_key="none"
)
self.agents[category] = Agent(config=agent_config)
# Create client pool for high throughput
self.client_pool[category] = [
openai.OpenAI(base_url=f"{manager.base_url}/v1")
for _ in range(5) # 5 clients per model for parallelism
]
return True
else:
raise Exception(f"Failed to start VLLM server for {category}")
def get_optimal_agent(self, request_complexity: str, current_load: Dict[str, int]) -> str:
"""Select optimal agent based on request complexity and current load."""
complexity_mapping = {
"simple": "lightweight",
"moderate": "balanced",
"complex": "capable"
}
preferred_category = complexity_mapping.get(request_complexity, "balanced")
# Check if preferred agent is available and not overloaded
if (preferred_category in self.agents and
current_load.get(preferred_category, 0) < 10): # Max 10 concurrent per agent
return preferred_category
# Fallback to least loaded available agent
available_agents = [(cat, load) for cat, load in current_load.items()
if cat in self.agents and load < 10]
if available_agents:
return min(available_agents, key=lambda x: x[1])[0]
return "balanced" # Default fallback
async def process_batch_requests(self, requests: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
"""Process multiple requests in parallel for maximum throughput."""
current_load = {cat: 0 for cat in self.agents.keys()}
tasks = []
for request in requests:
# Determine optimal agent
complexity = request.get("complexity", "moderate")
agent_category = self.get_optimal_agent(complexity, current_load)
current_load[agent_category] += 1
# Create processing task
task = self._process_single_request(request, agent_category)
tasks.append(task)
# Process all requests in parallel
results = await asyncio.gather(*tasks, return_exceptions=True)
# Format results
formatted_results = []
for i, result in enumerate(results):
if isinstance(result, Exception):
formatted_results.append({
"request_id": requests[i].get("id", i),
"status": "error",
"error": str(result)
})
else:
formatted_results.append(result)
return formatted_results
async def _process_single_request(self, request: Dict[str, Any], agent_category: str) -> Dict[str, Any]:
"""Process a single request with the specified agent."""
start_time = time.time()
try:
agent = self.agents[agent_category]
response = await agent.chat_async(request["message"])
processing_time = time.time() - start_time
return {
"request_id": request.get("id"),
"status": "success",
"response": response,
"agent_used": agent_category,
"processing_time": processing_time
}
except Exception as e:
return {
"request_id": request.get("id"),
"status": "error",
"error": str(e),
"agent_used": agent_category,
"processing_time": time.time() - start_time
}
# High-throughput usage example
throughput_manager = VLLMHighThroughputManager()
# Initialize all models
initialized_count = await throughput_manager.initialize_all_models()
print(f"Initialized {initialized_count} models")
# Process batch requests
batch_requests = [
{"id": 1, "message": "Simple question", "complexity": "simple"},
{"id": 2, "message": "Complex analysis needed", "complexity": "complex"},
{"id": 3, "message": "Moderate difficulty task", "complexity": "moderate"}
]
results = await throughput_manager.process_batch_requests(batch_requests)
for result in results:
print(f"Request {result['request_id']}: {result['status']} in {result.get('processing_time', 0):.2f}s")Enterprise VLLM Production Service:
import asyncio
import logging
import time
from typing import Dict, List, Optional
from dataclasses import dataclass
from microsoft_agent_framework import Agent, Config
import uvicorn
from fastapi import FastAPI, HTTPException, BackgroundTasks
from pydantic import BaseModel
@dataclass
class VLLMServerConfig:
model_name: str
port: int
gpu_memory_utilization: float
max_model_len: int
tensor_parallel_size: int = 1
quantization: Optional[str] = None
class AgentRequest(BaseModel):
message: str
agent_type: str = "general"
priority: str = "normal"
timeout: int = 30
class VLLMProductionService:
def __init__(self, server_configs: Dict[str, VLLMServerConfig]):
self.server_configs = server_configs
self.managers = {}
self.agents = {}
self.metrics = {
"requests_processed": 0,
"requests_failed": 0,
"total_processing_time": 0,
"agent_usage": {name: 0 for name in server_configs.keys()},
"throughput_per_minute": 0
}
self.request_queue = asyncio.Queue(maxsize=1000)
self.processing_workers = []
self.app = FastAPI(title="VLLM Agent Service")
self._setup_routes()
async def initialize_production_environment(self):
"""Initialize all VLLM servers for production."""
logging.info("Initializing VLLM production environment...")
initialization_tasks = []
for name, config in self.server_configs.items():
task = self._initialize_server(name, config)
initialization_tasks.append(task)
results = await asyncio.gather(*initialization_tasks, return_exceptions=True)
successful_servers = 0
for i, result in enumerate(results):
server_name = list(self.server_configs.keys())[i]
if isinstance(result, Exception):
logging.error(f"Failed to initialize {server_name}: {result}")
else:
successful_servers += 1
logging.info(f"Successfully initialized {server_name}")
if successful_servers == 0:
raise Exception("No VLLM servers could be initialized")
# Start processing workers
self.processing_workers = [
asyncio.create_task(self._processing_worker(i))
for i in range(min(4, successful_servers)) # 4 workers max
]
logging.info(f"Production environment ready with {successful_servers} servers")
return successful_servers
async def _initialize_server(self, name: str, config: VLLMServerConfig):
"""Initialize a single VLLM server."""
manager = VLLMManager(
model_name=config.model_name,
port=config.port,
gpu_memory_utilization=config.gpu_memory_utilization,
max_model_len=config.max_model_len
)
# Add quantization if specified
if config.quantization:
# This would be added to the manager's start command
pass
success = manager.start_server()
if success:
self.managers[name] = manager
# Create production agent
agent_config = Config(
name=f"vllm-production-{name}",
model_provider="vllm",
model_id=manager.get_served_model_name(),
endpoint=f"{manager.base_url}/v1",
api_key="none",
timeout=30.0
)
agent = Agent(config=agent_config)
# Add production tools
self._add_production_tools(agent, name)
self.agents[name] = agent
return True
else:
raise Exception(f"Failed to start VLLM server for {name}")
def _add_production_tools(self, agent: Agent, server_type: str):
"""Add production tools based on server type."""
if server_type == "customer_service":
@agent.tool
def escalate_to_human(issue: str, customer_id: str) -> str:
"""Escalate complex issues to human agents."""
return f"Escalated issue for customer {customer_id}: {issue}"
@agent.tool
def lookup_order_status(order_id: str) -> dict:
"""Look up order status from production database."""
# Production database lookup
return {"order_id": order_id, "status": "shipped", "eta": "2 days"}
elif server_type == "technical_support":
@agent.tool
def run_system_diagnostics(system_id: str) -> dict:
"""Run comprehensive system diagnostics."""
return {"system_id": system_id, "status": "healthy", "issues": []}
@agent.tool
def create_incident_report(description: str, severity: str) -> str:
"""Create incident report in production system."""
incident_id = f"INC-{hash(description) % 100000:05d}"
return f"Created incident {incident_id} with severity {severity}"
def _setup_routes(self):
"""Set up FastAPI routes for production service."""
@self.app.post("/chat")
async def chat_endpoint(request: AgentRequest, background_tasks: BackgroundTasks):
try:
# Add request to queue
await self.request_queue.put({
"request": request,
"timestamp": time.time(),
"future": asyncio.Future()
})
# Wait for processing (with timeout)
result = await asyncio.wait_for(
self._wait_for_result(request),
timeout=request.timeout
)
return result
except asyncio.TimeoutError:
raise HTTPException(status_code=408, detail="Request timeout")
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@self.app.get("/health")
async def health_endpoint():
return await self.get_health_status()
@self.app.get("/metrics")
async def metrics_endpoint():
return self.get_production_metrics()
async def _processing_worker(self, worker_id: int):
"""Background worker for processing agent requests."""
logging.info(f"Starting processing worker {worker_id}")
while True:
try:
# Get request from queue
queue_item = await self.request_queue.get()
request_data = queue_item["request"]
request_future = queue_item["future"]
# Select appropriate agent
agent_name = self._select_agent(request_data.agent_type)
if agent_name not in self.agents:
request_future.set_exception(Exception(f"Agent {agent_name} not available"))
continue
# Process request
start_time = time.time()
try:
agent = self.agents[agent_name]
response = await agent.chat_async(request_data.message)
processing_time = time.time() - start_time
# Update metrics
self.metrics["requests_processed"] += 1
self.metrics["total_processing_time"] += processing_time
self.metrics["agent_usage"][agent_name] += 1
result = {
"response": response,
"agent_used": agent_name,
"processing_time": processing_time,
"worker_id": worker_id
}
request_future.set_result(result)
except Exception as e:
self.metrics["requests_failed"] += 1
request_future.set_exception(e)
finally:
self.request_queue.task_done()
except Exception as e:
logging.error(f"Worker {worker_id} error: {e}")
await asyncio.sleep(1)
def _select_agent(self, agent_type: str) -> str:
"""Select appropriate agent based on request type."""
agent_mapping = {
"customer_service": "customer_service",
"technical": "technical_support",
"general": "general_purpose"
}
return agent_mapping.get(agent_type, "general_purpose")
async def _wait_for_result(self, request: AgentRequest):
"""Wait for request processing to complete."""
# This is simplified - in production you'd track futures properly
await asyncio.sleep(0.1) # Placeholder
return {"response": "Processed", "status": "success"}
async def get_health_status(self) -> dict:
"""Get comprehensive health status of all services."""
health_status = {
"overall_status": "healthy",
"servers": {},
"queue_size": self.request_queue.qsize(),
"active_workers": len([w for w in self.processing_workers if not w.done()]),
"timestamp": time.time()
}
unhealthy_servers = 0
for name, manager in self.managers.items():
try:
is_healthy = manager.health_check()
health_status["servers"][name] = {
"status": "healthy" if is_healthy else "unhealthy",
"endpoint": manager.base_url,
"model": manager.model_name
}
if not is_healthy:
unhealthy_servers += 1
except Exception as e:
health_status["servers"][name] = {
"status": "error",
"error": str(e)
}
unhealthy_servers += 1
if unhealthy_servers > 0:
health_status["overall_status"] = "degraded" if unhealthy_servers < len(self.managers) else "unhealthy"
return health_status
def get_production_metrics(self) -> dict:
"""Get production performance metrics."""
total_requests = self.metrics["requests_processed"] + self.metrics["requests_failed"]
avg_processing_time = (
self.metrics["total_processing_time"] / self.metrics["requests_processed"]
if self.metrics["requests_processed"] > 0 else 0
)
success_rate = (
self.metrics["requests_processed"] / total_requests * 100
if total_requests > 0 else 0
)
return {
"total_requests": total_requests,
"successful_requests": self.metrics["requests_processed"],
"failed_requests": self.metrics["requests_failed"],
"success_rate_percent": success_rate,
"average_processing_time_seconds": avg_processing_time,
"agent_usage_distribution": self.metrics["agent_usage"],
"queue_size": self.request_queue.qsize()
}
async def start_production_server(self, host: str = "0.0.0.0", port: int = 8080):
"""Start the production FastAPI server."""
config = uvicorn.Config(
self.app,
host=host,
port=port,
log_level="info",
workers=1 # Single worker for simplicity
)
server = uvicorn.Server(config)
await server.serve()
# Production deployment example
production_configs = {
"customer_service": VLLMServerConfig(
model_name="microsoft/Phi-3.5-mini-instruct",
port=8000,
gpu_memory_utilization=0.4,
max_model_len=4096
),
"technical_support": VLLMServerConfig(
model_name="meta-llama/Llama-3.2-3B-Instruct",
port=8001,
gpu_memory_utilization=0.6,
max_model_len=8192
),
"general_purpose": VLLMServerConfig(
model_name="Qwen/Qwen2.5-1.5B-Instruct",
port=8002,
gpu_memory_utilization=0.3,
max_model_len=2048
)
}
production_service = VLLMProductionService(production_configs)
# Initialize and start production service
# await production_service.initialize_production_environment()
# await production_service.start_production_server()Advanced VLLM Performance Monitoring:
import psutil
import nvidia_ml_py3 as nvml
from dataclasses import dataclass
from typing import List, Dict, Optional
import json
import asyncio
@dataclass
class PerformanceMetrics:
timestamp: float
requests_per_second: float
average_latency_ms: float
gpu_utilization_percent: float
gpu_memory_used_gb: float
cpu_utilization_percent: float
memory_used_gb: float
queue_length: int
active_requests: int
class VLLMAdvancedMonitoring:
def __init__(self, vllm_managers: Dict[str, VLLMManager]):
self.managers = vllm_managers
self.metrics_history = []
self.alert_thresholds = {
"gpu_utilization_max": 95,
"gpu_memory_max_gb": 10,
"latency_max_ms": 3000,
"queue_length_max": 50,
"error_rate_max_percent": 10
}
# Initialize NVIDIA ML for GPU monitoring
try:
nvml.nvmlInit()
self.gpu_monitoring_available = True
self.gpu_count = nvml.nvmlDeviceGetCount()
except:
self.gpu_monitoring_available = False
self.gpu_count = 0
async def collect_comprehensive_metrics(self) -> Dict[str, PerformanceMetrics]:
"""Collect detailed performance metrics for all VLLM instances."""
all_metrics = {}
for name, manager in self.managers.items():
try:
metrics = await self._collect_single_instance_metrics(name, manager)
all_metrics[name] = metrics
except Exception as e:
logging.error(f"Failed to collect metrics for {name}: {e}")
# Create error metrics
all_metrics[name] = PerformanceMetrics(
timestamp=time.time(),
requests_per_second=0,
average_latency_ms=0,
gpu_utilization_percent=0,
gpu_memory_used_gb=0,
cpu_utilization_percent=0,
memory_used_gb=0,
queue_length=0,
active_requests=0
)
return all_metrics
async def _collect_single_instance_metrics(self, name: str, manager: VLLMManager) -> PerformanceMetrics:
"""Collect metrics for a single VLLM instance."""
timestamp = time.time()
# Get VLLM-specific metrics via API
vllm_stats = await self._get_vllm_stats(manager)
# Get system metrics
cpu_percent = psutil.cpu_percent(interval=0.1)
memory_info = psutil.virtual_memory()
memory_used_gb = memory_info.used / (1024**3)
# Get GPU metrics if available
gpu_utilization = 0
gpu_memory_used = 0
if self.gpu_monitoring_available and self.gpu_count > 0:
try:
# Assuming first GPU for simplicity
handle = nvml.nvmlDeviceGetHandleByIndex(0)
gpu_util = nvml.nvmlDeviceGetUtilizationRates(handle)
gpu_utilization = gpu_util.gpu
gpu_mem = nvml.nvmlDeviceGetMemoryInfo(handle)
gpu_memory_used = gpu_mem.used / (1024**3)
except Exception as e:
logging.warning(f"GPU monitoring failed: {e}")
return PerformanceMetrics(
timestamp=timestamp,
requests_per_second=vllm_stats.get("requests_per_second", 0),
average_latency_ms=vllm_stats.get("average_latency_ms", 0),
gpu_utilization_percent=gpu_utilization,
gpu_memory_used_gb=gpu_memory_used,
cpu_utilization_percent=cpu_percent,
memory_used_gb=memory_used_gb,
queue_length=vllm_stats.get("queue_length", 0),
active_requests=vllm_stats.get("active_requests", 0)
)
async def _get_vllm_stats(self, manager: VLLMManager) -> dict:
"""Get VLLM-specific statistics via API calls."""
try:
# Test inference to measure latency
start_time = time.time()
client = manager.get_openai_client()
response = await asyncio.wait_for(
asyncio.to_thread(
client.chat.completions.create,
model=manager.get_served_model_name(),
messages=[{"role": "user", "content": "ping"}],
max_tokens=1
),
timeout=5.0
)
latency_ms = (time.time() - start_time) * 1000
return {
"average_latency_ms": latency_ms,
"requests_per_second": 1000 / latency_ms if latency_ms > 0 else 0,
"queue_length": 0, # Would need to be exposed by VLLM
"active_requests": 1 # Approximation
}
except Exception as e:
logging.warning(f"Failed to get VLLM stats: {e}")
return {
"average_latency_ms": 0,
"requests_per_second": 0,
"queue_length": 0,
"active_requests": 0
}
def generate_performance_report(self, time_window_minutes: int = 60) -> dict:
"""Generate comprehensive performance report."""
cutoff_time = time.time() - (time_window_minutes * 60)
recent_metrics = [
metrics for metrics in self.metrics_history
if any(m.timestamp > cutoff_time for m in metrics.values())
]
if not recent_metrics:
return {"error": "No recent metrics available"}
report = {
"time_window_minutes": time_window_minutes,
"total_samples": len(recent_metrics),
"instances": {}
}
# Analyze each instance
for instance_name in self.managers.keys():
instance_metrics = [
metrics[instance_name] for metrics in recent_metrics
if instance_name in metrics
]
if instance_metrics:
report["instances"][instance_name] = {
"avg_latency_ms": sum(m.average_latency_ms for m in instance_metrics) / len(instance_metrics),
"max_latency_ms": max(m.average_latency_ms for m in instance_metrics),
"avg_gpu_utilization": sum(m.gpu_utilization_percent for m in instance_metrics) / len(instance_metrics),
"avg_requests_per_second": sum(m.requests_per_second for m in instance_metrics) / len(instance_metrics),
"max_queue_length": max(m.queue_length for m in instance_metrics),
"availability_percent": (len(instance_metrics) / len(recent_metrics)) * 100
}
return report
async def auto_scaling_recommendations(self) -> List[dict]:
"""Generate auto-scaling recommendations based on performance metrics."""
recommendations = []
if not self.metrics_history:
return recommendations
latest_metrics = self.metrics_history[-1]
for instance_name, metrics in latest_metrics.items():
# High latency recommendation
if metrics.average_latency_ms > self.alert_thresholds["latency_max_ms"]:
recommendations.append({
"instance": instance_name,
"type": "scale_up",
"reason": f"High latency: {metrics.average_latency_ms:.0f}ms",
"suggestion": "Consider adding tensor parallelism or increasing GPU memory"
})
# High GPU utilization recommendation
if metrics.gpu_utilization_percent > self.alert_thresholds["gpu_utilization_max"]:
recommendations.append({
"instance": instance_name,
"type": "scale_out",
"reason": f"High GPU utilization: {metrics.gpu_utilization_percent:.1f}%",
"suggestion": "Consider adding additional GPU instances"
})
# Low utilization recommendation
if (metrics.gpu_utilization_percent < 20 and
metrics.requests_per_second < 1):
recommendations.append({
"instance": instance_name,
"type": "scale_down",
"reason": f"Low utilization: {metrics.gpu_utilization_percent:.1f}% GPU, {metrics.requests_per_second:.1f} RPS",
"suggestion": "Consider consolidating workloads or reducing resources"
})
return recommendations
# Advanced monitoring setup
monitoring = VLLMAdvancedMonitoring({
"customer_service": vllm_manager,
# Add other managers as needed
})
async def advanced_monitoring_loop():
"""Advanced monitoring with auto-scaling recommendations."""
while True:
try:
# Collect metrics
metrics = await monitoring.collect_comprehensive_metrics()
monitoring.metrics_history.append(metrics)
# Keep only last 1000 entries
if len(monitoring.metrics_history) > 1000:
monitoring.metrics_history = monitoring.metrics_history[-1000:]
# Generate recommendations every 5 minutes
if len(monitoring.metrics_history) % 10 == 0: # Every 10th collection (5 minutes if collecting every 30s)
recommendations = await monitoring.auto_scaling_recommendations()
if recommendations:
logging.info(f"Auto-scaling recommendations: {recommendations}")
# Generate performance report every hour
if len(monitoring.metrics_history) % 120 == 0: # Every 120th collection (1 hour)
report = monitoring.generate_performance_report(60)
logging.info(f"Performance report: {json.dumps(report, indent=2)}")
except Exception as e:
logging.error(f"Advanced monitoring error: {e}")
await asyncio.sleep(30) # Collect metrics every 30 seconds
# Start advanced monitoring
# asyncio.create_task(advanced_monitoring_loop())Production VLLM Configuration Templates:
from enum import Enum
from typing import Dict, Any
class DeploymentScenario(Enum):
DEVELOPMENT = "development"
STAGING = "staging"
PRODUCTION_LOW = "production_low"
PRODUCTION_HIGH = "production_high"
ENTERPRISE = "enterprise"
class VLLMConfigTemplates:
"""Production-ready VLLM configuration templates."""
@staticmethod
def get_config_template(scenario: DeploymentScenario) -> Dict[str, Any]:
"""Get optimized configuration for deployment scenario."""
templates = {
DeploymentScenario.DEVELOPMENT: {
"gpu_memory_utilization": 0.6,
"max_model_len": 2048,
"tensor_parallel_size": 1,
"pipeline_parallel_size": 1,
"quantization": None,
"enable_prefix_caching": False,
"max_num_seqs": 32,
"max_num_batched_tokens": 2048
},
DeploymentScenario.STAGING: {
"gpu_memory_utilization": 0.8,
"max_model_len": 4096,
"tensor_parallel_size": 1,
"pipeline_parallel_size": 1,
"quantization": "awq",
"enable_prefix_caching": True,
"max_num_seqs": 64,
"max_num_batched_tokens": 4096
},
DeploymentScenario.PRODUCTION_LOW: {
"gpu_memory_utilization": 0.85,
"max_model_len": 4096,
"tensor_parallel_size": 1,
"pipeline_parallel_size": 1,
"quantization": "awq",
"enable_prefix_caching": True,
"max_num_seqs": 128,
"max_num_batched_tokens": 8192,
"enable_chunked_prefill": True
},
DeploymentScenario.PRODUCTION_HIGH: {
"gpu_memory_utilization": 0.9,
"max_model_len": 8192,
"tensor_parallel_size": 2,
"pipeline_parallel_size": 1,
"quantization": "awq",
"enable_prefix_caching": True,
"max_num_seqs": 256,
"max_num_batched_tokens": 16384,
"enable_chunked_prefill": True,
"speculative_model": "small_draft_model"
},
DeploymentScenario.ENTERPRISE: {
"gpu_memory_utilization": 0.95,
"max_model_len": 16384,
"tensor_parallel_size": 4,
"pipeline_parallel_size": 2,
"quantization": "awq",
"enable_prefix_caching": True,
"max_num_seqs": 512,
"max_num_batched_tokens": 32768,
"enable_chunked_prefill": True,
"speculative_model": "optimized_draft_model",
"guided_decoding_backend": "outlines"
}
}
return templates[scenario]
@staticmethod
def generate_vllm_command(model_name: str,
scenario: DeploymentScenario,
port: int = 8000,
host: str = "0.0.0.0") -> List[str]:
"""Generate optimized VLLM command for deployment scenario."""
config = VLLMConfigTemplates.get_config_template(scenario)
cmd = [
"python", "-m", "vllm.entrypoints.openai.api_server",
"--model", model_name,
"--host", host,
"--port", str(port),
"--gpu-memory-utilization", str(config["gpu_memory_utilization"]),
"--max-model-len", str(config["max_model_len"]),
"--tensor-parallel-size", str(config["tensor_parallel_size"]),
"--max-num-seqs", str(config["max_num_seqs"]),
"--max-num-batched-tokens", str(config["max_num_batched_tokens"]),
"--trust-remote-code",
"--disable-log-requests"
]
# Add optional parameters
if config.get("quantization"):
cmd.extend(["--quantization", config["quantization"]])
if config.get("enable_prefix_caching"):
cmd.append("--enable-prefix-caching")
if config.get("enable_chunked_prefill"):
cmd.append("--enable-chunked-prefill")
if config.get("pipeline_parallel_size", 1) > 1:
cmd.extend(["--pipeline-parallel-size", str(config["pipeline_parallel_size"])])
if config.get("speculative_model"):
cmd.extend(["--speculative-model", config["speculative_model"]])
return cmd
# Usage examples
dev_cmd = VLLMConfigTemplates.generate_vllm_command(
"microsoft/Phi-3.5-mini-instruct",
DeploymentScenario.DEVELOPMENT,
port=8000
)
prod_cmd = VLLMConfigTemplates.generate_vllm_command(
"microsoft/Phi-3.5-mini-instruct",
DeploymentScenario.PRODUCTION_HIGH,
port=8001
)
print(f"Development command: {' '.join(dev_cmd)}")
print(f"Production command: {' '.join(prod_cmd)}")Production Deployment Checklist for VLLM:
✅ Hardware Optimization:
- Configure tensor parallelism for multi-GPU setups
- Enable quantization (AWQ/GPTQ) for memory efficiency
- Set optimal GPU memory utilization (85-95%)
- Configure appropriate batch sizes for throughput
✅ Performance Tuning:
- Enable prefix caching for repeated queries
- Configure chunked prefill for long sequences
- Set up speculative decoding for faster inference
- Optimize max_num_seqs based on hardware
✅ Production Features:
- Set up health monitoring and metrics collection
- Configure automatic restart and failover
- Implement request queuing and load balancing
- Set up comprehensive logging and alerting
✅ Security and Reliability:
- Configure firewall rules and access controls
- Set up API rate limiting and authentication
- Implement graceful shutdown and cleanup
- Configure backup and disaster recovery
✅ Integration Testing:
- Test Microsoft Agent Framework integration
- Validate high-throughput scenarios
- Test failover and recovery procedures
- Benchmark performance under load
Comparison with Other Solutions:
| Feature | VLLM | Foundry Local | Ollama |
|---|---|---|---|
| Target Use Case | High-throughput production | Enterprise ease-of-use | Development & community |
| Performance | Maximum throughput | Balanced | Good |
| Memory Efficiency | PagedAttention optimization | Automatic optimization | Standard |
| Setup Complexity | High (many parameters) | Low (automatic) | Low (simple) |
| Scalability | Excellent (tensor/pipeline parallel) | Good | Limited |
| Quantization | Advanced (AWQ, GPTQ, FP8) | Automatic | Standard GGUF |
| Enterprise Features | Custom implementation needed | Built-in | Community tools |
| Best For | High-scale production agents | Enterprise production | Development |
When to Choose VLLM:
- High-Throughput Requirements: Processing hundreds of requests per second
- Large-Scale Deployments: Multi-GPU, multi-node deployments
- Performance Critical: Sub-second response times at scale
- Advanced Optimization: Need for custom quantization and batching
- Resource Efficiency: Maximum utilization of expensive GPU hardware
- SLM capabilities: Account lookups, password resets, order status checks
- Cost benefits: 10x reduction in inference costs compared to LLM agents
- Performance: Faster response times with consistent quality for routine queries
- Invoice processing agents: Extract data, validate information, route for approval
- Email management agents: Categorize, prioritize, draft responses automatically
- Scheduling agents: Coordinate meetings, manage calendars, send reminders
- Task management agents: Create, update, organize to-do lists efficiently
- Information gathering agents: Research topics, summarize findings locally
- Communication agents: Draft emails, messages, social media posts privately
- Market monitoring agents: Track prices, identify trends in real-time
- Report generation agents: Create daily/weekly summaries automatically
- Risk assessment agents: Evaluate portfolio positions using local data
- Patient scheduling agents: Coordinate appointments, send automated reminders
- Documentation agents: Generate medical summaries, reports locally
- Prescription management agents: Track refills, check interactions privately
Microsoft Agent Framework provides a comprehensive, enterprise-grade platform for building, deploying, and managing AI agents that can operate both in cloud and offline edge environments. The framework is specifically designed to work seamlessly with Small Language Models and edge computing scenarios, making it ideal for privacy-sensitive and resource-constrained deployments.
Core Framework Components:
- Agent Runtime: Lightweight execution environment optimized for edge devices
- Tool Integration System: Extensible plugin architecture for connecting external services and APIs
- State Management: Persistent agent memory and context handling across sessions
- Security Layer: Built-in security controls for enterprise deployment
- Orchestration Engine: Multi-agent coordination and workflow management
Offline-First Architecture: Microsoft Agent Framework is designed with offline-first principles, enabling agents to operate effectively without constant internet connectivity. This includes local model inference, cached knowledge bases, offline tool execution, and graceful degradation when cloud services are unavailable.
Resource Optimization: The framework provides intelligent resource management with automatic memory optimization for SLMs, CPU/GPU load balancing for edge devices, adaptive model selection based on available resources, and power-efficient inference patterns for mobile deployment.
Security and Privacy: Enterprise-grade security features include local data processing to maintain privacy, encrypted agent communication channels, role-based access controls for agent capabilities, and audit logging for compliance requirements.
Microsoft Agent Framework seamlessly integrates with Foundry Local to provide a complete edge AI solution:
Automatic Model Discovery: The framework automatically detects and connects to Foundry Local instances, discovers available SLM models, and selects optimal models based on agent requirements and hardware capabilities.
Dynamic Model Loading: Agents can dynamically load different SLMs for specific tasks, enabling multi-model agent systems where different models handle different types of requests, and automatic failover between models based on availability and performance.
Performance Optimization: Integrated caching mechanisms reduce model loading times, connection pooling optimizes API calls to Foundry Local, and intelligent batching improves throughput for multiple agent requests.
from microsoft_agent_framework import Agent, Tool, Config
from foundry_local import FoundryLocalManager
# Configure agent with Foundry Local integration
config = Config(
name="customer-service-agent",
model_provider="foundry-local",
model_alias="phi-4-mini",
max_tokens=512,
temperature=0.1,
offline_mode=True
)
# Initialize Foundry Local connection
foundry = FoundryLocalManager("phi-4-mini")
# Create agent instance
agent = Agent(
config=config,
model_endpoint=foundry.endpoint,
api_key=foundry.api_key
)# Define tools for offline operation
@agent.tool
def lookup_customer_info(customer_id: str) -> dict:
"""Look up customer information from local database."""
# Local database query - works offline
return local_db.get_customer(customer_id)
@agent.tool
def create_support_ticket(issue: str, priority: str) -> str:
"""Create a support ticket in local system."""
# Local ticket creation with sync when online
ticket_id = local_system.create_ticket(issue, priority)
return f"Ticket {ticket_id} created successfully"
@agent.tool
def schedule_callback(customer_id: str, preferred_time: str) -> str:
"""Schedule a callback for the customer."""
# Local scheduling with calendar integration
return local_calendar.schedule(customer_id, preferred_time)from microsoft_agent_framework import AgentOrchestrator
# Create specialized agents for different domains
scheduling_agent = Agent(
config=Config(
name="scheduling-agent",
model_alias="qwen2.5-0.5b", # Lightweight for simple tasks
specialized_for="scheduling"
)
)
technical_support_agent = Agent(
config=Config(
name="technical-agent",
model_alias="phi-4-mini", # More capable for complex issues
specialized_for="technical_support"
)
)
# Orchestrate multiple agents
orchestrator = AgentOrchestrator([
scheduling_agent,
technical_support_agent
])
# Route requests based on intent
result = orchestrator.process_request(
"I need to schedule a callback for a technical issue",
routing_strategy="intent-based"
)Local Agent Clusters: Deploy multiple specialized SLM agents on edge devices, each optimized for specific tasks. Use lightweight models like Qwen2.5-0.5B for simple routing and scheduling, medium models like Phi-4-Mini for customer service and documentation, and larger models for complex reasoning when resources allow.
Edge-to-Cloud Coordination: Implement intelligent escalation patterns where local agents handle routine tasks, cloud agents provide complex reasoning when connectivity allows, and seamless handoff between edge and cloud processing maintains continuity.
Single Device Deployment:
deployment:
type: single-device
hardware: edge-device
models:
- alias: "phi-4-mini"
primary: true
tasks: ["conversation", "reasoning"]
- alias: "qwen2.5-0.5b"
secondary: true
tasks: ["routing", "classification"]
agents:
- name: "primary-agent"
model: "phi-4-mini"
tools: ["database", "calendar", "email"]Distributed Edge Deployment:
deployment:
type: distributed-edge
nodes:
- id: "edge-1"
agents: ["customer-service", "scheduling"]
models: ["phi-4-mini"]
- id: "edge-2"
agents: ["technical-support", "documentation"]
models: ["qwen2.5-coder-0.5b"]
coordination:
load_balancing: true
failover: automaticTask-Based Model Assignment: Microsoft Agent Framework enables intelligent model selection based on task complexity and requirements:
- Simple Tasks (Q&A, routing): Qwen2.5-0.5B (500MB, <100ms response)
- Moderate Tasks (customer service, scheduling): Phi-4-Mini (2.4GB, 200-500ms response)
- Complex Tasks (technical analysis, planning): Phi-4 (7GB, 1-3s response when resources allow)
Dynamic Model Switching: Agents can switch between models based on current system load, task complexity assessment, user priority levels, and available hardware resources.
# Configure resource constraints for edge deployment
resource_config = ResourceConfig(
max_memory_usage="4GB",
max_concurrent_agents=3,
model_cache_size="2GB",
auto_unload_idle_models=True,
power_management=True
)
agent = Agent(
config=config,
resource_limits=resource_config
)Local Data Processing: All agent processing occurs locally, ensuring sensitive data never leaves the edge device. This includes customer information protection, HIPAA compliance for healthcare agents, financial data security for banking agents, and GDPR compliance for European deployments.
Access Control: Role-based permissions control which tools agents can access, user authentication for agent interactions, and audit trails for all agent actions and decisions.
from microsoft_agent_framework import AgentMonitor
# Set up monitoring for edge agents
monitor = AgentMonitor(
metrics=["response_time", "success_rate", "resource_usage"],
alerts=[
{"metric": "response_time", "threshold": "2s", "action": "scale_down_model"},
{"metric": "memory_usage", "threshold": "80%", "action": "unload_idle_agents"}
],
local_storage=True # Store metrics locally for offline operation
)
agent.add_monitor(monitor)# Retail kiosk agent for in-store customer assistance
retail_agent = Agent(
config=Config(
name="retail-assistant",
model_alias="phi-4-mini",
context="You are a helpful retail assistant in an electronics store."
)
)
@retail_agent.tool
def check_inventory(product_sku: str) -> dict:
"""Check local inventory for a product."""
return local_inventory.lookup(product_sku)
@retail_agent.tool
def find_alternatives(product_category: str) -> list:
"""Find alternative products in the same category."""
return local_catalog.find_similar(product_category)
@retail_agent.tool
def create_price_quote(items: list) -> dict:
"""Generate a price quote for multiple items."""
return pricing_engine.calculate_quote(items)# HIPAA-compliant patient support agent
healthcare_agent = Agent(
config=Config(
name="patient-support",
model_alias="phi-4-mini",
privacy_mode=True, # Enhanced privacy for healthcare
compliance=["HIPAA"]
)
)
@healthcare_agent.tool
def check_appointment_availability(provider_id: str, date_range: str) -> list:
"""Check appointment slots with healthcare provider."""
return local_scheduling.get_availability(provider_id, date_range)
@healthcare_agent.tool
def access_patient_portal(patient_id: str, auth_token: str) -> dict:
"""Secure access to patient information."""
if security.validate_token(auth_token):
return patient_portal.get_summary(patient_id)
return {"error": "Authentication failed"}- Start Simple: Begin with single-agent scenarios before building complex multi-agent systems
- Model Right-Sizing: Choose the smallest model that meets your accuracy requirements
- Tool Design: Create focused, single-purpose tools rather than complex multi-function tools
- Error Handling: Implement graceful degradation for offline scenarios and model failures
- Testing: Test agents extensively in offline conditions and resource-constrained environments
- Gradual Rollout: Deploy to small user groups initially, monitor performance metrics closely
- Resource Monitoring: Set up alerts for memory, CPU, and response time thresholds
- Fallback Strategies: Always have backup plans for model failures or resource exhaustion
- Security First: Implement security controls from the beginning, not as an afterthought
- Documentation: Maintain clear documentation of agent capabilities and limitations
Microsoft Agent Framework continues to evolve with enhanced SLM optimization, improved edge deployment tools, better resource management for constrained environments, and expanded tool ecosystem for common enterprise scenarios.
Upcoming Features:
- AutoML for Agent Optimization: Automatic fine-tuning of SLMs for specific agent tasks
- Edge Mesh Networking: Coordination between multiple edge agent deployments
- Advanced Telemetry: Enhanced monitoring and analytics for agent performance
- Visual Agent Builder: Low-code/no-code agent development tools
When selecting SLMs for agent deployment, consider the following factors:
Model Size Considerations: Choose ultra-compressed models like Q2_K for extreme mobile agent applications, balanced models such as Q4_K_M for general agent scenarios, and higher precision models like Q8_0 for quality-critical agent applications.
Agent Use Case Alignment: Match SLM capabilities to specific agent requirements, considering factors like accuracy preservation for agent decisions, inference speed for real-time agent interactions, memory constraints for edge agent deployment, and offline operation requirements for privacy-focused agents.
Quantization Approach for Agents: Select appropriate quantization levels based on agent quality requirements and hardware constraints. Consider Q4_0 for maximum compression in mobile agents, Q5_1 for balanced quality-compression in general agents, and Q8_0 for near-original quality in critical agent applications.
Framework Selection for Agent Deployment: Choose optimization frameworks based on target hardware and agent requirements. Use Llama.cpp for CPU-optimized agent deployment, Apple MLX for Apple Silicon agent applications, and ONNX for cross-platform agent compatibility.
Mobile Agent Applications: Q4_K formats excel in smartphone agent applications with minimal memory footprint, while Q8_0 provides balanced performance for tablet-based agent systems. Q5_K formats offer superior quality for mobile productivity agents.
Desktop and Edge Agent Computing: Q5_K delivers optimal performance for desktop agent applications, Q8_0 provides high-quality inference for workstation agent environments, and Q4_K enables efficient processing on edge agent devices.
Research and Experimental Agents: Advanced quantization formats enable exploration of ultra-low precision agent inference for academic research and proof-of-concept agent applications requiring extreme resource constraints.
Agent Inference Speed: Q4_K achieves fastest agent response times on mobile CPUs, Q5_K provides balanced speed-quality ratio for general agent applications, Q8_0 offers superior quality for complex agent tasks, and experimental formats deliver maximum throughput for specialized agent hardware.
Agent Memory Requirements: Quantization levels for agents range from Q2_K (under 500MB for small agent models) to Q8_0 (approximately 50% of original size), with experimental configurations achieving maximum compression for resource-constrained agent environments.
SLM agent deployment involves careful consideration of trade-offs between model size, agent response speed, and output quality. While Q4_K offers exceptional speed and efficiency for mobile agents, Q8_0 provides superior quality for complex agent tasks. Q5_K strikes a middle ground suitable for most general agent applications.
Different edge devices have varying capabilities for SLM agent deployment. Q4_K runs efficiently on basic processors for simple agents, Q5_K requires moderate computational resources for balanced agent performance, and Q8_0 benefits from higher-end hardware for advanced agent capabilities.
While SLM agents enable local processing for enhanced privacy, proper security measures must be implemented to protect agent models and data in edge environments. This is particularly important when deploying high-precision agent formats in enterprise environments or compressed agent formats in applications handling sensitive data.
The SLM agent landscape continues to evolve with advances in compression techniques, optimization methods, and edge deployment strategies. Future developments include more efficient quantization algorithms for agent models, improved compression methods for agent workflows, and better integration with edge hardware accelerators for agent processing.
Market Predictions for SLM Agents: According to recent research, agent-powered automation could eliminate 40–60% of repetitive cognitive tasks in enterprise workflows by 2027, with SLMs leading this transformation due to their cost efficiency and deployment flexibility.
Technology Trends in SLM Agents:
- Specialized SLM Agents: Domain-specific models trained for particular agent tasks and industries
- Edge Agent Computing: Enhanced on-device agent capabilities with improved privacy and reduced latency
- Agent Orchestration: Better coordination between multiple SLM agents with dynamic routing and load balancing
- Democratization: SLM flexibility enables broader participation in agent development across organizations
Install Dependencies:
# Install Microsoft Agent Framework
pip install microsoft-agent-framework
# Install Foundry Local SDK for edge deployment
pip install foundry-local-sdk
# Install additional dependencies for edge agents
pip install openai asyncioInitialize Foundry Local:
# Start Foundry Local service
foundry service start
# Load default model for agent development
foundry model run phi-4-miniPopular options for Microsoft Agent Framework:
- Microsoft Phi-4 Mini (3.8B): Excellent for general agent tasks with balanced performance
- Qwen2.5-0.5B (0.5B): Ultra-efficient for simple routing and classification agents
- Qwen2.5-Coder-0.5B (0.5B): Specialized for code-related agent tasks
- Phi-4 (7B): Advanced reasoning for complex edge scenarios when resources allow
Basic Agent Setup:
from microsoft_agent_framework import Agent, Config
from foundry_local import FoundryLocalManager
# Initialize Foundry Local connection
foundry = FoundryLocalManager("phi-4-mini")
# Create agent configuration
config = Config(
name="my-first-agent",
model_provider="foundry-local",
model_alias="phi-4-mini",
offline_mode=True
)
# Create and configure agent
agent = Agent(
config=config,
model_endpoint=foundry.endpoint,
api_key=foundry.api_key
)
# Define a simple tool
@agent.tool
def get_current_time() -> str:
"""Get the current time."""
from datetime import datetime
return datetime.now().strftime("%Y-%m-%d %H:%M:%S")
# Test the agent
response = agent.chat("What time is it?")
print(response)Start with focused, well-defined agent applications using Microsoft Agent Framework:
- Single domain agents: Customer service OR scheduling OR research
- Clear agent objectives: Specific, measurable goals for agent performance
- Limited tool integration: 3-5 tools maximum for initial agent deployment
- Defined agent boundaries: Clear escalation paths for complex scenarios
- Edge-first design: Prioritize offline functionality and local processing
Resource Configuration:
from microsoft_agent_framework import ResourceConfig
# Configure for edge deployment
resource_config = ResourceConfig(
max_memory_usage="2GB",
max_concurrent_agents=2,
model_cache_size="1GB",
auto_unload_idle_models=True,
power_management=True
)
agent = Agent(
config=config,
resource_limits=resource_config
)Deploy Safety Measures for Edge Agents:
- Local input validation: Check requests without cloud dependencies
- Offline output filtering: Ensure responses meet quality standards locally
- Edge security controls: Implement security without requiring internet connectivity
- Local monitoring: Track performance and flag issues using edge telemetry
- Agent task completion rates: Monitor success rates in offline scenarios
- Agent response times: Ensure sub-second response times for edge deployment
- Resource utilization: Track memory, CPU, and battery usage on edge devices
- Cost efficiency: Compare edge deployment costs to cloud-based alternatives
- Offline reliability: Measure agent performance during network outages
- SLMs are sufficient for agents: For most agent tasks, small models perform as well as large ones while offering significant advantages
- Cost efficiency in agents: 10-30x cheaper to run SLM agents, making them economically viable for widespread deployment
- Specialization works for agents: Fine-tuned SLMs often outperform general-purpose LLMs in specific agent applications
- Hybrid agent architecture: Use SLMs for routine agent tasks, LLMs for complex reasoning when necessary
- Microsoft Agent Framework enables production deployment: Provides enterprise-grade tools for building, deploying, and managing edge agents
- Edge-first design principles: Offline-capable agents with local processing ensure privacy and reliability
- Foundry Local integration: Seamless connection between Microsoft Agent Framework and local model inference
- Future is SLM agents: Small language models with production frameworks are the future of agentic AI, enabling democratized and efficient agent deployment
-
"Language Agents as Optimizable Graphs" (2024) - Fundamental research on agent architecture and optimization
- Authors: Wenyue Hua, Lishan Yang, et al.
- Link: https://arxiv.org/abs/2402.16823
- Key Insights: Graph-based agent design and optimization strategies
-
"The Rise and Potential of Large Language Model Based Agents" (2023)
- Authors: Zhiheng Xi, Wenxiang Chen, et al.
- Link: https://arxiv.org/abs/2309.07864
- Key Insights: Comprehensive survey of LLM-based agent capabilities and applications
-
"Cognitive Architectures for Language Agents" (2024)
- Authors: Theodore Sumers, Shunyu Yao, et al.
- Link: https://arxiv.org/abs/2309.02427
- Key Insights: Cognitive frameworks for designing intelligent agents
-
"Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone" (2024)
- Authors: Microsoft Research Team
- Link: https://arxiv.org/abs/2404.14219
- Key Insights: SLM design principles and mobile deployment strategies
-
"Qwen2.5 Technical Report" (2024)
- Authors: Alibaba Cloud Team
- Link: https://arxiv.org/abs/2407.10671
- Key Insights: Advanced SLM training techniques and performance optimization
-
"TinyLlama: An Open-Source Small Language Model" (2024)
- Authors: Peiyuan Zhang, Guangtao Zeng, et al.
- Link: https://arxiv.org/abs/2401.02385
- Key Insights: Ultra-compact model design and training efficiency
- Official Documentation: https://docs.microsoft.com/en-us/azure/ai-services/agents/
- GitHub Repository: https://github.com/microsoft/agent-framework
- Primary Repository: https://github.com/microsoft/foundry-local
- Documentation: https://github.com/microsoft/foundry-local/blob/main/docs/README.md
- Main Repository: https://github.com/vllm-project/vllm
- Documentation: https://docs.vllm.ai/
- Official Website: https://ollama.ai/
- GitHub Repository: https://github.com/ollama/ollama
- Repository: https://github.com/ggml-org/llama.cpp
- Documentation: https://microsoft.github.io/Olive/
- GitHub Repository: https://github.com/microsoft/Olive
- Official Site: https://docs.openvino.ai/
- Repository: https://github.com/ml-explore/mlx
- "The State of AI Agents 2025" - McKinsey Global Institute
- Link: https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/ai-agents-2025
- Key Insights: Market trends and enterprise adoption patterns
- "Edge AI Inference Benchmarks" - MLPerf
- Link: https://mlcommons.org/en/inference-edge/
- Key Insights: Standardized performance metrics for edge deployment
- ONNX (Open Neural Network Exchange): https://onnx.ai/
- Cross-platform model format for interoperability
- GGUF Specification: https://github.com/ggerganov/ggml/blob/master/docs/gguf.md
- Quantized model format for CPU inference
- OpenAI API Specification: https://platform.openai.com/docs/api-reference
- Standard API format for language model integration
- NIST AI Risk Management Framework: https://www.nist.gov/itl/ai-risk-management-framework
- ISO/IEC 23053:2022 - AI Systems: Framework for AI systems and safety
- IEEE Standards for AI: https://standards.ieee.org/industry-connections/ai/
The shift toward SLM-powered agents represents a fundamental change in how we approach AI deployment. Microsoft Agent Framework, combined with local platforms and efficient Small Language Models, provides a complete solution for building production-ready agents that operate effectively in edge environments. By focusing on efficiency, specialization, and practical utility, this technology stack makes AI agents more accessible, affordable, and effective for real-world applications across every industry and edge computing environment.
As we advance through 2025, the combination of increasingly capable small models, sophisticated agent frameworks like Microsoft Agent Framework, and robust edge deployment platforms will unlock new possibilities for autonomous systems that can operate efficiently on edge devices while maintaining privacy, reducing costs, and delivering exceptional user experiences.
Next Steps for Implementation:
- Explore Function Calling: Learn how SLMs handle tool integration and structured outputs
- Master Model Context Protocol (MCP): Understand advanced agent communication patterns
- Build Production Agents: Use Microsoft Agent Framework for enterprise-grade deployments
- Optimize for Edge: Apply advanced optimization techniques for resource-constrained environments