Treat AI models as modular, customizable tools that run directly on-device with Foundry Local. This session emphasizes practical workflows for privacy-preserving, low-latency inference and how to integrate these tools via SDKs, APIs, or CLI. You'll also learn how to scale to Azure AI Foundry when needed.
🔄 Updated for Modern SDK: This module has been aligned with the latest Microsoft Foundry-Local repository patterns and matches the intelligent routing implementation in
samples/06/. The examples now use the modernfoundry-local-sdkand advanced model selection strategies.
🏗️ Architecture Highlights:
- Intelligent Model Routing: Keyword-based selection between general, reasoning, code, and creative models
- Modern SDK Integration: Uses
FoundryLocalManagerwith automatic service discovery - Environment Configuration: Flexible model assignment via environment variables
- Health Monitoring: Service validation and model availability checking
- Production Ready: Comprehensive error handling and fallback mechanisms
📁 Local Implementation:
samples/06/router.py- Intelligent model router with keyword-based selectionsamples/06/model_router.ipynb- Interactive examples and benchmarkssamples/06/README.md- Configuration and usage instructions
References:
- Foundry Local docs: https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/
- Integrate with inference SDKs: https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/how-to/how-to-integrate-with-inference-sdks
- Compile Hugging Face models: https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/how-to/how-to-compile-hugging-face-models6: Foundry Local – Models as Tools
Treat AI models as modular, customizable tools that run directly on-device with Foundry Local. This session emphasizes practical workflows for privacy-preserving, low-latency inference and how to integrate these tools via SDKs, APIs, or CLI. You’ll also learn how to scale to Azure AI Foundry when needed.
References:
- Foundry Local docs: https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/
- Integrate with inference SDKs: https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/how-to/how-to-integrate-with-inference-sdks
- Compile Hugging Face models: https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/how-to/how-to-compile-hugging-face-models
- Design model-as-a-tool patterns on-device
- Integrate via OpenAI-compatible REST API or SDKs
- Customize models to domain-specific use cases
- Plan for hybrid scaling to Azure AI Foundry
Goal: Implement intelligent model selection with automatic routing based on query content.
📋 Note: This implementation matches the patterns used in
samples/06/router.pywith advanced keyword-based model selection.
Step 1) Define modern model router with FoundryLocalManager
# router/intelligent_router.py
from foundry_local import FoundryLocalManager
from openai import OpenAI
from typing import Dict, Any, Optional
import os
import json
class ModelRouter:
"""Intelligent model router that selects appropriate models for different task types."""
def __init__(self):
self.client = None
self.base_url = None
self.tools = self._load_tool_registry()
self._initialize_client()
def _load_tool_registry(self) -> Dict[str, Dict[str, Any]]:
"""Load tool registry from environment or use defaults."""
default_tools = {
"general": {
"model": os.environ.get("GENERAL_MODEL", "phi-4-mini"),
"notes": "Fast general-purpose chat and Q&A",
"temperature": 0.7
},
"reasoning": {
"model": os.environ.get("REASONING_MODEL", "deepseek-r1-7b"),
"notes": "Step-by-step analysis and logical reasoning",
"temperature": 0.3
},
"code": {
"model": os.environ.get("CODE_MODEL", "qwen2.5-7b"),
"notes": "Code generation, debugging, and technical tasks",
"temperature": 0.2
},
"creative": {
"model": os.environ.get("CREATIVE_MODEL", "phi-4-mini"),
"notes": "Creative writing and storytelling",
"temperature": 0.9
}
}
# Check for environment override
tools_env = os.environ.get("TOOL_REGISTRY")
if tools_env:
try:
return json.loads(tools_env)
except json.JSONDecodeError:
print("Warning: Invalid TOOL_REGISTRY JSON, using defaults")
return default_toolsStep 2) Initialize client with modern SDK and service discovery
def _initialize_client(self):
"""Initialize OpenAI client with Foundry Local or fallback configuration."""
try:
from foundry_local import FoundryLocalManager
# Try to use any available model for client initialization
first_model = next(iter(self.tools.values()))["model"]
manager = FoundryLocalManager(first_model)
self.client = OpenAI(
base_url=manager.endpoint,
api_key=manager.api_key
)
self.base_url = manager.endpoint
print(f"✅ Foundry Local SDK initialized")
except Exception as e:
print(f"Warning: Could not use Foundry SDK ({e}), falling back to manual configuration")
# Fallback to manual configuration
self.base_url = os.environ.get("BASE_URL", "http://localhost:8000")
api_key = os.environ.get("API_KEY", "")
self.client = OpenAI(
base_url=f"{self.base_url}/v1",
api_key=api_key
)
print(f"Initialized manual configuration at {self.base_url}")
def select_tool(self, user_query: str) -> str:
"""Select the most appropriate tool based on the user query."""
query_lower = user_query.lower()
# Code-related keywords
code_keywords = ["code", "python", "function", "class", "method", "bug", "debug",
"programming", "script", "algorithm", "implementation", "refactor"]
if any(keyword in query_lower for keyword in code_keywords):
return "code"
# Reasoning keywords
reasoning_keywords = ["why", "how", "explain", "step-by-step", "reason", "analyze",
"think", "logic", "because", "cause", "compare", "evaluate"]
if any(keyword in query_lower for keyword in reasoning_keywords):
return "reasoning"
# Creative keywords
creative_keywords = ["story", "poem", "creative", "imagine", "write", "tale",
"narrative", "fiction", "character", "plot"]
if any(keyword in query_lower for keyword in creative_keywords):
return "creative"
# Default to general
return "general"
def chat(self, model: str, content: str, max_tokens: int = 300, temperature: Optional[float] = None) -> str:
"""Send chat completion request to the specified model."""
try:
params = {
"model": model,
"messages": [{"role": "user", "content": content}],
"max_tokens": max_tokens
}
if temperature is not None:
params["temperature"] = temperature
response = self.client.chat.completions.create(**params)
return response.choices[0].message.content
except Exception as e:
return f"Error generating response with model {model}: {str(e)}"Step 3) Implement intelligent routing and execution (see samples/06/router.py)
def route_and_run(self, prompt: str) -> Dict[str, Any]:
"""Route the prompt to the appropriate model and generate response."""
tool_key = self.select_tool(prompt)
tool_config = self.tools[tool_key]
model = tool_config["model"]
temperature = tool_config.get("temperature", 0.7)
print(f"🎯 Selected tool: {tool_key} (model: {model})")
answer = self.chat(
model=model,
content=prompt,
max_tokens=400,
temperature=temperature
)
return {
"tool": tool_key,
"model": model,
"tool_description": tool_config["notes"],
"temperature": temperature,
"answer": answer
}
def check_service_health(self) -> Dict[str, Any]:
"""Check Foundry Local service health and available models."""
try:
models_response = self.client.models.list()
available_models = [model.id for model in models_response.data]
return {
"status": "healthy",
"base_url": self.base_url,
"available_models": available_models,
"tools_configured": list(self.tools.keys())
}
except Exception as e:
return {
"status": "error",
"base_url": self.base_url,
"error": str(e)
}
if __name__ == "__main__":
# Ensure: foundry model run phi-4-mini
router = ModelRouter()
# Check health
health = router.check_service_health()
print(f"Service Health: {json.dumps(health, indent=2)}")
# Test different query types
queries = [
"Write a Python function to calculate fibonacci numbers", # -> code
"Explain step-by-step why the sky is blue", # -> reasoning
"Tell me a creative story about AI", # -> creative
"What's the weather like today?" # -> general
]
for query in queries:
result = router.route_and_run(query)
print(f"\nQuery: {query}")
print(f"Selected: {result['tool']} -> {result['model']}")
print(f"Answer: {result['answer'][:100]}...")Goal: Use the Foundry Local SDK with OpenAI Python SDK for seamless integration.
Step 1) Install dependencies
cd Module08
.\.venv\Scripts\activate
pip install foundry-local-sdk openaiStep 2) Configure environment (optional - see samples/06/README.md)
REM Override default models per tool
set GENERAL_MODEL=phi-4-mini
set REASONING_MODEL=deepseek-r1-7b
set CODE_MODEL=qwen2.5-7b
REM Or provide a full JSON registry
set TOOL_REGISTRY={"general":{"model":"phi-4-mini"},"reasoning":{"model":"deepseek-r1-7b"}}Step 3) Modern SDK integration
# modern_sdk_demo.py
from foundry_local import FoundryLocalManager
from openai import OpenAI
import sys
def main():
"""Demonstrate modern SDK integration."""
try:
# Initialize with FoundryLocalManager
alias = "phi-4-mini"
manager = FoundryLocalManager(alias)
# Create OpenAI client using Foundry Local endpoint
client = OpenAI(
base_url=manager.endpoint,
api_key=manager.api_key
)
# Get model info
model_info = manager.get_model_info(alias)
print(f"Using model: {model_info.id}")
# Make request with streaming
stream = client.chat.completions.create(
model=model_info.id,
messages=[{"role": "user", "content": "Explain edge AI benefits in one paragraph."}],
stream=True,
max_tokens=200
)
print("Response: ", end="")
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
except Exception as e:
print(f"Error: {e}")
print("Ensure Foundry Local is running with: foundry model run phi-4-mini")
sys.exit(1)
if __name__ == "__main__":
main()Goal: Tailor outputs for a domain using prompt templates and JSON schema.
Step 1) Create a domain prompt template
# domain/templates.py
BUSINESS_ANALYST_SYSTEM = """
You are a senior business analyst. Provide:
1) Key insights
2) Risks
3) Next steps
Respond in valid JSON with fields: insights, risks, next_steps.
"""Step 2) Enforce JSON output
# domain/analyst.py
import requests, os, json
BASE_URL = os.getenv("OPENAI_BASE_URL", "http://localhost:8000/v1")
API_KEY = os.getenv("OPENAI_API_KEY", "local-key")
HEADERS = {"Content-Type":"application/json","Authorization":f"Bearer {API_KEY}"}
from domain.templates import BUSINESS_ANALYST_SYSTEM
def analyze(text: str) -> dict:
messages = [
{"role":"system","content": BUSINESS_ANALYST_SYSTEM},
{"role":"user","content": f"Analyze this business text:\n{text}"}
]
r = requests.post(f"{BASE_URL}/chat/completions", json={
"model":"phi-4-mini",
"messages": messages,
"response_format": {"type":"json_object"},
"temperature": 0.3
}, headers=HEADERS, timeout=60)
r.raise_for_status()
# Parse JSON content
content = r.json()["choices"][0]["message"]["content"]
return json.loads(content)
if __name__ == "__main__":
print(analyze("Sales dipped 12% in Q3 due to supply constraints and marketing cuts."))Goal: Ensure privacy and resilience when running models as tools locally.
Step 1) Pre-warm and validate local endpoint
foundry model run phi-4-mini
curl http://localhost:8000/v1/modelsStep 2) Sanitize inputs
# security/sanitize.py
import re
EMAIL_RE = re.compile(r"[\w\.-]+@[\w\.-]+")
PHONE_RE = re.compile(r"\+?\d[\d\s\-]{7,}\d")
def sanitize(text: str) -> str:
text = EMAIL_RE.sub("[REDACTED_EMAIL]", text)
text = PHONE_RE.sub("[REDACTED_PHONE]", text)
return textStep 3) Local-only flag and logging
# security/local_only.py
import os, json, time
LOG = os.getenv("MODELS_AS_TOOLS_LOG", "./tools_logs.jsonl")
def record(event: dict):
with open(LOG, "a", encoding="utf-8") as f:
f.write(json.dumps(event) + "\n")
# Usage before each call
def before_call(tool_name, payload):
record({"ts": time.time(), "tool": tool_name, "event": "before_call"})
# After each call
def after_call(tool_name, result):
record({"ts": time.time(), "tool": tool_name, "event": "after_call"})Goal: Deploy the intelligent router with monitoring and Azure AI Foundry integration.
📋 Note: The local implementation in
samples/06/model_router.ipynbincludes comprehensive examples of production deployment patterns.
Step 1) Production router with monitoring (see samples/06/router.py)
# production/router.py
from router.intelligent_router import ModelRouter
import json
import time
import sys
class ProductionModelRouter(ModelRouter):
"""Production-ready model router with monitoring and logging."""
def __init__(self):
super().__init__()
self.request_count = 0
self.error_count = 0
self.start_time = time.time()
def route_and_run_with_monitoring(self, prompt: str) -> Dict[str, Any]:
"""Route with comprehensive monitoring and error handling."""
start_time = time.time()
self.request_count += 1
try:
result = self.route_and_run(prompt)
processing_time = time.time() - start_time
# Log successful request
self._log_request({
"status": "success",
"tool": result["tool"],
"model": result["model"],
"processing_time": processing_time,
"timestamp": time.strftime("%Y-%m-%d %H:%M:%S")
})
result["processing_time"] = processing_time
return result
except Exception as e:
self.error_count += 1
error_result = {
"status": "error",
"error": str(e),
"processing_time": time.time() - start_time,
"timestamp": time.strftime("%Y-%m-%d %H:%M:%S")
}
self._log_request(error_result)
return error_result
def _log_request(self, data: Dict[str, Any]):
"""Log request data for monitoring."""
print(f"📊 {json.dumps(data)}")
def get_stats(self) -> Dict[str, Any]:
"""Get router statistics."""
uptime = time.time() - self.start_time
return {
"uptime_seconds": uptime,
"total_requests": self.request_count,
"error_count": self.error_count,
"success_rate": (self.request_count - self.error_count) / max(1, self.request_count),
"requests_per_minute": self.request_count / max(1, uptime / 60)
}
def main():
"""Production router demo."""
router = ProductionModelRouter()
# Health check
health = router.check_service_health()
if health["status"] == "error":
print(f"❌ Service health check failed: {health['error']}")
sys.exit(1)
print(f"✅ Service healthy with {len(health['available_models'])} models")
# Process user query
user_prompt = " ".join(sys.argv[1:]) or "Write three benefits of on-device AI in JSON format."
print(f"\n🎯 Processing: {user_prompt}")
result = router.route_and_run_with_monitoring(user_prompt)
if result.get("status") == "error":
print(f"❌ Error: {result['error']}")
else:
print(f"\n📋 Result:")
print(f"Tool: {result['tool']} -> Model: {result['model']}")
print(f"Processing Time: {result['processing_time']:.2f}s")
print(f"Answer: {result['answer']}")
# Show stats
stats = router.get_stats()
print(f"\n📊 Statistics: {json.dumps(stats, indent=2)}")
if __name__ == "__main__":
main()- Implement intelligent model router with keyword-based selection (
samples/06/router.py) - Configure multiple specialized models (general, reasoning, code, creative)
- Test the interactive Jupyter notebook (
samples/06/model_router.ipynb) - Set up environment-based model configuration
- Implement service health monitoring and error handling
- Deploy production router with comprehensive logging
Run the complete implementation:
cd Module08
.\.venv\Scripts\activate
REM Start required models
foundry model run phi-4-mini
foundry model run qwen2.5-7b
foundry model run deepseek-r1-7b
REM Test the intelligent router
python samples\06\router.py "Write a Python function to sort a list"
python samples\06\router.py "Explain step-by-step how bubble sort works"
python samples\06\router.py "Tell me a creative story about robots"
REM Explore the interactive notebook
jupyter notebook samples/06/model_router.ipynb- Local Implementation:
samples/06/- Complete intelligent router with multiple model support - Microsoft Samples: Hello Foundry Local
- Integration Docs: Integrate with Inference SDKs
- Advanced Patterns: Explore function calling and multi-agent orchestration in Module 5
Foundry Local enables robust on-device AI where models become intelligent, specialized tools. With automatic model selection, comprehensive monitoring, and production-ready patterns, teams can ship sophisticated AI applications that adapt to different task types while maintaining privacy and performance. The intelligent router pattern demonstrated here provides a foundation for building complex AI systems that can scale from local development to production deployment.