A lightweight FastAPI service that exposes a simple, safe endpoint to query any OpenAI-compatible backend (e.g., Ollama, vLLM, LM Studio). It supports optional inline text or file uploads (.txt, .pdf, .docx) that the model can use to answer the prompt.
- Tech stack: Python 3.11, FastAPI, Uvicorn
- Backends: Anything that speaks the OpenAI Chat Completions API
- File types: .txt, .pdf (pypdf), .docx (python-docx)
Prerequisites: Docker and Docker Compose.
# Start Ollama + API (compose will pull the base model and create the `gpt-oss` alias automatically)
docker compose up -d
# Verify models are present inside Ollama
docker exec -it ollama ollama list
# You should see: gpt-oss:latest and llama3:latest
# Check the API health
curl http://localhost:8000/healthz
# => {"status":"ok"}
# Open interactive docs
open http://localhost:8000/docsNotes:
- The compose file configures:
OPENAI_BASE_URL=http://ollama:11434/v1OPENAI_API_KEY=ollama(Ollama ignores the key; it just must be non-empty)MODEL_ID=gpt-oss(default). Change if you prefer a different model name.
- To change the model, edit
docker-compose.ymland updateMODEL_IDunder theapiservice.
# Build the image
docker build -t fastapi-oss-llm .
# Run with environment configured for an OpenAI-compatible backend
# Example: point to a locally running Ollama at localhost:11434
docker run --rm -p 8000:8000 \
-e OPENAI_BASE_URL=http://host.docker.internal:11434/v1 \
-e OPENAI_API_KEY=ollama \
-e MODEL_ID=gpt-oss \
fastapi-oss-llmPrerequisites: Python 3.11.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Configure your backend
export OPENAI_BASE_URL=http://localhost:11434/v1 # Ollama example
export OPENAI_API_KEY=ollama # any non-empty string
export MODEL_ID=gpt-oss
# Run the server
uvicorn app.main:app --reload --port 8000- OPENAI_BASE_URL: Required. Base URL to an OpenAI-compatible API (e.g.,
http://localhost:11434/v1for Ollama,http://localhost:8000/v1for LM Studio, or your vLLM endpoint). - OPENAI_API_KEY: API key; required by the client library but may be ignored by some backends (use any non-empty string for Ollama/LM Studio local).
- MODEL_ID: Default model used if not overridden per request. Examples:
llama3,mistral,qwen2, or a custom vLLM served name. - MAX_ATTACHED_CHARS: Soft limit for combined attached content (inline text + file). Default:
120000characters. Longer inputs are truncated withtruncated=truein the response. - UVICORN_WORKERS: Number of Uvicorn worker processes (Dockerfile defaults to
1; compose sets2).
CORS is enabled for all origins by default. Adjust in app/main.py if deploying publicly.
GET /healthz→ status check. Response:{ "status": "ok" }.POST /v1/ask→ query the model.- Form fields:
prompt(string, required): Instruction or question for the model.text(string, optional): Inline text content to include.file(file, optional):.txt,.pdf, or.docxupload.model(string, optional): Override model id per request.temperature(float, optional, default0.2): Sampling temperature.max_tokens(int, optional, default512): Output length cap.
- Response JSON schema:
model(string): The model used for the call.response(string): The model’s answer.input_characters(int): Count of attached content characters after truncation.truncated(bool): Whether attached content was truncated.
- Form fields:
Interactive docs: open http://localhost:8000/docs (Swagger UI) or http://localhost:8000/redoc.
curl -s -X POST http://localhost:8000/v1/ask \
-F 'prompt=Explain what an API gateway is.' | jqcurl -s -X POST http://localhost:8000/v1/ask \
-F 'prompt=Summarize the content below.' \
-F $'text=APIs let services communicate. Gateways centralize auth, routing, and rate limiting.' | jqcurl -s -X POST http://localhost:8000/v1/ask \
-F 'prompt=Extract the key bullet points from the attached file.' \
-F 'file=@notes.txt;type=text/plain' | jq# PDF
curl -s -X POST http://localhost:8000/v1/ask \
-F 'prompt=Create a concise summary of the attached PDF.' \
-F 'file=@report.pdf;type=application/pdf' | jq
# DOCX
curl -s -X POST http://localhost:8000/v1/ask \
-F 'prompt=Summarize this document.' \
-F 'file=@proposal.docx;type=application/vnd.openxmlformats-officedocument.wordprocessingml.document' | jqcurl -s -X POST http://localhost:8000/v1/ask \
-F 'prompt=Write a haiku about microservices.' \
-F 'model=llama3' | jqcurl -s -X POST http://localhost:8000/v1/ask \
-F 'prompt=List 5 best practices for API versioning.' \
-F 'temperature=0.7' \
-F 'max_tokens=300' | jqimport requests
url = "http://localhost:8000/v1/ask"
data = {
"prompt": "Summarize the attached text.",
"temperature": "0.3",
}
files = {
# Use one of: text; or file; both are optional
# "file": ("notes.txt", open("notes.txt", "rb"), "text/plain"),
}
resp = requests.post(url, data=data, files=files)
resp.raise_for_status()
print(resp.json())const form = new FormData();
form.append("prompt", "Summarize the following content.");
form.append("text", "Microservices split systems into smaller, deployable services...");
// Or attach a file: form.append("file", fileInput.files[0]);
const res = await fetch("http://localhost:8000/v1/ask", {
method: "POST",
body: form,
});
const data = await res.json();
console.log(data);- Ollama (local):
- Base URL:
http://localhost:11434/v1 - Model names:
llama3,mistral,qwen2, etc. Pull withollama pull <model>.
- Base URL:
- LM Studio (local):
- Enable the server and copy the base URL from the app (usually
http://localhost:1234/v1).
- Enable the server and copy the base URL from the app (usually
- vLLM:
- Point
OPENAI_BASE_URLto your vLLM endpoint (e.g.,http://<host>:<port>/v1); use the served model name forMODEL_ID.
- Point
Update environment variables accordingly and restart the service.
- Error: OPENAI_BASE_URL is not set: Set it to your backend’s
/v1URL and restart. - Model not found / backend returns 404: Follow these steps.
- Check whether
gpt-ossexists in Ollama:If you do not seedocker exec -it ollama ollama listgpt-oss:latest, continue. - Create or recreate the alias via compose jobs:
These commands pull
docker compose run --rm ollama-pull-base docker compose run --rm ollama-create
llama3and create a local aliasgpt-ossusingollama/Modelfile. - Restart the API service:
docker compose up -d api
- Try a request again:
curl -s -X POST http://localhost:8000/v1/ask -F 'prompt=ping' | jq
- Manual fallback (no compose jobs):
Ensure the repo
docker exec -it ollama ollama pull llama3 docker exec -it ollama ollama create gpt-oss -f /models/Modelfile
ollama/Modelfileis mounted or adjust the path accordingly.
- Check whether
- Large files get cut off: Input is truncated to
MAX_ATTACHED_CHARS(default 120k). Increase the env var if needed. - PDF or DOCX extraction missing: The app relies on
pypdfandpython-docx(installed viarequirements.txt). If running locally, ensure they are installed. - CORS issues: CORS is open by default. For stricter settings, adjust middleware in
app/main.py.
app/
__init__.py
main.py # FastAPI app, /healthz and /v1/ask endpoints
Dockerfile # Containerized server
docker-compose.yml # Local stack: Ollama + API
requirements.txt # Python dependencies
Add your preferred license here.