Skip to content

comradesurendra/gpt-oss

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FastAPI OSS LLM Gateway

A lightweight FastAPI service that exposes a simple, safe endpoint to query any OpenAI-compatible backend (e.g., Ollama, vLLM, LM Studio). It supports optional inline text or file uploads (.txt, .pdf, .docx) that the model can use to answer the prompt.

  • Tech stack: Python 3.11, FastAPI, Uvicorn
  • Backends: Anything that speaks the OpenAI Chat Completions API
  • File types: .txt, .pdf (pypdf), .docx (python-docx)

Quick start

Option A — docker-compose (recommended)

Prerequisites: Docker and Docker Compose.

# Start Ollama + API (compose will pull the base model and create the `gpt-oss` alias automatically)
docker compose up -d

# Verify models are present inside Ollama
docker exec -it ollama ollama list
# You should see: gpt-oss:latest and llama3:latest

# Check the API health
curl http://localhost:8000/healthz
# => {"status":"ok"}

# Open interactive docs
open http://localhost:8000/docs

Notes:

  • The compose file configures:
    • OPENAI_BASE_URL=http://ollama:11434/v1
    • OPENAI_API_KEY=ollama (Ollama ignores the key; it just must be non-empty)
    • MODEL_ID=gpt-oss (default). Change if you prefer a different model name.
  • To change the model, edit docker-compose.yml and update MODEL_ID under the api service.

Option B — Docker only

# Build the image
docker build -t fastapi-oss-llm .

# Run with environment configured for an OpenAI-compatible backend
# Example: point to a locally running Ollama at localhost:11434
docker run --rm -p 8000:8000 \
  -e OPENAI_BASE_URL=http://host.docker.internal:11434/v1 \
  -e OPENAI_API_KEY=ollama \
  -e MODEL_ID=gpt-oss \
  fastapi-oss-llm

Option C — Local Python

Prerequisites: Python 3.11.

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Configure your backend
export OPENAI_BASE_URL=http://localhost:11434/v1   # Ollama example
export OPENAI_API_KEY=ollama                       # any non-empty string
export MODEL_ID=gpt-oss

# Run the server
uvicorn app.main:app --reload --port 8000

Configuration

  • OPENAI_BASE_URL: Required. Base URL to an OpenAI-compatible API (e.g., http://localhost:11434/v1 for Ollama, http://localhost:8000/v1 for LM Studio, or your vLLM endpoint).
  • OPENAI_API_KEY: API key; required by the client library but may be ignored by some backends (use any non-empty string for Ollama/LM Studio local).
  • MODEL_ID: Default model used if not overridden per request. Examples: llama3, mistral, qwen2, or a custom vLLM served name.
  • MAX_ATTACHED_CHARS: Soft limit for combined attached content (inline text + file). Default: 120000 characters. Longer inputs are truncated with truncated=true in the response.
  • UVICORN_WORKERS: Number of Uvicorn worker processes (Dockerfile defaults to 1; compose sets 2).

CORS is enabled for all origins by default. Adjust in app/main.py if deploying publicly.


Endpoints

  • GET /healthz → status check. Response: { "status": "ok" }.
  • POST /v1/ask → query the model.
    • Form fields:
      • prompt (string, required): Instruction or question for the model.
      • text (string, optional): Inline text content to include.
      • file (file, optional): .txt, .pdf, or .docx upload.
      • model (string, optional): Override model id per request.
      • temperature (float, optional, default 0.2): Sampling temperature.
      • max_tokens (int, optional, default 512): Output length cap.
    • Response JSON schema:
      • model (string): The model used for the call.
      • response (string): The model’s answer.
      • input_characters (int): Count of attached content characters after truncation.
      • truncated (bool): Whether attached content was truncated.

Interactive docs: open http://localhost:8000/docs (Swagger UI) or http://localhost:8000/redoc.


Step-by-step use cases

1) Ask a simple question (no attachments)

curl -s -X POST http://localhost:8000/v1/ask \
  -F 'prompt=Explain what an API gateway is.' | jq

2) Ask with inline text

curl -s -X POST http://localhost:8000/v1/ask \
  -F 'prompt=Summarize the content below.' \
  -F $'text=APIs let services communicate. Gateways centralize auth, routing, and rate limiting.' | jq

3) Ask with a .txt file

curl -s -X POST http://localhost:8000/v1/ask \
  -F 'prompt=Extract the key bullet points from the attached file.' \
  -F 'file=@notes.txt;type=text/plain' | jq

4) Ask with a PDF or DOCX

# PDF
curl -s -X POST http://localhost:8000/v1/ask \
  -F 'prompt=Create a concise summary of the attached PDF.' \
  -F 'file=@report.pdf;type=application/pdf' | jq

# DOCX
curl -s -X POST http://localhost:8000/v1/ask \
  -F 'prompt=Summarize this document.' \
  -F 'file=@proposal.docx;type=application/vnd.openxmlformats-officedocument.wordprocessingml.document' | jq

5) Override the model per request

curl -s -X POST http://localhost:8000/v1/ask \
  -F 'prompt=Write a haiku about microservices.' \
  -F 'model=llama3' | jq

6) Control temperature and max_tokens

curl -s -X POST http://localhost:8000/v1/ask \
  -F 'prompt=List 5 best practices for API versioning.' \
  -F 'temperature=0.7' \
  -F 'max_tokens=300' | jq

Client examples

Python (requests)

import requests

url = "http://localhost:8000/v1/ask"

data = {
    "prompt": "Summarize the attached text.",
    "temperature": "0.3",
}
files = {
    # Use one of: text; or file; both are optional
    # "file": ("notes.txt", open("notes.txt", "rb"), "text/plain"),
}

resp = requests.post(url, data=data, files=files)
resp.raise_for_status()
print(resp.json())

JavaScript (fetch + FormData)

const form = new FormData();
form.append("prompt", "Summarize the following content.");
form.append("text", "Microservices split systems into smaller, deployable services...");
// Or attach a file: form.append("file", fileInput.files[0]);

const res = await fetch("http://localhost:8000/v1/ask", {
  method: "POST",
  body: form,
});
const data = await res.json();
console.log(data);

Switching backends

  • Ollama (local):
    • Base URL: http://localhost:11434/v1
    • Model names: llama3, mistral, qwen2, etc. Pull with ollama pull <model>.
  • LM Studio (local):
    • Enable the server and copy the base URL from the app (usually http://localhost:1234/v1).
  • vLLM:
    • Point OPENAI_BASE_URL to your vLLM endpoint (e.g., http://<host>:<port>/v1); use the served model name for MODEL_ID.

Update environment variables accordingly and restart the service.


Troubleshooting

  • Error: OPENAI_BASE_URL is not set: Set it to your backend’s /v1 URL and restart.
  • Model not found / backend returns 404: Follow these steps.
    1. Check whether gpt-oss exists in Ollama:
      docker exec -it ollama ollama list
      If you do not see gpt-oss:latest, continue.
    2. Create or recreate the alias via compose jobs:
      docker compose run --rm ollama-pull-base
      docker compose run --rm ollama-create
      These commands pull llama3 and create a local alias gpt-oss using ollama/Modelfile.
    3. Restart the API service:
      docker compose up -d api
    4. Try a request again:
      curl -s -X POST http://localhost:8000/v1/ask -F 'prompt=ping' | jq
    5. Manual fallback (no compose jobs):
      docker exec -it ollama ollama pull llama3
      docker exec -it ollama ollama create gpt-oss -f /models/Modelfile
      Ensure the repo ollama/Modelfile is mounted or adjust the path accordingly.
  • Large files get cut off: Input is truncated to MAX_ATTACHED_CHARS (default 120k). Increase the env var if needed.
  • PDF or DOCX extraction missing: The app relies on pypdf and python-docx (installed via requirements.txt). If running locally, ensure they are installed.
  • CORS issues: CORS is open by default. For stricter settings, adjust middleware in app/main.py.

Project layout

app/
  __init__.py
  main.py           # FastAPI app, /healthz and /v1/ask endpoints
Dockerfile          # Containerized server
docker-compose.yml  # Local stack: Ollama + API
requirements.txt    # Python dependencies

License

Add your preferred license here.

About

FastAPI OSS LLM Gateway — OpenAI-compatible API for local backends (Ollama, vLLM, LM Studio) with file-aware prompts.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors