GitHub - comradesurendra/gpt-oss: FastAPI OSS LLM Gateway — OpenAI-compatible API for local backends (Ollama, vLLM, LM Studio) with file-aware prompts.

FastAPI OSS LLM Gateway

A lightweight FastAPI service that exposes a simple, safe endpoint to query any OpenAI-compatible backend (e.g., Ollama, vLLM, LM Studio). It supports optional inline text or file uploads (.txt, .pdf, .docx) that the model can use to answer the prompt.

Tech stack: Python 3.11, FastAPI, Uvicorn
Backends: Anything that speaks the OpenAI Chat Completions API
File types: .txt, .pdf (pypdf), .docx (python-docx)

Quick start

Option A — docker-compose (recommended)

Prerequisites: Docker and Docker Compose.

# Start Ollama + API (compose will pull the base model and create the `gpt-oss` alias automatically)
docker compose up -d

# Verify models are present inside Ollama
docker exec -it ollama ollama list
# You should see: gpt-oss:latest and llama3:latest

# Check the API health
curl http://localhost:8000/healthz
# => {"status":"ok"}

# Open interactive docs
open http://localhost:8000/docs

Notes:

The compose file configures:
- OPENAI_BASE_URL=http://ollama:11434/v1
- OPENAI_API_KEY=ollama (Ollama ignores the key; it just must be non-empty)
- MODEL_ID=gpt-oss (default). Change if you prefer a different model name.
To change the model, edit docker-compose.yml and update MODEL_ID under the api service.

Option B — Docker only

# Build the image
docker build -t fastapi-oss-llm .

# Run with environment configured for an OpenAI-compatible backend
# Example: point to a locally running Ollama at localhost:11434
docker run --rm -p 8000:8000 \
  -e OPENAI_BASE_URL=http://host.docker.internal:11434/v1 \
  -e OPENAI_API_KEY=ollama \
  -e MODEL_ID=gpt-oss \
  fastapi-oss-llm

Option C — Local Python

Prerequisites: Python 3.11.

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Configure your backend
export OPENAI_BASE_URL=http://localhost:11434/v1   # Ollama example
export OPENAI_API_KEY=ollama                       # any non-empty string
export MODEL_ID=gpt-oss

# Run the server
uvicorn app.main:app --reload --port 8000

Configuration

OPENAI_BASE_URL: Required. Base URL to an OpenAI-compatible API (e.g., http://localhost:11434/v1 for Ollama, http://localhost:8000/v1 for LM Studio, or your vLLM endpoint).
OPENAI_API_KEY: API key; required by the client library but may be ignored by some backends (use any non-empty string for Ollama/LM Studio local).
MODEL_ID: Default model used if not overridden per request. Examples: llama3, mistral, qwen2, or a custom vLLM served name.
MAX_ATTACHED_CHARS: Soft limit for combined attached content (inline text + file). Default: 120000 characters. Longer inputs are truncated with truncated=true in the response.
UVICORN_WORKERS: Number of Uvicorn worker processes (Dockerfile defaults to 1; compose sets 2).

CORS is enabled for all origins by default. Adjust in app/main.py if deploying publicly.

Endpoints

GET /healthz → status check. Response: { "status": "ok" }.
POST /v1/ask → query the model.
- Form fields:
  - prompt (string, required): Instruction or question for the model.
  - text (string, optional): Inline text content to include.
  - file (file, optional): .txt, .pdf, or .docx upload.
  - model (string, optional): Override model id per request.
  - temperature (float, optional, default 0.2): Sampling temperature.
  - max_tokens (int, optional, default 512): Output length cap.
- Response JSON schema:
  - model (string): The model used for the call.
  - response (string): The model’s answer.
  - input_characters (int): Count of attached content characters after truncation.
  - truncated (bool): Whether attached content was truncated.

Interactive docs: open http://localhost:8000/docs (Swagger UI) or http://localhost:8000/redoc.

Step-by-step use cases

1) Ask a simple question (no attachments)

curl -s -X POST http://localhost:8000/v1/ask \
  -F 'prompt=Explain what an API gateway is.' | jq

2) Ask with inline text

curl -s -X POST http://localhost:8000/v1/ask \
  -F 'prompt=Summarize the content below.' \
  -F $'text=APIs let services communicate. Gateways centralize auth, routing, and rate limiting.' | jq

3) Ask with a .txt file

curl -s -X POST http://localhost:8000/v1/ask \
  -F 'prompt=Extract the key bullet points from the attached file.' \
  -F 'file=@notes.txt;type=text/plain' | jq

4) Ask with a PDF or DOCX

# PDF
curl -s -X POST http://localhost:8000/v1/ask \
  -F 'prompt=Create a concise summary of the attached PDF.' \
  -F 'file=@report.pdf;type=application/pdf' | jq

# DOCX
curl -s -X POST http://localhost:8000/v1/ask \
  -F 'prompt=Summarize this document.' \
  -F 'file=@proposal.docx;type=application/vnd.openxmlformats-officedocument.wordprocessingml.document' | jq

5) Override the model per request

curl -s -X POST http://localhost:8000/v1/ask \
  -F 'prompt=Write a haiku about microservices.' \
  -F 'model=llama3' | jq

6) Control temperature and max_tokens

curl -s -X POST http://localhost:8000/v1/ask \
  -F 'prompt=List 5 best practices for API versioning.' \
  -F 'temperature=0.7' \
  -F 'max_tokens=300' | jq

Client examples

Python (requests)

import requests

url = "http://localhost:8000/v1/ask"

data = {
    "prompt": "Summarize the attached text.",
    "temperature": "0.3",
}
files = {
    # Use one of: text; or file; both are optional
    # "file": ("notes.txt", open("notes.txt", "rb"), "text/plain"),
}

resp = requests.post(url, data=data, files=files)
resp.raise_for_status()
print(resp.json())

JavaScript (fetch + FormData)

const form = new FormData();
form.append("prompt", "Summarize the following content.");
form.append("text", "Microservices split systems into smaller, deployable services...");
// Or attach a file: form.append("file", fileInput.files[0]);

const res = await fetch("http://localhost:8000/v1/ask", {
  method: "POST",
  body: form,
});
const data = await res.json();
console.log(data);

Switching backends

Ollama (local):
- Base URL: http://localhost:11434/v1
- Model names: llama3, mistral, qwen2, etc. Pull with ollama pull <model>.
LM Studio (local):
- Enable the server and copy the base URL from the app (usually http://localhost:1234/v1).
vLLM:
- Point OPENAI_BASE_URL to your vLLM endpoint (e.g., http://<host>:<port>/v1); use the served model name for MODEL_ID.

Update environment variables accordingly and restart the service.

Troubleshooting

Error: OPENAI_BASE_URL is not set: Set it to your backend’s /v1 URL and restart.
Model not found / backend returns 404: Follow these steps.
1. Check whether gpt-oss exists in Ollama:
```
docker exec -it ollama ollama list
```
  If you do not see gpt-oss:latest, continue.
2. Create or recreate the alias via compose jobs:
```
docker compose run --rm ollama-pull-base
docker compose run --rm ollama-create
```
  These commands pull llama3 and create a local alias gpt-oss using ollama/Modelfile.
3. Restart the API service:
```
docker compose up -d api
```
4. Try a request again:
```
curl -s -X POST http://localhost:8000/v1/ask -F 'prompt=ping' | jq
```
5. Manual fallback (no compose jobs):
```
docker exec -it ollama ollama pull llama3
docker exec -it ollama ollama create gpt-oss -f /models/Modelfile
```
  Ensure the repo ollama/Modelfile is mounted or adjust the path accordingly.
Large files get cut off: Input is truncated to MAX_ATTACHED_CHARS (default 120k). Increase the env var if needed.
PDF or DOCX extraction missing: The app relies on pypdf and python-docx (installed via requirements.txt). If running locally, ensure they are installed.
CORS issues: CORS is open by default. For stricter settings, adjust middleware in app/main.py.

Project layout

app/
  __init__.py
  main.py           # FastAPI app, /healthz and /v1/ask endpoints
Dockerfile          # Containerized server
docker-compose.yml  # Local stack: Ollama + API
requirements.txt    # Python dependencies

License

Add your preferred license here.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app		app
ollama		ollama
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FastAPI OSS LLM Gateway

Quick start

Option A — docker-compose (recommended)

Option B — Docker only

Option C — Local Python

Configuration

Endpoints

Step-by-step use cases

1) Ask a simple question (no attachments)

2) Ask with inline text

3) Ask with a .txt file

4) Ask with a PDF or DOCX

5) Override the model per request

6) Control temperature and max_tokens

Client examples

Python (requests)

JavaScript (fetch + FormData)

Switching backends

Troubleshooting

Project layout

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FastAPI OSS LLM Gateway

Quick start

Option A — docker-compose (recommended)

Option B — Docker only

Option C — Local Python

Configuration

Endpoints

Step-by-step use cases

1) Ask a simple question (no attachments)

2) Ask with inline text

3) Ask with a .txt file

4) Ask with a PDF or DOCX

5) Override the model per request

6) Control temperature and max_tokens

Client examples

Python (requests)

JavaScript (fetch + FormData)

Switching backends

Troubleshooting

Project layout

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages