| Endpoint | URL | Auth |
|---|---|---|
| Chat completions | POST /chat/completions |
Bearer $LITELLM_MASTER_KEY |
| Embeddings | POST /embeddings |
Bearer $LITELLM_MASTER_KEY |
| Image generation | POST /images/generations |
Bearer $LITELLM_MASTER_KEY |
| Audio transcription | POST /audio/transcriptions |
Bearer $LITELLM_MASTER_KEY |
| Text-to-speech | POST /audio/speech |
Bearer $LITELLM_MASTER_KEY |
| Models list | GET /models |
Bearer $LITELLM_MASTER_KEY |
| Health check | GET /health/liveliness |
none |
| MCP server (all tools) | POST /mcp/ |
Bearer $LITELLM_MASTER_KEY |
| Admin UI | GET /ui/ |
optional basic auth |
The admin UI at /ui/ is rate-limited to 30 requests/minute by default (configurable via RATELIMIT_ADMIN in .env). Set LITELLM_UI_BASIC_AUTH=user:password in .env to enable HTTP basic auth on top of that.
Use claudebox models through the standard LiteLLM chat completions endpoint. Pass workspace via extra headers:
curl http://localhost:4000/chat/completions \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "claudebox-sonnet",
"messages": [{"role": "user", "content": "analyze data.csv and summarize it"}],
"extra_headers": {"X-Claude-Workspace": "myproject"}
}'Available models: claudebox-haiku, claudebox-sonnet, claudebox-opus, pibox-zai-glm-4.5-air, pibox-zai-glm-4.7, pibox-zai-glm-5.1
Base URLs: http://localhost:4000/claudebox/ (Claude Code via OAuth/API key) and http://localhost:4000/pibox-zai/ (pi-coding-agent via z.ai/GLM).
claudebox/* requires Authorization: Bearer $CLAUDEBOX_API_TOKEN. pibox-zai/* requires Authorization: Bearer $PIBOX_ZAI_API_TOKEN. Health endpoints are open. Pibox uses /healthz (not /health); the rest of the path shape (/run, /run/{id}, /v1/chat/completions, /mcp, /files/{path}) is the same except /files/ is workspace-rooted on pibox vs. workspace-prefixed on claudebox.
| Method | Path | Description |
|---|---|---|
GET |
/claudebox/health |
Health check — no auth required |
GET |
/claudebox/status |
Returns which workspaces currently have running Claude processes |
POST |
/claudebox/run |
Run a prompt through Claude Code |
POST |
/claudebox/run/cancel?workspace=<x> |
Kill the running Claude process in a workspace |
PUT |
/claudebox/files/<workspace>/<path> |
Upload a file to a workspace |
GET |
/claudebox/files/<workspace>/<path> |
Download a file from a workspace |
GET |
/claudebox/files/<workspace> |
List files in a workspace |
GET |
/claudebox/files |
List files in the root workspace directory |
DELETE |
/claudebox/files/<workspace>/<path> |
Delete a file from a workspace |
| Field | Type | Description | Default |
|---|---|---|---|
prompt |
string | The prompt to send to Claude Code | (required) |
workspace |
string | Subpath under /workspaces for isolation |
default workspace |
model |
string | haiku, sonnet, opus, or full model name |
account default |
systemPrompt |
string | Replace the default system prompt entirely | (none) |
appendSystemPrompt |
string | Append to the default system prompt without replacing it | (none) |
jsonSchema |
string | JSON Schema string — Claude returns JSON matching this schema | (none) |
effort |
string | Reasoning effort: low, medium, high, max |
(none) |
outputFormat |
string | json or json-verbose (includes full tool call history) |
json |
noContinue |
bool | Start a fresh session instead of continuing the previous one | false |
resume |
string | Resume a specific session by session ID | (none) |
fireAndForget |
bool | Keep the Claude process running even if the HTTP client disconnects | false |
Returns 409 Conflict if the workspace already has a running Claude process.
{
"type": "result",
"subtype": "success",
"isError": false,
"result": "the response text",
"numTurns": 3,
"durationMs": 12400,
"totalCostUsd": 0.049,
"sessionId": "abc123-...",
"usage": {
"inputTokens": 312,
"outputTokens": 87,
"cacheReadInputTokens": 1024
}
}Same as json but includes a turns array with every tool call, tool result, and assistant message:
{
"type": "result",
"subtype": "success",
"result": "Done. I created data_summary.md with statistics.",
"turns": [
{
"role": "assistant",
"content": [
{"type": "tool_use", "id": "toolu_abc", "name": "Bash", "input": {"command": "head data.csv"}}
]
},
{
"role": "tool_result",
"content": [
{"type": "toolResult", "toolUseId": "toolu_abc", "isError": false, "content": "id,name,value\n1,foo,42\n..."}
]
}
],
"numTurns": 5,
"totalCostUsd": 0.089,
"sessionId": "abc123-..."
}Claudebox also speaks OpenAI's chat/completions protocol directly. This is what LiteLLM uses internally, but you can also hit it directly:
| Method | Path | Description |
|---|---|---|
GET |
/claudebox/openai/v1/models |
List available models |
POST |
/claudebox/openai/v1/chat/completions |
Chat completions (streaming + non-streaming) |
Custom headers for workspace control:
| Header | Description |
|---|---|
X-Claude-Workspace |
Workspace subpath to run in |
X-Claude-Continue |
Set to 1, true, or yes to continue the previous session |
X-Claude-Append-System-Prompt |
Text to append to the system prompt for this request |
Note: temperature, max_tokens, tools, and other standard OpenAI fields are accepted but silently ignored — Claude Code manages these internally.
Claudebox exposes an MCP server at /claudebox/mcp/. Tools: claude_run, read_file, write_file, list_files, delete_file. See mcp-tools.md for full parameter reference.
Each workspace subpath gets its own directory, file context, and conversation history. Only one Claude process can run per workspace at a time — concurrent requests return 409. Use different workspace names for parallel work:
# these run concurrently without conflicting
curl -X POST http://localhost:4000/claudebox/run \
-H "Authorization: Bearer $CLAUDEBOX_API_TOKEN" \
-d '{"prompt": "write a Go HTTP server", "workspace": "go-project"}'
curl -X POST http://localhost:4000/claudebox/run \
-H "Authorization: Bearer $CLAUDEBOX_API_TOKEN" \
-d '{"prompt": "write pytest tests", "workspace": "py-tests"}'Base URL: http://localhost:4000/storage/
| Method | Path | Auth | Description |
|---|---|---|---|
GET |
/storage/health |
none | Returns {"status":"ok"} |
GET |
/storage/ |
master or bucket key | List buckets (master sees all, bucket key sees own) |
GET |
/storage/<bucket> |
public: none / private: key | List objects (supports ?prefix= and ?max-keys=) |
PUT |
/storage/<bucket>/<key> |
bucket key or master key | Upload object (MIME auto-detected) |
GET |
/storage/<bucket>/<key> |
public: none / private: key | Download object |
HEAD |
/storage/<bucket>/<key> |
public: none / private: key | Object metadata — no body |
DELETE |
/storage/<bucket>/<key> |
bucket key or master key | Delete object — 204 even if it doesn't exist |
POST |
/storage/presign/<bucket>/<key> |
bucket key or master key | Generate presigned URL |
POST |
/storage/mcp/ |
per-tool auth_key |
MCP endpoint |
Authentication: pass Authorization: Bearer <key> where <key> is the bucket's private key or $HYBRIDS3_MASTER_KEY.
The uploads bucket is configured as public-read — GET/LIST require no auth. PUT/DELETE always require the bucket key.
# generate a presigned URL (expires in 1 hour by default, max 7 days)
curl -X POST "http://localhost:4000/storage/presign/uploads/photo.jpg?expires=3600" \
-H "Authorization: Bearer $HYBRIDS3_UPLOADS_KEY"
# response for public bucket — plain URL, no expiry
{"url": "http://localhost:4000/storage/uploads/photo.jpg", "expires": null}
# response for private bucket — signed URL with expiry
{"url": "http://localhost:4000/storage/private/doc.pdf?X-Amz-Algorithm=...&X-Amz-Signature=...", "expires": 3600}
# use the presigned URL — no auth header needed
curl "http://localhost:4000/storage/uploads/photo.jpg"import boto3
from botocore.config import Config
s3 = boto3.client(
"s3",
endpoint_url="http://localhost:4000/storage",
aws_access_key_id="uploads", # bucket name (public_key)
aws_secret_access_key=HYBRIDS3_UPLOADS_KEY,
region_name="us-east-1",
config=Config(signature_version="s3v4"),
)
s3.upload_file("image.png", "uploads", "image.png")
s3.download_file("uploads", "image.png", "local.png")
s3.list_objects_v2(Bucket="uploads", Prefix="images/")
s3.delete_object(Bucket="uploads", Key="image.png")
# generate presigned URL via boto3
url = s3.generate_presigned_url(
"get_object",
Params={"Bucket": "uploads", "Key": "image.png"},
ExpiresIn=3600,
)Every response includes X-Request-Id for log correlation and X-Content-Type-Options: nosniff. Upload responses include ETag (MD5 of content). GET/HEAD responses include ETag, Last-Modified, Content-Length, and Content-Type (auto-detected from content).
Each object key has its own async read-write lock. Multiple concurrent reads are allowed. Writes are exclusive — a write blocks all other readers and writers on that key. Requests that can't acquire the lock within 30 seconds, or that hold it for more than 300 seconds, get 503.
The uploads bucket has TTL configured (default: HYBRIDS3_UPLOADS_TTL, typically 168h / 7 days). Uploading a file resets its expiry clock. A background sweep runs every minute and deletes expired objects.
| Endpoint | URL | Auth |
|---|---|---|
| Browser API | POST /stealthy-auto-browse/ |
optional bearer token |
| Screenshot (browser) | GET /stealthy-auto-browse/screenshot/browser |
optional bearer token |
| Screenshot (desktop) | GET /stealthy-auto-browse/screenshot/desktop |
optional bearer token |
| MCP server | POST /stealthy-auto-browse/mcp/ |
optional bearer token |
| Queue health | GET /stealthy-auto-browse/__queue/health |
none |
| Cluster status | GET /stealthy-auto-browse/__queue/status |
none |
Set STEALTHY_AUTO_BROWSE_AUTH_TOKEN in .env to set the bearer auth token. Defaults to lulz-4-security if unset — always change this in production.
- 5 browser replicas by default — set
STEALTHY_AUTO_BROWSE_NUM_REPLICASto change - Each replica: 256 MB RAM, up to 1 GB swap
- HAProxy routes requests to replicas and enforces session stickiness:
- MCP requests: pinned by
Mcp-Session-Idheader - All other requests: pinned by
INSTANCEIDcookie, max 1 concurrent request per replica
- MCP requests: pinned by
{
"action": "goto",
"url": "https://example.com"
}Atomic actions: goto, get_text, get_html, get_interactive_elements, screenshot, system_click, system_type, send_key, click, fill, scroll, mouse_move, wait_for_element, wait_for_text, eval_js, browser_action.
run_script composes multiple actions into a single request — executes them sequentially on the same replica in a single HTTP round-trip:
{
"action": "run_script",
"steps": [
{"action": "goto", "url": "https://example.com"},
{"action": "wait_for_element", "selector": "h1", "timeout": 5},
{"action": "get_text"}
]
}Local image generation via stable-diffusion.cpp with a Go wrapper. CPU variant runs with SDCPP=1, CUDA variant with SDCPP_CUDA=1. Both expose an OpenAI-compatible /v1/images/generations endpoint proxied through LiteLLM.
| Endpoint | URL | Description |
|---|---|---|
| Image generation | POST /v1/images/generations |
OpenAI-compatible, proxied through LiteLLM |
| Load model | POST /sdcpp/v1/load?model=<key> |
Pre-load a model without generating |
| Unload model | POST /sdcpp/v1/unload |
Free VRAM/RAM |
| Cancel generation | POST /sdcpp/v1/cancel |
Kill in-progress generation |
| Status | GET /sdcpp/v1/status |
Current state: loaded model, generating, process info |
| Models list | GET /v1/models |
Available models |
| Health | GET /sdcpp/v1/health |
Wrapper health check |
CPU (SDCPP=1): sd-turbo, sdxl-turbo
CUDA (SDCPP_CUDA=1): sd-turbo, sdxl-turbo, sdxl-lightning, flux-schnell, juggernaut-xi
- Auto-load: sending a generation request loads the model automatically if not loaded
- Model hot-swap: requesting a different model stops the current sd-server, starts a new one
- Idle timeout: unloads model after 5 minutes of inactivity (configurable)
- Non-blocking: concurrent requests get 503 immediately instead of queuing. The LiteLLM resource manager semaphore handles scheduling.
- CUDA resource manager: only one CUDA job (LLM, image gen, TTS, STT) runs at a time. Competing services are unloaded before the request proceeds.
| Variable | Default | Description |
|---|---|---|
SDCPP_IDLE_TIMEOUT |
5m |
CPU idle timeout before auto-unload |
SDCPP_CUDA_IDLE_TIMEOUT |
5m |
CUDA idle timeout before auto-unload |
SDCPP_MEM_LIMIT |
12g |
CPU container memory limit |
SDCPP_MEMSWAP_LIMIT |
24g |
CPU container memory + swap limit |
SDCPP_CPUS |
4.0 |
CPU container CPU limit |
SDCPP_CUDA_MEM_LIMIT |
12g |
CUDA container memory limit |
SDCPP_CUDA_MEMSWAP_LIMIT |
24g |
CUDA container memory + swap limit |
SDCPP_CUDA_CPUS |
4.0 |
CUDA container CPU limit |
SDCPP_LOAD_TIMEOUT / SDCPP_CUDA_LOAD_TIMEOUT |
10m |
Max time to wait for model load |
SDCPP_VERBOSE / SDCPP_CUDA_VERBOSE |
false |
Debug logging |
SDCPP_LOG_LEVEL / SDCPP_CUDA_LOG_LEVEL |
info |
Log level |
External image: psyb0t/talkies (pinned to v0.4.0 / v0.4.0-cuda). One container, both endpoints: POST /v1/audio/transcriptions (whisper + canary + parakeet) and POST /v1/audio/speech (Kokoro-82M TTS, plus Qwen3-TTS voice cloning on CUDA). CPU image ships the four ASR models that run reasonably without a GPU (three Whisper variants + canary-180m-flash) plus Kokoro. CUDA image adds Parakeet-TDT, Canary-1B-Flash, Canary-Qwen-2.5B SALM, and Qwen3-TTS-0.6B (voice cloning, 17 languages) on top. Kokoro stays CPU-bound in both images.
| Endpoint | URL | Description |
|---|---|---|
| Transcribe | POST /v1/audio/transcriptions |
OpenAI-compatible multipart upload (file, model, language, response_format, timestamp_granularities[], diarization). Supports json, text, verbose_json, srt, vtt. Stereo channel-split diarization via diarization=true — segments + words get a "channel": "L"/"R" field. |
| Speech | POST /v1/audio/speech |
OpenAI-compatible TTS — Kokoro-82M behind kokoro-82m slug. JSON body with model, input, voice, response_format (mp3/opus/aac/flac/wav/pcm). |
| List models | GET /v1/models |
Configured model_ids |
| Loaded models | GET /api/ps |
Currently loaded backends + idle_seconds |
| Unload one | DELETE /api/ps/{model_id} |
Evict one model from RAM/VRAM (URL-encoded) |
| Unload all | POST /unload |
Evict every loaded backend |
| List voices | GET /v1/audio/voices |
Available Kokoro voices |
| Health | GET /healthz |
Liveness + device + configured model_ids |
CPU (TALKIES=1):
whisper-large-v3,whisper-large-v3-turbo— multilingual ASRcanary-180m-flash— English ASR (FastConformer)kokoro-82m— TTS, ~41 voices across en/es/fr/hi/it/pt
CUDA (TALKIES_CUDA=1): all of the above plus
parakeet-tdt-0.6b-v3— 25 European languages (NeMo RNNT)canary-1b-flash— EN/DE/FR/ES + EN↔X translation (NeMo multitask)canary-qwen-2.5b— English, NeMo SALM hybrid ASR+LLM (text-only; no per-word timestamps)qwen3-tts-0.6b— voice cloning, 17 languages (en, zh, ja, ko, fr, de, es, it, pt, ru, vi, th, id, ar, tr, pl, nl). Drop a<name>.wav(10-30s clean speech) into${DATA_DIR_TALKIES}/custom-voices/on the host (mounted as/data/custom-voicesinside the container) and usevoice=<name>on/v1/audio/speech. Samplesalloy/echo/fablebaked in. Nested paths supported (voice=clients/acme/jane→${DATA_DIR_TALKIES}/custom-voices/clients/acme/jane.wav).
- Lazy load + idle TTL unload: weights download on first request, sit on disk in
${DATA_DIR_TALKIES}(HF cache layout). A background sweeper unloads any model idle longer thanTALKIES_MODEL_TTL(default10m); next request warm-reloads from disk. - Sibling eviction: only one model resident per talkies container at a time. When request N arrives for a different model, talkies evicts the prior one before loading.
- Resource-manager aware:
local-talkies-cuda-*participates in thecuda-stt-talkiesgroup,local-talkies-*incpu-stt-talkies. A competing job (LLM, image gen, TTS, other STT) triggersDELETE /api/ps/{model_id}for every model before its own load. - VAD chunking: long audio is sliced via Silero VAD into ≤28-second speech regions before each backend forward pass, then results are stitched into one Whisper-shape timeline. Backends that don't support timeline assembly (the SALM
canary-qwen-2.5b) concatenate per-chunk text without timestamps. - Audio preprocessing: any container/codec is ffmpeg-converted to 16 kHz mono WAV before the backend sees it. Stereo
diarization=truesplits L/R into two mono streams, transcribes each, and time-interleaves the segments with channel tags. - OpenAI parity: every response_format returns the correct Content-Type body —
text/plainfortext,application/x-subripforsrt,text/vttforvtt,application/jsonforjson/verbose_json.verbose_jsoncarriestext,language,duration,segments[{id,start,end,text,channel?,…}],words[{word,start,end,channel?}].
| Variable | Default | Description |
|---|---|---|
TALKIES_MODEL_TTL / TALKIES_CUDA_MODEL_TTL |
10m |
Idle duration before unload (-1 disables). Accepts bare seconds or Go-style strings (3h30m5s, 45m, 90s). |
TALKIES_SWEEPER_INTERVAL / TALKIES_CUDA_SWEEPER_INTERVAL |
1m |
Idle sweeper poll interval |
TALKIES_LOAD_TIMEOUT / TALKIES_CUDA_LOAD_TIMEOUT |
5m |
Max wait for model load before the request errors |
TALKIES_MAX_UPLOAD_BYTES / TALKIES_CUDA_MAX_UPLOAD_BYTES |
104857600 |
Max audio upload size (bytes) |
TALKIES_LOG_LEVEL / TALKIES_CUDA_LOG_LEVEL |
INFO |
Log level |
TALKIES_PRELOAD / TALKIES_CUDA_PRELOAD |
empty | Comma-separated model_ids to load at boot |
TALKIES_VAD_CHUNK_THRESHOLD / TALKIES_CUDA_VAD_CHUNK_THRESHOLD |
30 |
Audio length (seconds) above which VAD chunking kicks in |
TALKIES_VAD_MAX_SPEECH / TALKIES_CUDA_VAD_MAX_SPEECH |
28 |
Max chunk length fed to a single forward pass |
TALKIES_MEM_LIMIT / TALKIES_CUDA_MEM_LIMIT |
8g / 12g |
Container memory limit |
TALKIES_CPUS / TALKIES_CUDA_CPUS |
4.0 |
Container CPU limit |
DATA_DIR_TALKIES |
${DATA_DIR}/talkies |
Bind-mount root for talkies' /data dir. Contains hf/hub/models--*/ (HF cache, shared by CPU + CUDA) and — for CUDA — custom-voices/<name>.wav (Qwen3-TTS reference voices). |
Auto-enabled when any image or TTS provider is active (HuggingFace, OpenAI, Speaches, SDCPP, CUDA). Runs as an internal service — no direct nginx route, accessed only through LiteLLM's aggregated MCP endpoint at /mcp/.
| Endpoint | URL | Auth |
|---|---|---|
| MCP server (via proxy) | POST /mcp/ |
Bearer $LITELLM_MASTER_KEY |
| Health (internal only) | GET :8000/health |
none (not exposed via nginx) |
generate_image— create images from text prompts (FLUX, DALL-E, Stable Diffusion depending on enabled providers)generate_tts— generate speech audio from text (Kokoro, Qwen3-TTS, OpenAI TTS depending on enabled providers)
Both tools return structured JSON with persistent HybridS3 URLs — no base64 blobs sent to the LLM.
See mcp-tools.md for full parameter reference.
| Variable | Default | Description |
|---|---|---|
MCP_TOOLS_AUTH_TOKEN |
— | Bearer token for MCP auth (required) |
MCP_MEM_LIMIT |
256m |
Container memory limit |
MCP_MEMSWAP_LIMIT |
512m |
Container memory + swap limit |
MCP_CPUS |
0.5 |
CPU limit |
Web UI for LLM interaction at /librechat/. Pre-configured with all LiteLLM models and MCP tools. Uses MongoDB for conversation storage.
| Endpoint | URL | Auth |
|---|---|---|
| Web UI | GET /librechat/ |
email/password (own auth) |
| API | /librechat/api/* |
JWT (managed by LibreChat) |
LibreChat has its own email/password authentication — no basic auth (SPAs and basic auth are incompatible due to Authorization header collision). The first registered user automatically becomes admin. After creating your account, set LIBRECHAT_ALLOW_REGISTRATION=false in .env and restart to lock registration.
All MCP tools from the LiteLLM aggregated endpoint are available in LibreChat conversations. Connected via streamable-http with apiKey.source: admin (bypasses LibreChat's OAuth detection probe). Configuration in librechat/librechat.yaml.
| Variable | Default | Description |
|---|---|---|
LIBRECHAT_DOMAIN_CLIENT |
http://librechat:3080/librechat |
Public URL for client (sets <base href>) |
LIBRECHAT_DOMAIN_SERVER |
http://librechat:3080/librechat |
Public URL for server API |
LIBRECHAT_CREDS_KEY |
— | Encryption key for stored credentials (64 hex chars) |
LIBRECHAT_CREDS_IV |
— | Encryption IV (32 hex chars) |
LIBRECHAT_JWT_SECRET |
— | JWT signing secret |
LIBRECHAT_TITLE_MODEL |
groq-llama-3.3-70b |
Model for auto-titling conversations |
LIBRECHAT_ALLOW_REGISTRATION |
true |
Set to false after creating admin account |
LIBRECHAT_ALLOW_EMAIL_LOGIN |
true |
Enable email/password login |
LIBRECHAT_ALLOW_SOCIAL_LOGIN |
false |
Enable social login providers |
LIBRECHAT_ALLOW_UNVERIFIED_EMAIL_LOGIN |
true |
Allow login without email verification |
LIBRECHAT_DEBUG_LOGGING |
true |
Enable debug-level logging |
LIBRECHAT_DEBUG_CONSOLE |
false |
Log to console (in addition to file) |
LIBRECHAT_MEM_LIMIT |
512m |
LibreChat container memory limit |
LIBRECHAT_MEMSWAP_LIMIT |
1g |
LibreChat container memory + swap limit |
LIBRECHAT_CPUS |
1.0 |
LibreChat CPU limit |
LIBRECHAT_MONGO_MEM_LIMIT |
512m |
MongoDB container memory limit |
LIBRECHAT_MONGO_MEMSWAP_LIMIT |
1g |
MongoDB container memory + swap limit |
LIBRECHAT_MONGO_CPUS |
0.5 |
MongoDB CPU limit |
RATELIMIT_LIBRECHAT |
500r/m |
Nginx rate limit |
RATELIMIT_LIBRECHAT_BURST |
100 |
Burst allowance |
TIMEOUT_LIBRECHAT |
600s |
Nginx proxy timeout |
LIBRECHAT_MAX_BODY_SIZE |
25m |
Max upload size |
DATA_DIR_LIBRECHAT |
${DATA_DIR}/librechat |
Data directory (MongoDB + uploads) |
Self-hosted meta-search engine at /searxng/. Aggregates results from Google, Bing, DuckDuckGo, and Wikipedia. Protected by nginx admin auth (LITELLM_UI_BASIC_AUTH). Rate-limited to 60 req/min by default.
Also exposed to the MCP search_web tool — when SEARXNG=1, the MCP tools server gains a search_web tool that any function-calling model can invoke.
| Endpoint | URL | Auth |
|---|---|---|
| Search UI | GET /searxng/ |
nginx basic auth |
| JSON API | GET /searxng/search?q=...&format=json |
nginx basic auth |
| Variable | Default | Description |
|---|---|---|
RATELIMIT_SEARXNG |
60r/m |
Nginx rate limit |
RATELIMIT_SEARXNG_BURST |
20 |
Burst allowance |
TIMEOUT_SEARXNG |
60s |
Nginx proxy timeout |
SEARXNG_MEM_LIMIT |
256m |
Container memory limit |
SEARXNG_MEMSWAP_LIMIT |
512m |
Container memory + swap limit |
SEARXNG_CPUS |
0.5 |
CPU limit |
Settings are in searxng/settings.yml (mounted read-only into the container). The default config enables HTML and JSON output formats and activates Google, Bing, DuckDuckGo, and Wikipedia engines with no rate limiter.
Telegram client at /telethon/. Backed by docker-telethon-plus — exposes a REST API and MCP server for sending/reading messages, managing dialogs, forwarding content, and group operations.
| Endpoint | URL | Auth |
|---|---|---|
| Health | GET /telethon/healthz |
none |
| REST API | /telethon/api/* |
Bearer token |
| MCP server | /telethon/mcp |
Bearer token |
- Get API credentials at my.telegram.org/apps
- Generate a string session:
docker run -it --rm \ -e TELETHON_API_ID=your_id \ -e TELETHON_API_HASH=your_hash \ psyb0t/telethon-plus:v0.2.0 login
- Add to
.env:TELETHON=1,TELETHON_API_ID,TELETHON_API_HASH,TELETHON_SESSION,TELETHON_AUTH_KEY
| Variable | Default | Description |
|---|---|---|
TELETHON_API_ID |
— | Telegram app ID (required) |
TELETHON_API_HASH |
— | Telegram app hash (required) |
TELETHON_SESSION |
— | String session from login.py (required) |
TELETHON_AUTH_KEY |
lulz-4-security |
Bearer token for REST + MCP auth |
TELETHON_PROXY |
— | SOCKS5 proxy (socks5://user:pass@host:port) |
TELETHON_REQUEST_TIMEOUT |
60 |
Telegram API request timeout (seconds) |
TELETHON_FLOOD_SLEEP_THRESHOLD |
60 |
Auto-sleep on flood wait up to this many seconds |
RATELIMIT_TELETHON |
30r/m |
Nginx rate limit |
RATELIMIT_TELETHON_BURST |
10 |
Burst allowance |
TIMEOUT_TELETHON |
60s |
Nginx proxy timeout |
TELETHON_MEM_LIMIT |
256m |
Container memory limit |
TELETHON_MEMSWAP_LIMIT |
512m |
Container memory + swap limit |
TELETHON_CPUS |
0.2 |
CPU limit |
Foundation time-series forecasting at /predictalot/. Backed by docker-predictalot — five foundation models (chronos-2, timesfm-2.5, moirai-2, toto-1, sundial-base-128m) exposed via a type-routed API. Each forecast modality has its own URL prefix; a model only appears under a type if it implements that modality. Per-type weighted-mean ensembles parallelize across all type members.
Not registered with LiteLLM (no chat/completion surface) — accessed directly via nginx. MCP is exposed by default through LiteLLM's /mcp/ aggregator with 26 tools (one per (type, model) cell + per-type ensemble + per-type listing).
| Type | Base URL | Members | Request shape | Response shape |
|---|---|---|---|---|
| univariate | /v1/univariate |
chronos-2, timesfm-2.5, moirai-2, toto-1, sundial-base-128m | context: float[series][time] |
quantiles per series |
| multivariate | /v1/multivariate |
chronos-2, moirai-2, toto-1 | context: float[series][channel][time] |
quantiles per (series, channel) |
| covariates — past only | /v1/covariates/past |
chronos-2, moirai-2 | univariate target + pastCovariates |
quantiles per series |
| covariates — future only | /v1/covariates/future |
chronos-2 | univariate target + futureCovariates |
quantiles per series |
| covariates — past + future | /v1/covariates |
chronos-2 | univariate target + both | quantiles per series |
| samples | /v1/samples |
toto-1, sundial-base-128m | univariate target + numSamples |
raw sample paths |
Every base URL exposes the same three sub-paths: <base>/forecast, <base>/forecast/ensemble, <base>/models.
| Endpoint | URL | Auth |
|---|---|---|
| Liveness probe | GET /predictalot/healthz |
none (open) |
| Per-type model listing | GET /predictalot/v1/<type>/models |
Bearer token |
| Per-type single-model forecast | POST /predictalot/v1/<type>/forecast |
Bearer token |
| Per-type weighted ensemble | POST /predictalot/v1/<type>/forecast/ensemble |
Bearer token |
| MCP (direct) | /predictalot/mcp |
Bearer token |
| MCP (aggregated) | /mcp/ (via LiteLLM master key) |
Master key |
curl -s http://localhost:4000/predictalot/v1/univariate/forecast \
-H "Authorization: Bearer $PREDICTALOT_AUTH_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "chronos-2",
"context": [[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]],
"config": {"horizon": 5}
}' | jq
# weighted ensemble — weight 0 disables a model, omitted entries default to 1
curl -s http://localhost:4000/predictalot/v1/univariate/forecast/ensemble \
-H "Authorization: Bearer $PREDICTALOT_AUTH_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"context": [[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]],
"config": {"horizon": 5},
"weights": {"chronos-2": 2.0, "moirai-2": 1.0, "timesfm-2.5": 0}
}' | jqThe ensemble response carries the weighted-mean forecast plus an individual map containing each contributing model's own forecast and applied weight, so dissent / post-processing is possible.
Models lazy-load on first request (each downloads ~50-800MB into the mounted models dir, subsequent calls are fast). Idle models are unloaded after PREDICTALOT_MODEL_IDLE_TIMEOUT (default 30m). The Sundial model runs in its own sidecar venv (it pins transformers==4.40.1 for upstream compatibility) — transparent over the wire.
CPU and CUDA variants are mutually exclusive (both bind the predictalot network alias) — pick one. CUDA variant requires nvidia-container-toolkit + --gpus configuration on the host.
| Variable | Default | Description |
|---|---|---|
PREDICTALOT_AUTH_TOKEN |
lulz-4-security-predictalot |
Bearer token for /predictalot/* + MCP |
PREDICTALOT_DEVICE |
auto |
auto / cpu / cuda / cuda:N |
PREDICTALOT_PREFETCH |
— | Comma-separated slugs (or all) to fetch weights at boot |
PREDICTALOT_PRELOAD |
— | Comma-separated slugs to load into memory at boot |
PREDICTALOT_MODEL_IDLE_TIMEOUT |
30m |
Idle time before a loaded model is unloaded (Go-style) |
PREDICTALOT_MAX_BODY_SIZE |
32mb |
Max request body |
PREDICTALOT_LOG_LEVEL |
INFO |
Python log level |
DATA_DIR_PREDICTALOT |
${DATA_DIR}/predictalot |
Where HF snapshots are stored (~1.4GB for all five models) |
RATELIMIT_PREDICTALOT |
60r/m |
Nginx rate limit |
RATELIMIT_PREDICTALOT_BURST |
20 |
Burst allowance |
TIMEOUT_PREDICTALOT |
600s |
Nginx proxy timeout (cold model loads can be slow) |
PREDICTALOT_MEM_LIMIT |
6g (CPU) / 12g (CUDA) |
Container memory limit |
PREDICTALOT_CPUS |
4.0 |
CPU limit |
See docker-predictalot README for full per-model quirks, request/response shapes, and accuracy benchmarks.
Stateless IMAP+SMTP gateway at /mailbox/. Backed by docker-mailbox — point it at N email accounts via a single YAML config and out the other end you get one HTTP API + one MCP server (streamable-HTTP at /mailbox/mcp), both on the same bearer token. Unified inbox across all accounts, per-account CRUD, and SMTP send. No database; every read hits the upstream IMAP server live.
Not registered with LiteLLM (no chat/completion surface). MCP is enabled by default via LiteLLM's /mcp/ aggregator with a flat tool set (mailbox-mailboxes, mailbox-inbox, mailbox-list_messages, mailbox-get_message, mailbox-search, mailbox-send, mailbox-mark_seen, mailbox-move, mailbox-delete, …) — every per-account op takes mailbox as a parameter, so 100 inboxes ship the same handful of tools.
| Endpoint | URL | Auth |
|---|---|---|
| Health | GET /mailbox/health |
open |
| List mailboxes | GET /mailbox/mailboxes |
Bearer token |
| Unified inbox | GET /mailbox/inbox |
Bearer token |
| Per-mailbox ops | GET/POST/DELETE /mailbox/<name>/… |
Bearer token |
| Send | POST /mailbox/<name>/send |
Bearer token |
| MCP (direct) | /mailbox/mcp |
Bearer token |
| MCP (aggregated) | /mcp/ (via LiteLLM master key) |
Master key |
- Copy
mailbox/config.example.yamlto a host path outside git history (recommended:.data/mailbox/config.yaml—.data/**is gitignored). - Fill in the IMAP/SMTP creds for each account.
- Put at least one bearer token in
auth.tokens:and mirror it asMAILBOX_AUTH_TOKENin.env(the MCP aggregator uses this to reach/mailbox/mcp). - Set
MAILBOX_CONFIGin.envto the absolute or repo-relative path of the YAML. MAILBOX=1in.env, thenmake run-bg.
curl -s http://localhost:4000/mailbox/mailboxes -H "Authorization: Bearer $MAILBOX_AUTH_TOKEN" | jq
curl -s "http://localhost:4000/mailbox/inbox?limit=5" -H "Authorization: Bearer $MAILBOX_AUTH_TOKEN" | jqThe config file holds plaintext IMAP/SMTP passwords + bearer tokens — treat it like a private key. Never commit. Makefile's check_file_vars validates MAILBOX_CONFIG resolves to an existing file before bringing the stack up.
| Variable | Default | Description |
|---|---|---|
MAILBOX_CONFIG |
— (required) | Host path to the mailbox YAML config |
MAILBOX_AUTH_TOKEN |
change-me-mailbox-auth |
Bearer token; MUST also appear in the YAML's auth.tokens: |
RATELIMIT_MAILBOX |
60r/m |
Nginx rate limit |
RATELIMIT_MAILBOX_BURST |
20 |
Burst allowance |
TIMEOUT_MAILBOX |
120s |
Nginx proxy timeout (IMAP fetches can be slow on large folders) |
MAILBOX_MEM_LIMIT |
256m |
Container memory limit |
MAILBOX_CPUS |
0.5 |
CPU limit |
See docker-mailbox README for the full config schema, query params (mailbox=, unseen=, from=, reader-mode HTML stripping, etc.), and the complete MCP tool list.
Disabled by default. Enable by setting CLOUDFLARED=1 in .env.
CLOUDFLARED=1Cloudflare assigns a random *.trycloudflare.com URL and logs it on startup:
docker compose up -d
docker compose logs cloudflared | grep trycloudflareCLOUDFLARED=1
CLOUDFLARED_CONFIG=/absolute/path/to/config.yml
CLOUDFLARED_CREDS=/absolute/path/to/credentials.jsonExample config.yml:
tunnel: <your-tunnel-id>
credentials-file: /etc/cloudflared/credentials.json
ingress:
- hostname: aigate.yourdomain.com
service: http://nginx:4000
- service: http_status:404Get your tunnel ID and credentials: Cloudflare Tunnel guide
Disabled by default. Runs the official tailscale/tailscale image with tailscale serve configured for L4 TCP forwarding to nginx on port 4000. Access is tailnet-only — no public exposure, no port forwarding, no Cloudflare in the middle.
L4 mode means tailscale forwards the raw TCP stream straight to nginx without inspecting the Host header. nginx sees the original request — including Host, paths, everything — exactly as the client sent it. No FQDN config needed on the tailscale side.
State is bind-mounted at ${DATA_DIR_TAILSCALE:-${DATA_DIR:-.data}/tailscale} so the node identity survives container recreates. After the first auth, the node stays logged in even if TS_AUTHKEY is rotated.
-
Generate an auth key at login.tailscale.com/admin/settings/keys (reusable + ephemeral both work).
-
Set:
TAILSCALE=1 TS_AUTHKEY=tskey-auth-... TS_HOSTNAME=aigate
-
make run-bg. The node joins your tailnet underTS_HOSTNAME. -
From any tailnet-joined device:
http://aigate.tailXXXX.ts.net→ aigate's nginx.
Find your tailnet name with docker compose exec tailscale tailscale status after first connect.
Add --login-server to TS_EXTRA_ARGS:
TAILSCALE=1
TS_AUTHKEY=hskey-auth-...
TS_HOSTNAME=aigate
TS_EXTRA_ARGS=--login-server=https://your-headscale.example.comThe FQDN your tailnet exposes (aigate.<base_domain>) is determined by your Headscale's base_domain setting — nothing to configure on the aigate side.
Default forward port is 80 (http://<host>.<tailnet>/). Change with TS_SERVE_PORT=8080 etc.
- L4 forwarding means HTTPS auto-cert (Tailscale's hosted ACME proxy) is not in play here — TLS termination, if you want it, lives in nginx. Easier to keep it as plain HTTP over the tailnet, which is already encrypted by WireGuard.
- The container needs
NET_ADMIN,NET_RAW, and/dev/net/tunfor kernel networking. - Forwarding sysctls (
net.ipv4.ip_forward=1,net.ipv6.conf.all.forwarding=1) are set so subnet-routing and exit-node modes work if you add--advertise-routes=...or--advertise-exit-nodeviaTS_EXTRA_ARGS. - Stays on the
aigate-publicnetwork so thenginx:4000upstream resolves via Docker DNS.
Local services (Ollama, sd.cpp, Speaches, Qwen3-TTS) share limited hardware. The platform coordinates them automatically — no manual model management needed.
Every local service unloads models after a period of inactivity:
| Service | Default idle timeout | Configurable via |
|---|---|---|
| Ollama (CPU/CUDA) | 5 minutes | Ollama's built-in keep_alive |
| sd.cpp CPU | 5 minutes | SDCPP_IDLE_TIMEOUT |
| sd.cpp CUDA | 5 minutes | SDCPP_CUDA_IDLE_TIMEOUT |
| Speaches | On-demand unload | Resource manager triggers DELETE /api/ps/{model} |
| Qwen3 CUDA TTS | On-demand unload | Resource manager triggers POST /unload |
Models load automatically when a request arrives. Send a chat completion to local-ollama-cuda-qwen3-8b and Ollama pulls/loads it. Send an image generation to local-sdcpp-cuda-flux-schnell and the sd.cpp wrapper spawns sd-server with that model. No pre-loading required.
A LiteLLM callback (resource_manager.py) enforces mutual exclusion per hardware class:
- CUDA semaphore — one CUDA job at a time across all groups: LLM (
cuda-llm), image gen (cuda-img), TTS (cuda-tts), STT (cuda-stt) - CPU semaphore — one CPU job at a time across: LLM (
cpu-llm), image gen (cpu-img), TTS (cpu-tts), STT (cpu-stt)
When a request arrives for a local model:
- The resource manager identifies which group it belongs to (e.g.
local-sdcpp-cuda-flux-schnell→cuda-img) - It acquires the hardware semaphore (waits if another job is running)
- It unloads all competing groups on the same hardware (e.g. unloads
cuda-llm,cuda-tts,cuda-stt) - The request proceeds
- On completion (success or failure), the semaphore is released
Each service has its own unload API:
| Service | Unload method |
|---|---|
| Ollama | POST /api/generate {"model": "...", "keep_alive": 0} |
| sd.cpp | POST /sdcpp/v1/unload |
| Speaches | DELETE /api/ps/{model_id} |
| Qwen3 CUDA TTS | POST /unload |
The sd.cpp wrapper uses TryLock — if a generation or model swap is in progress, new requests get 503 immediately instead of queuing. Scheduling happens at the LiteLLM layer via the semaphore, not inside individual services.