diff --git a/docs.json b/docs.json
index 49168dd..f75fe5d 100644
--- a/docs.json
+++ b/docs.json
@@ -93,6 +93,7 @@
"documentation/serverless/architecture",
"documentation/serverless/overview",
"documentation/serverless/SDKoverview",
+ "documentation/serverless/openai-compatible-api",
"documentation/serverless/automatedperformancetesting"
]
},
@@ -110,6 +111,7 @@
"group": "Monitoring and Debug",
"pages": [
"documentation/serverless/worker-states",
+ "documentation/serverless/zero-downtime-worker-update",
"documentation/serverless/logging"
]
},
diff --git a/documentation/serverless/openai-compatible-api.mdx b/documentation/serverless/openai-compatible-api.mdx
new file mode 100644
index 0000000..1973e05
--- /dev/null
+++ b/documentation/serverless/openai-compatible-api.mdx
@@ -0,0 +1,150 @@
+---
+title: OpenAI API-compatible Interface
+description: Use Vast.ai Serverless endpoints with the standard OpenAI API client by swapping your API key and base URL.
+"canonical": "/documentation/serverless/openai-compatible-api"
+---
+
+
+
+Vast provides an OpenAI API-compatible proxy service that lets you point any application or library that works with the OpenAI API at a Vast Serverless vLLM endpoint instead. If your code already uses the OpenAI Python client (or any OpenAI-compatible HTTP client), you can switch to Vast by changing two values: the **API key** and the **base URL**.
+
+## Prerequisites
+
+- A Vast.ai account with a valid **API key**. You can find your key on the [Account page](https://cloud.vast.ai/account/).
+- An active Serverless endpoint running the **vLLM** template. See the [Quickstart](/documentation/serverless/quickstart) guide to create one.
+
+## How It Works
+
+Vast runs a lightweight proxy at `openai.vast.ai` that accepts requests in the OpenAI API format and routes them to your Serverless vLLM endpoint. Your client sends a standard OpenAI request, the proxy translates it into a Vast Serverless call, and the response is returned in the OpenAI format your client expects.
+
+This means frameworks and tools built on the OpenAI SDK — such as LangChain, LlamaIndex, or custom chat applications — can use Vast Serverless without any code changes beyond updating credentials.
+
+## Migrating from OpenAI (or Another Provider)
+
+If you already have an application that calls the OpenAI API (or another OpenAI-compatible provider such as Together AI, Anyscale, or a self-hosted vLLM instance), migration requires only two changes:
+
+| Setting | Before | After |
+|---|---|---|
+| API Key | Your OpenAI / provider key | Your [Vast API key](https://cloud.vast.ai/account/) |
+| Base URL | `https://api.openai.com/v1` (or provider URL) | `https://openai.vast.ai/` |
+
+Replace `` with the name of your Serverless endpoint. No other code changes are required — the proxy accepts the same request and response schema for the supported endpoints.
+
+
+ The `model` field is required by the OpenAI SDK but is **ignored** by the proxy. The model served is determined entirely by the `MODEL_NAME` environment variable set in your vLLM endpoint configuration. You can pass any string (including an empty string) for this field.
+
+
+
+
+```python
+from openai import OpenAI
+
+client = OpenAI(
+ api_key="",
+ base_url="https://openai.vast.ai/",
+)
+
+response = client.chat.completions.create(
+ model="", # model is determined by your endpoint configuration
+ messages=[
+ {"role": "system", "content": "You are a helpful assistant."},
+ {"role": "user", "content": "Explain serverless computing in two sentences."},
+ ],
+ max_tokens=256,
+ temperature=0.7,
+)
+
+print(response.choices[0].message.content)
+```
+
+
+```javascript
+import OpenAI from "openai";
+
+const client = new OpenAI({
+ apiKey: "",
+ baseURL: "https://openai.vast.ai/",
+});
+
+const response = await client.chat.completions.create({
+ model: "", // model is determined by your endpoint configuration
+ messages: [
+ { role: "system", content: "You are a helpful assistant." },
+ { role: "user", content: "Explain serverless computing in two sentences." },
+ ],
+ max_tokens: 256,
+ temperature: 0.7,
+});
+
+console.log(response.choices[0].message.content);
+```
+
+
+```bash
+curl https://openai.vast.ai//v1/chat/completions \
+ -H "Authorization: Bearer " \
+ -H "Content-Type: application/json" \
+ -d '{
+ "model": "",
+ "messages": [
+ {"role": "system", "content": "You are a helpful assistant."},
+ {"role": "user", "content": "Explain serverless computing in two sentences."}
+ ],
+ "max_tokens": 256,
+ "temperature": 0.7
+ }'
+```
+
+
+
+## Supported Endpoints
+
+The proxy supports the following OpenAI-compatible endpoints exposed by vLLM:
+
+| Endpoint | Description |
+|---|---|
+| `/v1/chat/completions` | Multi-turn conversational completions |
+| `/v1/completions` | Single-prompt text completions |
+
+Both endpoints support streaming (`"stream": true`).
+
+For detailed request/response schemas and parameters, see the [vLLM template documentation](/documentation/serverless/vllm).
+
+## Limitations
+
+
+ The OpenAI-compatible proxy is designed for **text-in, text-out** workloads only. Review the limitations below before integrating.
+
+
+### Text only
+
+The proxy supports **text inputs and text outputs** only. The following OpenAI features are **not** supported:
+
+- **Vision / image inputs** — Passing images via `image_url` in message content is not supported.
+- **Audio inputs and outputs** — The `/v1/audio` endpoints (speech, transcription, translation) are not available.
+- **Image generation** — The `/v1/images` endpoint is not available.
+- **Embeddings** — The `/v1/embeddings` endpoint is not available through the proxy.
+
+### vLLM-specific differences from the OpenAI specification
+
+Because the proxy routes to a vLLM backend rather than OpenAI's own service, there are inherent differences between the two:
+
+- **Tokenization** — Token counts may differ from OpenAI models because vLLM uses the tokenizer bundled with the open-source model (e.g., Qwen, Llama). This can affect billing estimates and `max_tokens` behavior.
+- **Streaming chunk boundaries** — While the proxy uses the same Server-Sent Events (SSE) format, the exact boundaries of streamed chunks may differ. Some chunks may contain empty strings when chunked prefill is enabled.
+- **Tool / function calling** — Tool calling is supported on models that are fine-tuned for it, but behavior may differ from OpenAI's implementation. The `parallel_tool_calls` parameter is not supported. See the [vLLM template documentation](/documentation/serverless/vllm) for details.
+- **Unsupported parameters** — The following request parameters are accepted but ignored: `user`, `suffix`, and `image_url.detail`.
+- **Response fields** — vLLM may return additional fields not present in the OpenAI specification (e.g., `kv_transfer_params`). Standard OpenAI client libraries will safely ignore these.
+- **Moderation** — No content moderation layer is applied. OpenAI's `/v1/moderations` endpoint is not available.
diff --git a/documentation/serverless/zero-downtime-worker-update.md b/documentation/serverless/zero-downtime-worker-update.md
new file mode 100644
index 0000000..103ab8a
--- /dev/null
+++ b/documentation/serverless/zero-downtime-worker-update.md
@@ -0,0 +1,63 @@
+---
+title: Zero Downtime Worker Update
+description: Update your Serverless template or model without dropping in-flight requests.
+"canonical": "/documentation/serverless/zero-downtime-worker-update"
+---
+
+
+
+When you need to change the model or template behind a live Serverless endpoint, Vast can perform a **rolling update** that transitions every worker to the new configuration without dropping in-flight requests. From your users' perspective there is no downtime — existing requests complete normally, and new requests are routed to updated workers as they become available.
+
+## When to Use This
+
+A zero downtime update is useful any time you need to change the backend of a live endpoint, for example:
+
+- Upgrading to a newer version of your template.
+- Switching to a different model (e.g., moving from `Qwen/Qwen3-8B` to `Qwen/Qwen3-14B`).
+- Adjusting vLLM launch arguments or other environment variables in the template.
+- Adding or changing the search filter on your worker group.
+
+## How to Trigger an Update
+
+The process requires two steps:
+
+### 1. Update your template
+
+Modify the template that your endpoint uses. This could involve changing the `MODEL_NAME`, updating `VLLM_ARGS`, or selecting an entirely new template version.
+
+### 2. Update the worker group configuration
+
+Once the template is saved, update your worker group to reference the new template. This signals Vast to begin the rolling update.
+
+
+ After you complete these two steps, Vast handles the rest automatically. No additional action is required on your part.
+
+
+## What Happens During the Update
+
+Vast orchestrates the transition across your worker group in the following sequence:
+
+1. **Inactive workers become active and update** — Any inactive workers are brought into an active state, updated to the new template and model configuration, and made available for requests.
+2. **Active workers finish existing tasks first** — Workers that are currently active and handling requests are allowed to complete all of their in-flight tasks before updating. Once an active worker finishes its current work, it updates to the new configuration and rejoins the pool.
+3. **New requests route to updated workers** — As updated workers come online, incoming requests are directed to them. This continues until every worker in the group is running the new configuration.
+
+Because active workers are never interrupted mid-request, no responses are dropped or truncated during the rollout.
+
+## Best Practices
+
+- **Schedule updates during low-traffic periods** — While the update process is designed to be seamless, performing it during a period of stable, low traffic reduces the number of in-flight requests that need to drain and shortens the overall transition window.
+- **Verify the new template independently** — Before triggering a rolling update on a production endpoint, consider testing the new template on a separate endpoint to confirm that the model loads correctly and produces the expected output.
+- **Monitor during the rollout** — Keep an eye on your endpoint's request latency and error rate while the update is in progress. A brief increase in latency is normal as the worker pool transitions, but errors may indicate a problem with the new configuration.