vast-ai · LucasArmandVast · Mar 12, 2026 · Mar 12, 2026 · Mar 12, 2026
diff --git a/docs.json b/docs.json
@@ -93,6 +93,7 @@
                   "documentation/serverless/architecture",
                   "documentation/serverless/overview",
                   "documentation/serverless/SDKoverview",
+                  "documentation/serverless/openai-compatible-api",
                   "documentation/serverless/automatedperformancetesting"
                 ]
               },
@@ -110,6 +111,7 @@
                 "group": "Monitoring and Debug",
                 "pages": [
                   "documentation/serverless/worker-states",
+                  "documentation/serverless/zero-downtime-worker-update",
                   "documentation/serverless/logging"
                 ]
               },

diff --git a/documentation/serverless/openai-compatible-api.mdx b/documentation/serverless/openai-compatible-api.mdx
@@ -0,0 +1,150 @@
+---
+title: OpenAI API-compatible Interface
+description: Use Vast.ai Serverless endpoints with the standard OpenAI API client by swapping your API key and base URL.
+"canonical": "/documentation/serverless/openai-compatible-api"
+---
+
+<script type="application/ld+json" dangerouslySetInnerHTML={{
+  __html: JSON.stringify({
+    "@context": "https://schema.org",
+    "@type": "TechArticle",
+    "headline": "OpenAI API-compatible Interface for Vast.ai Serverless",
+    "description": "Use Vast.ai Serverless endpoints with the standard OpenAI API client. Swap your API key and base URL to route requests through Vast's OpenAI-compatible proxy for vLLM endpoints.",
+    "author": {
+      "@type": "Organization",
+      "name": "Vast.ai"
+    },
+    "articleSection": "Serverless Documentation",
+    "keywords": ["OpenAI", "API", "compatible", "proxy", "vLLM", "serverless", "vast.ai", "LLM", "inference"]
+  })
+}} />
+
+Vast provides an OpenAI API-compatible proxy service that lets you point any application or library that works with the OpenAI API at a Vast Serverless vLLM endpoint instead. If your code already uses the OpenAI Python client (or any OpenAI-compatible HTTP client), you can switch to Vast by changing two values: the **API key** and the **base URL**.
+
+## Prerequisites
+
+- A Vast.ai account with a valid **API key**. You can find your key on the [Account page](https://cloud.vast.ai/account/).
+- An active Serverless endpoint running the **vLLM** template. See the [Quickstart](/documentation/serverless/quickstart) guide to create one.
+
+## How It Works
+
+Vast runs a lightweight proxy at `openai.vast.ai` that accepts requests in the OpenAI API format and routes them to your Serverless vLLM endpoint. Your client sends a standard OpenAI request, the proxy translates it into a Vast Serverless call, and the response is returned in the OpenAI format your client expects.
+
+This means frameworks and tools built on the OpenAI SDK — such as LangChain, LlamaIndex, or custom chat applications — can use Vast Serverless without any code changes beyond updating credentials.
+
+## Migrating from OpenAI (or Another Provider)
+
+If you already have an application that calls the OpenAI API (or another OpenAI-compatible provider such as Together AI, Anyscale, or a self-hosted vLLM instance), migration requires only two changes:
+
+| Setting | Before | After |
+|---|---|---|
+| API Key | Your OpenAI / provider key | Your [Vast API key](https://cloud.vast.ai/account/) |
+| Base URL | `https://api.openai.com/v1` (or provider URL) | `https://openai.vast.ai/<ENDPOINT_NAME>` |
+
+Replace `<ENDPOINT_NAME>` with the name of your Serverless endpoint. No other code changes are required — the proxy accepts the same request and response schema for the supported endpoints.
+
+<Note>
+  The `model` field is required by the OpenAI SDK but is **ignored** by the proxy. The model served is determined entirely by the `MODEL_NAME` environment variable set in your vLLM endpoint configuration. You can pass any string (including an empty string) for this field.
+</Note>
+
+<Tabs>
+  <Tab title="Python (OpenAI SDK)">
+```python
+from openai import OpenAI
+
+client = OpenAI(
+    api_key="<YOUR_VAST_API_KEY>",
+    base_url="https://openai.vast.ai/<ENDPOINT_NAME>",
+)
+
+response = client.chat.completions.create(
+    model="",  # model is determined by your endpoint configuration
+    messages=[
+        {"role": "system", "content": "You are a helpful assistant."},
+        {"role": "user", "content": "Explain serverless computing in two sentences."},
+    ],
+    max_tokens=256,
+    temperature=0.7,
+)
+
+print(response.choices[0].message.content)
+```
+  </Tab>
+  <Tab title="JavaScript (OpenAI SDK)">
+```javascript
+import OpenAI from "openai";
+
+const client = new OpenAI({
+  apiKey: "<YOUR_VAST_API_KEY>",
+  baseURL: "https://openai.vast.ai/<ENDPOINT_NAME>",
+});
+
+const response = await client.chat.completions.create({
+  model: "",  // model is determined by your endpoint configuration
+  messages: [
+    { role: "system", content: "You are a helpful assistant." },
+    { role: "user", content: "Explain serverless computing in two sentences." },
+  ],
+  max_tokens: 256,
+  temperature: 0.7,
+});
+
+console.log(response.choices[0].message.content);
+```
+  </Tab>
+  <Tab title="cURL">
+```bash
+curl https://openai.vast.ai/<ENDPOINT_NAME>/v1/chat/completions \
+  -H "Authorization: Bearer <YOUR_VAST_API_KEY>" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "",
+    "messages": [
+      {"role": "system", "content": "You are a helpful assistant."},
+      {"role": "user", "content": "Explain serverless computing in two sentences."}
+    ],
+    "max_tokens": 256,
+    "temperature": 0.7
+  }'
+```
+  </Tab>
+</Tabs>
+
+## Supported Endpoints
+
+The proxy supports the following OpenAI-compatible endpoints exposed by vLLM:
+
+| Endpoint | Description |
+|---|---|
+| `/v1/chat/completions` | Multi-turn conversational completions |
+| `/v1/completions` | Single-prompt text completions |
+
+Both endpoints support streaming (`"stream": true`).
+
+For detailed request/response schemas and parameters, see the [vLLM template documentation](/documentation/serverless/vllm).
+
+## Limitations
+
+<Warning>
+  The OpenAI-compatible proxy is designed for **text-in, text-out** workloads only. Review the limitations below before integrating.
+</Warning>
+
+### Text only
+
+The proxy supports **text inputs and text outputs** only. The following OpenAI features are **not** supported:
+
+- **Vision / image inputs** — Passing images via `image_url` in message content is not supported.
+- **Audio inputs and outputs** — The `/v1/audio` endpoints (speech, transcription, translation) are not available.
+- **Image generation** — The `/v1/images` endpoint is not available.
+- **Embeddings** — The `/v1/embeddings` endpoint is not available through the proxy.
+
+### vLLM-specific differences from the OpenAI specification
+
+Because the proxy routes to a vLLM backend rather than OpenAI's own service, there are inherent differences between the two:
+
+- **Tokenization** — Token counts may differ from OpenAI models because vLLM uses the tokenizer bundled with the open-source model (e.g., Qwen, Llama). This can affect billing estimates and `max_tokens` behavior.
+- **Streaming chunk boundaries** — While the proxy uses the same Server-Sent Events (SSE) format, the exact boundaries of streamed chunks may differ. Some chunks may contain empty strings when chunked prefill is enabled.
+- **Tool / function calling** — Tool calling is supported on models that are fine-tuned for it, but behavior may differ from OpenAI's implementation. The `parallel_tool_calls` parameter is not supported. See the [vLLM template documentation](/documentation/serverless/vllm) for details.
+- **Unsupported parameters** — The following request parameters are accepted but ignored: `user`, `suffix`, and `image_url.detail`.
+- **Response fields** — vLLM may return additional fields not present in the OpenAI specification (e.g., `kv_transfer_params`). Standard OpenAI client libraries will safely ignore these.
+- **Moderation** — No content moderation layer is applied. OpenAI's `/v1/moderations` endpoint is not available.
diff --git a/documentation/serverless/zero-downtime-worker-update.md b/documentation/serverless/zero-downtime-worker-update.md
@@ -0,0 +1,63 @@
+---
+title: Zero Downtime Worker Update
+description: Update your Serverless template or model without dropping in-flight requests.
+"canonical": "/documentation/serverless/zero-downtime-worker-update"
+---
+
+<script type="application/ld+json" dangerouslySetInnerHTML={{
+  __html: JSON.stringify({
+    "@context": "https://schema.org",
+    "@type": "TechArticle",
+    "headline": "Zero Downtime Worker Update for Vast.ai Serverless",
+    "description": "Learn how to update your Serverless template or model configuration without dropping in-flight requests. Vast orchestrates a graceful rolling update across your worker group automatically.",
+    "author": {
+      "@type": "Organization",
+      "name": "Vast.ai"
+    },
+    "articleSection": "Serverless Documentation",
+    "keywords": ["zero downtime", "rolling update", "worker", "template", "model", "serverless", "vast.ai", "graceful", "migration"]
+  })
+}} />
+
+When you need to change the model or template behind a live Serverless endpoint, Vast can perform a **rolling update** that transitions every worker to the new configuration without dropping in-flight requests. From your users' perspective there is no downtime — existing requests complete normally, and new requests are routed to updated workers as they become available.
+
+## When to Use This
+
+A zero downtime update is useful any time you need to change the backend of a live endpoint, for example:
+
+- Upgrading to a newer version of your template.
+- Switching to a different model (e.g., moving from `Qwen/Qwen3-8B` to `Qwen/Qwen3-14B`).
+- Adjusting vLLM launch arguments or other environment variables in the template.
+- Adding or changing the search filter on your worker group.
+
+## How to Trigger an Update
+
+The process requires two steps:
+
+### 1. Update your template
+
+Modify the template that your endpoint uses. This could involve changing the `MODEL_NAME`, updating `VLLM_ARGS`, or selecting an entirely new template version.
+
+### 2. Update the worker group configuration
+
+Once the template is saved, update your worker group to reference the new template. This signals Vast to begin the rolling update.
+
+<Note>
+  After you complete these two steps, Vast handles the rest automatically. No additional action is required on your part.
+</Note>
+
+## What Happens During the Update
+
+Vast orchestrates the transition across your worker group in the following sequence:
+
+1. **Inactive workers become active and update** — Any inactive workers are brought into an active state, updated to the new template and model configuration, and made available for requests.
+2. **Active workers finish existing tasks first** — Workers that are currently active and handling requests are allowed to complete all of their in-flight tasks before updating. Once an active worker finishes its current work, it updates to the new configuration and rejoins the pool.
+3. **New requests route to updated workers** — As updated workers come online, incoming requests are directed to them. This continues until every worker in the group is running the new configuration.
+
+Because active workers are never interrupted mid-request, no responses are dropped or truncated during the rollout.
+
+## Best Practices
+
+- **Schedule updates during low-traffic periods** — While the update process is designed to be seamless, performing it during a period of stable, low traffic reduces the number of in-flight requests that need to drain and shortens the overall transition window.
+- **Verify the new template independently** — Before triggering a rolling update on a production endpoint, consider testing the new template on a separate endpoint to confirm that the model loads correctly and produces the expected output.
+- **Monitor during the rollout** — Keep an eye on your endpoint's request latency and error rate while the update is in progress. A brief increase in latency is normal as the worker pool transitions, but errors may indicate a problem with the new configuration.