Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@
"documentation/serverless/architecture",
"documentation/serverless/overview",
"documentation/serverless/SDKoverview",
"documentation/serverless/openai-compatible-api",
"documentation/serverless/automatedperformancetesting"
]
},
Expand All @@ -110,6 +111,7 @@
"group": "Monitoring and Debug",
"pages": [
"documentation/serverless/worker-states",
"documentation/serverless/zero-downtime-worker-update",
"documentation/serverless/logging"
]
},
Expand Down
150 changes: 150 additions & 0 deletions documentation/serverless/openai-compatible-api.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
---
title: OpenAI API-compatible Interface
description: Use Vast.ai Serverless endpoints with the standard OpenAI API client by swapping your API key and base URL.
"canonical": "/documentation/serverless/openai-compatible-api"
---

<script type="application/ld+json" dangerouslySetInnerHTML={{
__html: JSON.stringify({
"@context": "https://schema.org",
"@type": "TechArticle",
"headline": "OpenAI API-compatible Interface for Vast.ai Serverless",
"description": "Use Vast.ai Serverless endpoints with the standard OpenAI API client. Swap your API key and base URL to route requests through Vast's OpenAI-compatible proxy for vLLM endpoints.",
"author": {
"@type": "Organization",
"name": "Vast.ai"
},
"articleSection": "Serverless Documentation",
"keywords": ["OpenAI", "API", "compatible", "proxy", "vLLM", "serverless", "vast.ai", "LLM", "inference"]
})
}} />

Vast provides an OpenAI API-compatible proxy service that lets you point any application or library that works with the OpenAI API at a Vast Serverless vLLM endpoint instead. If your code already uses the OpenAI Python client (or any OpenAI-compatible HTTP client), you can switch to Vast by changing two values: the **API key** and the **base URL**.

## Prerequisites

- A Vast.ai account with a valid **API key**. You can find your key on the [Account page](https://cloud.vast.ai/account/).
- An active Serverless endpoint running the **vLLM** template. See the [Quickstart](/documentation/serverless/quickstart) guide to create one.

## How It Works

Vast runs a lightweight proxy at `openai.vast.ai` that accepts requests in the OpenAI API format and routes them to your Serverless vLLM endpoint. Your client sends a standard OpenAI request, the proxy translates it into a Vast Serverless call, and the response is returned in the OpenAI format your client expects.

This means frameworks and tools built on the OpenAI SDK — such as LangChain, LlamaIndex, or custom chat applications — can use Vast Serverless without any code changes beyond updating credentials.

## Migrating from OpenAI (or Another Provider)

If you already have an application that calls the OpenAI API (or another OpenAI-compatible provider such as Together AI, Anyscale, or a self-hosted vLLM instance), migration requires only two changes:

| Setting | Before | After |
|---|---|---|
| API Key | Your OpenAI / provider key | Your [Vast API key](https://cloud.vast.ai/account/) |
| Base URL | `https://api.openai.com/v1` (or provider URL) | `https://openai.vast.ai/<ENDPOINT_NAME>` |

Replace `<ENDPOINT_NAME>` with the name of your Serverless endpoint. No other code changes are required — the proxy accepts the same request and response schema for the supported endpoints.

<Note>
The `model` field is required by the OpenAI SDK but is **ignored** by the proxy. The model served is determined entirely by the `MODEL_NAME` environment variable set in your vLLM endpoint configuration. You can pass any string (including an empty string) for this field.
</Note>

<Tabs>
<Tab title="Python (OpenAI SDK)">
```python
from openai import OpenAI

client = OpenAI(
api_key="<YOUR_VAST_API_KEY>",
base_url="https://openai.vast.ai/<ENDPOINT_NAME>",
)

response = client.chat.completions.create(
model="", # model is determined by your endpoint configuration
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain serverless computing in two sentences."},
],
max_tokens=256,
temperature=0.7,
)

print(response.choices[0].message.content)
```
</Tab>
<Tab title="JavaScript (OpenAI SDK)">
```javascript
import OpenAI from "openai";

const client = new OpenAI({
apiKey: "<YOUR_VAST_API_KEY>",
baseURL: "https://openai.vast.ai/<ENDPOINT_NAME>",
});

const response = await client.chat.completions.create({
model: "", // model is determined by your endpoint configuration
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain serverless computing in two sentences." },
],
max_tokens: 256,
temperature: 0.7,
});

console.log(response.choices[0].message.content);
```
</Tab>
<Tab title="cURL">
```bash
curl https://openai.vast.ai/<ENDPOINT_NAME>/v1/chat/completions \
-H "Authorization: Bearer <YOUR_VAST_API_KEY>" \
-H "Content-Type: application/json" \
-d '{
"model": "",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain serverless computing in two sentences."}
],
"max_tokens": 256,
"temperature": 0.7
}'
```
</Tab>
</Tabs>

## Supported Endpoints

The proxy supports the following OpenAI-compatible endpoints exposed by vLLM:

| Endpoint | Description |
|---|---|
| `/v1/chat/completions` | Multi-turn conversational completions |
| `/v1/completions` | Single-prompt text completions |

Both endpoints support streaming (`"stream": true`).

For detailed request/response schemas and parameters, see the [vLLM template documentation](/documentation/serverless/vllm).

## Limitations

<Warning>
The OpenAI-compatible proxy is designed for **text-in, text-out** workloads only. Review the limitations below before integrating.
</Warning>

### Text only

The proxy supports **text inputs and text outputs** only. The following OpenAI features are **not** supported:

- **Vision / image inputs** — Passing images via `image_url` in message content is not supported.
- **Audio inputs and outputs** — The `/v1/audio` endpoints (speech, transcription, translation) are not available.
- **Image generation** — The `/v1/images` endpoint is not available.
- **Embeddings** — The `/v1/embeddings` endpoint is not available through the proxy.

### vLLM-specific differences from the OpenAI specification

Because the proxy routes to a vLLM backend rather than OpenAI's own service, there are inherent differences between the two:

- **Tokenization** — Token counts may differ from OpenAI models because vLLM uses the tokenizer bundled with the open-source model (e.g., Qwen, Llama). This can affect billing estimates and `max_tokens` behavior.
- **Streaming chunk boundaries** — While the proxy uses the same Server-Sent Events (SSE) format, the exact boundaries of streamed chunks may differ. Some chunks may contain empty strings when chunked prefill is enabled.
- **Tool / function calling** — Tool calling is supported on models that are fine-tuned for it, but behavior may differ from OpenAI's implementation. The `parallel_tool_calls` parameter is not supported. See the [vLLM template documentation](/documentation/serverless/vllm) for details.
- **Unsupported parameters** — The following request parameters are accepted but ignored: `user`, `suffix`, and `image_url.detail`.
- **Response fields** — vLLM may return additional fields not present in the OpenAI specification (e.g., `kv_transfer_params`). Standard OpenAI client libraries will safely ignore these.
- **Moderation** — No content moderation layer is applied. OpenAI's `/v1/moderations` endpoint is not available.
63 changes: 63 additions & 0 deletions documentation/serverless/zero-downtime-worker-update.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
---
title: Zero Downtime Worker Update
description: Update your Serverless template or model without dropping in-flight requests.
"canonical": "/documentation/serverless/zero-downtime-worker-update"
---

<script type="application/ld+json" dangerouslySetInnerHTML={{
__html: JSON.stringify({
"@context": "https://schema.org",
"@type": "TechArticle",
"headline": "Zero Downtime Worker Update for Vast.ai Serverless",
"description": "Learn how to update your Serverless template or model configuration without dropping in-flight requests. Vast orchestrates a graceful rolling update across your worker group automatically.",
"author": {
"@type": "Organization",
"name": "Vast.ai"
},
"articleSection": "Serverless Documentation",
"keywords": ["zero downtime", "rolling update", "worker", "template", "model", "serverless", "vast.ai", "graceful", "migration"]
})
}} />

When you need to change the model or template behind a live Serverless endpoint, Vast can perform a **rolling update** that transitions every worker to the new configuration without dropping in-flight requests. From your users' perspective there is no downtime — existing requests complete normally, and new requests are routed to updated workers as they become available.

## When to Use This

A zero downtime update is useful any time you need to change the backend of a live endpoint, for example:

- Upgrading to a newer version of your template.
- Switching to a different model (e.g., moving from `Qwen/Qwen3-8B` to `Qwen/Qwen3-14B`).
- Adjusting vLLM launch arguments or other environment variables in the template.
- Adding or changing the search filter on your worker group.

## How to Trigger an Update

The process requires two steps:

### 1. Update your template

Modify the template that your endpoint uses. This could involve changing the `MODEL_NAME`, updating `VLLM_ARGS`, or selecting an entirely new template version.

### 2. Update the worker group configuration

Once the template is saved, update your worker group to reference the new template. This signals Vast to begin the rolling update.

<Note>
After you complete these two steps, Vast handles the rest automatically. No additional action is required on your part.
</Note>

## What Happens During the Update

Vast orchestrates the transition across your worker group in the following sequence:

1. **Inactive workers become active and update** — Any inactive workers are brought into an active state, updated to the new template and model configuration, and made available for requests.
2. **Active workers finish existing tasks first** — Workers that are currently active and handling requests are allowed to complete all of their in-flight tasks before updating. Once an active worker finishes its current work, it updates to the new configuration and rejoins the pool.
3. **New requests route to updated workers** — As updated workers come online, incoming requests are directed to them. This continues until every worker in the group is running the new configuration.

Because active workers are never interrupted mid-request, no responses are dropped or truncated during the rollout.

## Best Practices

- **Schedule updates during low-traffic periods** — While the update process is designed to be seamless, performing it during a period of stable, low traffic reduces the number of in-flight requests that need to drain and shortens the overall transition window.
- **Verify the new template independently** — Before triggering a rolling update on a production endpoint, consider testing the new template on a separate endpoint to confirm that the model loads correctly and produces the expected output.
- **Monitor during the rollout** — Keep an eye on your endpoint's request latency and error rate while the update is in progress. A brief increase in latency is normal as the worker pool transitions, but errors may indicate a problem with the new configuration.