Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -229,7 +229,8 @@
"quantized-gguf-models-cloned",
"vllm-llm-inference-and-serving",
"examples/text-generation/minimax-m2",
"examples/text-generation/glm-47-flash"
"examples/text-generation/glm-47-flash",
"examples/text-generation/nemotron-3-super"
]
},
{
Expand Down
321 changes: 321 additions & 0 deletions examples/text-generation/nemotron-3-super.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,321 @@
---
title: NVIDIA Nemotron 3 Super
slug: nemotron-3-super-deployment-vast
createdAt: Wed Mar 12 2026 20:00:00 GMT+0000 (Coordinated Universal Time)
updatedAt: Wed Mar 12 2026 20:00:00 GMT+0000 (Coordinated Universal Time)
---

<script type="application/ld+json" dangerouslySetInnerHTML={{
__html: JSON.stringify({
"@context": "https://schema.org",
"@type": "HowTo",
"name": "Deploy NVIDIA Nemotron 3 Super on Vast.ai",
"description": "Learn how to deploy the NVIDIA Nemotron 3 Super 120B hybrid Mamba-Transformer-MoE model on Vast.ai using SGLang with toggleable reasoning.",
"step": [
{
"@type": "HowToStep",
"name": "Set up your Vast.ai account",
"text": "Create a Vast.ai account and configure your API key using the CLI."
},
{
"@type": "HowToStep",
"name": "Find a suitable instance",
"text": "Search for 2x H100 SXM instances with 80GB VRAM, CUDA 12.4+, and 150GB+ disk space."
},
{
"@type": "HowToStep",
"name": "Deploy with SGLang",
"text": "Launch an instance with the lmsysorg/sglang:v0.5.9 Docker image and the FP8 model checkpoint."
},
{
"@type": "HowToStep",
"name": "Wait for model loading",
"text": "Monitor logs until the server reports ready."
},
{
"@type": "HowToStep",
"name": "Query the API",
"text": "Send requests to the OpenAI-compatible API endpoint with reasoning mode control via chat_template_kwargs."
}
],
"author": {
"@type": "Organization",
"name": "Vast.ai Team"
},
"datePublished": "2026-03-12",
"dateModified": "2026-03-12"
})
}} />

# Running NVIDIA Nemotron 3 Super on Vast.ai

NVIDIA Nemotron 3 Super is a 120B parameter model that only activates 12B parameters per token, which means you get the quality of a much larger model at a fraction of the compute. It uses a novel hybrid architecture — Mamba-2 for fast sequence processing, Transformer attention where precision matters, and a Latent Mixture-of-Experts layer for efficient routing — and supports context windows up to 1M tokens.

The model is particularly interesting because it ships with a built-in reasoning toggle. You can turn reasoning on for complex tasks like math and coding, switch to a low-effort mode for lighter thinking, or turn it off entirely for fast direct answers — all from the same deployment, controlled per request. It also supports Multi-Token Prediction for faster inference through speculative decoding, and performs well on agentic benchmarks involving tool use and multi-step task execution.

This guide deploys the FP8 variant on Vast.ai using SGLang and queries it via the OpenAI-compatible API.

## Prerequisites

Before getting started, you'll need:

- A Vast.ai account with credits ([Sign up here](https://cloud.vast.ai))
- Vast.ai CLI installed (`pip install vastai`)
- Your Vast.ai API key configured
- Python 3.8+ (for the OpenAI SDK examples)

<Note>
Get your API key from the [Vast.ai account page](https://cloud.vast.ai/account/) and set it with `vastai set api-key YOUR_API_KEY`.
</Note>

## Understanding Nemotron 3 Super

Key capabilities:

- **Efficient MoE Architecture**: 120B total parameters, only 12B active per token
- **Hybrid Layers**: Mamba-2 (linear-time) + Transformer attention + Latent MoE
- **Reasoning Toggle**: On, off, or low-effort modes via `chat_template_kwargs`
- **Long Context**: Up to 1M tokens (256K default)
- **Commercial License**: NVIDIA Nemotron Open Model License

## Hardware Requirements

The FP8 variant requires:

- **GPUs**: 2× H100 SXM (80GB each) with NVLink for tensor parallelism
- **Disk Space**: 200GB minimum (model is ~120GB)
- **CUDA Version**: 12.4 or higher
- **Docker Image**: `lmsysorg/sglang:v0.5.9`

<Note>
H100 SXM GPUs are required (not PCIe) because NVLink is needed for efficient tensor parallelism across 2 GPUs.
</Note>

## Instance Configuration

### Step 1: Search for Suitable Instances

```bash Bash
vastai search offers \
"gpu_name=H100_SXM num_gpus=2 gpu_ram>=80 cuda_vers>=12.4 \
disk_space>=200 direct_port_count>1 inet_down>=500 rentable=true" \
--order "dph_base" --limit 10
```

This searches for:
- 2× H100 SXM GPUs with at least 80GB VRAM each
- CUDA 12.4 or higher
- At least 200GB disk space
- Direct port access for the API endpoint
- High download speed for faster model loading
- Sorted by price (lowest first)

### Step 2: Create the Instance

Select an instance ID from the search results and deploy:

```bash Bash
vastai create instance <INSTANCE_ID> \
--image lmsysorg/sglang:v0.5.9 \
--env '-p 5000:5000' \
--disk 200 \
--onstart-cmd "python3 -m sglang.launch_server \
--model-path nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
--served-model-name nvidia/nemotron-3-super \
--host 0.0.0.0 \
--port 5000 \
--trust-remote-code \
--tp 2 \
--kv-cache-dtype fp8_e4m3 \
--reasoning-parser nano_v3"
```

**Key parameters explained**:
- `--image lmsysorg/sglang:v0.5.9` — SGLang stable release with Nemotron 3 Super support
- `--env '-p 5000:5000'` — Expose port 5000 for the API endpoint
- `--disk 200` — 200GB for the ~120GB model weights plus overhead
- `--tp 2` — Tensor parallelism across both H100 GPUs
- `--kv-cache-dtype fp8_e4m3` — FP8 KV cache for efficient memory usage
- `--reasoning-parser nano_v3` — Enables reasoning content parsing for thinking mode
- `--trust-remote-code` — Required for the custom Nemotron architecture

## Monitoring Deployment

### Check Deployment Status

```bash Bash
vastai logs <INSTANCE_ID>
```

Look for this message indicating the server is ready:

```text Text
The server is fired up and ready to roll!
```

### Get Your Endpoint

Once deployment completes, get your instance details:

```bash Bash
vastai show instance <INSTANCE_ID> --raw
```

Look for the `ports` field — it maps internal port 5000 to an external port. Your API endpoint will be:

```text Text
http://<PUBLIC_IP>:<EXTERNAL_PORT>/v1
```

## Using the Nemotron 3 Super API

### Quick Test with cURL

Verify the server is responding:

```bash Bash
curl -X POST http://<IP>:<PORT>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/nemotron-3-super",
"messages": [{"role": "user", "content": "What is 25 * 37?"}],
"max_tokens": 500,
"temperature": 1.0,
"top_p": 0.95
}'
```

<Note>
NVIDIA requires `temperature=1.0` and `top_p=0.95` for all inference with this model.
</Note>

### Python Integration

Using the OpenAI Python SDK:

```python icon="python" Python
from openai import OpenAI

client = OpenAI(
base_url="http://<IP>:<PORT>/v1",
api_key="EMPTY" # SGLang doesn't require an API key
)

response = client.chat.completions.create(
model="nvidia/nemotron-3-super",
messages=[{"role": "user", "content": "Explain quantum entanglement in 2 sentences."}],
max_tokens=300,
temperature=1.0,
top_p=0.95
)

print(response.choices[0].message.content)
```

## Reasoning Modes

Nemotron 3 Super supports three reasoning modes, controlled via `chat_template_kwargs`. By default, reasoning is enabled.

### Reasoning ON (Default)

The model shows its thinking in `reasoning_content` before giving the final answer in `content`:

```python icon="python" Python
response = client.chat.completions.create(
model="nvidia/nemotron-3-super",
messages=[{"role": "user", "content": "Explain quantum entanglement in 2 sentences."}],
max_tokens=300,
temperature=1.0,
top_p=0.95,
extra_body={"chat_template_kwargs": {"enable_thinking": True}}
)

msg = response.choices[0].message
print("Thinking:", msg.reasoning_content)
print("Answer:", msg.content)
```

### Reasoning OFF

Disable reasoning for faster, direct responses:

```python icon="python" Python
response = client.chat.completions.create(
model="nvidia/nemotron-3-super",
messages=[{"role": "user", "content": "Explain quantum entanglement in 2 sentences."}],
max_tokens=300,
temperature=1.0,
top_p=0.95,
extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)

msg = response.choices[0].message
# With reasoning OFF, the answer is in reasoning_content
print("Answer:", msg.reasoning_content)
```

<Note>
When reasoning is disabled via SGLang's `nano_v3` parser, the response text is returned in `reasoning_content` instead of `content` (which will be `None`). Make sure to read from the correct field based on the mode you're using.
</Note>

### Low-Effort Reasoning

A middle ground — brief reasoning with fast responses:

```python icon="python" Python
response = client.chat.completions.create(
model="nvidia/nemotron-3-super",
messages=[{"role": "user", "content": "What is 25 * 37?"}],
max_tokens=300,
temperature=1.0,
top_p=0.95,
extra_body={"chat_template_kwargs": {"enable_thinking": True, "low_effort": True}}
)

msg = response.choices[0].message
print("Thinking:", msg.reasoning_content) # Brief reasoning
print("Answer:", msg.content)
```

### Reasoning with cURL

Pass `chat_template_kwargs` at the top level of the JSON body:

```bash Bash
curl -X POST http://<IP>:<PORT>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/nemotron-3-super",
"messages": [{"role": "user", "content": "What is 25 * 37?"}],
"max_tokens": 500,
"temperature": 1.0,
"top_p": 0.95,
"chat_template_kwargs": {"enable_thinking": false}
}'
```

## Cleanup

When you're done, destroy the instance to stop billing:

```bash Bash
vastai destroy instance <INSTANCE_ID>
```

<Note>
Always destroy your instance when you're finished to avoid unnecessary charges.
</Note>

## Additional Resources

- [NVIDIA Nemotron 3 Super Blog Post](https://developer.nvidia.com/blog/introducing-nemotron-3-super-an-open-hybrid-mamba-transformer-moe-for-agentic-reasoning/) — Architecture details and benchmarks
- [HuggingFace Model Card (FP8)](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8) — Model card and usage instructions
- [SGLang Documentation](https://docs.sglang.ai/) — SGLang configuration and usage
- [Vast.ai CLI Guide](/cli/get-started) — Learn more about the Vast.ai CLI
- [GPU Instance Guide](/documentation/instances/overview) — Understanding Vast.ai instances

## Conclusion

Nemotron 3 Super delivers frontier-class reasoning performance by activating only 12B of its 120B parameters per token. With SGLang and Vast.ai, you can deploy the model on 2× H100 GPUs and start querying it via the OpenAI-compatible API.

The reasoning toggle is particularly useful: enable it for complex tasks like math, coding, and analysis, or disable it for fast direct answers in production pipelines.