docs(spark): add local Ollama inference setup section#678
docs(spark): add local Ollama inference setup section#678paritoshd-nv wants to merge 4 commits intoNVIDIA:mainfrom
Conversation
Add step-by-step instructions for setting up local inference with Ollama on DGX Spark, covering NVIDIA runtime verification, Ollama install and model pre-load, OLLAMA_HOST=0.0.0.0 configuration, and sandbox connection with verification. Fixes NVIDIA#314, NVIDIA#385
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdded a new "Setup Local Inference (Ollama)" documentation section to Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Host
participant Docker
participant Ollama
participant Systemd
participant OpenShell
participant NemoClaw
User->>Host: run `docker run --gpus all --rm nvidia/cuda:... nvidia-smi`
Host->>Docker: attempt GPU runtime
alt GPU runtime missing
Host->>Host: run `nvidia-ctk` to configure runtime & restart Docker
end
User->>Host: install Ollama (official script)
Host->>Ollama: start service (default localhost:11434)
User->>Ollama: curl http://127.0.0.1:11434 (verify)
User->>Ollama: `ollama pull nemotron-3-super:120b` (preload)
User->>Systemd: create override to set `OLLAMA_HOST=0.0.0.0`
Systemd->>Ollama: restart service (listen 0.0.0.0:11434)
User->>OpenShell: install (script) & choose "Local Ollama" + model
User->>NemoClaw: install (script), run `nemoclaw ... connect`
NemoClaw->>Host: curl -sf https://inference.local/v1/models (validate routing)
User->>OpenShell: run `openclaw agent ... --local` (start agent using local inference)
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@spark-install.md`:
- Around line 169-177: Add an explicit check for the local inference proxy in
Step 6: after running the "nemoclaw my-assistant connect" command and before
"openclaw agent --agent main --local ...", run a curl GET against
http://inference.local/api/tags, capture the response body to
/tmp/inference_tags.json and assert the HTTP status is 200 so the documentation
verifies the non-403 fallback path is working; reference the existing step
commands ("nemoclaw my-assistant connect" and "openclaw agent --agent main
--local -m ...") so the check is placed inside the sandbox and fails the doc
verification if inference.local returns 403 or non-200.
- Line 157: Replace the netstat-based listener check "sudo netstat -nap | grep
11434" with an ss-based check: update the line that mentions netstat to use ss
to list listening TCP sockets with numeric ports and process info (for example
using ss with listen, tcp, numeric and process flags and filtering for port
11434) so the doc uses the standard iproute2 tool present on Ubuntu 24.04.
- Line 105: Replace the failing runtime verification command that uses the plain
"ubuntu" image; update the Docker command string "docker run --rm
--runtime=nvidia --gpus all ubuntu nvidia-smi" to use an NVIDIA CUDA image (for
example an nvidia/cuda:<tag>-runtime image such as
nvidia/cuda:11.8-runtime-ubuntu20.04) so that nvidia-smi is present in the
container; keep the same flags (--rm --runtime=nvidia --gpus all) and the final
command (nvidia-smi) but swap the image name to a CUDA runtime image.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6ab3c5b3-d63d-4252-9a95-e07ac7cddea3
📒 Files selected for processing (1)
spark-install.md
netstat requires net-tools which is not installed by default on Ubuntu 24.04. ss from iproute2 is available by default and is more reliable for verifying listening sockets. Signed-off-by: Paritosh Dixit <paritoshd@nvidia.com>
Add explicit curl to https://inference.local/v1/models inside the sandbox to validate the proxy route before running the agent. This prevents fallback paths from masking regressions in the fix for NVIDIA#314.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
spark-install.md (1)
142-152: Add a hardening note when binding Ollama to0.0.0.0.Line 148 intentionally exposes Ollama on all interfaces; add a short warning to restrict network access (trusted LAN only / firewall), since Ollama is typically unauthenticated by default.
Suggested wording
### Step 4: Configure Ollama to Listen on All Interfaces By default Ollama binds to `127.0.0.1`, which is not reachable from inside the sandbox container. Configure it to listen on all interfaces: +> Security note: `OLLAMA_HOST=0.0.0.0` exposes Ollama on your network. Restrict access with host firewall rules or trusted-network isolation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark-install.md` around lines 142 - 152, Add a short hardening warning to Step 4 near the OLLAMA_HOST=0.0.0.0 instruction: note that binding Ollama to 0.0.0.0 exposes the service to all network interfaces and should only be done on a trusted LAN or behind a firewall, and recommend restricting access via firewall rules or local network-only interfaces if Ollama is unauthenticated by default; reference the OLLAMA_HOST=0.0.0.0 override.conf instruction so readers know where to apply the caution.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@spark-install.md`:
- Around line 176-181: Update the probe command that currently reads `curl -s
https://inference.local/v1/models` so it fails fast on non-2xx responses;
replace it with a curl invocation that returns non-zero on non-success (for
example `curl -sSf https://inference.local/v1/models`) or explicitly assert HTTP
200 (for example `curl -s -o /dev/null -w '%{http_code}'
https://inference.local/v1/models | grep -q '^200$'`) so the step gates success
when `inference.local` does not return 200.
---
Nitpick comments:
In `@spark-install.md`:
- Around line 142-152: Add a short hardening warning to Step 4 near the
OLLAMA_HOST=0.0.0.0 instruction: note that binding Ollama to 0.0.0.0 exposes the
service to all network interfaces and should only be done on a trusted LAN or
behind a firewall, and recommend restricting access via firewall rules or local
network-only interfaces if Ollama is unauthenticated by default; reference the
OLLAMA_HOST=0.0.0.0 override.conf instruction so readers know where to apply the
caution.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3569242d-fbcd-4439-8540-752cb0c458c5
📒 Files selected for processing (1)
spark-install.md
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
spark-install.md (3)
157-157: Addsudoto thesscommand for complete process information.The
-pflag requires elevated privileges to display process information. While the command will work withoutsudo, it won't show the full process details that help verify Ollama is the service listening on port 11434.📝 Suggested fix
-ss -tlnp | grep 11434 +sudo ss -tlnp | grep 11434🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark-install.md` at line 157, Update the command that checks listeners to run with elevated privileges so process info is shown; change the existing "ss -tlnp | grep 11434" invocation to run under sudo (i.e., prefix with sudo) so the -p flag can return full process details and confirm Ollama is the process on port 11434.
138-140: Consider providing a non-interactive alternative for model preloading.The current step requires users to manually type
/byeto exit, which breaks automation. Consider adding a note about a non-interactive approach or mention that this step is optional (the model will be loaded on first actual use).📝 Suggested documentation improvement
Run it briefly to pre-load weights into unified memory, then exit: ```bash ollama run nemotron-3-super:120b # type /bye to exit
+> Note: This step is optional. The model will be loaded automatically on first use, but pre-loading can reduce initial inference latency.
</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@spark-install.mdaround lines 138 - 140, Update the "ollama run
nemotron-3-super:120b" step to note that it is optional and that the model will
be loaded on first use, and add a short non-interactive alternative so
automation isn't blocked; reference the interactive shutdown token "/bye" and
describe using a one-shot or piped/timeout-based invocation as the
non-interactive approach and include a brief example sentence explaining reduced
latency from preloading.</details> --- `173-173`: **Clarify the sandbox name reference.** The command references `my-assistant` as the sandbox name, but this name isn't defined in the "Setup Local Inference (Ollama)" section. Consider adding a note that this is the default sandbox name created during onboarding, or reference where users should have created this sandbox. <details> <summary>📝 Suggested clarification</summary> ```diff +Connect to your sandbox (the default name is `my-assistant` if created during onboarding): + ```bash # Connect to the sandbox nemoclaw my-assistant connect</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@spark-install.mdat line 173, Clarify that the sandbox name "my-assistant"
used in the command "nemoclaw my-assistant connect" is the default sandbox
created during onboarding (or point to where users should create it). Update the
"Setup Local Inference (Ollama)" section to either mention that onboarding
creates a sandbox named "my-assistant" or add a brief note/instruction telling
users how to create/choose a sandbox before running "nemoclaw my-assistant
connect" so the reference is explicit and not ambiguous.</details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against the current code and only fix it if needed.
Inline comments:
In@spark-install.md:
- Line 164: Update the curl command string so the NemoClaw install URL matches
other docs by adding the "www." prefix; replace the existing
"https://nvidia.com/nemoclaw.sh" occurrence in the curl invocation (the line
containing "curl -fsSL https://nvidia.com/nemoclaw.sh | bash") with
"https://www.nvidia.com/nemoclaw.sh" so the URL pattern is consistent with
README.md and docs/index.md.
Nitpick comments:
In@spark-install.md:
- Line 157: Update the command that checks listeners to run with elevated
privileges so process info is shown; change the existing "ss -tlnp | grep 11434"
invocation to run under sudo (i.e., prefix with sudo) so the -p flag can return
full process details and confirm Ollama is the process on port 11434.- Around line 138-140: Update the "ollama run nemotron-3-super:120b" step to
note that it is optional and that the model will be loaded on first use, and add
a short non-interactive alternative so automation isn't blocked; reference the
interactive shutdown token "/bye" and describe using a one-shot or
piped/timeout-based invocation as the non-interactive approach and include a
brief example sentence explaining reduced latency from preloading.- Line 173: Clarify that the sandbox name "my-assistant" used in the command
"nemoclaw my-assistant connect" is the default sandbox created during onboarding
(or point to where users should create it). Update the "Setup Local Inference
(Ollama)" section to either mention that onboarding creates a sandbox named
"my-assistant" or add a brief note/instruction telling users how to
create/choose a sandbox before running "nemoclaw my-assistant connect" so the
reference is explicit and not ambiguous.</details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Path: .coderabbit.yaml **Review profile**: CHILL **Plan**: Pro **Run ID**: `216e11da-1053-4127-b4a3-815b93174dc0` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between a9dbc13e8855c50e06595bbd6295ee5102983a7f and 909f98e94bcddb60745c43e0238c5a5ed9161004. </details> <details> <summary>📒 Files selected for processing (1)</summary> * `spark-install.md` </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
909f98e to
0ff8614
Compare
Use curl -sf so the check exits non-zero on HTTP errors (403, 503, etc.), preventing a silent 403 from masking a proxy routing regression. Signed-off-by: Paritosh Dixit <paritoshd@nvidia.com>
0ff8614 to
8d02c4d
Compare
Add step-by-step instructions for setting up local inference with Ollama on DGX Spark, covering NVIDIA runtime verification, Ollama install and model pre-load, OLLAMA_HOST=0.0.0.0 configuration, and sandbox connection with verification.
Fixes #314, #385
Summary
Related Issue
Changes
Type of Change
Testing
make checkpasses.npm testpasses.make docsbuilds without warnings. (for doc-only changes)Checklist
General
Code Changes
make formatapplied (TypeScript and Python).Doc Changes
update-docsagent skill to draft changes while complying with the style guide. For example, prompt your agent with "/update-docscatch up the docs for the new changes I made in this PR."Summary by CodeRabbit