diff --git a/README.md b/README.md index ed3571b3..dee0a76a 100644 --- a/README.md +++ b/README.md @@ -25,6 +25,7 @@ - [Quickstart](#quickstart) - [Generator with Diffusers](#generator-with-diffusers) - [Generator with vLLM-Omni](#generator-with-vllm-omni) + - [Generator with SGLang](#generator-with-sglang) - [Reasoner with Transformers](#reasoner-with-transformers) - [Reasoner with vLLM](#reasoner-with-vllm) - [Troubleshooting](#troubleshooting) @@ -61,7 +62,7 @@ Cosmos 3 exposes two runtime surfaces: - **World understanding:** Analyze videos and images for captions, temporal events, next actions, spatial grounding, physical plausibility, and causal outcomes. - **World generation:** Produce images, videos, synchronized sound, and action-conditioned rollouts from text, image, video, or action inputs. - **Action modeling:** Predict policy actions, inverse dynamics, and forward dynamics for robotics, camera motion, egocentric motion, and autonomous-driving settings. -- **Research and production paths:** Use Diffusers and Transformers for Python-first development, then vLLM-Omni and vLLM for OpenAI-compatible serving. +- **Research and production paths:** Use Diffusers and Transformers for Python-first development, then vLLM-Omni/vLLM or SGLang for OpenAI-compatible serving. - **Post-training recipes:** Adapt vision, action, and reasoner workflows with Cosmos Framework training recipes and task-specific evaluation [Coming Soon]. ### Model Architecture @@ -413,6 +414,169 @@ References: +#### Generator with SGLang + +
+Expand SGLang generator setup, endpoints, and request reference + +Use SGLang Diffusion for native Cosmos 3 visual generation behind OpenAI-compatible image and video APIs. Cosmos 3 also includes video-with-sound and action/policy models; this SGLang section focuses on the currently supported text-to-image, text-to-video, and image-to-video generator serving paths. + +Supported checkpoints: + +| Model | Status | Notes | +| --- | --- | --- | +| `nvidia/Cosmos3-Nano` | Supported | Text-to-image, text-to-video, image-to-video | +| `nvidia/Cosmos3-Super` | Supported | Use multiple GPUs for the 64B checkpoint | +| `nvidia/Cosmos3-Super-Text2Image` | Supported | Text-to-image specialized checkpoint | +| `nvidia/Cosmos3-Super-Image2Video` | Supported | Image-to-video specialized checkpoint | +| `nvidia/Cosmos3-Nano-Policy-DROID` | Supported | Action/policy checkpoint | + +Install SGLang from the main branch with diffusion extras: + +```shell +git clone --branch main https://github.com/sgl-project/sglang.git +cd sglang +python -m venv .venv +source .venv/bin/activate +python -m pip install --upgrade pip +pip install -e "python[diffusion]" +pip install "cosmos-guardrail==0.3.1" +``` + +> **Version note:** Cosmos 3 support in SGLang Diffusion currently requires the SGLang main branch. Switch to a stable SGLang release once Cosmos 3 support is included there. + +Start a Nano server: + +```shell +sglang serve --model-path nvidia/Cosmos3-Nano +``` + +For a video-specialized checkpoint, use `Cosmos3-Super-Image2Video` with multiple GPUs: + +```shell +sglang serve \ + --model-path nvidia/Cosmos3-Super-Image2Video \ + --num-gpus 4 +``` + +This is the performance-mode setup. If it runs out of memory, switch to SGLang Diffusion's memory preset: + +```shell +sglang serve \ + --model-path nvidia/Cosmos3-Super-Image2Video \ + --num-gpus 4 \ + --performance-mode memory +``` + +Vision endpoints: + +| Mode | Endpoint | Notes | +| --- | --- | --- | +| Text to image | `POST /v1/images/generations` | Returns base64 by default for Cosmos 3 | +| Text to video | `POST /v1/videos` | Creates an async job; poll `GET /v1/videos/{id}` and download `/content` | +| Image to video | `POST /v1/videos` | Upload the conditioning image with `input_reference` | +| Video to Video | `POST /v1/videos` | Upload the conditioning video with `video_reference` and choose which frames stay as clean conditioning | +| Video with sound | `POST /v1/videos` | Add `generate_sound=true` to produce a soundtrack alongside the video | + +Action modes use Cosmos 3 as a world model: they condition on an embodiment (`domain_name`) and exchange video and action sequences. Policy and inverse dynamics return a predicted action chunk, and read the action data from the completed result; forward dynamics returns only video. + +| Mode | `action_mode` | Input | Output | +| --- | --- | --- | --- | +| Policy | `policy` | Image + instruction | Video + predicted action chunk | +| Inverse dynamics | `inverse_dynamics` | Video + instruction | Video + predicted action chunk | +| Forward dynamics | `forward_dynamics` | Image + action chunk | Video | + +Pass embodiment settings through `extra_params`: `action_mode`, `domain_name` (for example `bridge_orig_lerobot`, `av`, or `camera_pose`), `raw_action_dim`, and optionally `action_view_point`. SGLang derives the action chunk length from `num_frames - 1`, so set `num_frames` to `action_chunk_size + 1`. +For forward dynamics, pass the action trajectory directly in `extra_params["action"]` as a JSON array of shape `[action_chunk_size, raw_action_dim]`. SGLang does not use action_path for HTTP requests, so no `--allowed-local-media-path` setup is needed for action files. + +Text-to-video example: + +```shell +# Submit an async video generation job and capture its ID. +job_id=$(curl -sS -X POST http://localhost:30000/v1/videos \ + --form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \ + --form-string "negative_prompt=blurry, distorted, low quality" \ + --form-string "size=1280x720" \ + --form-string "num_frames=81" \ + --form-string "fps=24" \ + --form-string "num_inference_steps=35" \ + --form-string "guidance_scale=4.0" \ + --form-string "flow_shift=10.0" \ + --form-string "seed=42" \ + --form-string 'extra_params={"guardrails":true,"use_resolution_template":false,"use_duration_template":false}' \ + | jq -r .id) + +# Poll until the job completes. Cosmos 3 video generation can take several minutes. +status="" +until [ "$status" = "completed" ]; do + status=$(curl -sS "http://localhost:30000/v1/videos/${job_id}" | jq -r .status) + [ "$status" = "failed" ] && exit 1 + sleep 5 +done + +# Download the completed MP4. +curl -sS -L "http://localhost:30000/v1/videos/${job_id}/content" \ + -o cosmos3_t2v_output.mp4 +``` + +Text-to-image example: + +```shell +curl -sS -X POST http://localhost:30000/v1/images/generations \ + -H "Content-Type: application/json" \ + -d '{ + "prompt": "A warehouse robot folds a blue cloth on a clean workbench.", + "size": "1280x720", + "n": 1, + "num_inference_steps": 35, + "guidance_scale": 6.0, + "flow_shift": 10.0, + "seed": 0, + "extra_args": { + "use_resolution_template": false, + "guardrails": true + } + }' +``` + +Video-to-video-with-sound example: + +```shell +job_id=$(curl -sS --fail-with-body -X POST "http://localhost:30000/v1/videos" \ + -H "Accept: application/json" \ + --form-string 'prompt=A small warehouse robot moves a blue box across a clean floor.' \ + --form-string 'negative_prompt=blurry, distorted, low quality' \ + --form-string 'size=1280x720' \ + --form-string 'num_frames=61' \ + --form-string 'fps=24' \ + --form-string 'num_inference_steps=30' \ + --form-string 'guidance_scale=4.0' \ + --form-string 'flow_shift=10.0' \ + --form-string 'seed=1234' \ + --form-string 'generate_sound=true' \ + --form-string 'extra_params={"use_resolution_template":false,"use_duration_template":false,"guardrails":true}' \ + -F 'video_reference=@/path/to/video.mp4;type=video/mp4' \ + | jq -r .id) + +# Poll until the job completes. Cosmos 3 video generation can take several minutes. +status="" +until [ "$status" = "completed" ]; do + status=$(curl -sS "http://localhost:30000/v1/videos/${job_id}" | jq -r .status) + [ "$status" = "failed" ] && exit 1 + sleep 5 +done + +# Download the completed MP4. +curl -sS -L "http://localhost:30000/v1/videos/${job_id}/content" \ + -o cosmos3_v2vs_output.mp4 +``` + +SGLang accepts Cosmos 3 request options including `max_sequence_length`, `flow_shift`, `extra_params.guardrails`, `extra_params.use_resolution_template`, and `extra_params.use_duration_template`. Guardrails are enabled by default when `cosmos-guardrail` is installed; set `SGLANG_DISABLE_COSMOS3_GUARDRAILS=1` before starting the server to skip loading the guardrail models. + +For complete serving instructions and request examples, see the [Cosmos3 SGLang cookbook](https://docs.sglang.io/cookbook/diffusion/Cosmos/Cosmos3). + +
+ #### Reasoner with Transformers Coming soon! @@ -483,7 +647,7 @@ The Cosmos Framework requires `uv >= 0.11.3` (enforced via its `pyproject.toml`) | Goal | Use | Notes | | --- | --- | --- | | Generator research or model development | Diffusers | Python-first path for inspecting and modifying generator behavior | -| Generator production inference | vLLM-Omni | API path for image, video, sound, and action outputs | +| Generator production inference | vLLM-Omni/SGLang | API path for image, video, sound, and action outputs | | Reasoner research or model development | Transformers (coming soon) | Python-first path for prompts, processors, and model behavior | | Reasoner production inference | vLLM | OpenAI-compatible endpoint for text outputs from text and vision inputs | | Runnable setup, training, or evaluation | Cosmos Framework | Full workflow docs for setup, inference, omni-model training, and evaluation | @@ -497,10 +661,13 @@ We are building examples that show Cosmos 3 capabilities end to end, including w | Generator (audiovisual) with Diffusers | Generator | Text-to-image, plus text-to-video and image-to-video each with or without synchronized sound, via `Cosmos3OmniPipeline`. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb) | | Generator (audiovisual) with Cosmos Framework | Generator | Text-to-image, plus text-to-video and image-to-video each with sound on or off, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_cosmos_framework.ipynb) | | Generator (audiovisual) with vLLM-Omni | Generator | Text-to-image, plus text-to-video and image-to-video each with sound on or off, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb) | +| Generator (audiovisual) with SGLang | Generator | Text-to-image, plus text-to-video and image-to-video each with sound on or off, against an OpenAI-compatible SGLang server. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_sglang.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_sglang.ipynb) | | Forward dynamics with Cosmos Framework | Generator | Forward dynamics: action-conditioned future-observation prediction for AV, DROID, and UMI, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/action/run_fd_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_fd_with_cosmos_framework.ipynb) | | Forward dynamics with vLLM-Omni | Generator | Forward dynamics: action-conditioned future-observation prediction for AV, DROID, and UMI, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb) | +| Forward dynamics with SGLang | Generator | Forward dynamics: action-conditioned future-observation prediction for AV, DROID, and UMI, against an OpenAI-compatible SGLang server. | [Notebook](cookbooks/cosmos3/generator/action/run_fd_with_sglang.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_fd_with_sglang.ipynb) | | Inverse dynamics with Cosmos Framework | Generator | Inverse dynamics: ego-motion trajectory prediction from input AV video, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/action/run_id_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_id_with_cosmos_framework.ipynb) | | Inverse dynamics with vLLM-Omni | Generator | Inverse dynamics: ego-motion trajectory prediction from input AV video, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/action/run_id_with_vllm.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_id_with_vllm.ipynb) | +| Inverse dynamics with SGLang | Generator | Inverse dynamics: ego-motion trajectory prediction from input AV video, against an OpenAI-compatible SGLang server. | [Notebook](cookbooks/cosmos3/generator/action/run_id_with_sglang.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_id_with_sglang.ipynb) | | Reasoner with Cosmos Framework | Reasoner | Text and image reasoning: detailed captioning, robot task planning, 2D grounding, describe-anything, and action-trajectory prompts, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/reasoner/run_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/reasoner/run_with_cosmos_framework.ipynb) | | Reasoner with vLLM | Reasoner | Image and video reasoning: captioning, temporal localization, embodied reasoning, common-sense reasoning, 2D grounding, describe-anything, action CoT, driving scenes, physical-plausibility, and situation understanding, against an OpenAI-compatible vLLM server (Cosmos3-Super on 4 GPUs by default; switch to Nano per the cookbook README). | [Notebook](cookbooks/cosmos3/reasoner/run_with_vllm.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/reasoner/run_with_vllm.ipynb) | diff --git a/cookbooks/cosmos3/generator/action/run_fd_with_sglang.ipynb b/cookbooks/cosmos3/generator/action/run_fd_with_sglang.ipynb new file mode 100644 index 00000000..a3988033 --- /dev/null +++ b/cookbooks/cosmos3/generator/action/run_fd_with_sglang.ipynb @@ -0,0 +1,1318 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "license-header", + "metadata": {}, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "id": "fdvl-title", + "metadata": {}, + "source": [ + "# Cosmos3 Nano Action: Forward Dynamics with SGLang\n", + "\n", + "This notebook runs Cosmos3 Nano **action forward-dynamics** inference through the SGlang OpenAI-compatible video API:\n", + "\n", + "```text\n", + "POST /v1/videos\n", + "```\n", + "\n", + "Forward dynamics predicts future visual observations from an initial image and an action trajectory. This notebook contains separate AV and robotics sections that each build their own input spec, run inference, and visualize generated videos.\n", + "\n", + "Start the SGLang server:\n", + "\n", + "```bash\n", + "docker rm -f cosmos3-sglang-notebook 2>/dev/null || true\n", + "\n", + "docker run -d --name cosmos3-sglang-notebook \\\n", + " --runtime nvidia --gpus '\"device=0\"' \\\n", + " -e CUDA_DEVICE_ORDER=PCI_BUS_ID \\\n", + " -v ~/.cache/huggingface:/root/.cache/huggingface \\\n", + " -v \"$PWD:/workspace\" \\\n", + " -p 30000:30000 --ipc=host \\\n", + " lmsysorg/sglang:dev \\\n", + " sglang serve \\\n", + " --model-path nvidia/Cosmos3-Nano \\\n", + " --host 0.0.0.0\n", + "\n", + "# Wait until this returns model metadata before running the inference cell.\n", + "curl http://localhost:30000/v1/models\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "id": "fdvl-vars-md", + "metadata": {}, + "source": [ + "## Configure Notebook Variables\n", + "\n", + "Run this cell after the SGLang server is available. It resolves local input/output paths and stores generated outputs under `outputs/cosmos3_action_sglang/` by default.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fdvl-vars-code", + "metadata": {}, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "import os\n", + "\n", + "\n", + "def find_repo_root(start: Path) -> Path:\n", + " for path in [start, *start.parents]:\n", + " if (path / \"README.md\").exists() and (path / \"cookbooks\").exists():\n", + " return path\n", + "\n", + " return start\n", + "\n", + "\n", + "COSMOS_ROOT = find_repo_root(Path.cwd().resolve())\n", + "COSMOS3_REPO = Path(os.environ.get(\"COSMOS3_REPO\", COSMOS_ROOT / \"packages\" / \"cosmos3\")).resolve()\n", + "COSMOS3_OUTPUT_ROOT = Path(\n", + " os.environ.get(\"COSMOS3_SGLANG_OUTPUT_ROOT\", COSMOS_ROOT / \"outputs\" / \"cosmos3_action_sglang\")\n", + ").resolve()\n", + "COSMOS3_INPUT_DIR = COSMOS3_OUTPUT_ROOT / \"inputs\"\n", + "SGLANG_BASE_URL = os.environ.get(\"COSMOS3_SGLANG_BASE_URL\", \"http://localhost:30000\").rstrip(\"/\")\n", + "\n", + "\n", + "def resolve_input(rel_path: str) -> str:\n", + " path = (COSMOS_ROOT / rel_path).resolve()\n", + " assert path.exists(), f\"missing input: {path}\"\n", + " return str(path)\n", + "\n", + "\n", + "COSMOS3_OUTPUT_ROOT.mkdir(parents=True, exist_ok=True)\n", + "COSMOS3_INPUT_DIR.mkdir(parents=True, exist_ok=True)\n", + "\n", + "print(\"COSMOS_ROOT:\", COSMOS_ROOT)\n", + "print(\"COSMOS3_REPO:\", COSMOS3_REPO)\n", + "print(\"COSMOS3_INPUT_DIR:\", COSMOS3_INPUT_DIR)\n", + "print(\"COSMOS3_OUTPUT_ROOT:\", COSMOS3_OUTPUT_ROOT)\n", + "print(\"COSMOS3_SGLANG_BASE_URL:\", SGLANG_BASE_URL)\n" + ] + }, + { + "cell_type": "markdown", + "id": "fdvl-av-md", + "metadata": {}, + "source": [ + "## AV\n", + "\n", + "In this example, we show how to provide a set of ego poses of a autonomous vehicle and an image to generate driving videos using Cosmos3-Nano.\n" + ] + }, + { + "cell_type": "markdown", + "id": "fdvl-av-spec-md", + "metadata": {}, + "source": [ + "### Create the AV Forward-Dynamics Input Spec\n", + "\n", + "AV forward-dynamics inference is driven by a JSONL spec, one line per run. Each line shares the same start frame (`vision_path`) but uses a different ego trajectory (`action_path`), so we get one generated video per trajectory.\n", + "\n", + "The action input is prepared in a JSON file, which can be converted from camera poses (camera-to-world transformation, OpenCV convention, unit in meter) via `pose_abs_to_rel`:\n", + "\n", + "```python\n", + "if str(COSMOS3_REPO) not in sys.path:\n", + " sys.path.insert(0, str(COSMOS3_REPO))\n", + "from cosmos_framework.data.vfm.action.pose_utils import pose_abs_to_rel\n", + "\n", + "poses_abs = np.array([...]) # [T, 4, 4], camera-to-world transformation in opencv convention, unit in meter\n", + "poses_rel = pose_abs_to_rel(\n", + " poses_abs,\n", + " rotation_format=\"rot6d\",\n", + " pose_convention=\"backward_framewise\",\n", + " translation_scale=1.35,\n", + ") # [T-1, 9], translation(3), rot6d(6), framewise relative transformation\n", + "\n", + "with open(\"custom_traj.json\", \"w\") as f:\n", + " json.dump(poses_rel, f)\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fdvl-av-spec-code", + "metadata": {}, + "outputs": [], + "source": [ + "# `resolve_input` and the COSMOS3_* paths come from the variables cell.\n", + "import json\n", + "\n", + "# Local AV inputs, relative to the cosmos repo root.\n", + "av_input_image = \"cookbooks/cosmos3/generator/action/assets/images/av_0.jpg\"\n", + "av_input_actions = {\n", + " \"av_forward\": \"cookbooks/cosmos3/generator/action/assets/actions/av_traj_forward.json\",\n", + " \"av_left\": \"cookbooks/cosmos3/generator/action/assets/actions/av_traj_left.json\",\n", + " \"av_right\": \"cookbooks/cosmos3/generator/action/assets/actions/av_traj_right.json\",\n", + "}\n", + "\n", + "av_vision_path = resolve_input(av_input_image)\n", + "av_records = [\n", + " {\n", + " \"action_chunk_size\": 60,\n", + " \"action_path\": resolve_input(action_rel),\n", + " \"domain_name\": \"av\",\n", + " \"fps\": 10,\n", + " \"image_size\": 480,\n", + " \"view_point\": \"ego_view\",\n", + " \"model_mode\": \"forward_dynamics\",\n", + " \"name\": name,\n", + " \"prompt\": \"You are an autonomous vehicle planning system.\",\n", + " \"seed\": 0,\n", + " \"vision_path\": av_vision_path,\n", + " }\n", + " for name, action_rel in av_input_actions.items()\n", + "]\n", + "\n", + "COSMOS3_INPUT_DIR.mkdir(parents=True, exist_ok=True)\n", + "av_fd_input_path = COSMOS3_INPUT_DIR / \"action_forward_dynamics_av_custom.jsonl\"\n", + "av_fd_input_path.write_text(\"\".join(json.dumps(r) + \"\\n\" for r in av_records))\n", + "av_fd_output_dir = COSMOS3_OUTPUT_ROOT / \"action_forward_dynamics_av_custom\"\n", + "\n", + "os.environ[\"COSMOS3_AV_FD_INPUT\"] = str(av_fd_input_path)\n", + "os.environ[\"COSMOS3_AV_FD_OUTPUT\"] = str(av_fd_output_dir)\n", + "\n", + "print(\"wrote AV spec:\", av_fd_input_path)\n", + "print(\"AV runs:\", list(av_input_actions))\n", + "print(av_fd_input_path.read_text())\n" + ] + }, + { + "cell_type": "markdown", + "id": "fdvl-av-traj-md", + "metadata": {}, + "source": [ + "### Visualize AV Input Trajectories\n", + "\n", + "Before generating any video, plot each input ego trajectory as a 3D camera path with frustums and a top-down bird's-eye view.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fdvl-av-traj-code", + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "import json\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from matplotlib.collections import LineCollection\n", + "from mpl_toolkits.mplot3d.art3d import Line3DCollection\n", + "import os\n", + "\n", + "# The notebook kernel may differ from the framework venv, so put the repo on the\n", + "# path before importing `cosmos_framework`.\n", + "COSMOS3_FRAMEWORK_PATH = os.environ.get(\"COSMOS3_FRAMEWORK_PATH\")\n", + "if str(COSMOS3_FRAMEWORK_PATH) not in sys.path:\n", + " sys.path.insert(0, str(COSMOS3_FRAMEWORK_PATH))\n", + "from cosmos_framework.data.vfm.action.pose_utils import pose_rel_to_abs\n", + "\n", + "# frustum: apex + image-rectangle corners (camera +Z forward), and their edges\n", + "_FRUSTUM = np.array([[0, 0, 0], [-1, -1, 1], [1, -1, 1], [1, 1, 1], [-1, 1, 1]], float)\n", + "_EDGES = [(0, 1), (0, 2), (0, 3), (0, 4), (1, 2), (2, 3), (3, 4), (4, 1)]\n", + "\n", + "\n", + "def visualize_pose(poses_abs, *, n_frustums=20, scale_frac=0.03, aspect=16 / 9,\n", + " fov_deg=60.0, vertical_exaggeration=1.0, cmap=\"turbo\",\n", + " title=None, save_path=None, show=True):\n", + " \"\"\"3D camera trajectory (with frustums) + a top-down bird's-eye view.\"\"\"\n", + " poses_abs = np.asarray(poses_abs)\n", + " pos = poses_abs[:, :3, 3]\n", + " fwd = poses_abs[:, :3, 2]\n", + " T = len(pos)\n", + " colors = plt.get_cmap(cmap)(np.arange(T) / max(T - 1, 1))\n", + " scale = max(np.ptp(pos, axis=0).max() * scale_frac, 1e-3)\n", + " step = max(1, T // max(n_frustums, 1))\n", + " xzy = [0, 2, 1]\n", + "\n", + " fig = plt.figure(figsize=(14, 6))\n", + "\n", + " ax = fig.add_subplot(1, 2, 1, projection=\"3d\")\n", + " path = pos[:, xzy]\n", + " ax.plot(*path.T, color=\"0.6\", lw=1.0, alpha=0.7)\n", + " lines, lcolors, allpts = [], [], [path]\n", + " for i in range(0, T, step):\n", + " cw = ((_FRUSTUM * [aspect, 1, 1] * scale * np.tan(np.radians(fov_deg) / 2))\n", + " @ poses_abs[i, :3, :3].T + poses_abs[i, :3, 3])[:, xzy]\n", + " allpts.append(cw)\n", + " lines += [[cw[a], cw[b]] for a, b in _EDGES]\n", + " lcolors += [colors[i]] * len(_EDGES)\n", + " ax.add_collection3d(Line3DCollection(lines, colors=lcolors, linewidths=1.2))\n", + " ax.scatter(*path[0], color=\"lime\", s=80, edgecolor=\"k\", label=\"first frame\", zorder=5)\n", + " ax.scatter(*path[-1], color=\"red\", s=80, edgecolor=\"k\", label=\"last frame\", zorder=5)\n", + " rng = np.clip(np.ptp(np.concatenate(allpts), axis=0), 1e-9, None)\n", + " ax.set_box_aspect((rng[0], rng[1], rng[2] * vertical_exaggeration))\n", + " ax.set_xlabel(\"X (m)\", labelpad=12)\n", + " ax.set_ylabel(\"Z forward (m)\", labelpad=12)\n", + " ax.set_zlabel(\"Y up (m)\", labelpad=10)\n", + " ax.set_zticks([])\n", + " ax.set_title(title or f\"Camera trajectory + frustums ({T} frames)\")\n", + " ax.legend(loc=\"upper left\")\n", + " ax.view_init(elev=22, azim=-70)\n", + "\n", + " ax2 = fig.add_subplot(1, 2, 2)\n", + " seg = np.stack([pos[:-1, [0, 2]], pos[1:, [0, 2]]], axis=1)\n", + " lc = LineCollection(seg, cmap=cmap, norm=plt.Normalize(0, T - 1), linewidth=2.5)\n", + " lc.set_array(np.arange(T - 1))\n", + " ax2.add_collection(lc)\n", + " ax2.quiver(pos[::step, 0], pos[::step, 2], fwd[::step, 0], fwd[::step, 2],\n", + " color=colors[::step], angles=\"xy\", width=0.005, scale=22, zorder=3)\n", + " ax2.scatter(*pos[0, [0, 2]], color=\"lime\", s=80, edgecolor=\"k\", label=\"first frame\", zorder=5)\n", + " ax2.scatter(*pos[-1, [0, 2]], color=\"red\", s=80, edgecolor=\"k\", label=\"last frame\", zorder=5)\n", + " ax2.set_xlabel(\"X (m)\")\n", + " ax2.set_ylabel(\"Z forward (m)\")\n", + " ax2.set_title(\"Top-down (bird's-eye view)\")\n", + " ax2.set_aspect(\"equal\", adjustable=\"datalim\")\n", + " ax2.autoscale_view()\n", + " ax2.legend()\n", + " fig.colorbar(lc, ax=ax2, label=\"frame index\")\n", + "\n", + " plt.tight_layout(w_pad=6)\n", + " if save_path:\n", + " fig.savefig(save_path, dpi=120, bbox_inches=\"tight\")\n", + " print(\"saved\", save_path)\n", + " if show:\n", + " plt.show()\n", + "\n", + "\n", + "for record in av_records:\n", + " name = record[\"name\"]\n", + " with open(record[\"action_path\"]) as f:\n", + " poses_rel = np.array(json.load(f))\n", + "\n", + " # AV action convention: rot6d rotation, backward_framewise, translation_scale = 1.35.\n", + " poses_abs = pose_rel_to_abs(\n", + " poses_rel,\n", + " rotation_format=\"rot6d\",\n", + " pose_convention=\"backward_framewise\",\n", + " translation_scale=1.35,\n", + " )\n", + " print(name, poses_rel.shape, poses_abs.shape)\n", + " visualize_pose(poses_abs, title=f\"{name}: camera trajectory + frustums ({len(poses_abs)} frames)\", show=True)\n" + ] + }, + { + "cell_type": "markdown", + "id": "fdvl-av-run-md", + "metadata": {}, + "source": [ + "### Run AV Forward-Dynamics Inference\n", + "\n", + "Runs `Cosmos3-Nano` on every line of the AV spec through SGLang. Each run writes its video to:\n", + "\n", + "```text\n", + "/action_forward_dynamics_av_custom//vision.mp4\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fdvl-av-run-code", + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "import mimetypes\n", + "import time\n", + "from pathlib import Path\n", + "from PIL import Image\n", + "\n", + "try:\n", + " import requests\n", + "except ImportError as exc:\n", + " raise RuntimeError(\"Install requests in this notebook kernel: pip install requests\") from exc\n", + "\n", + "\n", + "def check_sglang_server(timeout_s: int = 600, interval_s: int = 10) -> None:\n", + " deadline = time.time() + timeout_s\n", + " last_error: Exception | None = None\n", + " while time.time() < deadline:\n", + " try:\n", + " response = requests.get(f\"{SGLANG_BASE_URL}/v1/models\", timeout=10)\n", + " response.raise_for_status()\n", + " print(response.json())\n", + " return\n", + " except requests.RequestException as exc:\n", + " last_error = exc\n", + " print(f\"Waiting for SGLang server at {SGLANG_BASE_URL}: {exc}\")\n", + " time.sleep(interval_s)\n", + " raise RuntimeError(\n", + " f\"SGLang server did not become ready at {SGLANG_BASE_URL} within {timeout_s}s. \"\n", + " \"Check `docker logs -f cosmos3-sglang-notebook`.\"\n", + " ) from last_error\n", + "\n", + "\n", + "def submit_forward_dynamics(record: dict, fd_output_dir: Path) -> dict:\n", + " run_dir = fd_output_dir / record[\"name\"]\n", + " run_dir.mkdir(parents=True, exist_ok=True)\n", + "\n", + " vision_path = Path(record[\"vision_path\"])\n", + " input_width, input_height = Image.open(vision_path).size\n", + " mime_type = mimetypes.guess_type(vision_path.name)[0] or \"application/octet-stream\"\n", + " extra_params = {\n", + " \"action_mode\": \"forward_dynamics\",\n", + " \"domain_name\": record[\"domain_name\"],\n", + " \"action_view_point\": record[\"view_point\"],\n", + " \"action\": json.loads(Path(record[\"action_path\"]).read_text()),\n", + " \"guardrails\": False,\n", + " }\n", + " prompt = str(record.get(\"prompt\") or \"\").strip() or \"A robot manipulates an object.\"\n", + " form = {\n", + " \"prompt\": prompt,\n", + " \"num_frames\": record[\"action_chunk_size\"] + 1,\n", + " \"fps\": record[\"fps\"],\n", + " \"size\": f\"{input_width}x{input_height}\",\n", + " \"num_inference_steps\": 30,\n", + " \"guidance_scale\": 1.0,\n", + " \"flow_shift\": 10.0,\n", + " \"seed\": record[\"seed\"],\n", + " \"extra_params\": json.dumps(extra_params),\n", + " }\n", + "\n", + " with vision_path.open(\"rb\") as image_file:\n", + " response = requests.post(\n", + " f\"{SGLANG_BASE_URL}/v1/videos\",\n", + " data={key: str(value) for key, value in form.items()},\n", + " files={\"input_reference\": (vision_path.name, image_file, mime_type)},\n", + " timeout=120,\n", + " )\n", + " if not response.ok:\n", + " (run_dir / \"error_response.txt\").write_text(response.text)\n", + " print(\"SGLang request failed:\", response.status_code)\n", + " print(response.text)\n", + " print(\"form:\", json.dumps(form, indent=2))\n", + " print(\"extra_params keys:\", sorted(extra_params))\n", + " print(\"action shape:\", [len(extra_params[\"action\"]), len(extra_params[\"action\"][0]) if extra_params[\"action\"] else 0])\n", + " response.raise_for_status()\n", + " initial = response.json()\n", + " (run_dir / \"response.json\").write_text(json.dumps(initial, indent=2))\n", + "\n", + " while True:\n", + " response = requests.get(f\"{SGLANG_BASE_URL}/v1/videos/{initial['id']}\", timeout=30)\n", + " response.raise_for_status()\n", + " final = response.json()\n", + " (run_dir / \"final.json\").write_text(json.dumps(final, indent=2))\n", + " print(initial[\"id\"], final.get(\"status\"), f\"{final.get('progress', 0)}%\")\n", + " if final.get(\"status\") == \"completed\":\n", + " break\n", + " if final.get(\"status\") in {\"failed\", \"cancelled\"}:\n", + " raise RuntimeError(json.dumps(final, indent=2))\n", + " time.sleep(2)\n", + "\n", + " response = requests.get(f\"{SGLANG_BASE_URL}/v1/videos/{initial['id']}/content\", timeout=300)\n", + " response.raise_for_status()\n", + " video_path = run_dir / \"vision.mp4\"\n", + " video_path.write_bytes(response.content)\n", + "\n", + " action = final.get(\"action\")\n", + " if action is not None:\n", + " (run_dir / \"action.json\").write_text(json.dumps(action, indent=2))\n", + "\n", + " print(\"saved\", video_path)\n", + " if action is not None:\n", + " print(\"action shape:\", action.get(\"shape\"), \"dtype:\", action.get(\"dtype\"))\n", + " return {\"record\": record, \"initial\": initial, \"final\": final, \"run_dir\": run_dir, \"video_path\": video_path, \"action\": action}\n", + "\n", + "\n", + "check_sglang_server()\n", + "av_results = []\n", + "for record in av_records:\n", + " print(f\"\\nSubmitting {record['name']}\")\n", + " av_results.append(submit_forward_dynamics(record, av_fd_output_dir))\n" + ] + }, + { + "cell_type": "markdown", + "id": "fdvl-av-preview-md", + "metadata": {}, + "source": [ + "### Visualize AV Generated Videos\n", + "\n", + "\n", + "`Video(..., embed=True)` base64-inlines a file into the notebook, and embedding full-resolution runs can freeze the front-end. This cell first transcodes each video to a small preview using the ffmpeg binary bundled with `imageio-ffmpeg`, then embeds the previews. The full-resolution `vision.mp4` files are left untouched.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fdvl-av-preview-code", + "metadata": {}, + "outputs": [], + "source": [ + "import subprocess\n", + "import imageio_ffmpeg\n", + "from IPython.display import Video, display\n", + "\n", + "FFMPEG = imageio_ffmpeg.get_ffmpeg_exe()\n", + "\n", + "\n", + "def make_preview(src: Path, crf: int = 28) -> Path:\n", + " \"\"\"Re-encode `src` to a compact, browser-friendly mp4 (cached).\"\"\"\n", + " preview = src.with_name(f\"{src.stem}_preview.mp4\")\n", + " if not preview.exists():\n", + " subprocess.run(\n", + " [FFMPEG, \"-y\", \"-loglevel\", \"error\", \"-i\", str(src),\n", + " \"-c:v\", \"libx264\", \"-crf\", str(crf),\n", + " \"-preset\", \"veryfast\", \"-an\", \"-pix_fmt\", \"yuv420p\", str(preview)],\n", + " check=True,\n", + " )\n", + " return preview\n", + "\n", + "\n", + "for record in av_records:\n", + " name = record[\"name\"]\n", + " src = av_fd_output_dir / name / \"vision.mp4\"\n", + " assert src.exists(), f\"missing: {src}\"\n", + " preview = make_preview(src)\n", + " print(f\"{name} ({src.stat().st_size // 1024} KB -> {preview.stat().st_size // 1024} KB preview)\")\n", + " display(Video(str(preview), embed=True))\n" + ] + }, + { + "cell_type": "markdown", + "id": "fdvl-robotics-md", + "metadata": {}, + "source": [ + "## Robotics\n", + "\n", + "In this example, we show how to start from a LeRobot dataset of DROID and run **multiview** generation for robotics manipulation **autoregressively**.\n" + ] + }, + { + "cell_type": "markdown", + "id": "fdvl-robotics-spec-md", + "metadata": {}, + "source": [ + "### Create the Robotics Autoregressive Forward-Dynamics Plan\n", + "\n", + "Robotics forward-dynamics runs autoregressively over five contiguous 16-action DROID chunks. This cell writes the GT first conditioning image for chunk 0 and one action JSON per chunk. Later chunks receive their conditioning image from the previous chunk's generated last frame during the inference loop.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fdvl-robotics-spec-code", + "metadata": {}, + "outputs": [], + "source": [ + "# `resolve_input` and the COSMOS3_* paths come from the variables cell.\n", + "import json\n", + "import os\n", + "import sys\n", + "\n", + "from PIL import Image\n", + "\n", + "# The notebook kernel may differ from the framework venv, so put the repo on the\n", + "# path before importing `cosmos_framework`.\n", + "COSMOS3_FRAMEWORK_PATH = os.environ.get(\"COSMOS3_FRAMEWORK_PATH\")\n", + "if str(COSMOS3_FRAMEWORK_PATH) not in sys.path:\n", + " sys.path.insert(0, str(COSMOS3_FRAMEWORK_PATH))\n", + "\n", + "from cosmos_framework.data.vfm.action.datasets import DROIDLeRobotDataset\n", + "\n", + "import av\n", + "import torch\n", + "import numpy as np\n", + "import cosmos_framework.data.vfm.action.datasets.droid_lerobot_dataset as droid_ds\n", + "\n", + "def decode_video_frames_av(video_path, timestamps, tolerance_s, backend=None):\n", + " loaded_ts = []\n", + " loaded_frames = []\n", + "\n", + " with av.open(str(video_path)) as container:\n", + " stream = container.streams.video[0]\n", + " for frame in container.decode(stream):\n", + " ts = float(frame.pts * frame.time_base) if frame.pts is not None else float(frame.time)\n", + " loaded_ts.append(ts)\n", + " loaded_frames.append(frame.to_ndarray(format=\"rgb24\"))\n", + "\n", + " loaded_ts = torch.tensor(loaded_ts, dtype=torch.float32)\n", + " query_ts = torch.tensor([float(t) for t in timestamps], dtype=torch.float32)\n", + "\n", + " dist = torch.cdist(query_ts[:, None], loaded_ts[:, None], p=1)\n", + " min_dist, idx = dist.min(dim=1)\n", + " if not bool((min_dist < tolerance_s).all()):\n", + " raise ValueError(f\"No frame within tolerance: {min_dist.tolist()} > {tolerance_s}\")\n", + "\n", + " frames = np.stack([loaded_frames[int(i)] for i in idx], axis=0)\n", + " return torch.from_numpy(frames).permute(0, 3, 1, 2).float() / 255.0\n", + "\n", + "droid_ds.decode_video_frames = decode_video_frames_av\n", + "\n", + "robotics_dataset_root = resolve_input(\"cookbooks/cosmos3/generator/action/assets/droid_lerobot_example\")\n", + "robotics_dataset = DROIDLeRobotDataset(root=robotics_dataset_root)\n", + "robotics_num_chunks = 5\n", + "robotics_chunk_length = 16\n", + "robotics_chunk_starts = [chunk_idx * robotics_chunk_length for chunk_idx in range(robotics_num_chunks)]\n", + "assert robotics_chunk_starts[-1] < len(robotics_dataset)\n", + "\n", + "COSMOS3_INPUT_DIR.mkdir(parents=True, exist_ok=True)\n", + "robotics_initial_vision_path = COSMOS3_INPUT_DIR / \"robotics_droid_autoregressive_input_chunk_00.png\"\n", + "robotics_records = []\n", + "\n", + "for chunk_idx, sample_idx in enumerate(robotics_chunk_starts):\n", + " robotics_sample = robotics_dataset[sample_idx]\n", + " assert int(robotics_sample[\"action\"].shape[0]) == robotics_chunk_length\n", + "\n", + " chunk_name = f\"robotics_action_cond_chunk_{chunk_idx:02d}\"\n", + " robotics_action_path = COSMOS3_INPUT_DIR / f\"robotics_droid_action_chunk_{chunk_idx:02d}.json\"\n", + " robotics_action_path.write_text(json.dumps(robotics_sample[\"action\"].cpu().tolist()))\n", + "\n", + " if chunk_idx == 0:\n", + " first_frame = robotics_sample[\"video\"][:, 0].permute(1, 2, 0).cpu().numpy()\n", + " Image.fromarray(first_frame).save(robotics_initial_vision_path)\n", + " vision_path = robotics_initial_vision_path\n", + " else:\n", + " vision_path = COSMOS3_INPUT_DIR / f\"robotics_droid_autoregressive_input_chunk_{chunk_idx:02d}.png\"\n", + "\n", + " robotics_records.append(\n", + " {\n", + " \"action_chunk_size\": robotics_chunk_length,\n", + " \"action_path\": str(robotics_action_path),\n", + " \"domain_name\": \"droid_lerobot\",\n", + " \"fps\": int(robotics_sample[\"conditioning_fps\"]),\n", + " \"image_size\": 480,\n", + " \"view_point\": robotics_sample[\"viewpoint\"],\n", + " \"model_mode\": \"forward_dynamics\",\n", + " \"name\": chunk_name,\n", + " \"prompt\": robotics_sample[\"ai_caption\"],\n", + " \"seed\": 0,\n", + " \"vision_path\": str(vision_path),\n", + " }\n", + " )\n", + "\n", + "robotics_fd_input_path = COSMOS3_INPUT_DIR / \"action_forward_dynamics_robotics_custom.jsonl\"\n", + "robotics_fd_input_path.write_text(\"\".join(json.dumps(r) + \"\\n\" for r in robotics_records))\n", + "robotics_fd_output_dir = COSMOS3_OUTPUT_ROOT / \"action_forward_dynamics_robotics_custom\"\n", + "\n", + "os.environ[\"COSMOS3_ROBOTICS_FD_INPUT\"] = str(robotics_fd_input_path)\n", + "os.environ[\"COSMOS3_ROBOTICS_FD_OUTPUT\"] = str(robotics_fd_output_dir)\n", + "\n", + "print(\"loaded DROID samples from:\", robotics_dataset_root)\n", + "print(\"chunk starts:\", robotics_chunk_starts)\n", + "print(\"total action frames:\", robotics_num_chunks * robotics_chunk_length)\n", + "print(\"wrote GT initial frame:\", robotics_initial_vision_path)\n", + "print(\"wrote robotics autoregressive plan:\", robotics_fd_input_path)\n", + "print(robotics_fd_input_path.read_text())\n" + ] + }, + { + "cell_type": "markdown", + "id": "fdvl-robotics-run-md", + "metadata": {}, + "source": [ + "### Run Robotics Autoregressive Forward-Dynamics Inference\n", + "\n", + "Runs `Cosmos3-Nano` once per robotics chunk through SGLang. Chunk 0 uses the DROID GT first frame. After each chunk finishes, the cell extracts that chunk's last generated frame and uses it as the conditioning image for the next chunk. Guardrails are disabled for this robotics run via `extra_params={\"guardrails\": false}`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fdvl-robotics-run-code", + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "import mimetypes\n", + "import subprocess\n", + "import time\n", + "from pathlib import Path\n", + "\n", + "import imageio_ffmpeg\n", + "from PIL import Image\n", + "\n", + "try:\n", + " import requests\n", + "except ImportError as exc:\n", + " raise RuntimeError(\"Install requests in this notebook kernel: pip install requests\") from exc\n", + "\n", + "FFMPEG = imageio_ffmpeg.get_ffmpeg_exe()\n", + "\n", + "\n", + "def check_sglang_server(timeout_s: int = 600, interval_s: int = 10) -> None:\n", + " deadline = time.time() + timeout_s\n", + " last_error: Exception | None = None\n", + " while time.time() < deadline:\n", + " try:\n", + " response = requests.get(f\"{SGLANG_BASE_URL}/v1/models\", timeout=10)\n", + " response.raise_for_status()\n", + " print(response.json())\n", + " return\n", + " except requests.RequestException as exc:\n", + " last_error = exc\n", + " print(f\"Waiting for SGLang server at {SGLANG_BASE_URL}: {exc}\")\n", + " time.sleep(interval_s)\n", + " raise RuntimeError(\n", + " f\"SGLang server did not become ready at {SGLANG_BASE_URL} within {timeout_s}s. \"\n", + " \"Check `docker logs -f cosmos3-sglang-notebook`.\"\n", + " ) from last_error\n", + "\n", + "\n", + "def submit_forward_dynamics(record: dict, fd_output_dir: Path, *, disable_guardrails: bool = False) -> dict:\n", + " run_dir = fd_output_dir / record[\"name\"]\n", + " run_dir.mkdir(parents=True, exist_ok=True)\n", + "\n", + " vision_path = Path(record[\"vision_path\"])\n", + " input_width, input_height = Image.open(vision_path).size\n", + " target_width = (input_width // 16) * 16\n", + " target_height = (input_height // 16) * 16\n", + " mime_type = mimetypes.guess_type(vision_path.name)[0] or \"application/octet-stream\"\n", + " extra_params = {\n", + " \"action_mode\": \"forward_dynamics\",\n", + " \"domain_name\": record[\"domain_name\"],\n", + " \"action_view_point\": record[\"view_point\"],\n", + " \"action\": json.loads(Path(record[\"action_path\"]).read_text()),\n", + " \"guardrails\": False,\n", + " }\n", + " if disable_guardrails:\n", + " extra_params[\"guardrails\"] = False\n", + "\n", + " prompt = str(record.get(\"prompt\") or \"\").strip() or \" \"\n", + " form = {\n", + " \"prompt\": prompt,\n", + " \"num_frames\": record[\"action_chunk_size\"] + 1,\n", + " \"fps\": record[\"fps\"],\n", + " \"size\": f\"{target_width}x{target_height}\",\n", + " \"num_inference_steps\": 30,\n", + " \"guidance_scale\": 1.0,\n", + " \"flow_shift\": 10.0,\n", + " \"seed\": record[\"seed\"],\n", + " \"extra_params\": json.dumps(extra_params),\n", + " }\n", + "\n", + " with vision_path.open(\"rb\") as image_file:\n", + " response = requests.post(\n", + " f\"{SGLANG_BASE_URL}/v1/videos\",\n", + " data={key: str(value) for key, value in form.items()},\n", + " files={\"input_reference\": (vision_path.name, image_file, mime_type)},\n", + " timeout=120,\n", + " )\n", + " if not response.ok:\n", + " (run_dir / \"error_response.txt\").write_text(response.text)\n", + " print(\"SGLang request failed:\", response.status_code)\n", + " print(response.text)\n", + " print(\"form:\", json.dumps(form, indent=2))\n", + " print(\"extra_params keys:\", sorted(extra_params))\n", + " print(\"action shape:\", [len(extra_params[\"action\"]), len(extra_params[\"action\"][0]) if extra_params[\"action\"] else 0])\n", + " response.raise_for_status()\n", + " initial = response.json()\n", + " (run_dir / \"response.json\").write_text(json.dumps(initial, indent=2))\n", + "\n", + " while True:\n", + " response = requests.get(f\"{SGLANG_BASE_URL}/v1/videos/{initial['id']}\", timeout=30)\n", + " response.raise_for_status()\n", + " final = response.json()\n", + " (run_dir / \"final.json\").write_text(json.dumps(final, indent=2))\n", + " print(initial[\"id\"], final.get(\"status\"), f\"{final.get('progress', 0)}%\")\n", + " if final.get(\"status\") == \"completed\":\n", + " break\n", + " if final.get(\"status\") in {\"failed\", \"cancelled\"}:\n", + " raise RuntimeError(json.dumps(final, indent=2))\n", + " time.sleep(2)\n", + "\n", + " response = requests.get(f\"{SGLANG_BASE_URL}/v1/videos/{initial['id']}/content\", timeout=300)\n", + " response.raise_for_status()\n", + " video_path = run_dir / \"vision.mp4\"\n", + " video_path.write_bytes(response.content)\n", + "\n", + " action = final.get(\"action\")\n", + " if action is not None:\n", + " (run_dir / \"action.json\").write_text(json.dumps(action, indent=2))\n", + "\n", + " print(\"saved\", video_path)\n", + " if action is not None:\n", + " print(\"action shape:\", action.get(\"shape\"), \"dtype:\", action.get(\"dtype\"))\n", + " return {\"record\": record, \"initial\": initial, \"final\": final, \"run_dir\": run_dir, \"video_path\": video_path, \"action\": action}\n", + "\n", + "\n", + "check_sglang_server()\n", + "robotics_results = []\n", + "robotics_actual_records = []\n", + "current_vision_path = Path(robotics_records[0][\"vision_path\"])\n", + "assert current_vision_path.exists(), f\"missing initial conditioning image: {current_vision_path}\"\n", + "\n", + "for chunk_idx, base_record in enumerate(robotics_records):\n", + " record = dict(base_record)\n", + " record[\"vision_path\"] = str(current_vision_path)\n", + " robotics_records[chunk_idx][\"vision_path\"] = str(current_vision_path)\n", + " robotics_actual_records.append(record)\n", + "\n", + " print(f\"\\nSubmitting {record['name']}\")\n", + " print(\"conditioning image:\", current_vision_path)\n", + " result = submit_forward_dynamics(record, robotics_fd_output_dir, disable_guardrails=True)\n", + " robotics_results.append(result)\n", + "\n", + " if chunk_idx + 1 < len(robotics_records):\n", + " next_vision_path = COSMOS3_INPUT_DIR / f\"robotics_droid_autoregressive_input_chunk_{chunk_idx + 1:02d}.png\"\n", + " subprocess.run(\n", + " [\n", + " FFMPEG,\n", + " \"-y\",\n", + " \"-loglevel\",\n", + " \"error\",\n", + " \"-i\",\n", + " str(result[\"video_path\"]),\n", + " \"-vf\",\n", + " fr\"select=eq(n\\,{record['action_chunk_size']})\",\n", + " \"-frames:v\",\n", + " \"1\",\n", + " str(next_vision_path),\n", + " ],\n", + " check=True,\n", + " )\n", + " assert next_vision_path.exists(), f\"failed to extract next conditioning image: {next_vision_path}\"\n", + " current_vision_path = next_vision_path\n", + "\n", + "robotics_fd_input_path.write_text(\"\".join(json.dumps(r) + \"\\n\" for r in robotics_actual_records))\n", + "print(\"wrote autoregressive run spec:\", robotics_fd_input_path)\n", + "print(\"completed chunks:\", [record[\"name\"] for record in robotics_actual_records])\n" + ] + }, + { + "cell_type": "markdown", + "id": "fdvl-robotics-stitch-md", + "metadata": {}, + "source": [ + "### Stitch Robotics Generated Chunks\n", + "\n", + "Each autoregressive chunk video includes its conditioning frame at frame 0. This cell drops that first frame from every chunk and concatenates the remaining 16 generated frames per chunk into one 80-frame video.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fdvl-robotics-stitch-code", + "metadata": {}, + "outputs": [], + "source": [ + "import subprocess\n", + "import imageio_ffmpeg\n", + "\n", + "FFMPEG = imageio_ffmpeg.get_ffmpeg_exe()\n", + "robotics_stitch_dir = robotics_fd_output_dir / \"_stitched_segments\"\n", + "robotics_stitch_dir.mkdir(parents=True, exist_ok=True)\n", + "\n", + "segment_paths = []\n", + "for record in robotics_records:\n", + " src = robotics_fd_output_dir / record[\"name\"] / \"vision.mp4\"\n", + " assert src.exists(), f\"missing: {src}\"\n", + "\n", + " segment = robotics_stitch_dir / f\"{record['name']}_generated.mp4\"\n", + " subprocess.run(\n", + " [\n", + " FFMPEG,\n", + " \"-y\",\n", + " \"-loglevel\",\n", + " \"error\",\n", + " \"-i\",\n", + " str(src),\n", + " \"-vf\",\n", + " r\"select=gte(n\\,1),setpts=N/FRAME_RATE/TB\",\n", + " \"-an\",\n", + " \"-r\",\n", + " str(record[\"fps\"]),\n", + " \"-c:v\",\n", + " \"libx264\",\n", + " \"-crf\",\n", + " \"18\",\n", + " \"-preset\",\n", + " \"veryfast\",\n", + " \"-pix_fmt\",\n", + " \"yuv420p\",\n", + " str(segment),\n", + " ],\n", + " check=True,\n", + " )\n", + " segment_paths.append(segment)\n", + "\n", + "concat_file = robotics_stitch_dir / \"concat.txt\"\n", + "concat_file.write_text(\"\".join(f\"file '{path.as_posix()}'\\n\" for path in segment_paths))\n", + "\n", + "robotics_stitched_video_path = robotics_fd_output_dir / \"robotics_action_cond_stitched.mp4\"\n", + "subprocess.run(\n", + " [\n", + " FFMPEG,\n", + " \"-y\",\n", + " \"-loglevel\",\n", + " \"error\",\n", + " \"-f\",\n", + " \"concat\",\n", + " \"-safe\",\n", + " \"0\",\n", + " \"-i\",\n", + " str(concat_file),\n", + " \"-c\",\n", + " \"copy\",\n", + " str(robotics_stitched_video_path),\n", + " ],\n", + " check=True,\n", + ")\n", + "\n", + "print(\"stitched robotics video:\", robotics_stitched_video_path)\n", + "print(\"expected generated frames:\", len(robotics_records) * robotics_chunk_length)\n", + "print(\"size KB:\", robotics_stitched_video_path.stat().st_size // 1024)\n" + ] + }, + { + "cell_type": "markdown", + "id": "fdvl-robotics-preview-md", + "metadata": {}, + "source": [ + "### Visualize Robotics Generated Videos\n", + "\n", + "`Video(..., embed=True)` base64-inlines a file into the notebook, and embedding full-resolution runs can freeze the front-end. This cell first displays a compact preview of the stitched 80-frame video when available, then previews each per-chunk video. The full-resolution `vision.mp4` files and stitched mp4 are left untouched.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fdvl-robotics-preview-code", + "metadata": {}, + "outputs": [], + "source": [ + "import subprocess\n", + "import imageio_ffmpeg\n", + "from IPython.display import Video, display\n", + "\n", + "FFMPEG = imageio_ffmpeg.get_ffmpeg_exe()\n", + "\n", + "\n", + "def make_preview(src: Path, crf: int = 28) -> Path:\n", + " \"\"\"Re-encode `src` to a compact, browser-friendly mp4 (cached).\"\"\"\n", + " preview = src.with_name(f\"{src.stem}_preview.mp4\")\n", + " if not preview.exists() or preview.stat().st_mtime < src.stat().st_mtime:\n", + " subprocess.run(\n", + " [FFMPEG, \"-y\", \"-loglevel\", \"error\", \"-i\", str(src),\n", + " \"-c:v\", \"libx264\", \"-crf\", str(crf),\n", + " \"-preset\", \"veryfast\", \"-an\", \"-pix_fmt\", \"yuv420p\", str(preview)],\n", + " check=True,\n", + " )\n", + " return preview\n", + "\n", + "\n", + "if \"robotics_stitched_video_path\" in globals():\n", + " assert robotics_stitched_video_path.exists(), f\"missing: {robotics_stitched_video_path}\"\n", + " stitched_preview = make_preview(robotics_stitched_video_path)\n", + " print(\n", + " f\"stitched ({robotics_stitched_video_path.stat().st_size // 1024} KB -> \"\n", + " f\"{stitched_preview.stat().st_size // 1024} KB preview)\"\n", + " )\n", + " display(Video(str(stitched_preview), embed=True))\n", + "\n", + "for record in robotics_records:\n", + " name = record[\"name\"]\n", + " src = robotics_fd_output_dir / name / \"vision.mp4\"\n", + " assert src.exists(), f\"missing: {src}\"\n", + " preview = make_preview(src)\n", + " print(f\"{name} ({src.stat().st_size // 1024} KB -> {preview.stat().st_size // 1024} KB preview)\")\n", + " display(Video(str(preview), embed=True))\n" + ] + }, + { + "cell_type": "markdown", + "id": "bd2edde3", + "metadata": {}, + "source": [ + "## UMI\n", + "\n", + "This example runs UMI forward dynamics through SGLang autoregressively over all 16-action chunks in `assets/actions/umi.json`. The action file stores the raw UMI 10D action representation, so the setup cell validates the row dimension, writes one action JSON per chunk, and prepares a run plan." + ] + }, + { + "cell_type": "markdown", + "id": "5abea6a3", + "metadata": {}, + "source": [ + "### Create the UMI Autoregressive Forward-Dynamics Plan\n", + "\n", + "The UMI action file is stored as one JSON array with `16 * n` action rows. SGLang receives one request per 16-action chunk. Chunk 0 uses the checked-in UMI conditioning image; later chunks use conditioning images extracted from the previous generated video." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "14cab52f", + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "from pathlib import Path\n", + "\n", + "umi_input_image = \"cookbooks/cosmos3/generator/action/assets/images/umi.png\"\n", + "umi_input_action = \"cookbooks/cosmos3/generator/action/assets/actions/umi.json\"\n", + "umi_prompt = \"mouse arrangement\"\n", + "umi_fps = 20\n", + "umi_action_chunk_size = 16\n", + "umi_raw_action_dim = 10\n", + "\n", + "umi_initial_vision_path = Path(resolve_input(umi_input_image))\n", + "umi_source_action_path = Path(resolve_input(umi_input_action))\n", + "umi_action = json.loads(umi_source_action_path.read_text())\n", + "assert len(umi_action) % umi_action_chunk_size == 0, (\n", + " f\"expected action count to be divisible by {umi_action_chunk_size}, got {len(umi_action)}\"\n", + ")\n", + "assert all(len(row) == umi_raw_action_dim for row in umi_action), \"UMI action rows must be 10D\"\n", + "\n", + "umi_num_chunks = len(umi_action) // umi_action_chunk_size\n", + "COSMOS3_INPUT_DIR.mkdir(parents=True, exist_ok=True)\n", + "umi_records = []\n", + "\n", + "for chunk_idx in range(umi_num_chunks):\n", + " chunk_name = f\"umi_action_cond_chunk_{chunk_idx:02d}\"\n", + " start = chunk_idx * umi_action_chunk_size\n", + " end = start + umi_action_chunk_size\n", + " action_chunk_10d = umi_action[start:end]\n", + " umi_action_path = COSMOS3_INPUT_DIR / f\"umi_action_chunk_{chunk_idx:02d}_10d.json\"\n", + " umi_action_path.write_text(json.dumps(action_chunk_10d, indent=2) + \"\\n\")\n", + "\n", + " if chunk_idx == 0:\n", + " vision_path = umi_initial_vision_path\n", + " else:\n", + " vision_path = COSMOS3_INPUT_DIR / f\"umi_autoregressive_input_chunk_{chunk_idx:02d}.png\"\n", + "\n", + " umi_records.append(\n", + " {\n", + " \"action_chunk_size\": umi_action_chunk_size,\n", + " \"action_path\": str(umi_action_path),\n", + " \"domain_name\": \"umi\",\n", + " \"fps\": umi_fps,\n", + " \"image_size\": 256,\n", + " \"view_point\": \"ego_view\",\n", + " \"model_mode\": \"forward_dynamics\",\n", + " \"name\": chunk_name,\n", + " \"prompt\": umi_prompt,\n", + " \"seed\": chunk_idx,\n", + " \"vision_path\": str(vision_path),\n", + " }\n", + " )\n", + "\n", + "umi_fd_input_path = COSMOS3_INPUT_DIR / \"action_forward_dynamics_umi_custom.jsonl\"\n", + "umi_fd_input_path.write_text(\"\".join(json.dumps(r) + \"\\n\" for r in umi_records))\n", + "umi_fd_output_dir = COSMOS3_OUTPUT_ROOT / \"action_forward_dynamics_umi_custom\"\n", + "\n", + "print(\"UMI chunks:\", umi_num_chunks)\n", + "print(\"wrote UMI spec:\", umi_fd_input_path)\n", + "print(umi_fd_input_path.read_text())" + ] + }, + { + "cell_type": "markdown", + "id": "50abf709", + "metadata": {}, + "source": [ + "### Run UMI Autoregressive Forward-Dynamics Inference\n", + "\n", + "Runs one SGLang video request per UMI action chunk. After each chunk completes, the cell extracts that chunk's last generated frame and uses it as the conditioning image for the next chunk." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6d8f37f5", + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "import mimetypes\n", + "import subprocess\n", + "import time\n", + "from pathlib import Path\n", + "\n", + "import imageio_ffmpeg\n", + "\n", + "try:\n", + " import requests\n", + "except ImportError as exc:\n", + " raise RuntimeError(\"Install requests in this notebook kernel: pip install requests\") from exc\n", + "\n", + "FFMPEG = imageio_ffmpeg.get_ffmpeg_exe()\n", + "\n", + "\n", + "def check_sglang_server_for_umi(timeout_s: int = 600, interval_s: int = 10) -> None:\n", + " deadline = time.time() + timeout_s\n", + " last_error: Exception | None = None\n", + " while time.time() < deadline:\n", + " try:\n", + " response = requests.get(f\"{SGLANG_BASE_URL}/v1/models\", timeout=10)\n", + " response.raise_for_status()\n", + " print(response.json())\n", + " return\n", + " except requests.RequestException as exc:\n", + " last_error = exc\n", + " print(f\"Waiting for SGLang server at {SGLANG_BASE_URL}: {exc}\")\n", + " time.sleep(interval_s)\n", + " raise RuntimeError(\n", + " f\"SGLang server did not become ready at {SGLANG_BASE_URL} within {timeout_s}s. \"\n", + " \"Check `docker logs -f cosmos3-sglang-notebook`.\"\n", + " ) from last_error\n", + "\n", + "\n", + "def submit_umi_forward_dynamics(record: dict, fd_output_dir: Path) -> dict:\n", + " run_dir = fd_output_dir / record[\"name\"]\n", + " run_dir.mkdir(parents=True, exist_ok=True)\n", + "\n", + " vision_path = Path(record[\"vision_path\"])\n", + " input_width, input_height = Image.open(vision_path).size\n", + " mime_type = mimetypes.guess_type(vision_path.name)[0] or \"application/octet-stream\"\n", + " extra_params = {\n", + " \"action_mode\": \"forward_dynamics\",\n", + " \"domain_name\": record[\"domain_name\"],\n", + " \"action_view_point\": record[\"view_point\"],\n", + " \"action\": json.loads(Path(record[\"action_path\"]).read_text()),\n", + " \"guardrails\": False,\n", + " }\n", + " prompt = str(record.get(\"prompt\") or \"\").strip() or \"A robot manipulates an object.\"\n", + " form = {\n", + " \"prompt\": prompt,\n", + " \"num_frames\": record[\"action_chunk_size\"] + 1,\n", + " \"fps\": record[\"fps\"],\n", + " \"size\": f\"{input_width}x{input_height}\",\n", + " \"num_inference_steps\": 30,\n", + " \"guidance_scale\": 1.0,\n", + " \"flow_shift\": 10.0,\n", + " \"seed\": record[\"seed\"],\n", + " \"extra_params\": json.dumps(extra_params),\n", + " }\n", + "\n", + " with vision_path.open(\"rb\") as image_file:\n", + " response = requests.post(\n", + " f\"{SGLANG_BASE_URL}/v1/videos\",\n", + " data={key: str(value) for key, value in form.items()},\n", + " files={\"input_reference\": (vision_path.name, image_file, mime_type)},\n", + " timeout=120,\n", + " )\n", + " if not response.ok:\n", + " (run_dir / \"error_response.txt\").write_text(response.text)\n", + " print(\"SGLang request failed:\", response.status_code)\n", + " print(response.text)\n", + " print(\"form:\", json.dumps(form, indent=2))\n", + " print(\"extra_params keys:\", sorted(extra_params))\n", + " print(\"action shape:\", [len(extra_params[\"action\"]), len(extra_params[\"action\"][0]) if extra_params[\"action\"] else 0])\n", + " response.raise_for_status()\n", + "\n", + " initial = response.json()\n", + " (run_dir / \"response.json\").write_text(json.dumps(initial, indent=2))\n", + "\n", + " while True:\n", + " response = requests.get(f\"{SGLANG_BASE_URL}/v1/videos/{initial['id']}\", timeout=30)\n", + " response.raise_for_status()\n", + " final = response.json()\n", + " (run_dir / \"final.json\").write_text(json.dumps(final, indent=2))\n", + " print(initial[\"id\"], final.get(\"status\"), f\"{final.get('progress', 0)}%\")\n", + " if final.get(\"status\") == \"completed\":\n", + " break\n", + " if final.get(\"status\") in {\"failed\", \"cancelled\"}:\n", + " raise RuntimeError(json.dumps(final, indent=2))\n", + " time.sleep(2)\n", + "\n", + " response = requests.get(f\"{SGLANG_BASE_URL}/v1/videos/{initial['id']}/content\", timeout=300)\n", + " response.raise_for_status()\n", + " video_path = run_dir / \"vision.mp4\"\n", + " video_path.write_bytes(response.content)\n", + " print(\"saved\", video_path)\n", + " return {\"record\": record, \"initial\": initial, \"final\": final, \"run_dir\": run_dir, \"video_path\": video_path}\n", + "\n", + "\n", + "check_sglang_server_for_umi()\n", + "umi_results = []\n", + "umi_actual_records = []\n", + "current_vision_path = Path(umi_records[0][\"vision_path\"])\n", + "assert current_vision_path.exists(), f\"missing initial conditioning image: {current_vision_path}\"\n", + "\n", + "for chunk_idx, base_record in enumerate(umi_records):\n", + " record = dict(base_record)\n", + " record[\"vision_path\"] = str(current_vision_path)\n", + " umi_records[chunk_idx][\"vision_path\"] = str(current_vision_path)\n", + " umi_actual_records.append(record)\n", + "\n", + " print(f\"\\nSubmitting {record['name']}\")\n", + " print(\"conditioning image:\", current_vision_path)\n", + " result = submit_umi_forward_dynamics(record, umi_fd_output_dir)\n", + " umi_results.append(result)\n", + "\n", + " if chunk_idx + 1 < len(umi_records):\n", + " next_vision_path = COSMOS3_INPUT_DIR / f\"umi_autoregressive_input_chunk_{chunk_idx + 1:02d}.png\"\n", + " subprocess.run(\n", + " [\n", + " FFMPEG,\n", + " \"-y\",\n", + " \"-loglevel\",\n", + " \"error\",\n", + " \"-i\",\n", + " str(result[\"video_path\"]),\n", + " \"-vf\",\n", + " fr\"select=eq(n\\,{record['action_chunk_size']})\",\n", + " \"-frames:v\",\n", + " \"1\",\n", + " str(next_vision_path),\n", + " ],\n", + " check=True,\n", + " )\n", + " assert next_vision_path.exists(), f\"failed to extract next conditioning image: {next_vision_path}\"\n", + " current_vision_path = next_vision_path\n", + "\n", + "umi_fd_input_path.write_text(\"\".join(json.dumps(r) + \"\\n\" for r in umi_actual_records))\n", + "print(\"wrote autoregressive UMI run spec:\", umi_fd_input_path)\n", + "print(\"completed UMI chunks:\", [record[\"name\"] for record in umi_actual_records])" + ] + }, + { + "cell_type": "markdown", + "id": "cc77c77a", + "metadata": {}, + "source": [ + "### Stitch and Visualize UMI Generated Chunks\n", + "\n", + "Each autoregressive chunk video includes its conditioning frame at frame 0. This cell drops that first frame from every chunk, concatenates the generated frames into one rollout video, transcodes a compact preview, and embeds it in the notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "93e49e34", + "metadata": {}, + "outputs": [], + "source": [ + "import subprocess\n", + "\n", + "import imageio_ffmpeg\n", + "from IPython.display import Video, display\n", + "\n", + "FFMPEG = imageio_ffmpeg.get_ffmpeg_exe()\n", + "umi_video_paths = [umi_fd_output_dir / record[\"name\"] / \"vision.mp4\" for record in umi_records]\n", + "for path in umi_video_paths:\n", + " assert path.exists(), f\"missing UMI chunk video: {path}\"\n", + "\n", + "umi_stitch_dir = umi_fd_output_dir / \"_stitched_segments\"\n", + "umi_stitch_dir.mkdir(parents=True, exist_ok=True)\n", + "segment_paths = []\n", + "for record, src in zip(umi_records, umi_video_paths, strict=True):\n", + " segment = umi_stitch_dir / f\"{record['name']}_generated.mp4\"\n", + " subprocess.run(\n", + " [\n", + " FFMPEG,\n", + " \"-y\",\n", + " \"-loglevel\",\n", + " \"error\",\n", + " \"-i\",\n", + " str(src),\n", + " \"-vf\",\n", + " r\"select=gte(n\\,1),setpts=N/FRAME_RATE/TB\",\n", + " \"-an\",\n", + " \"-r\",\n", + " str(record[\"fps\"]),\n", + " \"-c:v\",\n", + " \"libx264\",\n", + " \"-crf\",\n", + " \"18\",\n", + " \"-preset\",\n", + " \"veryfast\",\n", + " \"-pix_fmt\",\n", + " \"yuv420p\",\n", + " str(segment),\n", + " ],\n", + " check=True,\n", + " )\n", + " segment_paths.append(segment)\n", + "\n", + "concat_file = umi_stitch_dir / \"umi_concat.txt\"\n", + "concat_file.write_text(\"\".join(f\"file '{path.as_posix()}'\\n\" for path in segment_paths))\n", + "umi_stitched_video_path = umi_fd_output_dir / \"umi_action_cond_stitched.mp4\"\n", + "subprocess.run(\n", + " [\n", + " FFMPEG,\n", + " \"-y\",\n", + " \"-loglevel\",\n", + " \"error\",\n", + " \"-f\",\n", + " \"concat\",\n", + " \"-safe\",\n", + " \"0\",\n", + " \"-i\",\n", + " str(concat_file),\n", + " \"-c\",\n", + " \"copy\",\n", + " str(umi_stitched_video_path),\n", + " ],\n", + " check=True,\n", + ")\n", + "\n", + "\n", + "def make_umi_preview(src: Path, crf: int = 28) -> Path:\n", + " preview = src.with_name(f\"{src.stem}_preview.mp4\")\n", + " if not preview.exists() or preview.stat().st_mtime < src.stat().st_mtime:\n", + " subprocess.run(\n", + " [\n", + " FFMPEG,\n", + " \"-y\",\n", + " \"-loglevel\",\n", + " \"error\",\n", + " \"-i\",\n", + " str(src),\n", + " \"-c:v\",\n", + " \"libx264\",\n", + " \"-crf\",\n", + " str(crf),\n", + " \"-preset\",\n", + " \"veryfast\",\n", + " \"-an\",\n", + " \"-pix_fmt\",\n", + " \"yuv420p\",\n", + " str(preview),\n", + " ],\n", + " check=True,\n", + " )\n", + " return preview\n", + "\n", + "umi_preview_path = make_umi_preview(umi_stitched_video_path)\n", + "print(\"stitched UMI video:\", umi_stitched_video_path)\n", + "print(\"expected generated frames:\", len(umi_records) * umi_action_chunk_size)\n", + "print(f\"UMI preview: {umi_preview_path}\")\n", + "display(Video(str(umi_preview_path), embed=True))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a543c008", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/cookbooks/cosmos3/generator/action/run_id_with_sglang.ipynb b/cookbooks/cosmos3/generator/action/run_id_with_sglang.ipynb new file mode 100644 index 00000000..5d9e6197 --- /dev/null +++ b/cookbooks/cosmos3/generator/action/run_id_with_sglang.ipynb @@ -0,0 +1,449 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "license-header", + "metadata": {}, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Cosmos3 Nano Action: Inverse Dynamics with SGLang\n", + "\n", + "This notebook runs Cosmos3 Nano **action inverse-dynamics** inference through the SGLang OpenAI-compatible video API:\n", + "\n", + "```text\n", + "POST /v1/videos\n", + "```\n", + "\n", + "Inverse dynamics is the reverse of forward dynamics: given a video, it predicts the ego-motion (action) trajectory that produced it. This notebook builds the same custom input spec as [`run_id_with_cosmos_framework.ipynb`](./run_id_with_cosmos_framework.ipynb), keeps the same input-video preview and predicted-trajectory visualization, and only changes the environment setup plus the inference call.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Start SGLang Server\n", + "\n", + "Start the server in a terminal from the `cosmos` repo root.\n", + "\n", + "```bash\n", + "docker rm -f cosmos3-sglang-notebook 2>/dev/null || true\n", + "\n", + "docker run -d --name cosmos3-sglang-notebook \\\n", + " --runtime nvidia --gpus '\"device=0\"' \\\n", + " -e CUDA_DEVICE_ORDER=PCI_BUS_ID \\\n", + " -v ~/.cache/huggingface:/root/.cache/huggingface \\\n", + " -v \"$PWD:/workspace\" \\\n", + " -p 30000:30000 --ipc=host \\\n", + " lmsysorg/sglang:dev \\\n", + " sglang serve \\\n", + " --model-path nvidia/Cosmos3-Nano \\\n", + " --host 0.0.0.0\n", + "\n", + "# Wait until this returns model metadata before running the inference cell.\n", + "curl http://localhost:30000/v1/models\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "import os\n", + "\n", + "\n", + "def find_repo_root(start: Path) -> Path:\n", + " for path in [start, *start.parents]:\n", + " if (path / \"README.md\").exists() and (path / \"cookbooks\").exists():\n", + " return path\n", + " return start\n", + "\n", + "\n", + "COSMOS_ROOT = find_repo_root(Path.cwd().resolve())\n", + "COSMOS3_REPO = Path(os.environ.get(\"COSMOS3_REPO\", COSMOS_ROOT / \"packages\" / \"cosmos3\")).resolve()\n", + "COSMOS3_OUTPUT_ROOT = Path(\n", + " os.environ.get(\"COSMOS3_SGLANG_OUTPUT_ROOT\", COSMOS_ROOT / \"outputs\" / \"cosmos3_action_sglang\")\n", + ").resolve()\n", + "COSMOS3_INPUT_DIR = COSMOS3_OUTPUT_ROOT / \"inputs\"\n", + "SGLANG_BASE_URL = os.environ.get(\"COSMOS3_SGLANG_BASE_URL\", \"http://localhost:30000\").rstrip(\"/\")\n", + "\n", + "COSMOS3_OUTPUT_ROOT.mkdir(parents=True, exist_ok=True)\n", + "COSMOS3_INPUT_DIR.mkdir(parents=True, exist_ok=True)\n", + "\n", + "print(\"COSMOS_ROOT:\", COSMOS_ROOT)\n", + "print(\"COSMOS3_REPO:\", COSMOS3_REPO)\n", + "print(\"COSMOS3_INPUT_DIR:\", COSMOS3_INPUT_DIR)\n", + "print(\"COSMOS3_OUTPUT_ROOT:\", COSMOS3_OUTPUT_ROOT)\n", + "print(\"COSMOS3_SGLANG_BASE_URL:\", SGLANG_BASE_URL)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create the Inverse-Dynamics Input Spec\n", + "\n", + "Inverse-dynamics inference is driven by a JSONL spec, one line per run. Unlike forward dynamics, each line provides only an input video (`vision_path`) and **no** `action_path` — the action is what the model predicts.\n", + "\n", + "This cell builds that spec from local AV videos, writing it under:\n", + "\n", + "```text\n", + "outputs/cosmos3_action_sglang/inputs/action_inverse_dynamics_av_custom.jsonl\n", + "```\n", + "\n", + "It mirrors the native PyTorch notebook's spec format. The `vision_path` is written as an absolute path, because the SGLang request cell reads the spec records directly.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bd4b3ff8", + "metadata": {}, + "outputs": [], + "source": [ + "# `os` and the COSMOS3_* paths come from the configuration cell.\n", + "import json\n", + "\n", + "# Local inputs, relative to the cosmos repo root.\n", + "input_videos = {\n", + " \"av_inverse_0\": \"cookbooks/cosmos3/generator/action/assets/videos/av_0.mp4\",\n", + " \"av_inverse_1\": \"cookbooks/cosmos3/generator/action/assets/videos/av_1.mp4\",\n", + "}\n", + "\n", + "def resolve_input(rel_path: str) -> str:\n", + " path = (COSMOS_ROOT / rel_path).resolve()\n", + " assert path.exists(), f\"missing input: {path}\"\n", + " return str(path)\n", + "\n", + "records = [\n", + " {\n", + " \"action_chunk_size\": 60,\n", + " \"domain_name\": \"av\",\n", + " \"fps\": 10,\n", + " \"image_size\": 480,\n", + " \"view_point\": \"ego_view\",\n", + " \"model_mode\": \"inverse_dynamics\",\n", + " \"name\": name,\n", + " \"prompt\": \"You are an autonomous vehicle planning system.\",\n", + " \"seed\": 0,\n", + " \"vision_path\": resolve_input(video_rel),\n", + " }\n", + " for name, video_rel in input_videos.items()\n", + "]\n", + "\n", + "COSMOS3_INPUT_DIR.mkdir(parents=True, exist_ok=True)\n", + "id_input_path = COSMOS3_INPUT_DIR / \"action_inverse_dynamics_av_custom.jsonl\"\n", + "id_input_path.write_text(\"\".join(json.dumps(r) + \"\\n\" for r in records))\n", + "id_output_dir = COSMOS3_OUTPUT_ROOT / \"action_inverse_dynamics_av_custom\"\n", + "\n", + "# The bash inference cell can only see the environment, so export the paths it needs.\n", + "os.environ[\"COSMOS3_ID_INPUT\"] = str(id_input_path)\n", + "os.environ[\"COSMOS3_ID_OUTPUT\"] = str(id_output_dir)\n", + "\n", + "print(\"wrote spec:\", id_input_path)\n", + "print(\"runs:\", list(input_videos))\n", + "print(id_input_path.read_text())" + ] + }, + { + "cell_type": "markdown", + "id": "0f17af65", + "metadata": {}, + "source": [ + "## Preview the Input Video(s)\n", + "\n", + "Preview each input video before running inference. `Video(..., embed=True)` base64-inlines the file, and these AV clips are several MB each, so we first transcode a small preview (~150 KB) with the ffmpeg binary bundled in `imageio-ffmpeg` (installed by `uv sync`), then embed it. The original input videos are left untouched." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "293b1dfb", + "metadata": {}, + "outputs": [], + "source": [ + "import subprocess\n", + "import imageio_ffmpeg\n", + "from IPython.display import Video, display\n", + "\n", + "FFMPEG = imageio_ffmpeg.get_ffmpeg_exe()\n", + "\n", + "def make_preview(src: Path, dst: Path, crf: int = 28) -> Path:\n", + " \"\"\"Re-encode `src` to a compact, browser-friendly mp4 (cached).\"\"\"\n", + " if not dst.exists():\n", + " subprocess.run(\n", + " [FFMPEG, \"-y\", \"-loglevel\", \"error\", \"-i\", str(src),\n", + " \"-c:v\", \"libx264\", \"-crf\", str(crf),\n", + " \"-preset\", \"veryfast\", \"-an\", \"-pix_fmt\", \"yuv420p\", str(dst)],\n", + " check=True,\n", + " )\n", + " return dst\n", + "\n", + "# `records` comes from the prepare cell; preview each input video.\n", + "for record in records:\n", + " name = record[\"name\"]\n", + " src = Path(record[\"vision_path\"])\n", + " preview = make_preview(src, COSMOS3_INPUT_DIR / f\"{name}_input_preview.mp4\")\n", + " print(f\"{name} ({src.stat().st_size // 1024} KB -> {preview.stat().st_size // 1024} KB preview)\")\n", + " display(Video(str(preview), embed=True))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Run Inverse-Dynamics Inference\n", + "\n", + "Runs `Cosmos3-Nano` on every line of the spec through SGLang. Inverse dynamics predicts an action, and this cell writes a PyTorch-compatible result file for each run:\n", + "\n", + "```text\n", + "//sample_outputs.json\n", + "```\n", + "\n", + "The predicted action trajectory is stored under `outputs[0].content[\"action\"]`, matching the native notebook's visualization cell. SGLang also returns `response.json`, `final.json`, and `action.json` for API debugging.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "import time\n", + "from pathlib import Path\n", + "\n", + "try:\n", + " import requests\n", + "except ImportError as exc:\n", + " raise RuntimeError(\"Install requests in this notebook kernel: pip install requests\") from exc\n", + "\n", + "\n", + "def check_sglang_server() -> None:\n", + " response = requests.get(f\"{SGLANG_BASE_URL}/v1/models\", timeout=10)\n", + " response.raise_for_status()\n", + " print(response.json())\n", + "\n", + "\n", + "def submit_inverse_dynamics(record: dict) -> dict:\n", + " run_dir = id_output_dir / record[\"name\"]\n", + " run_dir.mkdir(parents=True, exist_ok=True)\n", + "\n", + " video_path = Path(record[\"vision_path\"])\n", + " extra_params = {\n", + " \"action_mode\": \"inverse_dynamics\",\n", + " \"domain_name\": record[\"domain_name\"],\n", + " \"view_point\": record[\"view_point\"],\n", + " \"raw_action_dim\": 9,\n", + " \"guardrails\": False,\n", + " }\n", + " form = {\n", + " \"prompt\": record[\"prompt\"],\n", + " \"num_frames\": record[\"action_chunk_size\"] + 1,\n", + " \"fps\": record[\"fps\"],\n", + " \"num_inference_steps\": 30,\n", + " \"guidance_scale\": 1.0,\n", + " \"flow_shift\": 10.0,\n", + " \"seed\": record[\"seed\"],\n", + " \"extra_params\": json.dumps(extra_params),\n", + " }\n", + "\n", + " with video_path.open(\"rb\") as video_file:\n", + " response = requests.post(\n", + " f\"{SGLANG_BASE_URL}/v1/videos\",\n", + " data={key: str(value) for key, value in form.items()},\n", + " files={\"input_reference\": (video_path.name, video_file, \"video/mp4\")},\n", + " timeout=120,\n", + " )\n", + " response.raise_for_status()\n", + " initial = response.json()\n", + " (run_dir / \"response.json\").write_text(json.dumps(initial, indent=2))\n", + "\n", + " while True:\n", + " response = requests.get(f\"{SGLANG_BASE_URL}/v1/videos/{initial['id']}\", timeout=30)\n", + " response.raise_for_status()\n", + " final = response.json()\n", + " (run_dir / \"final.json\").write_text(json.dumps(final, indent=2))\n", + " print(initial[\"id\"], final.get(\"status\"), f\"{final.get('progress', 0)}%\")\n", + " if final.get(\"status\") == \"completed\":\n", + " break\n", + " if final.get(\"status\") in {\"failed\", \"cancelled\"}:\n", + " raise RuntimeError(json.dumps(final, indent=2))\n", + " time.sleep(2)\n", + "\n", + " action = final.get(\"action\")\n", + " if not action or \"data\" not in action:\n", + " raise RuntimeError(f\"SGLang response did not include action data: {json.dumps(final, indent=2)}\")\n", + " (run_dir / \"action.json\").write_text(json.dumps(action, indent=2))\n", + "\n", + " sample_outputs = {\"outputs\": [{\"content\": {\"action\": action[\"data\"]}}]}\n", + " (run_dir / \"sample_outputs.json\").write_text(json.dumps(sample_outputs, indent=2))\n", + "\n", + " print(\"saved\", run_dir / \"sample_outputs.json\")\n", + " print(\"action shape:\", action.get(\"shape\"), \"dtype:\", action.get(\"dtype\"))\n", + " return {\"record\": record, \"initial\": initial, \"final\": final, \"run_dir\": run_dir, \"action\": action}\n", + "\n", + "\n", + "check_sglang_server()\n", + "results = []\n", + "for record in records:\n", + " print(f\"\\nSubmitting {record['name']}\")\n", + " results.append(submit_inverse_dynamics(record))\n" + ] + }, + { + "cell_type": "markdown", + "id": "324e6378", + "metadata": {}, + "source": [ + "## Visualize the Predicted Action\n", + "\n", + "Plot the action the model predicted from each input video, as a 3D camera path (with frustums) and a top-down bird's-eye view. The action is read from each run's `sample_outputs.json` and interpreted with the AV pose convention." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e7808372", + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "import json\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from matplotlib.collections import LineCollection\n", + "from mpl_toolkits.mplot3d.art3d import Line3DCollection\n", + "\n", + "# The notebook kernel may differ from the framework venv, so put the repo on the\n", + "# path before importing `cosmos_framework`.\n", + "COSMOS3_FRAMEWORK_PATH = os.environ.get(\"COSMOS3_FRAMEWORK_PATH\")\n", + "if str(COSMOS3_FRAMEWORK_PATH) not in sys.path:\n", + " sys.path.insert(0, str(COSMOS3_FRAMEWORK_PATH))\n", + "from cosmos_framework.data.vfm.action.pose_utils import pose_rel_to_abs\n", + "\n", + "# frustum: apex + image-rectangle corners (camera +Z forward), and their edges\n", + "_FRUSTUM = np.array([[0, 0, 0], [-1, -1, 1], [1, -1, 1], [1, 1, 1], [-1, 1, 1]], float)\n", + "_EDGES = [(0, 1), (0, 2), (0, 3), (0, 4), (1, 2), (2, 3), (3, 4), (4, 1)]\n", + "\n", + "def visualize_pose(poses_abs, *, n_frustums=20, scale_frac=0.03, aspect=16 / 9,\n", + " fov_deg=60.0, vertical_exaggeration=1.0, cmap=\"turbo\",\n", + " title=None, save_path=None, show=True):\n", + " \"\"\"3D camera trajectory (with frustums) + a top-down bird's-eye view.\n", + "\n", + " AV convention: world Y is up, world +Z is the heading. `vertical_exaggeration`\n", + " stretches only the up-axis box (uniform world scaling, so frustums never skew);\n", + " 1.0 = geometrically faithful. The 3D plot reorders world (X, Y, Z) -> (X, Z, Y)\n", + " so Y points up on screen.\n", + " \"\"\"\n", + " poses_abs = np.asarray(poses_abs)\n", + " pos = poses_abs[:, :3, 3] # camera centers [T, 3]\n", + " fwd = poses_abs[:, :3, 2] # heading (+Z) [T, 3]\n", + " T = len(pos)\n", + " colors = plt.get_cmap(cmap)(np.arange(T) / max(T - 1, 1))\n", + " scale = max(np.ptp(pos, axis=0).max() * scale_frac, 1e-3)\n", + " step = max(1, T // max(n_frustums, 1))\n", + " xzy = [0, 2, 1] # world (X,Y,Z) -> plot (X, Z, Y-up)\n", + "\n", + " fig = plt.figure(figsize=(14, 6))\n", + "\n", + " # (1) 3D perspective with frustums\n", + " ax = fig.add_subplot(1, 2, 1, projection=\"3d\")\n", + " path = pos[:, xzy]\n", + " ax.plot(*path.T, color=\"0.6\", lw=1.0, alpha=0.7)\n", + " lines, lcolors, allpts = [], [], [path]\n", + " for i in range(0, T, step):\n", + " cw = ((_FRUSTUM * [aspect, 1, 1] * scale * np.tan(np.radians(fov_deg) / 2))\n", + " @ poses_abs[i, :3, :3].T + poses_abs[i, :3, 3])[:, xzy] # frustum in plot coords\n", + " allpts.append(cw)\n", + " lines += [[cw[a], cw[b]] for a, b in _EDGES]\n", + " lcolors += [colors[i]] * len(_EDGES)\n", + " ax.add_collection3d(Line3DCollection(lines, colors=lcolors, linewidths=1.2))\n", + " ax.scatter(*path[0], color=\"lime\", s=80, edgecolor=\"k\", label=\"first frame\", zorder=5)\n", + " ax.scatter(*path[-1], color=\"red\", s=80, edgecolor=\"k\", label=\"last frame\", zorder=5)\n", + " rng = np.clip(np.ptp(np.concatenate(allpts), axis=0), 1e-9, None)\n", + " ax.set_box_aspect((rng[0], rng[1], rng[2] * vertical_exaggeration))\n", + " ax.set_xlabel(\"X (m)\", labelpad=12); ax.set_ylabel(\"Z forward (m)\", labelpad=12)\n", + " ax.set_zlabel(\"Y up (m)\", labelpad=10); ax.set_zticks([])\n", + " ax.set_title(title or f\"Camera trajectory + frustums ({T} frames)\")\n", + " ax.legend(loc=\"upper left\"); ax.view_init(elev=22, azim=-70)\n", + "\n", + " # (2) top-down bird's-eye view (X-Z ground plane)\n", + " ax2 = fig.add_subplot(1, 2, 2)\n", + " seg = np.stack([pos[:-1, [0, 2]], pos[1:, [0, 2]]], axis=1)\n", + " lc = LineCollection(seg, cmap=cmap, norm=plt.Normalize(0, T - 1), linewidth=2.5)\n", + " lc.set_array(np.arange(T - 1)); ax2.add_collection(lc)\n", + " ax2.quiver(pos[::step, 0], pos[::step, 2], fwd[::step, 0], fwd[::step, 2],\n", + " color=colors[::step], angles=\"xy\", width=0.005, scale=22, zorder=3)\n", + " ax2.scatter(*pos[0, [0, 2]], color=\"lime\", s=80, edgecolor=\"k\", label=\"first frame\", zorder=5)\n", + " ax2.scatter(*pos[-1, [0, 2]], color=\"red\", s=80, edgecolor=\"k\", label=\"last frame\", zorder=5)\n", + " ax2.set_xlabel(\"X (m)\"); ax2.set_ylabel(\"Z forward (m)\")\n", + " ax2.set_title(\"Top-down (bird's-eye view)\")\n", + " ax2.set_aspect(\"equal\", adjustable=\"datalim\"); ax2.autoscale_view(); ax2.legend()\n", + " fig.colorbar(lc, ax=ax2, label=\"frame index\")\n", + "\n", + " plt.tight_layout(w_pad=6)\n", + " if save_path:\n", + " fig.savefig(save_path, dpi=120, bbox_inches=\"tight\"); print(\"saved\", save_path)\n", + " if show:\n", + " plt.show()\n", + "\n", + "# `records` and `id_output_dir` come from the prepare cell; read each run's\n", + "# predicted action from its sample_outputs.json.\n", + "for record in records:\n", + " name = record[\"name\"]\n", + " outputs = json.loads((id_output_dir / name / \"sample_outputs.json\").read_text())\n", + " poses_rel = np.array(outputs[\"outputs\"][0][\"content\"][\"action\"][0]) # [T-1, 9] = [translation(3), rot6d(6)]\n", + "\n", + " # AV action convention (see cosmos_framework/data/vfm/action/av_dataset.py):\n", + " # rot6d rotation, backward_framewise, translation_scale = 1.35.\n", + " poses_abs = pose_rel_to_abs(\n", + " poses_rel,\n", + " rotation_format=\"rot6d\",\n", + " pose_convention=\"backward_framewise\",\n", + " translation_scale=1.35,\n", + " ) # [T, 4, 4] camera-to-world\n", + " print(name, poses_rel.shape, poses_abs.shape)\n", + " visualize_pose(poses_abs, title=f\"{name}: predicted camera trajectory + frustums ({len(poses_abs)} frames)\", show=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "59665c59", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/cookbooks/cosmos3/generator/action/run_policy_with_sglang.ipynb b/cookbooks/cosmos3/generator/action/run_policy_with_sglang.ipynb new file mode 100644 index 00000000..dcd29869 --- /dev/null +++ b/cookbooks/cosmos3/generator/action/run_policy_with_sglang.ipynb @@ -0,0 +1,417 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "license-header", + "metadata": {}, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "id": "policy-title", + "metadata": {}, + "source": [ + "# Cosmos3 Nano Action: Policy with SGLang\n", + "\n", + "## Prerequisites\n", + "\n", + "Generator requires the Guardrail. Request access to the gated [nvidia/Cosmos-1.0-Guardrail](https://huggingface.co/nvidia/Cosmos-1.0-Guardrail) HF repository before running guarded examples. This notebook disables guardrails for the policy request with `guardrails: false` in `extra_params`.\n", + "\n", + "## Overview\n", + "\n", + "This notebook runs Cosmos3 Nano **action policy** inference through SGLang using the checked-in DROID LeRobot sample under `assets/droid_lerobot_example`.\n", + "\n", + "It sends `POST /v1/videos` requests with a first frame and instruction, then retrieves a rollout video plus top-level `action` metadata." + ] + }, + { + "cell_type": "markdown", + "id": "policy-server-md", + "metadata": {}, + "source": [ + "## Start SGLang Policy Server\n", + "\n", + "Start the server in a terminal from the `cosmos` repo root.\n", + "\n", + "```bash\n", + "docker rm -f cosmos3-sglang-policy-notebook 2>/dev/null || true\n", + "\n", + "docker run -d --name cosmos3-sglang-policy-notebook \\\n", + " --runtime nvidia --gpus '\"device=0\"' \\\n", + " -e CUDA_DEVICE_ORDER=PCI_BUS_ID \\\n", + " -e PYTHONPATH=/workspace/cosmos-framework \\\n", + " -v ~/.cache/huggingface:/root/.cache/huggingface \\\n", + " -v \"$PWD:/workspace\" \\\n", + " -p 30000:30000 --ipc=host \\\n", + " lmsysorg/sglang:dev \\\n", + " sglang serve \\\n", + " --model-path nvidia/Cosmos3-Nano-Policy-DROID \\\n", + " --host 0.0.0.0\n", + "\n", + "# Wait until this returns model metadata before running the inference cells.\n", + "curl http://localhost:30000/v1/models\n", + "```\n", + "\n", + "To inspect startup logs:\n", + "\n", + "```bash\n", + "docker logs -f cosmos3-sglang-policy-notebook\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "policy-vars-code", + "metadata": {}, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "import os\n", + "\n", + "\n", + "def find_repo_root(start: Path) -> Path:\n", + " for path in [start, *start.parents]:\n", + " if (path / \"README.md\").exists() and (path / \"cookbooks\").exists():\n", + " return path\n", + " return start\n", + "COSMOS_ROOT = find_repo_root(Path.cwd().resolve())\n", + "COSMOS3_OUTPUT_ROOT = Path(\n", + " os.environ.get(\"COSMOS3_SGLANG_OUTPUT_ROOT\", COSMOS_ROOT / \"outputs\" / \"cosmos3_action_sglang\")\n", + ").resolve()\n", + "COSMOS3_INPUT_DIR = COSMOS3_OUTPUT_ROOT / \"inputs\"\n", + "COSMOS3_POLICY_OUTPUT_DIR = COSMOS3_OUTPUT_ROOT / \"action_policy_droid\"\n", + "COSMOS3_ACTION_ROOT = COSMOS_ROOT / \"cookbooks\" / \"cosmos3\" / \"generator\" / \"action\"\n", + "DROID_ASSET_ROOT = COSMOS3_ACTION_ROOT / \"assets\" / \"droid_lerobot_example\"\n", + "SGLANG_BASE_URL = os.environ.get(\"COSMOS3_SGLANG_BASE_URL\", \"http://localhost:30000\").rstrip(\"/\")\n", + "SGLANG_MODEL = os.environ.get(\"COSMOS3_SGLANG_MODEL\", \"nvidia/Cosmos3-Nano-Policy-DROID\")\n", + "\n", + "COSMOS3_INPUT_DIR.mkdir(parents=True, exist_ok=True)\n", + "COSMOS3_POLICY_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)\n", + "\n", + "print(\"COSMOS_ROOT:\", COSMOS_ROOT)\n", + "print(\"DROID_ASSET_ROOT:\", DROID_ASSET_ROOT)\n", + "print(\"COSMOS3_INPUT_DIR:\", COSMOS3_INPUT_DIR)\n", + "print(\"COSMOS3_POLICY_OUTPUT_DIR:\", COSMOS3_POLICY_OUTPUT_DIR)\n", + "print(\"COSMOS3_SGLANG_BASE_URL:\", SGLANG_BASE_URL)\n", + "print(\"COSMOS3_SGLANG_MODEL:\", SGLANG_MODEL)" + ] + }, + { + "cell_type": "markdown", + "id": "policy-input-md", + "metadata": {}, + "source": [ + "## Prepare a DROID Policy Input\n", + "\n", + "This cell extracts the first frame from each checked-in DROID camera video and creates a 640x540 multiview conditioning image for the video API." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "policy-input-code", + "metadata": {}, + "outputs": [], + "source": [ + "import subprocess\n", + "\n", + "import numpy as np\n", + "from PIL import Image, ImageOps\n", + "from IPython.display import display\n", + "\n", + "try:\n", + " import imageio_ffmpeg\n", + "except ImportError as exc:\n", + " raise RuntimeError(\"Install imageio-ffmpeg in this notebook kernel: pip install imageio-ffmpeg\") from exc\n", + "\n", + "FFMPEG = imageio_ffmpeg.get_ffmpeg_exe()\n", + "CAMERA_VIDEO_PATHS = {\n", + " \"observation/wrist_image_left\": DROID_ASSET_ROOT / \"videos\" / \"observation.image.wrist_image_left\" / \"chunk-000\" / \"file-000.mp4\",\n", + " \"observation/exterior_image_1_left\": DROID_ASSET_ROOT / \"videos\" / \"observation.image.exterior_image_1_left\" / \"chunk-000\" / \"file-000.mp4\",\n", + " \"observation/exterior_image_2_left\": DROID_ASSET_ROOT / \"videos\" / \"observation.image.exterior_image_2_left\" / \"chunk-000\" / \"file-000.mp4\",\n", + "}\n", + "for key, video_path in CAMERA_VIDEO_PATHS.items():\n", + " assert video_path.exists(), f\"missing {key}: {video_path}\"\n", + "\n", + "\n", + "def extract_first_frame(video_path: Path, out_path: Path) -> Path:\n", + " if not out_path.exists() or out_path.stat().st_mtime < video_path.stat().st_mtime:\n", + " subprocess.run(\n", + " [FFMPEG, \"-y\", \"-loglevel\", \"error\", \"-i\", str(video_path), \"-frames:v\", \"1\", str(out_path)],\n", + " check=True,\n", + " )\n", + " return out_path\n", + "\n", + "\n", + "frame_paths = {\n", + " key: extract_first_frame(video_path, COSMOS3_INPUT_DIR / f\"policy_{key.split('/')[-1]}.png\")\n", + " for key, video_path in CAMERA_VIDEO_PATHS.items()\n", + "}\n", + "frames = {key: Image.open(path).convert(\"RGB\") for key, path in frame_paths.items()}\n", + "\n", + "# DROID policy uses a concatenated multi-view frame for the /v1/videos path.\n", + "target_w, target_h = 640, 540\n", + "top_h = target_h // 2\n", + "bottom_h = target_h - top_h\n", + "half_w = target_w // 2\n", + "wrist = ImageOps.fit(frames[\"observation/wrist_image_left\"], (target_w, top_h), method=Image.Resampling.BICUBIC)\n", + "left = ImageOps.fit(frames[\"observation/exterior_image_1_left\"], (half_w, bottom_h), method=Image.Resampling.BICUBIC)\n", + "right = ImageOps.fit(frames[\"observation/exterior_image_2_left\"], (half_w, bottom_h), method=Image.Resampling.BICUBIC)\n", + "policy_image = Image.new(\"RGB\", (target_w, target_h))\n", + "policy_image.paste(wrist, (0, 0))\n", + "policy_image.paste(left, (0, top_h))\n", + "policy_image.paste(right, (half_w, top_h))\n", + "policy_image_path = COSMOS3_INPUT_DIR / \"droid_policy_first_frame.png\"\n", + "policy_image.save(policy_image_path)\n", + "\n", + "policy_prompt = os.environ.get(\n", + " \"COSMOS3_POLICY_PROMPT\",\n", + " \"Pick up the object and place it in the target container.\",\n", + ")\n", + "print(\"policy prompt:\", policy_prompt)\n", + "print(\"video API conditioning image:\", policy_image_path, policy_image.size)\n", + "display(policy_image)" + ] + }, + { + "cell_type": "markdown", + "id": "policy-video-md", + "metadata": {}, + "source": [ + "## Run Policy Inference Through `/v1/videos`\n", + "\n", + "This path behaves like the forward/inverse SGLANG action notebooks: it sends a multipart request to the OpenAI-compatible video API, polls the async job, writes the generated rollout video, and saves the predicted action from the response metadata." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "policy-video-code", + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "import time\n", + "from pathlib import Path\n", + "\n", + "try:\n", + " import requests\n", + "except ImportError as exc:\n", + " raise RuntimeError(\"Install requests in this notebook kernel: pip install requests\") from exc\n", + "\n", + "\n", + "def check_sglang_server(timeout_s: int = 600, interval_s: int = 10) -> None:\n", + " deadline = time.time() + timeout_s\n", + " last_error: Exception | None = None\n", + " while time.time() < deadline:\n", + " try:\n", + " response = requests.get(f\"{SGLANG_BASE_URL}/v1/models\", timeout=10)\n", + " response.raise_for_status()\n", + " print(response.json())\n", + " return\n", + " except requests.RequestException as exc:\n", + " last_error = exc\n", + " print(f\"Waiting for SGLang server at {SGLANG_BASE_URL}: {exc}\")\n", + " time.sleep(interval_s)\n", + " raise RuntimeError(\n", + " f\"SGLang server did not become ready at {SGLANG_BASE_URL} within {timeout_s}s. \"\n", + " \"Check `docker logs -f cosmos3-sglang-policy-notebook`.\"\n", + " ) from last_error\n", + "\n", + "\n", + "ACTION_VIDEO_RES_SIZE_INFO = {\n", + " \"480\": {\n", + " \"1,1\": (640, 640),\n", + " \"4,3\": (736, 544),\n", + " \"3,4\": (544, 736),\n", + " \"16,9\": (832, 480),\n", + " \"9,16\": (480, 832),\n", + " }\n", + "}\n", + "\n", + "\n", + "def closest_action_size(height: int, width: int, resolution: str = \"480\") -> tuple[int, int]:\n", + " input_ratio = height / width\n", + " candidates = ACTION_VIDEO_RES_SIZE_INFO[resolution].values()\n", + " return min(candidates, key=lambda size: abs(input_ratio - size[1] / size[0]))\n", + "\n", + "\n", + "def submit_policy_video() -> dict:\n", + " run_dir = COSMOS3_POLICY_OUTPUT_DIR / \"video_api\"\n", + " run_dir.mkdir(parents=True, exist_ok=True)\n", + "\n", + " input_width, input_height = Image.open(policy_image_path).size\n", + " target_width, target_height = closest_action_size(input_height, input_width)\n", + " extra_params = {\n", + " \"action_mode\": \"policy\",\n", + " \"domain_name\": \"droid_lerobot\",\n", + " \"raw_action_dim\": 10,\n", + " \"action_view_point\": \"concat_view\",\n", + " \"guardrails\": False,\n", + " }\n", + " form = {\n", + " \"prompt\": policy_prompt,\n", + " \"num_frames\": 17,\n", + " \"fps\": 15,\n", + " \"size\": f\"{target_width}x{target_height}\",\n", + " \"num_inference_steps\": 30,\n", + " \"guidance_scale\": 1.0,\n", + " \"flow_shift\": 5.0,\n", + " \"seed\": 0,\n", + " \"extra_params\": json.dumps(extra_params),\n", + " }\n", + "\n", + " with policy_image_path.open(\"rb\") as image_file:\n", + " response = requests.post(\n", + " f\"{SGLANG_BASE_URL}/v1/videos\",\n", + " data={key: str(value) for key, value in form.items()},\n", + " files={\"input_reference\": (policy_image_path.name, image_file, \"image/png\")},\n", + " timeout=120,\n", + " )\n", + " if not response.ok:\n", + " (run_dir / \"error_response.txt\").write_text(response.text)\n", + " print(\"SGLang request failed:\", response.status_code)\n", + " print(response.text)\n", + " print(\"form:\", json.dumps(form, indent=2))\n", + " response.raise_for_status()\n", + "\n", + " initial = response.json()\n", + " (run_dir / \"response.json\").write_text(json.dumps(initial, indent=2))\n", + "\n", + " while True:\n", + " response = requests.get(f\"{SGLANG_BASE_URL}/v1/videos/{initial['id']}\", timeout=30)\n", + " response.raise_for_status()\n", + " final = response.json()\n", + " (run_dir / \"final.json\").write_text(json.dumps(final, indent=2))\n", + " print(initial[\"id\"], final.get(\"status\"), f\"{final.get('progress', 0)}%\")\n", + " if final.get(\"status\") == \"completed\":\n", + " break\n", + " if final.get(\"status\") in {\"failed\", \"cancelled\"}:\n", + " raise RuntimeError(json.dumps(final, indent=2))\n", + " time.sleep(2)\n", + "\n", + " action = final.get(\"action\")\n", + " if not action or \"data\" not in action:\n", + " raise RuntimeError(f\"SGLang response did not include action data: {json.dumps(final, indent=2)}\")\n", + " (run_dir / \"action.json\").write_text(json.dumps(action, indent=2))\n", + " sample_outputs = {\"outputs\": [{\"content\": {\"action\": action[\"data\"]}}]}\n", + " (run_dir / \"sample_outputs.json\").write_text(json.dumps(sample_outputs, indent=2))\n", + "\n", + " content_response = requests.get(f\"{SGLANG_BASE_URL}/v1/videos/{initial['id']}/content\", timeout=300)\n", + " content_response.raise_for_status()\n", + " video_path = run_dir / \"policy_rollout.mp4\"\n", + " if content_response.content:\n", + " video_path.write_bytes(content_response.content)\n", + " print(\"saved\", video_path)\n", + " else:\n", + " video_path = None\n", + " print(\"video content endpoint returned an empty body\")\n", + "\n", + " print(\"saved\", run_dir / \"action.json\")\n", + " print(\"action shape:\", action.get(\"shape\"), \"dtype:\", action.get(\"dtype\"), \"domain_id:\", action.get(\"domain_id\"))\n", + " return {\"initial\": initial, \"final\": final, \"run_dir\": run_dir, \"video_path\": video_path, \"action\": action}\n", + "\n", + "\n", + "check_sglang_server()\n", + "policy_video_result = submit_policy_video()" + ] + }, + { + "cell_type": "markdown", + "id": "policy-preview-md", + "metadata": {}, + "source": [ + "## Inspect Video API Outputs\n", + "\n", + "Preview the rollout video if the server returned one, and print the first few predicted action rows." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "policy-preview-code", + "metadata": {}, + "outputs": [], + "source": [ + "import subprocess\n", + "\n", + "import imageio_ffmpeg\n", + "from IPython.display import Video, display\n", + "\n", + "FFMPEG = imageio_ffmpeg.get_ffmpeg_exe()\n", + "\n", + "\n", + "def make_preview(src: Path, crf: int = 28) -> Path:\n", + " preview = src.with_name(f\"{src.stem}_preview.mp4\")\n", + " if not preview.exists() or preview.stat().st_mtime < src.stat().st_mtime:\n", + " subprocess.run(\n", + " [\n", + " FFMPEG,\n", + " \"-y\",\n", + " \"-loglevel\",\n", + " \"error\",\n", + " \"-i\",\n", + " str(src),\n", + " \"-c:v\",\n", + " \"libx264\",\n", + " \"-crf\",\n", + " str(crf),\n", + " \"-preset\",\n", + " \"veryfast\",\n", + " \"-an\",\n", + " \"-pix_fmt\",\n", + " \"yuv420p\",\n", + " str(preview),\n", + " ],\n", + " check=True,\n", + " )\n", + " return preview\n", + "\n", + "\n", + "action = policy_video_result[\"action\"]\n", + "action_array = np.asarray(action[\"data\"], dtype=np.float32)\n", + "print(\"action array:\", action_array.shape, action_array.dtype)\n", + "print(action_array[: min(5, len(action_array))])\n", + "\n", + "video_path = policy_video_result.get(\"video_path\")\n", + "if video_path is not None:\n", + " preview = make_preview(video_path)\n", + " print(f\"preview: {preview}\")\n", + " display(Video(str(preview), embed=True))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9b987850", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/cookbooks/cosmos3/generator/audiovisual/run_with_sglang.ipynb b/cookbooks/cosmos3/generator/audiovisual/run_with_sglang.ipynb new file mode 100644 index 00000000..e310a030 --- /dev/null +++ b/cookbooks/cosmos3/generator/audiovisual/run_with_sglang.ipynb @@ -0,0 +1,1133 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "license-header", + "metadata": {}, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "id": "d88fe9a8", + "metadata": {}, + "source": [ + "# Cosmos3 Generator Audiovisual with SGLang\n", + "\n", + "This notebook calls already-running SGLang Cosmos3 servers with direct `curl` requests from Python.\n", + "\n", + "The examples are split into Cosmos3-Nano and Cosmos3-Super sections. Each section is self-contained, so you can run just one. Each section targets the matching model endpoint.\n" + ] + }, + { + "cell_type": "markdown", + "id": "49df4e61", + "metadata": {}, + "source": [ + "## 1. Prerequisites\n", + "\n", + "Use a running SGLang server and set endpoint environment variables before the setup cell if you are not using the local default. Text-to-image uses `/v1/images/generations`; video modes use `/v1/videos`.\n", + "\n", + "```bash\n", + "export COSMOS3_SGLANG_BASE_URL=http://localhost:30000\n", + "export COSMOS3_SGLANG_NANO_BASE_URL=http://localhost:30000\n", + "export COSMOS3_SGLANG_SUPER_BASE_URL=http://localhost:30000\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "id": "26776c50", + "metadata": {}, + "source": [ + "## 2. Start the Server\n", + "\n", + "Run the SGLang server before running the request cells. Use the Docker image for every modality on this page. Mount any directory that contains local media or action files you want the server to read.\n", + "\n", + "### Docker Image: Cosmos3-Nano\n", + "\n", + "```bash\n", + "docker run --runtime nvidia --gpus all \\\n", + " -v ~/.cache/huggingface:/root/.cache/huggingface \\\n", + " -v \"$(pwd):/workspace\" \\\n", + " -p 30000:30000 \\\n", + " --ipc=host \\\n", + " lmsysorg/sglang:dev \\\n", + " sglang serve \\\n", + " --model-path nvidia/Cosmos3-Nano \\\n", + " --host 0.0.0.0\n", + "```\n", + "\n", + "### Docker Image: Cosmos3-Super\n", + "\n", + "`Cosmos3-Super` is the larger 64B model, so it usually needs more GPU memory than `Cosmos3-Nano`. `--tp-size` splits model weights across multiple GPUs and reduces per-GPU memory use. Use `--performance-mode memory` preset to preserve memory.\n", + "\n", + "For example, on four GPUs:\n", + "\n", + "```bash\n", + "docker run --runtime nvidia --gpus all \\\n", + " -v ~/.cache/huggingface:/root/.cache/huggingface \\\n", + " -v \"$(pwd):/workspace\" \\\n", + " -p 30000:30000 \\\n", + " --ipc=host \\\n", + " lmsysorg/sglang:dev \\\n", + " sglang serve \\\n", + " --model-path nvidia/Cosmos3-Super \\\n", + " --host 0.0.0.0 \\\n", + " --num-gpus 4 \\\n", + " --performance-mode memory\n", + "```\n", + "\n", + "### CFG Parallel\n", + "\n", + "Use `--cfg-parallel-size 2` to run the positive and negative CFG branches in parallel on two GPUs:\n", + "\n", + "```bash\n", + "sglang serve \\\n", + " --model-path nvidia/Cosmos3-Nano \\\n", + " --host 0.0.0.0 \\\n", + " --num-gpus 2 \\\n", + " --enable-cfg-parallel \\\n", + " --cfg-parallel-size 2\n", + "```\n", + "\n", + "For Cosmos3, set CFG strength with the request-level `guidance_scale` field. Do not use `true_cfg_scale` for CFG Parallel with these Cosmos3 examples.\n" + ] + }, + { + "cell_type": "markdown", + "id": "4412f2f9", + "metadata": {}, + "source": [ + "## 3. Configure Paths and Endpoints\n", + "\n", + "This setup cell only configures repo/output paths and SGLang endpoint settings.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "23f04a90", + "metadata": {}, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "import os\n", + "\n", + "\n", + "def find_repo_root(start: Path) -> Path:\n", + " for path in [start, *start.parents]:\n", + " if (path / \"README.md\").exists() and (path / \"cookbooks\").exists():\n", + " return path\n", + " return start\n", + "\n", + "\n", + "COSMOS_ROOT = find_repo_root(Path.cwd().resolve())\n", + "COSMOS3_AUDIOVISUAL_ROOT = COSMOS_ROOT / \"cookbooks\" / \"cosmos3\" / \"generator\" / \"audiovisual\"\n", + "COSMOS3_AUDIOVISUAL_OUTPUT_ROOT = Path(\n", + " os.environ.get(\"COSMOS3_AUDIOVISUAL_OUTPUT_ROOT\", COSMOS3_AUDIOVISUAL_ROOT / \"outputs\" / \"notebooks\")\n", + ").resolve()\n", + "DEFAULT_SGLANG_BASE_URL = os.environ.get(\"COSMOS3_SGLANG_BASE_URL\", \"http://localhost:30000\")\n", + "SGLANG_ENDPOINTS = {\n", + " \"Cosmos3-Nano\": os.environ.get(\"COSMOS3_SGLANG_NANO_BASE_URL\", DEFAULT_SGLANG_BASE_URL),\n", + " \"Cosmos3-Super\": os.environ.get(\"COSMOS3_SGLANG_SUPER_BASE_URL\", DEFAULT_SGLANG_BASE_URL),\n", + "}\n", + "\n", + "os.environ[\"COSMOS3_AUDIOVISUAL_OUTPUT_ROOT\"] = str(COSMOS3_AUDIOVISUAL_OUTPUT_ROOT)\n", + "os.environ.setdefault(\"COSMOS3_SGLANG_API_KEY\", \"\")\n", + "\n", + "print(\"COSMOS_ROOT:\", COSMOS_ROOT)\n", + "print(\"COSMOS3_AUDIOVISUAL_OUTPUT_ROOT:\", COSMOS3_AUDIOVISUAL_OUTPUT_ROOT)\n", + "for model, endpoint in SGLANG_ENDPOINTS.items():\n", + " print(f\"{model} endpoint: {endpoint}\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "73369e7f", + "metadata": {}, + "source": [ + "## 4. Verify Endpoint Configuration\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1c50e183", + "metadata": {}, + "outputs": [], + "source": [ + "from urllib.parse import urlparse\n", + "\n", + "for model, base_url in SGLANG_ENDPOINTS.items():\n", + " api_root = base_url.rstrip(\"/\")\n", + " if not api_root.endswith(\"/v1\"):\n", + " api_root = f\"{api_root}/v1\"\n", + " parsed = urlparse(api_root)\n", + " print(model)\n", + " print(\" api root:\", api_root)\n", + " print(\" images generations:\", f\"{api_root}/images/generations\")\n", + " print(\" videos async:\", f\"{api_root}/videos\")\n", + " print(\" scheme:\", parsed.scheme)\n", + " print(\" host:\", parsed.netloc)\n" + ] + }, + { + "cell_type": "markdown", + "id": "c1d161a6", + "metadata": {}, + "source": [ + "## 5. Preview Available Inputs\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "973ea472", + "metadata": {}, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "import json\n", + "from IPython.display import Image, display\n", + "\n", + "assets_dir = COSMOS3_AUDIOVISUAL_ROOT / \"assets\"\n", + "for prompt_dir in sorted((assets_dir / \"prompts\").iterdir()):\n", + " if not prompt_dir.is_dir():\n", + " continue\n", + " print(f\"{prompt_dir.relative_to(assets_dir)}:\")\n", + " for prompt_path in sorted(prompt_dir.glob(\"*.json\")):\n", + " data = json.loads(prompt_path.read_text())\n", + " caption = (\n", + " data.get(\"temporal_caption\")\n", + " or data.get(\"comprehensive_t2i_caption\")\n", + " or data.get(\"extra\", {}).get(\"prompt\", \"\")\n", + " )\n", + " print(f\" {prompt_path.name}: {caption[:180]}{'...' if len(caption) > 180 else ''}\")\n", + " print()\n", + "\n", + "for image_dir in sorted((assets_dir / \"images\").iterdir()):\n", + " if not image_dir.is_dir():\n", + " continue\n", + " print(f\"{image_dir.relative_to(assets_dir)}:\")\n", + " for image_path in sorted(image_dir.iterdir()):\n", + " if image_path.suffix.lower() in {\".jpg\", \".jpeg\", \".png\", \".webp\", \".bmp\"}:\n", + " print(f\" {image_path.name}\")\n", + " display(Image(filename=str(image_path), width=420))\n" + ] + }, + { + "cell_type": "markdown", + "id": "cfa34351", + "metadata": {}, + "source": [ + "## 6. Define Asset Sets, Payload Helpers, Request Helpers, and Viewer Helpers\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "af5dd1c8", + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "import os\n", + "from pathlib import Path\n", + "from IPython.display import Image, display\n", + "\n", + "IMAGE_EXTENSIONS = {\".jpg\", \".jpeg\", \".png\", \".webp\", \".bmp\"}\n", + "\n", + "FIXED_SAMPLING = {\n", + " \"num_steps\": 35,\n", + " \"guidance\": 6.0,\n", + " \"shift\": 10.0,\n", + " \"fps\": 24,\n", + " \"num_frames\": 189,\n", + " \"resolution\": \"720\",\n", + " \"aspect_ratio\": \"16,9\",\n", + " \"seed\": 0,\n", + "}\n", + "\n", + "# All asset paths are repo-relative under cookbooks/cosmos3/generator/audiovisual.\n", + "# Model and sound choices live in this manifest; folders are organized only by modality.\n", + "ASSET_SETS = {\n", + " \"t2i\": {\n", + " \"model\": \"Cosmos3-Nano\",\n", + " \"mode\": \"text2image\",\n", + " \"prompt\": \"assets/prompts/text2image/robot_draping.json\",\n", + " \"enable_sound\": False,\n", + " },\n", + " \"t2i_super\": {\n", + " \"model\": \"Cosmos3-Super\",\n", + " \"mode\": \"text2image\",\n", + " \"prompt\": \"assets/prompts/text2image/robot_draping.json\",\n", + " \"enable_sound\": False,\n", + " },\n", + " \"t2v_nano_noaudio\": {\n", + " \"model\": \"Cosmos3-Nano\",\n", + " \"mode\": \"text2video\",\n", + " \"prompt\": \"assets/prompts/text2video/robot_kitchen.json\",\n", + " \"enable_sound\": False,\n", + " },\n", + " \"t2vs\": {\n", + " \"model\": \"Cosmos3-Nano\",\n", + " \"mode\": \"text2video\",\n", + " \"prompt\": \"assets/prompts/text2video/robot_pouring_water_audio.json\",\n", + " \"enable_sound\": True,\n", + " },\n", + " \"i2v_nano_noaudio\": {\n", + " \"model\": \"Cosmos3-Nano\",\n", + " \"mode\": \"image2video\",\n", + " \"prompt\": \"assets/prompts/image2video/car_driving.json\",\n", + " \"image\": \"assets/images/image2video/car_driving.jpg\",\n", + " \"enable_sound\": False,\n", + " },\n", + " \"i2vs\": {\n", + " \"model\": \"Cosmos3-Nano\",\n", + " \"mode\": \"image2video\",\n", + " \"prompt\": \"assets/prompts/image2video/coastal_road_audio.json\",\n", + " \"image\": \"assets/images/image2video/coastal_road_audio.jpg\",\n", + " \"enable_sound\": True,\n", + " },\n", + " \"t2v_super_noaudio\": {\n", + " \"model\": \"Cosmos3-Super\",\n", + " \"mode\": \"text2video\",\n", + " \"prompt\": \"assets/prompts/text2video/robot_kitchen.json\",\n", + " \"enable_sound\": False,\n", + " },\n", + " \"i2v_super_noaudio\": {\n", + " \"model\": \"Cosmos3-Super\",\n", + " \"mode\": \"image2video\",\n", + " \"prompt\": \"assets/prompts/image2video/car_driving.json\",\n", + " \"image\": \"assets/images/image2video/car_driving.jpg\",\n", + " \"enable_sound\": False,\n", + " },\n", + "}\n", + "\n", + "\n", + "def asset_path(relative_path: str) -> Path:\n", + " path = COSMOS3_AUDIOVISUAL_ROOT / relative_path\n", + " if not path.exists():\n", + " raise FileNotFoundError(path)\n", + " return path.resolve()\n", + "\n", + "\n", + "def compact_json_file(path: Path) -> str:\n", + " return json.dumps(json.loads(path.read_text()), ensure_ascii=True, separators=(\",\", \":\"))\n", + "\n", + "\n", + "def normalize_negative_prompt(value) -> str:\n", + " if value is None:\n", + " return \"\"\n", + " if isinstance(value, str):\n", + " return value\n", + " if isinstance(value, dict):\n", + " parts = []\n", + " for v in value.values():\n", + " if isinstance(v, str):\n", + " parts.append(v)\n", + " elif isinstance(v, list):\n", + " parts.extend(str(x) for x in v)\n", + " return \"\\n\".join(parts)\n", + " return str(value)\n", + "\n", + "\n", + "def payload_dimensions(payload: dict) -> tuple[int, int]:\n", + " if payload.get(\"resolution\") == \"720\" and payload.get(\"aspect_ratio\") == \"16,9\":\n", + " return 720, 1280\n", + " if payload.get(\"resolution\") == \"256\" and payload.get(\"aspect_ratio\") == \"16,9\":\n", + " return 192, 320\n", + " raise ValueError(f\"Unsupported payload resolution/aspect ratio: {payload.get('resolution')} {payload.get('aspect_ratio')}\")\n", + "\n", + "\n", + "def resolve_payload_path(payload_path: Path, value: str) -> Path:\n", + " path = Path(value)\n", + " if path.is_absolute():\n", + " return path\n", + " return (payload_path.parent / path).resolve()\n", + "\n", + "\n", + "def create_payload(use_case: str, *, backend: str) -> tuple[Path, Path, str]:\n", + " spec = ASSET_SETS[use_case]\n", + " payload_dir = Path(os.environ[\"COSMOS3_AUDIOVISUAL_OUTPUT_ROOT\"]) / backend / \"payloads\" / use_case\n", + " output_dir = Path(os.environ[\"COSMOS3_AUDIOVISUAL_OUTPUT_ROOT\"]) / backend / use_case\n", + " payload_dir.mkdir(parents=True, exist_ok=True)\n", + " output_dir.mkdir(parents=True, exist_ok=True)\n", + "\n", + " prompt_path = asset_path(spec[\"prompt\"])\n", + " negative_prompt = \"\"\n", + " if spec[\"mode\"] != \"text2image\":\n", + " negative_prompt_path = asset_path(f\"assets/negative_prompts/{spec['mode']}/neg_prompt.json\")\n", + " negative_prompt = normalize_negative_prompt(negative_prompt_path)\n", + " payload_path = payload_dir / f\"{use_case}.json\"\n", + " payload = {\n", + " \"model_mode\": spec[\"mode\"],\n", + " \"name\": use_case,\n", + " \"prompt\": compact_json_file(prompt_path),\n", + " \"negative_prompt\": negative_prompt,\n", + " \"enable_sound\": spec[\"enable_sound\"],\n", + " **FIXED_SAMPLING,\n", + " }\n", + " if spec[\"mode\"] == \"image2video\":\n", + " image_path = asset_path(spec[\"image\"])\n", + " payload[\"vision_path\"] = os.path.relpath(image_path, payload_path.parent)\n", + "\n", + " payload_path.write_text(json.dumps(payload, indent=2) + \"\\n\")\n", + "\n", + " os.environ[f\"COSMOS3_{backend.upper()}_{use_case.upper()}_INPUT\"] = str(payload_path)\n", + " os.environ[f\"COSMOS3_{backend.upper()}_{use_case.upper()}_OUTPUT\"] = str(output_dir)\n", + "\n", + " print(f\"model: {spec['model']}\")\n", + " print(f\"payload: {payload_path}\")\n", + " print(f\"output: {output_dir}\")\n", + " print(f\"prompt: {prompt_path.relative_to(COSMOS_ROOT)}\")\n", + " if \"vision_path\" in payload:\n", + " image_display_path = resolve_payload_path(payload_path, payload[\"vision_path\"])\n", + " print(f\"image: {image_display_path.relative_to(COSMOS_ROOT)}\")\n", + " display(Image(filename=str(image_display_path), width=420))\n", + " print(json.dumps({k: payload[k] for k in [\"model_mode\", \"name\", \"enable_sound\", \"num_steps\", \"guidance\", \"shift\", \"fps\", \"num_frames\", \"resolution\", \"aspect_ratio\", \"seed\"]}, indent=2))\n", + " return payload_path, output_dir, spec[\"model\"]\n", + "\n", + "\n", + "import base64\n", + "import html\n", + "import json\n", + "import os\n", + "import subprocess\n", + "import time\n", + "from pathlib import Path\n", + "from IPython.display import HTML, display\n", + "\n", + "\n", + "def api_root_url(base_url: str) -> str:\n", + " normalized = base_url.rstrip(\"/\")\n", + " if not normalized.endswith(\"/v1\"):\n", + " normalized = f\"{normalized}/v1\"\n", + " return normalized\n", + "\n", + "\n", + "def video_api_url(base_url: str) -> str:\n", + " return f\"{api_root_url(base_url)}/videos\"\n", + "\n", + "\n", + "def image_api_url(base_url: str) -> str:\n", + " return f\"{api_root_url(base_url)}/images/generations\"\n", + "\n", + "def build_sglang_video_form(payload: dict) -> dict[str, str]:\n", + " height, width = payload_dimensions(payload)\n", + " extra_params = {\n", + " \"use_resolution_template\": False,\n", + " \"use_duration_template\": False,\n", + " \"guardrails\": True,\n", + " }\n", + " form = {\n", + " \"prompt\": payload[\"prompt\"],\n", + " \"negative_prompt\": payload[\"negative_prompt\"],\n", + " \"size\": f\"{width}x{height}\",\n", + " \"num_frames\": str(payload[\"num_frames\"]),\n", + " \"fps\": str(payload[\"fps\"]),\n", + " \"num_inference_steps\": str(payload[\"num_steps\"]),\n", + " \"guidance_scale\": str(payload[\"guidance\"]),\n", + " \"flow_shift\": str(payload[\"shift\"]),\n", + " \"seed\": str(payload[\"seed\"]),\n", + " \"extra_params\": json.dumps(extra_params, separators=(\",\", \":\")),\n", + " }\n", + " if payload[\"enable_sound\"]:\n", + " form[\"generate_sound\"] = \"true\"\n", + " form[\"sound_duration\"] = f\"{payload['num_frames'] / payload['fps']:.3f}\"\n", + " return form\n", + "\n", + "\n", + "def build_sglang_image_body(payload: dict) -> dict:\n", + " height, width = payload_dimensions(payload)\n", + " return {\n", + " \"prompt\": payload[\"prompt\"],\n", + " \"negative_prompt\": payload.get(\"negative_prompt\", \"\"),\n", + " \"size\": f\"{width}x{height}\",\n", + " \"n\": 1,\n", + " \"num_inference_steps\": payload[\"num_steps\"],\n", + " \"guidance_scale\": payload[\"guidance\"],\n", + " \"flow_shift\": payload[\"shift\"],\n", + " \"seed\": payload[\"seed\"],\n", + " \"response_format\": \"b64_json\",\n", + " \"extra_params\": {\n", + " \"use_resolution_template\": False,\n", + " \"guardrails\": True,\n", + " },\n", + " }\n", + "\n", + "\n", + "def post_video(*, payload_path: Path, payload: dict, output_path: Path, model: str) -> None:\n", + " url = video_api_url(SGLANG_ENDPOINTS[model])\n", + " api_key = os.environ.get(\"COSMOS3_SGLANG_API_KEY\") or None\n", + " tmp_path = Path(f\"{output_path}.tmp\")\n", + " error_path = Path(f\"{output_path}.error.txt\")\n", + " if tmp_path.exists():\n", + " tmp_path.unlink()\n", + " if error_path.exists():\n", + " error_path.unlink()\n", + "\n", + " cmd = [\n", + " \"curl\",\n", + " \"-sS\",\n", + " \"--fail-with-body\",\n", + " \"-X\",\n", + " \"POST\",\n", + " url,\n", + " \"-H\",\n", + " \"Accept: video/mp4\",\n", + " ]\n", + " if api_key is not None:\n", + " cmd += [\"-H\", f\"Authorization: Bearer {api_key}\"]\n", + "\n", + " for key, value in build_sglang_video_form(payload).items():\n", + " cmd += [\"--form-string\", f\"{key}={value}\"]\n", + "\n", + " if payload[\"model_mode\"] == \"image2video\":\n", + " image_path = resolve_payload_path(payload_path, payload[\"vision_path\"])\n", + " cmd += [\"-F\", f\"input_reference=@{image_path}\"]\n", + "\n", + " if payload[\"model_mode\"] == \"video2video\":\n", + " video_path = resolve_payload_path(payload_path, payload[\"vision_path\"])\n", + " cmd += [\"-F\", f\"video_reference=@{video_path};type=video/mp4\"]\n", + "\n", + " result = subprocess.run(cmd, text=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n", + " if result.returncode != 0:\n", + " error_path.write_text((result.stdout or \"\") + (result.stderr or \"\"))\n", + " raise RuntimeError(f\"SGLang request failed with exit code {result.returncode}; see {error_path}\")\n", + "\n", + " # poll output video\n", + " initial = json.loads(result.stdout)\n", + " video_id = initial[\"id\"]\n", + "\n", + " while True:\n", + " poll = subprocess.run(\n", + " [\"curl\", \"-sS\", \"--fail-with-body\", f\"{url}/{video_id}\"],\n", + " text=True,\n", + " stdout=subprocess.PIPE,\n", + " stderr=subprocess.PIPE,\n", + " )\n", + " if poll.returncode != 0:\n", + " error_path.write_text((poll.stdout or \"\") + (poll.stderr or \"\"))\n", + " raise RuntimeError(f\"SGLang video poll failed; see {error_path}\")\n", + "\n", + " final = json.loads(poll.stdout)\n", + " if final.get(\"status\") == \"completed\":\n", + " break\n", + " if final.get(\"status\") in {\"failed\", \"cancelled\"}:\n", + " error_path.write_text(json.dumps(final, indent=2))\n", + " raise RuntimeError(f\"SGLang video failed; see {error_path}\")\n", + " time.sleep(5)\n", + "\n", + " download = subprocess.run(\n", + " [\"curl\", \"-sS\", \"--fail-with-body\", f\"{url}/{video_id}/content\", \"-o\", str(tmp_path)],\n", + " text=True,\n", + " stdout=subprocess.PIPE,\n", + " stderr=subprocess.PIPE,\n", + " )\n", + " if download.returncode != 0:\n", + " error_path.write_text(json.dumps(final, indent=2) + \"\\n\" + (download.stderr or \"\"))\n", + " raise RuntimeError(f\"SGLang video download failed; see {error_path}\")\n", + "\n", + " tmp_path.replace(output_path)\n", + "\n", + "\n", + "def post_image(*, payload: dict, output_path: Path, model: str) -> None:\n", + " url = image_api_url(SGLANG_ENDPOINTS[model])\n", + " api_key = os.environ.get(\"COSMOS3_SGLANG_API_KEY\") or None\n", + " tmp_path = Path(f\"{output_path}.tmp\")\n", + " error_path = Path(f\"{output_path}.error.txt\")\n", + " if tmp_path.exists():\n", + " tmp_path.unlink()\n", + " if error_path.exists():\n", + " error_path.unlink()\n", + "\n", + " cmd = [\n", + " \"curl\",\n", + " \"-sS\",\n", + " \"--fail-with-body\",\n", + " \"-X\",\n", + " \"POST\",\n", + " url,\n", + " \"-H\",\n", + " \"Content-Type: application/json\",\n", + " ]\n", + " if api_key is not None:\n", + " cmd += [\"-H\", f\"Authorization: Bearer {api_key}\"]\n", + " cmd += [\"-d\", json.dumps(build_sglang_image_body(payload), separators=(\",\", \":\"))]\n", + "\n", + " result = subprocess.run(cmd, text=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n", + " if result.returncode != 0:\n", + " error_path.write_text((result.stdout or \"\") + (result.stderr or \"\"))\n", + " raise RuntimeError(f\"SGLang image request failed with exit code {result.returncode}; see {error_path}\")\n", + " try:\n", + " response = json.loads(result.stdout)\n", + " b64_json = response[\"data\"][0][\"b64_json\"]\n", + " tmp_path.write_bytes(base64.b64decode(b64_json))\n", + " except Exception as exc:\n", + " error_path.write_text((result.stdout or \"\") + (result.stderr or \"\"))\n", + " raise RuntimeError(f\"Could not decode SGLang image response; see {error_path}\") from exc\n", + " tmp_path.replace(output_path)\n", + "\n", + "\n", + "def run_sglang_payload(payload_path: Path, output_dir: str | Path, *, model: str) -> Path:\n", + " payload_path = Path(payload_path)\n", + " output_dir = Path(output_dir)\n", + " output_dir.mkdir(parents=True, exist_ok=True)\n", + " payload = json.loads(payload_path.read_text())\n", + " output_ext = \".png\" if payload[\"model_mode\"] == \"text2image\" else \".mp4\"\n", + " output_path = output_dir / f\"{payload['name']}{output_ext}\"\n", + " endpoint = image_api_url(SGLANG_ENDPOINTS[model]) if payload[\"model_mode\"] == \"text2image\" else video_api_url(SGLANG_ENDPOINTS[model])\n", + " print(\"endpoint:\", endpoint)\n", + " print(\"payload:\", payload_path)\n", + " print(\"output:\", output_path)\n", + " if payload[\"model_mode\"] == \"image2video\":\n", + " print(\"input image:\", resolve_payload_path(payload_path, payload[\"vision_path\"]))\n", + " t0 = time.time()\n", + " if payload[\"model_mode\"] == \"text2image\":\n", + " post_image(payload=payload, output_path=output_path, model=model)\n", + " else:\n", + " post_video(payload_path=payload_path, payload=payload, output_path=output_path, model=model)\n", + " print(f\"wrote {output_path} in {time.time() - t0:.1f}s\")\n", + " return output_path\n", + "\n", + "\n", + "def display_video(path: Path, *, width: int = 720) -> None:\n", + " data = base64.b64encode(path.read_bytes()).decode(\"ascii\")\n", + " label = html.escape(str(path))\n", + " markup = f\"\"\"\n", + "\n", + "
{label}
\n", + "\"\"\"\n", + " display(HTML(markup))\n", + "\n", + "\n", + "def view_run(output_dir: str | Path) -> None:\n", + " output_dir = Path(output_dir)\n", + " videos = [\n", + " path\n", + " for path in sorted(output_dir.rglob(\"*.mp4\"))\n", + " if not path.name.endswith((\"_preview.mp4\", \"_browser.mp4\"))\n", + " ]\n", + " images = sorted(output_dir.rglob(\"*.png\"))\n", + " if not videos and not images:\n", + " print(f\"No generated media found under {output_dir}\")\n", + " return\n", + " for src in videos:\n", + " print(f\"source: {src} ({src.stat().st_size // 1024} KB)\")\n", + " display_video(src)\n", + " for src in images:\n", + " print(f\"source: {src} ({src.stat().st_size // 1024} KB)\")\n", + " display(Image(filename=str(src), width=720))\n" + ] + }, + { + "cell_type": "markdown", + "id": "9ffaab0e", + "metadata": {}, + "source": [ + "Run each use case top-to-bottom: create the JSON payload, run inference, then view the generated media. The examples are grouped by model size below. The Cosmos3-Nano and Cosmos3-Super sections are independent, so you can run just one.\n" + ] + }, + { + "cell_type": "markdown", + "id": "e7d284db", + "metadata": {}, + "source": [ + "# Cosmos3-Nano Examples\n", + "\n", + "Use cases for the `Cosmos3-Nano` model. This section is self-contained; you can run it without the Cosmos3-Super section below.\n" + ] + }, + { + "cell_type": "markdown", + "id": "27bf6dce", + "metadata": {}, + "source": [ + "## Nano: Text to Image\n", + "\n", + "Nano text-to-image generation using a structured JSON prompt.\n", + "\n", + "### Create Payload\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "383f6351", + "metadata": {}, + "outputs": [], + "source": [ + "t2i_payload, t2i_output, t2i_model = create_payload(\"t2i\", backend=\"sglang\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "d3c59500", + "metadata": {}, + "source": [ + "### Run\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "97b4b312", + "metadata": {}, + "outputs": [], + "source": [ + "run_sglang_payload(t2i_payload, t2i_output, model=\"Cosmos3-Nano\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "c027b956", + "metadata": {}, + "source": [ + "### View Results\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8fe6e3ed", + "metadata": {}, + "outputs": [], + "source": [ + "view_run(t2i_output)\n" + ] + }, + { + "cell_type": "markdown", + "id": "0e7d3092", + "metadata": {}, + "source": [ + "## Nano: Text to Video Without Audio\n", + "\n", + "Nano text-to-video generation with audio disabled.\n", + "\n", + "### Create Payload\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b2e85424", + "metadata": {}, + "outputs": [], + "source": [ + "t2v_nano_noaudio_payload, t2v_nano_noaudio_output, t2v_nano_noaudio_model = create_payload(\"t2v_nano_noaudio\", backend=\"sglang\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "e4fa0cdd", + "metadata": {}, + "source": [ + "### Run\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9db2a1a5", + "metadata": {}, + "outputs": [], + "source": [ + "run_sglang_payload(t2v_nano_noaudio_payload, t2v_nano_noaudio_output, model=\"Cosmos3-Nano\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "1bc8a331", + "metadata": {}, + "source": [ + "### View Results\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "87917755", + "metadata": {}, + "outputs": [], + "source": [ + "view_run(t2v_nano_noaudio_output)\n" + ] + }, + { + "cell_type": "markdown", + "id": "816e9b09", + "metadata": {}, + "source": [ + "## Nano: Text to Video with Audio\n", + "\n", + "Nano text-to-video generation with generated audio.\n", + "\n", + "### Create Payload\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e8450feb", + "metadata": {}, + "outputs": [], + "source": [ + "t2vs_payload, t2vs_output, t2vs_model = create_payload(\"t2vs\", backend=\"sglang\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "b6bd6ec8", + "metadata": {}, + "source": [ + "### Run\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "685d18d6", + "metadata": {}, + "outputs": [], + "source": [ + "run_sglang_payload(t2vs_payload, t2vs_output, model=\"Cosmos3-Nano\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "93743159", + "metadata": {}, + "source": [ + "### View Results\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a4bc049a", + "metadata": {}, + "outputs": [], + "source": [ + "view_run(t2vs_output)\n" + ] + }, + { + "cell_type": "markdown", + "id": "1cf05248", + "metadata": {}, + "source": [ + "## Nano: Image to Video Without Audio\n", + "\n", + "Nano image-to-video generation using its paired image asset, with audio disabled.\n", + "\n", + "### Create Payload\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "91e28e16", + "metadata": {}, + "outputs": [], + "source": [ + "i2v_nano_noaudio_payload, i2v_nano_noaudio_output, i2v_nano_noaudio_model = create_payload(\"i2v_nano_noaudio\", backend=\"sglang\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "ef42f84d", + "metadata": {}, + "source": [ + "### Run\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4ec848e8", + "metadata": {}, + "outputs": [], + "source": [ + "run_sglang_payload(i2v_nano_noaudio_payload, i2v_nano_noaudio_output, model=\"Cosmos3-Nano\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "dea78891", + "metadata": {}, + "source": [ + "### View Results\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ae7a16ed", + "metadata": {}, + "outputs": [], + "source": [ + "view_run(i2v_nano_noaudio_output)\n" + ] + }, + { + "cell_type": "markdown", + "id": "1860e911", + "metadata": {}, + "source": [ + "## Nano: Image to Video with Audio\n", + "\n", + "Nano image-to-video generation using its paired image asset and generated audio.\n", + "\n", + "### Create Payload\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "37c40d8e", + "metadata": {}, + "outputs": [], + "source": [ + "i2vs_payload, i2vs_output, i2vs_model = create_payload(\"i2vs\", backend=\"sglang\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "637e5e87", + "metadata": {}, + "source": [ + "### Run\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b4a4c308", + "metadata": {}, + "outputs": [], + "source": [ + "run_sglang_payload(i2vs_payload, i2vs_output, model=\"Cosmos3-Nano\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "62994915", + "metadata": {}, + "source": [ + "### View Results\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "58a28e11", + "metadata": {}, + "outputs": [], + "source": [ + "view_run(i2vs_output)\n" + ] + }, + { + "cell_type": "markdown", + "id": "fe62931e", + "metadata": {}, + "source": [ + "# Cosmos3-Super Examples\n", + "\n", + "The same use cases for the larger `Cosmos3-Super` model. This section is self-contained; you can run it without the Cosmos3-Nano section above.\n" + ] + }, + { + "cell_type": "markdown", + "id": "466c6bac", + "metadata": {}, + "source": [ + "## Super: Text to Image\n", + "\n", + "Super text-to-image generation using the same structured JSON prompt.\n", + "\n", + "### Create Payload\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "575653c1", + "metadata": {}, + "outputs": [], + "source": [ + "t2i_super_payload, t2i_super_output, t2i_super_model = create_payload(\"t2i_super\", backend=\"sglang\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "42776c8b", + "metadata": {}, + "source": [ + "### Run\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "824783c8", + "metadata": {}, + "outputs": [], + "source": [ + "run_sglang_payload(t2i_super_payload, t2i_super_output, model=\"Cosmos3-Super\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "1ebf0b15", + "metadata": {}, + "source": [ + "### View Results\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b0fba2bb", + "metadata": {}, + "outputs": [], + "source": [ + "view_run(t2i_super_output)\n" + ] + }, + { + "cell_type": "markdown", + "id": "f608eaa7", + "metadata": {}, + "source": [ + "## Super: Text to Video Without Audio\n", + "\n", + "Super text-to-video generation with audio disabled.\n", + "\n", + "### Create Payload\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t2v_super_noaudio_payload, t2v_super_noaudio_output, t2v_super_noaudio_model = create_payload(\"t2v_super_noaudio\", backend=\"sglang\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Run\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "run_sglang_payload(t2v_super_noaudio_payload, t2v_super_noaudio_output, model=\"Cosmos3-Super\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### View Results\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "view_run(t2v_super_noaudio_output)\n" + ] + }, + { + "cell_type": "markdown", + "id": "367f4161", + "metadata": {}, + "source": [ + "## Super: Image to Video Without Audio\n", + "\n", + "Super image-to-video generation using its paired image asset, with audio disabled.\n", + "\n", + "### Create Payload\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "i2v_super_noaudio_payload, i2v_super_noaudio_output, i2v_super_noaudio_model = create_payload(\"i2v_super_noaudio\", backend=\"sglang\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Run\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "run_sglang_payload(i2v_super_noaudio_payload, i2v_super_noaudio_output, model=\"Cosmos3-Super\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### View Results\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "view_run(i2v_super_noaudio_output)\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}