Manipulation audit -- Jay Vakil, Yi-Shiuan Tung by Jdvakil · Pull Request #61 · arpg/vla-foundations

Jdvakil · 2026-02-13T06:12:53Z

No description provided.

**LaTeX Rendering Fixes:** - Add KaTeX CDN fallback to ensure CSS loads on GitHub Pages - Configure rehype-katex with error-tolerant settings (strict: false, throwOnError: false) - Reorder CSS imports (KaTeX before globals) to allow overrides - Add explicit .katex-display styling for proper spacing and centering **Typography & Readability Improvements:** - Increase base font size to 18px with line-height 1.75 - Enhance prose configuration with better spacing and hierarchy - Improve heading styles with proper weights, spacing, and borders - Remove backticks from inline code styling for cleaner appearance **Layout Enhancements:** - Add subtle background gradient (slate-50 to slate-100) - Increase article max-width from 4xl to 5xl for wider reading area - Add white background with shadow to article for visual depth - Improve sidebar design with gradient background - Enhance responsive padding and spacing **Content Styling:** - Professional dark-themed code blocks with syntax highlighting support - Enhanced blockquotes with gradient backgrounds and border accents - Better list spacing and visual hierarchy - Improved table, link, image, and horizontal rule styling - Clear visual separation between content sections **Audit Header Redesign:** - Larger, bolder title (text-5xl, extrabold) for better hierarchy - Enhanced tag styling with borders and improved colors - Add author section with icon - Add bottom border to separate header from content - Improve banner designs for review/staging modes with icons This addresses double-rendering issues on GitHub Pages and significantly improves readability and visual appeal of audit pages. Fixes: LaTeX rendering twice (raw + formatted) Improves: Overall page aesthetics, typography, spacing, and user experience

Copilot

Pull request overview

Adds a new Manipulation technical audit (staging) and updates MDX rendering/styling (including KaTeX + prose typography) to improve audit readability, alongside a small staging deploy workflow tweak.

Changes:

Add new staging audit MDX content for “Manipulation: A technical audit”.
Update prose/typography styling and KaTeX display handling (Tailwind typography + global CSS).
Adjust audit page rendering (banner/header styling, KaTeX rehype options) and staging deploy host.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`tailwind.config.ts`	Extends typography defaults for prose content (links, headings, code, KaTeX display spacing).
`content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx`	Adds the new manipulation audit content (with math/KaTeX).
`components/KatexStyles.tsx`	Introduces a client-side KaTeX CSS injector via CDN.
`app/textbook/audits/[...slug]/page.tsx`	Updates audit page layout/styling and configures KaTeX rendering options.
`app/layout.tsx`	Reorders KaTeX CSS import ahead of `globals.css`.
`app/globals.css`	Adds global prose + KaTeX styling rules.
`.github/workflows/deploy-staging.yml`	Updates staging deploy SSH host.
`.continueignore`	Adds ignore patterns for Continue tooling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-13T06:17:17Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+
+1. Compute quantiles: $q_{0.01}, q_{0.99}$
+2. Uniformly divide: $\Delta_d = \frac{q_{0.99} - q_{0.01}}{256}$
+3. Map continuous action to bin: $\hat{a}_d = \left\lfloor \frac{a_d - q*{0.01}}{\Delta_d} \right\rfloor$


LaTeX typo in the binning formula: q*{0.01} looks like it should be q_{0.01} (matching the quantile notation used just above). As written, KaTeX will render an unexpected q* term.

Suggested change

3. Map continuous action to bin: $\hat{a}_d = \left\lfloor \frac{a_d - q*{0.01}}{\Delta_d} \right\rfloor$

3. Map continuous action to bin: $\hat{a}_d = \left\lfloor \frac{a_d - q_{0.01}}{\Delta_d} \right\rfloor$

Copilot · 2026-02-13T06:17:17Z

app/textbook/audits/[...slug]/page.tsx

+    <AuditLayout
+      chapters={chapters}
+      isReviewMode={isReviewMode}
+      prNumber={prNumber}
+    >


AuditLayout is being called with isReviewMode and prNumber props, but components/audit/AuditLayout.tsx currently defines AuditLayoutProps with only children and chapters. This will fail type-checking/build unless the layout component (and its props interface) is updated to accept these new props, or the extra props are removed here.

Copilot · 2026-02-13T06:17:18Z

app/textbook/audits/[...slug]/page.tsx

+      prNumber={prNumber}
+    >
+      <KatexStyles />
      <Link href="/textbook/audits" className="text-sm text-blue-600 hover:text-blue-800 mb-8 inline-block">


<KatexStyles /> injects a CDN stylesheet at runtime, but KaTeX CSS is already imported globally in app/layout.tsx. Since Next's bundled CSS won’t match link[href*="katex"], this effect will still append an extra external stylesheet, adding an unnecessary network request and potentially overriding local CSS unexpectedly. Consider removing KatexStyles (preferred) and relying on the bundled KaTeX CSS, or make stylesheet loading consistent in a single place.

Copilot · 2026-02-13T06:17:18Z

app/textbook/audits/[...slug]/page.tsx

+              [rehypeKatex, {
+                strict: false, // Don't fail on unknown LaTeX commands
+                trust: true, // Allow some advanced LaTeX features
+                throwOnError: false, // Gracefully handle errors
+              }]


rehype-katex is configured with trust: true. For KaTeX this enables “trusted” commands (e.g., HTML/URL-related macros) and can open the door to XSS if any MDX content is ever untrusted (student submissions / PR content). Unless you have a strong reason + additional sanitization, keep trust disabled (default) and allow only the subset of macros you need.

Copilot · 2026-02-13T06:17:18Z

components/KatexStyles.tsx

+import { useEffect } from 'react';
+
+export function KatexStyles() {
+  useEffect(() => {
+    // Ensure KaTeX CSS is loaded
+    const link = document.createElement('link');
+    link.rel = 'stylesheet';
+    link.href = 'https://cdn.jsdelivr.net/npm/katex@0.16.27/dist/katex.min.css';
+    link.integrity = 'sha384-yp+jpRNKIa0xGrYaVtwImDXkFq7ZOCV5kJZVDg/uAFfYPmtFcKr0sxhVJy1HqnWD';
+    link.crossOrigin = 'anonymous';
+
+    // Check if already loaded
+    const existing = document.querySelector('link[href*="katex"]');
+    if (!existing) {
+      document.head.appendChild(link);
+    }
+  }, []);
+


This component injects KaTeX CSS from a third-party CDN at runtime. The app already imports katex/dist/katex.min.css in app/layout.tsx, and the current existing check won’t detect Next’s bundled CSS link, so you’ll likely end up loading KaTeX twice. Prefer a single, build-time import (or a single <link> in the document head) to avoid extra requests and reliance on an external CDN.

Suggested change

import { useEffect } from 'react';

export function KatexStyles() {

useEffect(() => {

// Ensure KaTeX CSS is loaded

const link = document.createElement('link');

link.rel = 'stylesheet';

link.href = 'https://cdn.jsdelivr.net/npm/katex@0.16.27/dist/katex.min.css';

link.integrity = 'sha384-yp+jpRNKIa0xGrYaVtwImDXkFq7ZOCV5kJZVDg/uAFfYPmtFcKr0sxhVJy1HqnWD';

link.crossOrigin = 'anonymous';

// Check if already loaded

const existing = document.querySelector('link[href*="katex"]');

if (!existing) {

document.head.appendChild(link);

}

}, []);

export function KatexStyles() {

// KaTeX styles are loaded via a build-time import in app/layout.tsx.

// This component is kept for compatibility but no longer injects CSS at runtime.

Copilot · 2026-02-13T06:17:18Z

app/globals.css

+
+/* Prevent double-rendering by hiding any raw LaTeX that might leak through */
+.katex-html {
+  /* KaTeX should render, not raw LaTeX */


.katex-html has an empty rule block, but the comment suggests it prevents “double-rendering” / hides raw LaTeX. As written it has no effect and is confusing for future maintainers. Either remove this selector/comment or add the actual styles needed to address the issue you observed.

Suggested change

/* KaTeX should render, not raw LaTeX */

/* KaTeX should render, not raw LaTeX, so hide the underlying HTML/LaTeX layer */

display: none;

Copilot · 2026-02-13T06:17:18Z

tailwind.config.ts

+            // Ensure math displays properly
+            '.katex-display': {
+              margin: '1.5rem 0',
+              padding: '1.5rem 0',
+            },


.katex-display spacing is defined here inside the typography theme, but very similar .katex-display styling is also added globally in app/globals.css. Keeping the same styles in two places is easy to let drift and makes it unclear which source of truth to edit. Consider consolidating KaTeX display styling into either the typography config or globals (not both).

Suggested change

// Ensure math displays properly

'.katex-display': {

margin: '1.5rem 0',

padding: '1.5rem 0',

},

Copilot · 2026-02-13T06:17:19Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+
+## Summary
+
+Trained on 800k trajectories from the Open X-Embodiment dataset. Octo can be effectively finetuned to new observations and action spaces. Released model checkpoints with 27M and 93M parameters, out of the box, support multiple RGB camera inputs as well as both language and goal image task specificiation.


Typo in the summary: “task specificiation” should be “task specification”.

Suggested change

Trained on 800k trajectories from the Open X-Embodiment dataset. Octo can be effectively finetuned to new observations and action spaces. Released model checkpoints with 27M and 93M parameters, out of the box, support multiple RGB camera inputs as well as both language and goal image task specificiation.

Trained on 800k trajectories from the Open X-Embodiment dataset. Octo can be effectively finetuned to new observations and action spaces. Released model checkpoints with 27M and 93M parameters, out of the box, support multiple RGB camera inputs as well as both language and goal image task specification.

Jdvakil · 2026-02-19T16:35:10Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

@@ -0,0 +1,245 @@
+


It is missing references.

lorinachey · 2026-02-19T16:37:34Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+* **Dexterous manipulation** involves in-hand repositioning of objects using multi-fingered hands. This requires coordinated control of many degrees of freedom with continuous contact state estimation. $\pi_0$ and $\pi_{0.5}$ have demonstrated progress in this area, though the task remains unsolved in general.
+* **Contact-rich manipulation** involves tasks where the robot must make and maintain complex contact with the environment (e.g., insertion, assembly, screwing, polishing, wiping). These tasks require force modulation and compliance control, which are currently absent from the action spaces of standard VLAs.
+
+While computer vision and natural language processing have scaled to billions of parameters, manipulation remains bottlenecked. This discrepancy illustrates Moravec’s Paradox. Foundational vision models achieve zero-shot semantic understanding, and VLMs exhibit advanced logical reasoning. However, translating this reasoning into physical action is difficult because manipulation requires sub-millimeter geometric perception, multi-step causal reasoning, and the management of continuous contact dynamics (forces, compliance, constraints).


I don't see where Moravec's paradox is introduced. Is this something that's helpful for the reader to understand/is it part of the core analysis of manipulation?

Moravec's paradox is essentially higher level of reasoning is relatively less computationally intensive than lower-level of reasoning. Which ties into vision and reasoning having more resources (data) than sensorimotor interactions.

lorinachey · 2026-02-19T16:38:59Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+While computer vision and natural language processing have scaled to billions of parameters, manipulation remains bottlenecked. This discrepancy illustrates Moravec’s Paradox. Foundational vision models achieve zero-shot semantic understanding, and VLMs exhibit advanced logical reasoning. However, translating this reasoning into physical action is difficult because manipulation requires sub-millimeter geometric perception, multi-step causal reasoning, and the management of continuous contact dynamics (forces, compliance, constraints).
+
+## 2. Embodiment Gap and Data Scaling
+The primary barrier to generalized manipulation is the Embodiment Gap. LLMs are trained on passive, internet-scale datasets exceeding $15 \times 10^{12}$ tokens, and vision models on pixel datasets exceeding $10 \times 10^9$ image-text pairs. Conversely, robot action data must be physically generated through methods like kinesthetic teaching or teleoperation, yielding dataset sizes closer to $2 \times 10^6$ trajectories.


There is also synthetic data generation through simulation for robotics.

Hhy903 · 2026-02-19T16:43:13Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+$$
+
+They overwrite the 256 least-used tokens in the LLaMA vocabulary with action tokens.
+


Discretizing actions into bins makes training easier, but I’m wondering how much precision is lost for fine manipulation tasks.

crheckman · 2026-02-19T16:36:14Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+## Abstract
+Robotic manipulation in unstructured environments remains an open problem in embodied AI. Despite progress in vision-language foundation models, translating internet-scale semantic understanding into contact-rich motor control is challenging. This review examines the current landscape of Vision-Language-Action (VLA) models for manipulation through the lens of the data bottleneck. While language models train on trillions of tokens and vision-language models consume billions of image-text pairs, the robot manipulation community has produced roughly two million trajectories, fragmented across embodiments, labs, and task distributions. The field utilizes three competing strategies to address this limitation: (1) transferring internet-scale pretraining into action spaces, (2) multiplying limited real-world data through augmentation and cross-embodiment pooling, and (3) redesigning the action representation itself. This review traces the architectural lineage from RT-1 through $\pi_{0.5}$, formalizes the mathematical trade-offs at each model's critical interfaces, and identifies the limitations of each strategy.
+
+**Index terms:** Robotic manipulation, Foundational models, Vision-Language-Action (VLA) Model


can delete this

crheckman · 2026-02-19T16:38:22Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+* **Prehensile manipulation** involves grasping and transporting objects. This is the primary focus of current VLA research (e.g., pick-and-place, bin picking, tabletop rearrangement). The key challenge is grasp planning: selecting a grasp pose that is kinematically reachable, stable under the object's mass distribution, and achievable with the robot's gripper geometry.
+* **Non-prehensile manipulation** involves moving objects without grasping them (e.g., pushing, sliding, tilting, or toppling). The physics are dominated by friction and inertia, and control is typically open-loop or quasi-static.
+* **Dexterous manipulation** involves in-hand repositioning of objects using multi-fingered hands. This requires coordinated control of many degrees of freedom with continuous contact state estimation. $\pi_0$ and $\pi_{0.5}$ have demonstrated progress in this area, though the task remains unsolved in general.
+* **Contact-rich manipulation** involves tasks where the robot must make and maintain complex contact with the environment (e.g., insertion, assembly, screwing, polishing, wiping). These tasks require force modulation and compliance control, which are currently absent from the action spaces of standard VLAs.


All of the examples you gave involve a form of prehensile manipulation, making it seem that this is a subset of that task (somehow involving "complex contact," which isn't defined).

What might help is a taxonomy, or some more strict definitions of these other tasks (dextrous, contact-rich) that set them apart or are key differentiators from the fairly intuitive explanations of prehensile and non-prehensile manipulation.

crheckman · 2026-02-19T16:40:34Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+While computer vision and natural language processing have scaled to billions of parameters, manipulation remains bottlenecked. This discrepancy illustrates Moravec’s Paradox. Foundational vision models achieve zero-shot semantic understanding, and VLMs exhibit advanced logical reasoning. However, translating this reasoning into physical action is difficult because manipulation requires sub-millimeter geometric perception, multi-step causal reasoning, and the management of continuous contact dynamics (forces, compliance, constraints).
+
+## 2. Embodiment Gap and Data Scaling
+The primary barrier to generalized manipulation is the Embodiment Gap. LLMs are trained on passive, internet-scale datasets exceeding $15 \times 10^{12}$ tokens, and vision models on pixel datasets exceeding $10 \times 10^9$ image-text pairs. Conversely, robot action data must be physically generated through methods like kinesthetic teaching or teleoperation, yielding dataset sizes closer to $2 \times 10^6$ trajectories.


And yet, some leading roboticists are not convinced of this. This is a conversation to watch.

crheckman · 2026-02-19T16:43:18Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+
+Two models were trained on this aggregated data: RT-1-X (the RT-1 architecture trained on the pooled dataset) and RT-2-X (RT-2 fine-tuned on the pooled dataset). Positive transfer exists across embodiments; both models outperformed their single-embodiment counterparts, with RT-2-X showing a 3x improvement on emergent skill evaluations compared to RT-2 trained on Google Robot data alone.
+
+The pooled training does not explicitly model embodiment differences; there is no embodiment embedding or dynamics adapter. The model must implicitly learn to factor its representations into embodiment-invariant (task semantics, object properties) and embodiment-specific (workspace geometry, joint limits, gripper type) components.


OK, so we all believe the "embodiment gap" exists and we need lots of data to fill it.

So, the solution is to: ignore the embodiment differences and fill the gap with aggregated data across morphologies?

Is this brilliance or stupidity?

lorinachey · 2026-02-19T16:43:38Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+
+**Taxonomy Positioning:**
+* **Policy type:** Generalist robot policy (vision + optional language/goal)
+* **Training paradigm:** Large-scale imitation learning (OXE mixture)


What's the breakdown of the data used for the OXE mixture? Why not use the whole dataset? I'm interested in the breakdown of data chosen, and what kind of data diversity they have.

lorinachey · 2026-02-19T16:46:15Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+* Observation tokens at time $t$ attend only to task tokens and observation tokens up to $t$ ($T_{o, 0:t}$) plus language instructions ($T_{\ell}$).
+* Missing modalities are fully masked.
+
+Octo inserts learned readout tokens ($T_{R,t}$) that attend to preceding task and observation tokens, but are not attended to by task and observation tokens (forming a read-only pathway). This enables adding or removing observation channels or action heads without reinitializing the transformer.


This is fairly unique to the Octo paper as far as I know. I think it deserves a little more description of how this enables the adding or removing of components. Using one of the figures from the paper to illustrate this might help.

lorinachey · 2026-02-19T16:47:20Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+Octo inserts learned readout tokens ($T_{R,t}$) that attend to preceding task and observation tokens, but are not attended to by task and observation tokens (forming a read-only pathway). This enables adding or removing observation channels or action heads without reinitializing the transformer.
+
+#### 3.3.3 Diffusion Action Head
+Octo predicts actions using a conditional diffusion decoder. It performs one transformer forward pass per action, then runs multi-step denoising inside the diffusion head. They train with the standard DDPM objective (adding Gaussian noise to dataset actions and training $\epsilon_\theta$ to reconstruct the original action).


And what are they conditioning on? Is it multi-modal data? A single RGB frame?

lorinachey · 2026-02-19T16:49:48Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+* Runs at ~6 Hz on RTX 4090 (15GB bfloat16).
+
+### 4.2 Problem Domain & Taxonomy
+OpenVLA operates in generalist robotic manipulation across multiple embodiments, diverse scenes, multi-task environments, and end-effector control. The OpenX-Embodiment dataset is filtered to single-arm setups with at least one third-person camera.


I'm curious if the third person camera view has to be a specific view, or if any angle/orientation of the camera helps. For example, does the third person view have to be directly in front of the robot? Can it be a side angle view and still provide value? I'm not sure if they report any of that information in the paper but it would be interesting to know about the data set.

jt7347 · 2026-02-19T16:49:30Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+* Observation tokens at time $t$ attend only to task tokens and observation tokens up to $t$ ($T_{o, 0:t}$) plus language instructions ($T_{\ell}$).
+* Missing modalities are fully masked.
+
+Octo inserts learned readout tokens ($T_{R,t}$) that attend to preceding task and observation tokens, but are not attended to by task and observation tokens (forming a read-only pathway). This enables adding or removing observation channels or action heads without reinitializing the transformer.


Still a little unsure on significance of adding readout tokens, and what they do exactly. Could help elaborate a little? I.e. more info on why add this on top of traditional action/observation/task tokens?

Edit: I just didn't read into it enough. I think I have a better understanding now - this part could probably use a little emphasis, it seems like a pretty important trick within Octo to make the model more generalist.

zlaouar · 2026-02-19T16:52:50Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+* **Generalization:** Demonstrated behaviors include 10-15 minute continuous sequences. Performance declines when web data or cross-embodiment data is ablated for tasks requiring broad semantic reasoning.
+
+## 6. Limitations and Future Outlook
+The trajectory from behavior cloning to flow-matching VLAs demonstrates progress in closing the embodiment gap. However, physical robotics cannot strictly replicate the internet-scale passive scraping of LLMs. The field is approaching an asymptote on raw physical data collection. Future methods will likely depend on sample-efficient architectures capable of implicit physics understanding.


I think this audit is missing a a larger discussion on limitations / load-bearing walls. You can even add a section after each paper to discuss the approach specific limitations.

We understand from this audit that there is an embodiment gap but how close does each approach get to closing this gap?

lorinachey · 2026-02-19T16:55:06Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+### 5.4 Scaling and Experiments
+* **Data scaling:** Mobile manipulation consists of ~400 hours in ~100 homes, yet 97.6% of Phase 1 examples are from other domains, including web-scale captioning and VQA.
+* **Training scaling:** 280k gradient steps in pre-training, followed by 80k post-training steps for flow-matching.
+* **Generalization:** Demonstrated behaviors include 10-15 minute continuous sequences. Performance declines when web data or cross-embodiment data is ablated for tasks requiring broad semantic reasoning.


What are some examples of the tasks that require broad semantic reasoning where performance declined? Is there any theme amongst those tasks that could be a clue to what's missing in the data mixture for training?

aritrach · 2026-02-19T16:55:23Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+* **Quantization:** 8-bit quantization increases inference latency, dropping control frequency to 1.2 Hz on an A5000. 4-bit quantization reduces memory usage and yields higher throughput, achieving ~3 Hz control frequency on an A5000 GPU with rollout performance comparable to bfloat16.
+
+### 4.5 Experiments
+OpenVLA (7B) outperforms RT-2-X (55B parameters) and Diffusion Policy baselines across 29 tasks on BridgeData V2. LoRA-based parameter-efficient fine-tuning achieves performance close to full fine-tuning with lower memory and compute costs.


When you say 'diffusion policy baselines', do you mean JUST diffusion, or do the authors also compare to Octo which uses a diffusion action head?

lorinachey · 2026-02-19T16:56:21Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+* **Web Data (WD):** Improves semantic reasoning and object grounding, extended with bounding box annotations.
+
+**Stage 2: Post-training with an action expert for flow matching**
+Post-training adds a separate 300M parameter action expert (on top of the 2B PaliGemma backbone) that predicts continuous action chunks via flow matching.


Definitely would help to have a brief description of flow matching so that we know something about it before looking at it's loss in the following section. Also, because it's different than the prior methods, it would be helpful to have a compare and contrast on these action output methods (diffusion vs. flow matching) and when/why to use one or the other.

crheckman

mostly gushing, some requests for changes

crheckman · 2026-02-19T16:47:44Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+
+#### 3.1.1 Key Results
+* **Zero-shot:** On environments from its pretraining distribution, Octo achieves ~33% higher average success than RT-1-X and performs similarly to RT-2-X on tested WidowX tasks.
+* **Finetuning:** With ~100 demos per domain, Octo finetunes across real-robot domains, including new observation inputs (force-torque) and new action spaces (joint position control), reaching 72% average success compared to 20% for ResNet+Transformers and 15% for VC-1.


I looked into the remainder of your document. There are no details about how the fine-tuning was executed. Example: What is the data mix with Internet-scale and robot data? Corollary: Can the model complete user: what is the capital of France? with agent: Paris or agent: a_121, a_214, ...?

crheckman · 2026-02-19T16:53:12Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+* **Zero-shot:** On environments from its pretraining distribution, Octo achieves ~33% higher average success than RT-1-X and performs similarly to RT-2-X on tested WidowX tasks.
+* **Finetuning:** With ~100 demos per domain, Octo finetunes across real-robot domains, including new observation inputs (force-torque) and new action spaces (joint position control), reaching 72% average success compared to 20% for ResNet+Transformers and 15% for VC-1.
+* **Accessibility:** Pretraining (Octo-Base / ViT-B-sized backbone) takes 300k steps at batch size 2048 on a TPU v4-128 pod (~14 hours); finetuning on a single 24GB NVIDIA A5000 takes ~5 hours.
+* **Tokenization:** Language instructions are embedded via a pre-trained T5-base model (111M parameters), and images (wrist and 3rd-person cameras) are processed through a shallow CNN into patches.


I am very confused by the two above points. Accessibility says they train on Octo-Base for 5 hours. But the tokenization says they link this up with a 111M parameter model and process images through a shallow CNN. These two are not compatible architectures without some significant scaffolding. Your later section on the architecture is basically a rehash of these two bullets and does not help grant understanding of what is being done, leaving aside the "why" which is even more important.

crheckman · 2026-02-19T16:54:29Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+#### 3.1.1 Key Results
+* **Zero-shot:** On environments from its pretraining distribution, Octo achieves ~33% higher average success than RT-1-X and performs similarly to RT-2-X on tested WidowX tasks.
+* **Finetuning:** With ~100 demos per domain, Octo finetunes across real-robot domains, including new observation inputs (force-torque) and new action spaces (joint position control), reaching 72% average success compared to 20% for ResNet+Transformers and 15% for VC-1.
+* **Accessibility:** Pretraining (Octo-Base / ViT-B-sized backbone) takes 300k steps at batch size 2048 on a TPU v4-128 pod (~14 hours); finetuning on a single 24GB NVIDIA A5000 takes ~5 hours.


What was the pretraining mix? What is the fine-tuning data? Why did they split this up? Why did they need to pretrain rather than starting with a ViT-B for image tokenization?

crheckman · 2026-02-19T16:58:40Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+OpenVLA fine-tunes a pretrained VLM composed of:
+* **Visual encoder (600M params):** DINOv2 for geometric and spatial features and SigLIP for semantic alignment features. Given image patches $x$:
+
+$$


This architecture makes a lot of sense. Good work

crheckman · 2026-02-19T16:59:19Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+The authors finetuned the vision encoder during VLA training to capture spatial details for precise robotic control.
+
+### 4.4 Scaling and Efficiency
+* **Data and Compute Scaling:** The final OpenVLA model is trained on 970k episodes using 64 A100 GPUs for 14 days (21,500 A100-hours) with a batch size of 2048.


Batch size 2048 😭😭😭😭

Zaaler · 2026-02-19T17:01:14Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+* **Generalization:** Demonstrated behaviors include 10-15 minute continuous sequences. Performance declines when web data or cross-embodiment data is ablated for tasks requiring broad semantic reasoning.
+
+## 6. Limitations and Future Outlook
+The trajectory from behavior cloning to flow-matching VLAs demonstrates progress in closing the embodiment gap. However, physical robotics cannot strictly replicate the internet-scale passive scraping of LLMs. The field is approaching an asymptote on raw physical data collection. Future methods will likely depend on sample-efficient architectures capable of implicit physics understanding.


What are the primary modes of failure? Do you think the collection of raw physical data collection will solve all the problems of robotic manipulation with VLAs?

krusnim · 2026-02-19T16:46:35Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+
+The authors finetuned the vision encoder during VLA training to capture spatial details for precise robotic control.
+
+### 4.4 Scaling and Efficiency


IMO, this would be a good place to be more detailed and more opinionated. What design decisions contributed to these scaling behaviors? Do you agree with them?

For instance - OpenVLA uses single-image observations; there's no observation history. (IIRC, they note this as a limitation in the paper.) Why do you think they made this choice? How does it affect the model scaling and performance?

krusnim · 2026-02-19T16:53:43Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+
+#### 5.3.2 Two-Stage Training: Discrete-Token Pretraining and Flow-Matching Post-Training
+**Stage 1: Pre-training with discrete tokens (FAST)**
+During pretraining, all tasks (including robot actions) are represented as discrete tokens, enabling next-token prediction via FAST. The pretraining data mixture includes:


I haven't read the pi0.5 paper in detail, but my understanding is that they provide some ablations on this data mixture and argue why each component is important. I'd like to see some more detailed analysis about the reasons they gave for using this mixture. What failure modes would arise if any of these categories were left out? How do you think they were balanced (if there's any telling)?

You say "the paper posits that in-the-wild generalization requires knowledge transfer from heterogeneous sources" - here's the place to explain why.

krusnim · 2026-02-19T17:00:10Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+
+Given $(x_t, t)$, the model is trained to predict the flow field $v_t$, which is used for integrating from noise to actions at inference.
+
+### 5.4 Scaling and Experiments


This audit is missing a thesis, which I suspect others have pointed out, but the individual sections should also have theses. In particular, pi0.5 comes with some serious claims of generality - which (as the professor has pointed out) you shouldn't necessarily believe, so take a stance. What's new in this paper compared to e.g. pi0, why does it help with manipulation specifically, and do you think it was a worthy addition? Is pi0.5 really generalizable?

zlaouar · 2026-02-19T17:02:03Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

+**Taxonomy Positioning:**
+* **Policy type:** Hierarchical VLA (high-level semantic subtask + low-level control)
+* **Training paradigm:** Large-scale co-training on heterogeneous robot + web + semantic data
+* **Action representation:** Hybrid: discrete FAST tokens (pretraining) + flow matching (inference)


How do you think this compares to the diffusion-based and discrete token prediction based action representations in Octo and OpenVLA?

I'd like to understand the cost/benefit analysis of treating actions as continuous vs discrete (especially for fine motor skills required for manipulation).

Also how does multi-step action representation compare to single-step actions? Doesn't manipulation require higher frequency feedback (observe-act-observe-act...)? If so, what are the limitations of committing to an open-loop trajectory of actions?

Zaaler · 2026-02-19T17:16:26Z

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

I dont think the description of the training data is well covered. Trajectories can be tracked in either joint or task planning space. Which do these models use? Has anyone explored training between these two types (any ablation study)? Are both joint and pose trajectories required in training?

arpg-bot and others added 5 commits February 1, 2026 05:28

Fix GitHub Actions: update SSH host to direct.ristoffer.ch

d198412

fix: proper KaTeX CSS loading for static exports with client component

3f61047

Create jdvakil_yi_shiuan_tung.mdx

6778f6e

First rough draft. Needs better structure and better deeper dive.

ad5bbae

Copilot AI review requested due to automatic review settings February 13, 2026 06:12

Copilot started reviewing on behalf of Jdvakil February 13, 2026 06:13 View session

Copilot AI reviewed Feb 13, 2026

View reviewed changes

Jdvakil added 3 commits February 19, 2026 08:52

presentation draft

f855638

formatting changes

52af4c2

inline break fixes

c59dccd

Jdvakil commented Feb 19, 2026

View reviewed changes

content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx

@@ -0,0 +1,245 @@

Copy link

Contributor Author

Jdvakil Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is missing references.

lorinachey reviewed Feb 19, 2026

View reviewed changes

Hhy903 reviewed Feb 19, 2026

View reviewed changes

crheckman requested changes Feb 19, 2026

View reviewed changes

lorinachey reviewed Feb 19, 2026

View reviewed changes

jt7347 reviewed Feb 19, 2026

View reviewed changes

zlaouar reviewed Feb 19, 2026

View reviewed changes

lorinachey reviewed Feb 19, 2026

View reviewed changes

aritrach reviewed Feb 19, 2026

View reviewed changes

lorinachey reviewed Feb 19, 2026

View reviewed changes

crheckman requested changes Feb 19, 2026

View reviewed changes

Zaaler reviewed Feb 19, 2026

View reviewed changes

krusnim reviewed Feb 19, 2026

View reviewed changes

zlaouar reviewed Feb 19, 2026

View reviewed changes

Zaaler reviewed Feb 19, 2026

View reviewed changes

	3. Map continuous action to bin: $\hat{a}_d = \left\lfloor \frac{a_d - q*{0.01}}{\Delta_d} \right\rfloor$
	3. Map continuous action to bin: $\hat{a}_d = \left\lfloor \frac{a_d - q_{0.01}}{\Delta_d} \right\rfloor$

	/* KaTeX should render, not raw LaTeX */
	/* KaTeX should render, not raw LaTeX, so hide the underlying HTML/LaTeX layer */
	display: none;


		## Summary

		Trained on 800k trajectories from the Open X-Embodiment dataset. Octo can be effectively finetuned to new observations and action spaces. Released model checkpoints with 27M and 93M parameters, out of the box, support multiple RGB camera inputs as well as both language and goal image task specificiation.

		$$

		They overwrite the 256 least-used tokens in the LLaMA vocabulary with action tokens.


		Two models were trained on this aggregated data: RT-1-X (the RT-1 architecture trained on the pooled dataset) and RT-2-X (RT-2 fine-tuned on the pooled dataset). Positive transfer exists across embodiments; both models outperformed their single-embodiment counterparts, with RT-2-X showing a 3x improvement on emergent skill evaluations compared to RT-2 trained on Google Robot data alone.

		The pooled training does not explicitly model embodiment differences; there is no embodiment embedding or dynamics adapter. The model must implicitly learn to factor its representations into embodiment-invariant (task semantics, object properties) and embodiment-specific (workspace geometry, joint limits, gripper type) components.


		The authors finetuned the vision encoder during VLA training to capture spatial details for precise robotic control.

		### 4.4 Scaling and Efficiency


		Given $(x_t, t)$, the model is trained to predict the flow field $v_t$, which is used for integrating from noise to actions at inference.

		### 5.4 Scaling and Experiments

Conversation

Jdvakil commented Feb 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jt7347 Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lorinachey Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lorinachey Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crheckman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jt7347 Feb 19, 2026 •

edited

Loading

lorinachey Feb 19, 2026 •

edited

Loading

lorinachey Feb 19, 2026 •

edited

Loading