Conversation
**LaTeX Rendering Fixes:** - Add KaTeX CDN fallback to ensure CSS loads on GitHub Pages - Configure rehype-katex with error-tolerant settings (strict: false, throwOnError: false) - Reorder CSS imports (KaTeX before globals) to allow overrides - Add explicit .katex-display styling for proper spacing and centering **Typography & Readability Improvements:** - Increase base font size to 18px with line-height 1.75 - Enhance prose configuration with better spacing and hierarchy - Improve heading styles with proper weights, spacing, and borders - Remove backticks from inline code styling for cleaner appearance **Layout Enhancements:** - Add subtle background gradient (slate-50 to slate-100) - Increase article max-width from 4xl to 5xl for wider reading area - Add white background with shadow to article for visual depth - Improve sidebar design with gradient background - Enhance responsive padding and spacing **Content Styling:** - Professional dark-themed code blocks with syntax highlighting support - Enhanced blockquotes with gradient backgrounds and border accents - Better list spacing and visual hierarchy - Improved table, link, image, and horizontal rule styling - Clear visual separation between content sections **Audit Header Redesign:** - Larger, bolder title (text-5xl, extrabold) for better hierarchy - Enhanced tag styling with borders and improved colors - Add author section with icon - Add bottom border to separate header from content - Improve banner designs for review/staging modes with icons This addresses double-rendering issues on GitHub Pages and significantly improves readability and visual appeal of audit pages. Fixes: LaTeX rendering twice (raw + formatted) Improves: Overall page aesthetics, typography, spacing, and user experience
There was a problem hiding this comment.
Pull request overview
Adds a new Manipulation technical audit (staging) and updates MDX rendering/styling (including KaTeX + prose typography) to improve audit readability, alongside a small staging deploy workflow tweak.
Changes:
- Add new staging audit MDX content for “Manipulation: A technical audit”.
- Update prose/typography styling and KaTeX display handling (Tailwind typography + global CSS).
- Adjust audit page rendering (banner/header styling, KaTeX rehype options) and staging deploy host.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
tailwind.config.ts |
Extends typography defaults for prose content (links, headings, code, KaTeX display spacing). |
content/textbook/audits/staging/jdvakil_yi_shiuan_tung.mdx |
Adds the new manipulation audit content (with math/KaTeX). |
components/KatexStyles.tsx |
Introduces a client-side KaTeX CSS injector via CDN. |
app/textbook/audits/[...slug]/page.tsx |
Updates audit page layout/styling and configures KaTeX rendering options. |
app/layout.tsx |
Reorders KaTeX CSS import ahead of globals.css. |
app/globals.css |
Adds global prose + KaTeX styling rules. |
.github/workflows/deploy-staging.yml |
Updates staging deploy SSH host. |
.continueignore |
Adds ignore patterns for Continue tooling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| 1. Compute quantiles: $q_{0.01}, q_{0.99}$ | ||
| 2. Uniformly divide: $\Delta_d = \frac{q_{0.99} - q_{0.01}}{256}$ | ||
| 3. Map continuous action to bin: $\hat{a}_d = \left\lfloor \frac{a_d - q*{0.01}}{\Delta_d} \right\rfloor$ |
There was a problem hiding this comment.
LaTeX typo in the binning formula: q*{0.01} looks like it should be q_{0.01} (matching the quantile notation used just above). As written, KaTeX will render an unexpected q* term.
| 3. Map continuous action to bin: $\hat{a}_d = \left\lfloor \frac{a_d - q*{0.01}}{\Delta_d} \right\rfloor$ | |
| 3. Map continuous action to bin: $\hat{a}_d = \left\lfloor \frac{a_d - q_{0.01}}{\Delta_d} \right\rfloor$ |
| <AuditLayout | ||
| chapters={chapters} | ||
| isReviewMode={isReviewMode} | ||
| prNumber={prNumber} | ||
| > |
There was a problem hiding this comment.
AuditLayout is being called with isReviewMode and prNumber props, but components/audit/AuditLayout.tsx currently defines AuditLayoutProps with only children and chapters. This will fail type-checking/build unless the layout component (and its props interface) is updated to accept these new props, or the extra props are removed here.
| prNumber={prNumber} | ||
| > | ||
| <KatexStyles /> | ||
| <Link href="/textbook/audits" className="text-sm text-blue-600 hover:text-blue-800 mb-8 inline-block"> |
There was a problem hiding this comment.
<KatexStyles /> injects a CDN stylesheet at runtime, but KaTeX CSS is already imported globally in app/layout.tsx. Since Next's bundled CSS won’t match link[href*="katex"], this effect will still append an extra external stylesheet, adding an unnecessary network request and potentially overriding local CSS unexpectedly. Consider removing KatexStyles (preferred) and relying on the bundled KaTeX CSS, or make stylesheet loading consistent in a single place.
| [rehypeKatex, { | ||
| strict: false, // Don't fail on unknown LaTeX commands | ||
| trust: true, // Allow some advanced LaTeX features | ||
| throwOnError: false, // Gracefully handle errors | ||
| }] |
There was a problem hiding this comment.
rehype-katex is configured with trust: true. For KaTeX this enables “trusted” commands (e.g., HTML/URL-related macros) and can open the door to XSS if any MDX content is ever untrusted (student submissions / PR content). Unless you have a strong reason + additional sanitization, keep trust disabled (default) and allow only the subset of macros you need.
| import { useEffect } from 'react'; | ||
|
|
||
| export function KatexStyles() { | ||
| useEffect(() => { | ||
| // Ensure KaTeX CSS is loaded | ||
| const link = document.createElement('link'); | ||
| link.rel = 'stylesheet'; | ||
| link.href = 'https://cdn.jsdelivr.net/npm/katex@0.16.27/dist/katex.min.css'; | ||
| link.integrity = 'sha384-yp+jpRNKIa0xGrYaVtwImDXkFq7ZOCV5kJZVDg/uAFfYPmtFcKr0sxhVJy1HqnWD'; | ||
| link.crossOrigin = 'anonymous'; | ||
|
|
||
| // Check if already loaded | ||
| const existing = document.querySelector('link[href*="katex"]'); | ||
| if (!existing) { | ||
| document.head.appendChild(link); | ||
| } | ||
| }, []); | ||
|
|
There was a problem hiding this comment.
This component injects KaTeX CSS from a third-party CDN at runtime. The app already imports katex/dist/katex.min.css in app/layout.tsx, and the current existing check won’t detect Next’s bundled CSS link, so you’ll likely end up loading KaTeX twice. Prefer a single, build-time import (or a single <link> in the document head) to avoid extra requests and reliance on an external CDN.
| import { useEffect } from 'react'; | |
| export function KatexStyles() { | |
| useEffect(() => { | |
| // Ensure KaTeX CSS is loaded | |
| const link = document.createElement('link'); | |
| link.rel = 'stylesheet'; | |
| link.href = 'https://cdn.jsdelivr.net/npm/katex@0.16.27/dist/katex.min.css'; | |
| link.integrity = 'sha384-yp+jpRNKIa0xGrYaVtwImDXkFq7ZOCV5kJZVDg/uAFfYPmtFcKr0sxhVJy1HqnWD'; | |
| link.crossOrigin = 'anonymous'; | |
| // Check if already loaded | |
| const existing = document.querySelector('link[href*="katex"]'); | |
| if (!existing) { | |
| document.head.appendChild(link); | |
| } | |
| }, []); | |
| export function KatexStyles() { | |
| // KaTeX styles are loaded via a build-time import in app/layout.tsx. | |
| // This component is kept for compatibility but no longer injects CSS at runtime. |
|
|
||
| /* Prevent double-rendering by hiding any raw LaTeX that might leak through */ | ||
| .katex-html { | ||
| /* KaTeX should render, not raw LaTeX */ |
There was a problem hiding this comment.
.katex-html has an empty rule block, but the comment suggests it prevents “double-rendering” / hides raw LaTeX. As written it has no effect and is confusing for future maintainers. Either remove this selector/comment or add the actual styles needed to address the issue you observed.
| /* KaTeX should render, not raw LaTeX */ | |
| /* KaTeX should render, not raw LaTeX, so hide the underlying HTML/LaTeX layer */ | |
| display: none; |
| // Ensure math displays properly | ||
| '.katex-display': { | ||
| margin: '1.5rem 0', | ||
| padding: '1.5rem 0', | ||
| }, |
There was a problem hiding this comment.
.katex-display spacing is defined here inside the typography theme, but very similar .katex-display styling is also added globally in app/globals.css. Keeping the same styles in two places is easy to let drift and makes it unclear which source of truth to edit. Consider consolidating KaTeX display styling into either the typography config or globals (not both).
| // Ensure math displays properly | |
| '.katex-display': { | |
| margin: '1.5rem 0', | |
| padding: '1.5rem 0', | |
| }, |
|
|
||
| ## Summary | ||
|
|
||
| Trained on 800k trajectories from the Open X-Embodiment dataset. Octo can be effectively finetuned to new observations and action spaces. Released model checkpoints with 27M and 93M parameters, out of the box, support multiple RGB camera inputs as well as both language and goal image task specificiation. |
There was a problem hiding this comment.
Typo in the summary: “task specificiation” should be “task specification”.
| Trained on 800k trajectories from the Open X-Embodiment dataset. Octo can be effectively finetuned to new observations and action spaces. Released model checkpoints with 27M and 93M parameters, out of the box, support multiple RGB camera inputs as well as both language and goal image task specificiation. | |
| Trained on 800k trajectories from the Open X-Embodiment dataset. Octo can be effectively finetuned to new observations and action spaces. Released model checkpoints with 27M and 93M parameters, out of the box, support multiple RGB camera inputs as well as both language and goal image task specification. |
| @@ -0,0 +1,245 @@ | |||
|
|
|||
There was a problem hiding this comment.
It is missing references.
| * **Dexterous manipulation** involves in-hand repositioning of objects using multi-fingered hands. This requires coordinated control of many degrees of freedom with continuous contact state estimation. $\pi_0$ and $\pi_{0.5}$ have demonstrated progress in this area, though the task remains unsolved in general. | ||
| * **Contact-rich manipulation** involves tasks where the robot must make and maintain complex contact with the environment (e.g., insertion, assembly, screwing, polishing, wiping). These tasks require force modulation and compliance control, which are currently absent from the action spaces of standard VLAs. | ||
|
|
||
| While computer vision and natural language processing have scaled to billions of parameters, manipulation remains bottlenecked. This discrepancy illustrates Moravec’s Paradox. Foundational vision models achieve zero-shot semantic understanding, and VLMs exhibit advanced logical reasoning. However, translating this reasoning into physical action is difficult because manipulation requires sub-millimeter geometric perception, multi-step causal reasoning, and the management of continuous contact dynamics (forces, compliance, constraints). |
There was a problem hiding this comment.
I don't see where Moravec's paradox is introduced. Is this something that's helpful for the reader to understand/is it part of the core analysis of manipulation?
There was a problem hiding this comment.
Moravec's paradox is essentially higher level of reasoning is relatively less computationally intensive than lower-level of reasoning. Which ties into vision and reasoning having more resources (data) than sensorimotor interactions.
| While computer vision and natural language processing have scaled to billions of parameters, manipulation remains bottlenecked. This discrepancy illustrates Moravec’s Paradox. Foundational vision models achieve zero-shot semantic understanding, and VLMs exhibit advanced logical reasoning. However, translating this reasoning into physical action is difficult because manipulation requires sub-millimeter geometric perception, multi-step causal reasoning, and the management of continuous contact dynamics (forces, compliance, constraints). | ||
|
|
||
| ## 2. Embodiment Gap and Data Scaling | ||
| The primary barrier to generalized manipulation is the Embodiment Gap. LLMs are trained on passive, internet-scale datasets exceeding $15 \times 10^{12}$ tokens, and vision models on pixel datasets exceeding $10 \times 10^9$ image-text pairs. Conversely, robot action data must be physically generated through methods like kinesthetic teaching or teleoperation, yielding dataset sizes closer to $2 \times 10^6$ trajectories. |
There was a problem hiding this comment.
There is also synthetic data generation through simulation for robotics.
| $$ | ||
|
|
||
| They overwrite the 256 least-used tokens in the LLaMA vocabulary with action tokens. | ||
|
|
There was a problem hiding this comment.
Discretizing actions into bins makes training easier, but I’m wondering how much precision is lost for fine manipulation tasks.
| ## Abstract | ||
| Robotic manipulation in unstructured environments remains an open problem in embodied AI. Despite progress in vision-language foundation models, translating internet-scale semantic understanding into contact-rich motor control is challenging. This review examines the current landscape of Vision-Language-Action (VLA) models for manipulation through the lens of the data bottleneck. While language models train on trillions of tokens and vision-language models consume billions of image-text pairs, the robot manipulation community has produced roughly two million trajectories, fragmented across embodiments, labs, and task distributions. The field utilizes three competing strategies to address this limitation: (1) transferring internet-scale pretraining into action spaces, (2) multiplying limited real-world data through augmentation and cross-embodiment pooling, and (3) redesigning the action representation itself. This review traces the architectural lineage from RT-1 through $\pi_{0.5}$, formalizes the mathematical trade-offs at each model's critical interfaces, and identifies the limitations of each strategy. | ||
|
|
||
| **Index terms:** Robotic manipulation, Foundational models, Vision-Language-Action (VLA) Model |
| * **Prehensile manipulation** involves grasping and transporting objects. This is the primary focus of current VLA research (e.g., pick-and-place, bin picking, tabletop rearrangement). The key challenge is grasp planning: selecting a grasp pose that is kinematically reachable, stable under the object's mass distribution, and achievable with the robot's gripper geometry. | ||
| * **Non-prehensile manipulation** involves moving objects without grasping them (e.g., pushing, sliding, tilting, or toppling). The physics are dominated by friction and inertia, and control is typically open-loop or quasi-static. | ||
| * **Dexterous manipulation** involves in-hand repositioning of objects using multi-fingered hands. This requires coordinated control of many degrees of freedom with continuous contact state estimation. $\pi_0$ and $\pi_{0.5}$ have demonstrated progress in this area, though the task remains unsolved in general. | ||
| * **Contact-rich manipulation** involves tasks where the robot must make and maintain complex contact with the environment (e.g., insertion, assembly, screwing, polishing, wiping). These tasks require force modulation and compliance control, which are currently absent from the action spaces of standard VLAs. |
There was a problem hiding this comment.
All of the examples you gave involve a form of prehensile manipulation, making it seem that this is a subset of that task (somehow involving "complex contact," which isn't defined).
There was a problem hiding this comment.
What might help is a taxonomy, or some more strict definitions of these other tasks (dextrous, contact-rich) that set them apart or are key differentiators from the fairly intuitive explanations of prehensile and non-prehensile manipulation.
| While computer vision and natural language processing have scaled to billions of parameters, manipulation remains bottlenecked. This discrepancy illustrates Moravec’s Paradox. Foundational vision models achieve zero-shot semantic understanding, and VLMs exhibit advanced logical reasoning. However, translating this reasoning into physical action is difficult because manipulation requires sub-millimeter geometric perception, multi-step causal reasoning, and the management of continuous contact dynamics (forces, compliance, constraints). | ||
|
|
||
| ## 2. Embodiment Gap and Data Scaling | ||
| The primary barrier to generalized manipulation is the Embodiment Gap. LLMs are trained on passive, internet-scale datasets exceeding $15 \times 10^{12}$ tokens, and vision models on pixel datasets exceeding $10 \times 10^9$ image-text pairs. Conversely, robot action data must be physically generated through methods like kinesthetic teaching or teleoperation, yielding dataset sizes closer to $2 \times 10^6$ trajectories. |
There was a problem hiding this comment.
And yet, some leading roboticists are not convinced of this. This is a conversation to watch.
|
|
||
| Two models were trained on this aggregated data: RT-1-X (the RT-1 architecture trained on the pooled dataset) and RT-2-X (RT-2 fine-tuned on the pooled dataset). Positive transfer exists across embodiments; both models outperformed their single-embodiment counterparts, with RT-2-X showing a 3x improvement on emergent skill evaluations compared to RT-2 trained on Google Robot data alone. | ||
|
|
||
| The pooled training does not explicitly model embodiment differences; there is no embodiment embedding or dynamics adapter. The model must implicitly learn to factor its representations into embodiment-invariant (task semantics, object properties) and embodiment-specific (workspace geometry, joint limits, gripper type) components. |
There was a problem hiding this comment.
OK, so we all believe the "embodiment gap" exists and we need lots of data to fill it.
So, the solution is to: ignore the embodiment differences and fill the gap with aggregated data across morphologies?
Is this brilliance or stupidity?
|
|
||
| **Taxonomy Positioning:** | ||
| * **Policy type:** Generalist robot policy (vision + optional language/goal) | ||
| * **Training paradigm:** Large-scale imitation learning (OXE mixture) |
There was a problem hiding this comment.
What's the breakdown of the data used for the OXE mixture? Why not use the whole dataset? I'm interested in the breakdown of data chosen, and what kind of data diversity they have.
| * Observation tokens at time $t$ attend only to task tokens and observation tokens up to $t$ ($T_{o, 0:t}$) plus language instructions ($T_{\ell}$). | ||
| * Missing modalities are fully masked. | ||
|
|
||
| Octo inserts learned readout tokens ($T_{R,t}$) that attend to preceding task and observation tokens, but are not attended to by task and observation tokens (forming a read-only pathway). This enables adding or removing observation channels or action heads without reinitializing the transformer. |
There was a problem hiding this comment.
This is fairly unique to the Octo paper as far as I know. I think it deserves a little more description of how this enables the adding or removing of components. Using one of the figures from the paper to illustrate this might help.
| Octo inserts learned readout tokens ($T_{R,t}$) that attend to preceding task and observation tokens, but are not attended to by task and observation tokens (forming a read-only pathway). This enables adding or removing observation channels or action heads without reinitializing the transformer. | ||
|
|
||
| #### 3.3.3 Diffusion Action Head | ||
| Octo predicts actions using a conditional diffusion decoder. It performs one transformer forward pass per action, then runs multi-step denoising inside the diffusion head. They train with the standard DDPM objective (adding Gaussian noise to dataset actions and training $\epsilon_\theta$ to reconstruct the original action). |
There was a problem hiding this comment.
And what are they conditioning on? Is it multi-modal data? A single RGB frame?
| * Runs at ~6 Hz on RTX 4090 (15GB bfloat16). | ||
|
|
||
| ### 4.2 Problem Domain & Taxonomy | ||
| OpenVLA operates in generalist robotic manipulation across multiple embodiments, diverse scenes, multi-task environments, and end-effector control. The OpenX-Embodiment dataset is filtered to single-arm setups with at least one third-person camera. |
There was a problem hiding this comment.
I'm curious if the third person camera view has to be a specific view, or if any angle/orientation of the camera helps. For example, does the third person view have to be directly in front of the robot? Can it be a side angle view and still provide value? I'm not sure if they report any of that information in the paper but it would be interesting to know about the data set.
| * Observation tokens at time $t$ attend only to task tokens and observation tokens up to $t$ ($T_{o, 0:t}$) plus language instructions ($T_{\ell}$). | ||
| * Missing modalities are fully masked. | ||
|
|
||
| Octo inserts learned readout tokens ($T_{R,t}$) that attend to preceding task and observation tokens, but are not attended to by task and observation tokens (forming a read-only pathway). This enables adding or removing observation channels or action heads without reinitializing the transformer. |
There was a problem hiding this comment.
Still a little unsure on significance of adding readout tokens, and what they do exactly. Could help elaborate a little? I.e. more info on why add this on top of traditional action/observation/task tokens?
Edit: I just didn't read into it enough. I think I have a better understanding now - this part could probably use a little emphasis, it seems like a pretty important trick within Octo to make the model more generalist.
| * **Generalization:** Demonstrated behaviors include 10-15 minute continuous sequences. Performance declines when web data or cross-embodiment data is ablated for tasks requiring broad semantic reasoning. | ||
|
|
||
| ## 6. Limitations and Future Outlook | ||
| The trajectory from behavior cloning to flow-matching VLAs demonstrates progress in closing the embodiment gap. However, physical robotics cannot strictly replicate the internet-scale passive scraping of LLMs. The field is approaching an asymptote on raw physical data collection. Future methods will likely depend on sample-efficient architectures capable of implicit physics understanding. No newline at end of file |
There was a problem hiding this comment.
I think this audit is missing a a larger discussion on limitations / load-bearing walls. You can even add a section after each paper to discuss the approach specific limitations.
We understand from this audit that there is an embodiment gap but how close does each approach get to closing this gap?
| ### 5.4 Scaling and Experiments | ||
| * **Data scaling:** Mobile manipulation consists of ~400 hours in ~100 homes, yet 97.6% of Phase 1 examples are from other domains, including web-scale captioning and VQA. | ||
| * **Training scaling:** 280k gradient steps in pre-training, followed by 80k post-training steps for flow-matching. | ||
| * **Generalization:** Demonstrated behaviors include 10-15 minute continuous sequences. Performance declines when web data or cross-embodiment data is ablated for tasks requiring broad semantic reasoning. |
There was a problem hiding this comment.
What are some examples of the tasks that require broad semantic reasoning where performance declined? Is there any theme amongst those tasks that could be a clue to what's missing in the data mixture for training?
| * **Quantization:** 8-bit quantization increases inference latency, dropping control frequency to 1.2 Hz on an A5000. 4-bit quantization reduces memory usage and yields higher throughput, achieving ~3 Hz control frequency on an A5000 GPU with rollout performance comparable to bfloat16. | ||
|
|
||
| ### 4.5 Experiments | ||
| OpenVLA (7B) outperforms RT-2-X (55B parameters) and Diffusion Policy baselines across 29 tasks on BridgeData V2. LoRA-based parameter-efficient fine-tuning achieves performance close to full fine-tuning with lower memory and compute costs. |
There was a problem hiding this comment.
When you say 'diffusion policy baselines', do you mean JUST diffusion, or do the authors also compare to Octo which uses a diffusion action head?
| * **Web Data (WD):** Improves semantic reasoning and object grounding, extended with bounding box annotations. | ||
|
|
||
| **Stage 2: Post-training with an action expert for flow matching** | ||
| Post-training adds a separate 300M parameter action expert (on top of the 2B PaliGemma backbone) that predicts continuous action chunks via flow matching. |
There was a problem hiding this comment.
Definitely would help to have a brief description of flow matching so that we know something about it before looking at it's loss in the following section. Also, because it's different than the prior methods, it would be helpful to have a compare and contrast on these action output methods (diffusion vs. flow matching) and when/why to use one or the other.
crheckman
left a comment
There was a problem hiding this comment.
mostly gushing, some requests for changes
|
|
||
| #### 3.1.1 Key Results | ||
| * **Zero-shot:** On environments from its pretraining distribution, Octo achieves ~33% higher average success than RT-1-X and performs similarly to RT-2-X on tested WidowX tasks. | ||
| * **Finetuning:** With ~100 demos per domain, Octo finetunes across real-robot domains, including new observation inputs (force-torque) and new action spaces (joint position control), reaching 72% average success compared to 20% for ResNet+Transformers and 15% for VC-1. |
There was a problem hiding this comment.
I looked into the remainder of your document. There are no details about how the fine-tuning was executed. Example: What is the data mix with Internet-scale and robot data? Corollary: Can the model complete user: what is the capital of France? with agent: Paris or agent: a_121, a_214, ...?
| * **Zero-shot:** On environments from its pretraining distribution, Octo achieves ~33% higher average success than RT-1-X and performs similarly to RT-2-X on tested WidowX tasks. | ||
| * **Finetuning:** With ~100 demos per domain, Octo finetunes across real-robot domains, including new observation inputs (force-torque) and new action spaces (joint position control), reaching 72% average success compared to 20% for ResNet+Transformers and 15% for VC-1. | ||
| * **Accessibility:** Pretraining (Octo-Base / ViT-B-sized backbone) takes 300k steps at batch size 2048 on a TPU v4-128 pod (~14 hours); finetuning on a single 24GB NVIDIA A5000 takes ~5 hours. | ||
| * **Tokenization:** Language instructions are embedded via a pre-trained T5-base model (111M parameters), and images (wrist and 3rd-person cameras) are processed through a shallow CNN into patches. |
There was a problem hiding this comment.
I am very confused by the two above points. Accessibility says they train on Octo-Base for 5 hours. But the tokenization says they link this up with a 111M parameter model and process images through a shallow CNN. These two are not compatible architectures without some significant scaffolding. Your later section on the architecture is basically a rehash of these two bullets and does not help grant understanding of what is being done, leaving aside the "why" which is even more important.
| #### 3.1.1 Key Results | ||
| * **Zero-shot:** On environments from its pretraining distribution, Octo achieves ~33% higher average success than RT-1-X and performs similarly to RT-2-X on tested WidowX tasks. | ||
| * **Finetuning:** With ~100 demos per domain, Octo finetunes across real-robot domains, including new observation inputs (force-torque) and new action spaces (joint position control), reaching 72% average success compared to 20% for ResNet+Transformers and 15% for VC-1. | ||
| * **Accessibility:** Pretraining (Octo-Base / ViT-B-sized backbone) takes 300k steps at batch size 2048 on a TPU v4-128 pod (~14 hours); finetuning on a single 24GB NVIDIA A5000 takes ~5 hours. |
There was a problem hiding this comment.
What was the pretraining mix? What is the fine-tuning data? Why did they split this up? Why did they need to pretrain rather than starting with a ViT-B for image tokenization?
| OpenVLA fine-tunes a pretrained VLM composed of: | ||
| * **Visual encoder (600M params):** DINOv2 for geometric and spatial features and SigLIP for semantic alignment features. Given image patches $x$: | ||
|
|
||
| $$ |
There was a problem hiding this comment.
This architecture makes a lot of sense. Good work
| The authors finetuned the vision encoder during VLA training to capture spatial details for precise robotic control. | ||
|
|
||
| ### 4.4 Scaling and Efficiency | ||
| * **Data and Compute Scaling:** The final OpenVLA model is trained on 970k episodes using 64 A100 GPUs for 14 days (21,500 A100-hours) with a batch size of 2048. |
| * **Generalization:** Demonstrated behaviors include 10-15 minute continuous sequences. Performance declines when web data or cross-embodiment data is ablated for tasks requiring broad semantic reasoning. | ||
|
|
||
| ## 6. Limitations and Future Outlook | ||
| The trajectory from behavior cloning to flow-matching VLAs demonstrates progress in closing the embodiment gap. However, physical robotics cannot strictly replicate the internet-scale passive scraping of LLMs. The field is approaching an asymptote on raw physical data collection. Future methods will likely depend on sample-efficient architectures capable of implicit physics understanding. No newline at end of file |
There was a problem hiding this comment.
What are the primary modes of failure? Do you think the collection of raw physical data collection will solve all the problems of robotic manipulation with VLAs?
|
|
||
| The authors finetuned the vision encoder during VLA training to capture spatial details for precise robotic control. | ||
|
|
||
| ### 4.4 Scaling and Efficiency |
There was a problem hiding this comment.
IMO, this would be a good place to be more detailed and more opinionated. What design decisions contributed to these scaling behaviors? Do you agree with them?
For instance - OpenVLA uses single-image observations; there's no observation history. (IIRC, they note this as a limitation in the paper.) Why do you think they made this choice? How does it affect the model scaling and performance?
|
|
||
| #### 5.3.2 Two-Stage Training: Discrete-Token Pretraining and Flow-Matching Post-Training | ||
| **Stage 1: Pre-training with discrete tokens (FAST)** | ||
| During pretraining, all tasks (including robot actions) are represented as discrete tokens, enabling next-token prediction via FAST. The pretraining data mixture includes: |
There was a problem hiding this comment.
I haven't read the pi0.5 paper in detail, but my understanding is that they provide some ablations on this data mixture and argue why each component is important. I'd like to see some more detailed analysis about the reasons they gave for using this mixture. What failure modes would arise if any of these categories were left out? How do you think they were balanced (if there's any telling)?
You say "the paper posits that in-the-wild generalization requires knowledge transfer from heterogeneous sources" - here's the place to explain why.
|
|
||
| Given $(x_t, t)$, the model is trained to predict the flow field $v_t$, which is used for integrating from noise to actions at inference. | ||
|
|
||
| ### 5.4 Scaling and Experiments |
There was a problem hiding this comment.
This audit is missing a thesis, which I suspect others have pointed out, but the individual sections should also have theses. In particular, pi0.5 comes with some serious claims of generality - which (as the professor has pointed out) you shouldn't necessarily believe, so take a stance. What's new in this paper compared to e.g. pi0, why does it help with manipulation specifically, and do you think it was a worthy addition? Is pi0.5 really generalizable?
| **Taxonomy Positioning:** | ||
| * **Policy type:** Hierarchical VLA (high-level semantic subtask + low-level control) | ||
| * **Training paradigm:** Large-scale co-training on heterogeneous robot + web + semantic data | ||
| * **Action representation:** Hybrid: discrete FAST tokens (pretraining) + flow matching (inference) |
There was a problem hiding this comment.
How do you think this compares to the diffusion-based and discrete token prediction based action representations in Octo and OpenVLA?
I'd like to understand the cost/benefit analysis of treating actions as continuous vs discrete (especially for fine motor skills required for manipulation).
Also how does multi-step action representation compare to single-step actions? Doesn't manipulation require higher frequency feedback (observe-act-observe-act...)? If so, what are the limitations of committing to an open-loop trajectory of actions?
There was a problem hiding this comment.
I dont think the description of the training data is well covered. Trajectories can be tracked in either joint or task planning space. Which do these models use? Has anyone explored training between these two types (any ablation study)? Are both joint and pose trajectories required in training?
No description provided.