Autonomous Driving Audit - Aritra Chakrabarty and Zack Allen#62
Autonomous Driving Audit - Aritra Chakrabarty and Zack Allen#62
Conversation
…ut sections but content complete.
…nces to flesh out narrative for it.
…cessary, added a list of references to alphaDrive.mdx. TODO: finish citing for both mdx files.
…echnical paper audit sections and Ari's summary to final submission.
|
🚀 Preview Deployed Your preview is ready for review! 🔗 Preview URL: https://arpg.github.io/vla-foundations/staging/pulls/62/textbook/audits/staging/Zaaler-aritrach/ Review Checklist
Next Steps
This preview will be removed when the PR is closed. |
…ut sections but content complete.
…nces to flesh out narrative for it.
…cessary, added a list of references to alphaDrive.mdx. TODO: finish citing for both mdx files.
…echnical paper audit sections and Ari's summary to final submission.
…rive and alpamayo.
…-driving' into Zaaler-aritrach-autonomous-driving
d03904a to
d56a357
Compare
d56a357 to
4b83838
Compare
| - The trace must link these factors to the decision in a minimal, behavior-consistent way. | ||
|
|
||
| --- | ||
|
|
There was a problem hiding this comment.
Since CoC relies on a predefined taxonomy of causal factors, how robust is the approach to previously unseen causal structures or novel interaction patterns not captured in the labeling schema?
There was a problem hiding this comment.
The driving decisions is a closed set taxonomy. The causal factors is an open set taxonomy. So, the model can adapt to unseen and novel interaction patterns not caught in the labeling schema.
| Self-Supervision requires only future ego locations from driving logs — no human annotation, perception labels, or HD map labels. | ||
| This is a critical scalability property enabling training on Waymo's internal fleet data (24M sequences, 203K hours). | ||
|
|
||
| **Open-loop training, closed-loop gap:** Training is imitation learning on logged trajectories, so the model is never exposed to the distribution shift caused by its own compounding errors. |
There was a problem hiding this comment.
Have the authors explored any synthetic closed-loop augmentation (e.g., simulation rollouts or trajectory perturbations) to partially address this distribution shift during training?
There was a problem hiding this comment.
The authors of this paper performed no closed-loop evaluation of their model. Therefore, they currently have no way of addressing this potential distribution shift. Since this paper was released in 2024, they do mention the large strides being made in the world simulations in terms of full sensor-suite input systems for closed loop evaluation.
|
|
||
| 2) **LLM/LRM critics are calibrated** | ||
| Reasoning reward is computed by a large reasoning model judge; the approach assumes the judge scores correlate with true causal fidelity and not superficial templates. | ||
|
|
There was a problem hiding this comment.
how sensitive is training to critic miscalibration, and have the authors tested robustness to alternative judging prompts or scoring criteria?
There was a problem hiding this comment.
Using LLM/LRM to label chain of causality and then feeding it into a human in the loop scorer provides an attempt to prevent miscalibration. The success of the algorithm in reducing close collision events shows that they have mitigated the miscalibration but, they provide no testing to quantify the robustness of their solution. We suggest an approach in our next 10,000 GPU section below.
| ## Problem Statement | ||
| Modern autonomy systems increasingly explore **VLM/MLLM-based planners** that map perception (images/video) plus context (routing/intent/ego state) into **driving decisions**. | ||
| Across real-world driving, (i) **multiple actions can be valid** for the same scene, (ii) decisions must satisfy **real-time constraints**, and (iii) developers often want **human-interpretable rationales**—ideally with some form of **consistency** between the rationale and the executed plan. | ||
| These three papers share that motivation, but differ in **action representation**, **reasoning representation**, and **how training enforces correctness vs diversity vs causal consistency**. |
There was a problem hiding this comment.
Saying "correctness vs diversity vs causal consistency" seems to imply that training is pitting these three things against each other. Is that the take-away here? Or are these just three related parts of training rather than 3 warring factions of it?
There was a problem hiding this comment.
Not necessarily "warring factions" just different approaches to how this problem can be solved. Each group focused on different aspects while training, with causal consistency being a major argument that Alpamayo-R1 tries to sell.
crheckman
left a comment
There was a problem hiding this comment.
first 10 minutes of review period
|
|
||
| # Features (Inputs / Outputs / What “Action” Means) | ||
|
|
||
| | Model | Primary Inputs | Primary Outputs | What “Action” is | |
There was a problem hiding this comment.
Primary inputs should include exactly what the framerate and window of history is passed along. These models have fundamental differences in their context length and the multimodal ingest that aren't clear based on this table. For instance, EMMA has a text representation of history, but does it only provide the t=0 image, or the t=-k image, or some subset of them?
|
|
||
| --- | ||
|
|
||
| # Training & Supervision |
There was a problem hiding this comment.
We need a section on training data here. What is the volume, how was the training data constructed, what does that implicitly emphasize.
| |---|---|---|---| | ||
| | **AlphaDrive** | (1) Distill reasoning from a larger teacher → **SFT** warm-start; (2) **GRPO RL** refinement | GT meta-actions + reward shaping | **Multimodal planning** (diversity), **safety-critical weighting**, and structured output constraints | | ||
| | **EMMA** | Multitask training with a unified language formulation; adds **CoT** prompting/training | **Future ego locations** from logs for planning; plus task-specific labels (detection/road-graph) | **Shared interface across tasks**; co-training yields cross-task gains | | ||
| | **Alpamayo-R1** | Multi-stage: add action modality → SFT for reasoning → **RL post-training**; plus **CoC dataset/pipeline** | Structured **Chain-of-Causation** + trajectory objectives | **Causal structure**, **reasoning/action consistency**, and high-quality multimodal trajectories under runtime constraints | |
There was a problem hiding this comment.
Get as specific as you can about this one - what is a trajectory objective? Is it any different from the "future ego locations from logs" of EMMA?
| | Model | Training Stages | Key Supervision Signal | What the objective emphasizes | | ||
| |---|---|---|---| | ||
| | **AlphaDrive** | (1) Distill reasoning from a larger teacher → **SFT** warm-start; (2) **GRPO RL** refinement | GT meta-actions + reward shaping | **Multimodal planning** (diversity), **safety-critical weighting**, and structured output constraints | | ||
| | **EMMA** | Multitask training with a unified language formulation; adds **CoT** prompting/training | **Future ego locations** from logs for planning; plus task-specific labels (detection/road-graph) | **Shared interface across tasks**; co-training yields cross-task gains | |
There was a problem hiding this comment.
Future ego locations from logs as a supervisory signal implies that the decision that the driver made was the right one. It also relies on an enormous corpus of human expert driving data.
Aside: who has the largest corpus of human expert driving data?
| |---|---|---| | ||
| | **AlphaDrive** | Structured “planning reasoning” text (format explicitly rewarded) | Improves planning quality via distillation + RL; reasoning is trained as part of the output distribution | | ||
| | **EMMA** | Chain-of-thought rationales (text) | Primarily an accompanying rationale paired with predicted outputs; leverages MLLM capabilities and unified prompting | | ||
| | **Alpamayo-R1** | **Chain-of-Causation (CoC)** (decision-grounded causal links) | Intended to provide *structured* decision grounding and improved alignment between reasoning and action generation | |
There was a problem hiding this comment.
Did they not demonstrate a performance improvement when introducing CoC? All you mention here are "structure enforcement" (teacher forcing) and "alignment" (frictionless reasoning->action)
| ## 1. Summary | ||
|
|
||
| EMMA is a **Gemini-powered end-to-end multimodal model** for autonomous driving that directly maps raw surround-view camera images into driving-specific outputs: **future ego trajectories**, 3D object detections, road graph elements, and scene understanding predictions. | ||
| All non-sensor inputs (navigation commands, ego history) and all outputs (trajectory waypoints, bounding boxes) are represented as **plain text**, unifying every task within a single language space and allowing task-specific behavior to be selected at inference time via prompt variation. |
There was a problem hiding this comment.
Is there any discussion of when this type of input translation to text might fail? It seems like this opens the door for some information degradation if all inputs of all types are being transformed into a text representation.
| - the continuous waypoint output space is heterogeneous from the natural language space used for all other tasks | ||
|
|
||
| The paper argues modular pipelines with fixed symbolic interfaces are brittle at the long tail, and end-to-end imitation approaches trained on limited datasets fail to generalize. | ||
| EMMA's resolution: leverage the pre-training scale and world knowledge of Gemini, fine-tuned within a unified language output space. |
There was a problem hiding this comment.
"Pre-training scale and world knowledge" of has become the norm for justifying the use of these large models. I say the same thing, but I do think it's worth investigating just how much we really gain from these models. For example, can we conclusively point to the "world knowledge of Gemini" and say that it has discovered long-tail scenarios for autonomous driving that we couldn't have predicted? I suppose this really isn't an actionable comment for this audit unless they explicitly disclose interesting findings.
| |---|---|---| | ||
| | Modular stacks + LLM augmentation | Specialized modules with LLMs for explainability/command | Fixed interfaces brittle to novel environments | | ||
| | End-to-end imitation planners | Direct sensor-to-trajectory mapping | Prone to ego-status shortcuts; limited generalization | | ||
| | **VLM-primary generalist (EMMA)** | MLLM as core compute; all tasks as VQA in unified language space | Closed-loop stability unvalidated | |
There was a problem hiding this comment.
What would it look like to validate the closed-loop stability? I guess I'm having trouble understanding if this critique is referring to the outputs to the car's controls or if it's referring to the stability of the model in terms of it's VQA performance. Specifically, what is the stability criteria for validation?
There was a problem hiding this comment.
I think we should include an explanation of Open Loop vs Closed Loop testing. Open loop does not update the system, but closed loop interacts with a simulator and the vehicle for testing.
|
|
||
| **Experiment B — "Multi-Task Output Consistency Verification"** <br /> | ||
| Define a consistency oracle: if a bounding box is predicted at $(x, y, \theta)$, the trajectory should respect that agent's right-of-way; if the road graph predicts a merge in 40m, the trajectory should respond within that horizon. Run on the full nuScenes validation set. | ||
| - **Success**: Consistency failure rate < 5% for safety-critical classes, or inconsistencies used to build a consistency-regularized training objective. |
There was a problem hiding this comment.
What is the <5% metric based on? Is this an industry standard of some kind? It feels a bit arbitrary.
| ### 7.4 The Next 10,000 GPU-hour Experiment | ||
|
|
||
| **Experiment A — "Closed-Loop Stability via Sensor Simulation"** <br /> | ||
| Leverage neural scene reconstruction (NeRF/Gaussian Splatting) to generate novel camera observations from perturbed ego states, enabling reactive closed-loop evaluation without a full physical simulator. |
There was a problem hiding this comment.
Why does it have to be NeRF/Gaussian Splats?
| - disorganized reasoning with weak causal links, | ||
| - overly long and ineffective reasoning. | ||
|
|
||
| So they use a stronger cloud model (e.g., GPT-4o) to generate concise reasoning conditioned on real actions + state + nav, manually filter errors, and distill via SFT. |
There was a problem hiding this comment.
How many traces were generated and filtered?
|
|
||
| ## 1. Summary | ||
|
|
||
| Alpamayo-R1 (AR1) is a **vision–language–action (VLA)** based driving policy designed to improve **generalization** in safety-critical long-tail scenarios where pure imitation learning is brittle. |
There was a problem hiding this comment.
Should "generalization" be defined here? It's a term we throw around a lot, but what does it really mean in the context of this paper?
|
|
||
| - **Vision input ($V$):** Surround-view camera images (6–8 cameras, 360° coverage). No LiDAR, radar, or HD maps. | ||
| - **Language input ($T$):** A high-level intent command (e.g., "go straight") and historical ego waypoints as plain-text BEV coordinate pairs: $T_{ego} = \{(x_t, y_t)\}_{t=-T_h}^{-1}$. | ||
| - **Output format:** All outputs are natural language text: trajectory waypoints as floating-point pairs, bounding boxes as `[x, y, z, l, w, h, θ, class]`, road graph polylines as semicolon-delimited waypoint sequences with `valid`/`invalid` tagging. |
There was a problem hiding this comment.
Are outputs validated before downstream use?
| The paper’s central claim is that “reasoning” only helps driving if it is **(i) causally grounded, (ii) decision-aligned, and (iii) behavior-consistent**, and that you need both *data* (via a reasoning specific dataset) and *training* to make it possible. | ||
|
|
||
| AR1 couples two outputs: | ||
| a structured **Chain of Causation (CoC)** reasoning trace, and a **6.4s future ego trajectory** (controls/trajectory), so the model is trained to jointly predict the *action* and the *thought process* in one step. |
There was a problem hiding this comment.
Though a reader might infer what is means to do "Chain of Causation" versus "Chain of Thought", I do think it's worth explicitly describing how they differ. Sorry if that's been done later on in the audit...
| Self-Supervision requires only future ego locations from driving logs — no human annotation, perception labels, or HD map labels. | ||
| This is a critical scalability property enabling training on Waymo's internal fleet data (24M sequences, 203K hours). | ||
|
|
||
| **Open-loop training, closed-loop gap:** Training is imitation learning on logged trajectories, so the model is never exposed to the distribution shift caused by its own compounding errors. |
There was a problem hiding this comment.
nit: maybe add more info on closed-loop vs. open-loop
|
|
||
| Because high-quality driving “chain-of-thought” data is scarce, they use a multi-stage reasoning strategy: generate a small batch of reasoning traces using a stronger cloud model (e.g., GPT-4o), manually filter it, SFT warm-up on that reasoning data for stability, then run **RL on the full dataset**. | ||
|
|
||
| On MetaAD (120k 3-second clips; 110k train / 10k val), AlphaDrive reports **77.12 overall planning accuracy**, outperforming fine-tuned baselines including a larger Qwen2VL-7B result (61.44). |
There was a problem hiding this comment.
what does 77.12 mean?
is it good enough to drive?
|
|
||
| On MetaAD (120k 3-second clips; 110k train / 10k val), AlphaDrive reports **77.12 overall planning accuracy**, outperforming fine-tuned baselines including a larger Qwen2VL-7B result (61.44). | ||
|
|
||
| They further claim **+25.52%** planning accuracy vs an SFT-trained model, and that with only 20% training data they outperform SFT by **35.31%**, emphasizing data-efficiency. |
There was a problem hiding this comment.
Strange wording. Does this mean that reasoning improves over an already-SFT'd model by 25.52%? Where does the 35.31% number come from, do they ablate SFT to 20% of the original fine tuning data and then run their standard reasoning RL improvement?
| The paper argues that naive “correctness reward” used in math/programming applications does not transfer cleanly to planning because there often isn't a single verifiable solution in driving; you need a reward that is robust early in training and resistant to shortcut solutions. | ||
|
|
||
| ### 2.2 Context | ||
| - **End-to-end driving models** can output trajectories/controls directly from sensors, but they are “black-box” systems that struggle with the long-tail of driving cases because they lack explicit reasoning. |
There was a problem hiding this comment.
but... the reasoning systems are black box too? (evidence: alpamayo-r1 acknowledges this by saying they will improve their system's direct alignment between reasoning->action through CoC)
|
|
||
| AlphaDrive’s “architecture” is best described as a **training + inference pipeline**. | ||
|
|
||
| <img src="https://raw.githubusercontent.com/arpg/vla-foundations/audit/Zaaler-aritrach-autonomous-driving/content/textbook/audits/staging/figures/alphadrive/alphadrive_architecture.png" alt="AlphaDrive Architecture" width="900" /> |
There was a problem hiding this comment.
this... is a very interesting example. why is the "User" prompt existing at all? Isn't this a driver, not a conversational agent? Also, what about the pedestrian overyielding that Answer1 suggests. I would agree that the pedestrians might move, and if we aren't running at a high enough frequency, the predictions will misalign with the action timing.
This seems like a useful example to walk through as a deep dive, if they provide more details on it within the paper.
|
|
||
| ### 5.1 Dataset | ||
|
|
||
| They adopt **MetaAD** \[*NOTE: Could not find this dataset anywhere, neither could their reviewers at ICLR 2026*\] as the benchmark: |
There was a problem hiding this comment.
wow, amazing that they have essentially no extra details on this. Was it human drivers? How did they "balance the distribution" over environments and planning actions.
|
|
||
| **Experiment B — “Multimodal plan selection” in closed-loop** | ||
| - Motivation: they claim multimodal planning emerges post-RL. | ||
| - Proposal: generate K plans, run a safety/rule feasibility filter, select, then evaluate closed-loop safety proxies (hard-brake rate, time-to-collision proxy, rule violations). |
There was a problem hiding this comment.
thereby multiplying your model footprint by K?
|
|
||
| ### 7.5 Sign-Off Criteria | ||
|
|
||
| **Technical recommendation:** |
There was a problem hiding this comment.
Research sure, but what about the data? Can't even see where to improve on this one without knowing what the complement of their data is
|
|
||
| ### 4.1 Self-Supervised Planning Objective | ||
|
|
||
| EMMA's core training objective is **autoregressive next-token prediction** over text-encoded trajectory waypoints, conditioned on visual and language tokens. |
There was a problem hiding this comment.
text-encoded trajectory waypoints lol these researchers were just rushing out the first VLM for driving paper
|
|
||
| ### 4.2 Chain-of-Thought Reasoning Structure | ||
|
|
||
| EMMA predicts a four-component **driving rationale** before outputting waypoints, structured coarse-to-fine: |
There was a problem hiding this comment.
hugely important, since this is also what their SFT dataset consisted of
| *Table 2: End-to-end motion planning on nuScenes. | ||
| EMMA and EMMA+ achieve state-of-the-art, outperforming all prior supervised and self-supervised methods.* | ||
|
|
||
| **WOMD** — EMMA+ (w/ CoT): **0.027 / 0.203 / 0.543m at 1s/3s/5s**, outperforming MotionLM (0.696m at 5s) and Wayformer (0.628m at 5s) by 13.5–22.5%, while those baselines consume LiDAR-derived agent boxes, HD maps, and traffic light states as input. |
There was a problem hiding this comment.
but they did not consume images.
|
|
||
| 2. **Diversity reward producing “diverse but unsafe” plans** | ||
| Diversity is rewarded by penalizing frequency among sampled answers. | ||
| - Risk: incentivize disagreement without feasibility grounding, making downstream selection harder. |
There was a problem hiding this comment.
I think some more detail about the consequences of the diversity reward would be appropriate, since it's a major component of the paper. What kind of behaviors does it incentivize / disincentivize? How might it affect training behavior?
| ### 7.1 Load-Bearing Assumptions | ||
|
|
||
| 1. **Reward alignment assumption** | ||
| The 4-reward design (F1 accuracy + action weights + diversity + format) must correlate with “better driving,” not just better label matching. |
There was a problem hiding this comment.
Relatedly, there's an assumption that the safety breakdown via action weighting actually captures safety. How exactly are different behaviors assigned different safety importance? (And if this is done at the action level - which seems to be the case - isn't that a very low fidelity approach to safety, given that their actions are pretty general?)
|
|
||
| 3. **Format-induced brittleness** | ||
| Format reward is hard-zero when tags fail. | ||
| - Risk: rare formatting drift can be catastrophic in a production parser unless you robustify extraction. |
There was a problem hiding this comment.
I don't totally understand this point. Is the argument that the format reward should be more aggressive (or more informative), because a format failure in production would be catastrophic? What do you mean by "robustify[ing] extraction"?
| 2) **Causal understanding gap** | ||
| Many “reasoning datasets” for AVs have explanations that are: | ||
| - vague (“be cautious”), | ||
| - not decision-committing (no explicit maneuver), |
There was a problem hiding this comment.
Aren't there times when not making an explicit maneuver is desirable? The act of not making a maneuver is a decision in itself, right?
|
|
||
| The paper is addressing a concrete deployment failure pattern: | ||
|
|
||
| > A policy can look good in open-loop trajectory metrics, yet still fail in closed-loop, interactive, long-tail scenarios. |
There was a problem hiding this comment.
An example of this would be really helpful.
|
|
||
| #### Outputs | ||
| - **CoC reasoning trace** | ||
| A structured explanation aligned to a **closed-set driving decision** that is anchored to an *explicit* decision category. |
There was a problem hiding this comment.
What are the explicit decision categories?
| - **CoC reasoning trace** | ||
| A structured explanation aligned to a **closed-set driving decision** that is anchored to an *explicit* decision category. | ||
| - **Continuous future trajectory** | ||
| The model predicts a **future trajectory over a fixed horizon (6.4s)**. |
There was a problem hiding this comment.
Was 6.4 seconds the right balance between usefulness and inference time latency?
|
|
||
| For post-training, the paper uses a **GRPO-style** (Group Relative Policy Optimization) approach: | ||
|
|
||
| - Sample multiple rollouts per prompt/context. |
There was a problem hiding this comment.
How can we (as the reader) understand what's necessary for recreating this method? For example, how many rollouts are needed? It says score them with "reward models / critics" so how many of those are needed? What's the group size? How often is KL regularization applied (periodically or every iteration)? Maybe these details are coming later in the audit...
| 3. Apply rule-based matching across both axes. | ||
|
|
||
| The reward is assigned as: | ||
| - $r_\text{consistency} = 1$ if the reasoning-implied meta-actions match the trajectory-derived meta-actions **for both longitudinal and lateral behavior**, |
There was a problem hiding this comment.
What determines whether these match? Is it direct text comparison?
| 3) **Closed-set decision taxonomy is expressive enough** | ||
| CoC enforces decisions from a predefined set; AR1 assumes this is enough to capture the key maneuver choices relevant to long-tail safety. | ||
|
|
||
| 4) **Diffusion decoding produces plans that are controller-compatible** |
There was a problem hiding this comment.
Is this an assumption or do they enforce controller-compatibility on the outputs to make them stable?
Draft of Autonomous Driving Audit including discussion of the following three papers: