Autonomous Driving Audit - Aritra Chakrabarty and Zack Allen by Zaaler · Pull Request #62 · arpg/vla-foundations

Zaaler · 2026-02-18T05:05:11Z

Draft of Autonomous Driving Audit including discussion of the following three papers:

AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning
EMMA: End-to-End Multimodal Model for Autonomous Driving
Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

…ut sections but content complete.

…nces to flesh out narrative for it.

…cessary, added a list of references to alphaDrive.mdx. TODO: finish citing for both mdx files.

…echnical paper audit sections and Ari's summary to final submission.

…lpamayo-R1

github-actions · 2026-02-18T05:05:49Z

🚀 Preview Deployed

Your preview is ready for review!

🔗 Preview URL: https://arpg.github.io/vla-foundations/staging/pulls/62/textbook/audits/staging/Zaaler-aritrach/

Review Checklist

LaTeX equations render correctly
All sections are complete per the template
References are formatted properly
Figures/diagrams display correctly

Next Steps

Review your rendered content using the preview link above
Tag @crheckman when ready for instructor review
Push updates to auto-refresh the preview

This preview will be removed when the PR is closed.

…ut sections but content complete.

…nces to flesh out narrative for it.

…cessary, added a list of references to alphaDrive.mdx. TODO: finish citing for both mdx files.

…echnical paper audit sections and Ari's summary to final submission.

…lpamayo-R1

…rive and alpamayo.

…-driving' into Zaaler-aritrach-autonomous-driving

…ections

…dx formatting)

Hhy903 · 2026-02-24T16:32:03Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+- The trace must link these factors to the decision in a minimal, behavior-consistent way.
+
+---
+


Since CoC relies on a predefined taxonomy of causal factors, how robust is the approach to previously unseen causal structures or novel interaction patterns not captured in the labeling schema?

The driving decisions is a closed set taxonomy. The causal factors is an open set taxonomy. So, the model can adapt to unseen and novel interaction patterns not caught in the labeling schema.

Hhy903 · 2026-02-24T16:32:45Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+Self-Supervision requires only future ego locations from driving logs — no human annotation, perception labels, or HD map labels. 
+This is a critical scalability property enabling training on Waymo's internal fleet data (24M sequences, 203K hours).
+
+**Open-loop training, closed-loop gap:** Training is imitation learning on logged trajectories, so the model is never exposed to the distribution shift caused by its own compounding errors. 


Have the authors explored any synthetic closed-loop augmentation (e.g., simulation rollouts or trajectory perturbations) to partially address this distribution shift during training?

The authors of this paper performed no closed-loop evaluation of their model. Therefore, they currently have no way of addressing this potential distribution shift. Since this paper was released in 2024, they do mention the large strides being made in the world simulations in terms of full sensor-suite input systems for closed loop evaluation.

Hhy903 · 2026-02-24T16:33:35Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+
+2) **LLM/LRM critics are calibrated**  
+Reasoning reward is computed by a large reasoning model judge; the approach assumes the judge scores correlate with true causal fidelity and not superficial templates.
+


how sensitive is training to critic miscalibration, and have the authors tested robustness to alternative judging prompts or scoring criteria?

Using LLM/LRM to label chain of causality and then feeding it into a human in the loop scorer provides an attempt to prevent miscalibration. The success of the algorithm in reducing close collision events shows that they have mitigated the miscalibration but, they provide no testing to quantify the robustness of their solution. We suggest an approach in our next 10,000 GPU section below.

lorinachey · 2026-02-24T16:34:01Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+## Problem Statement
+Modern autonomy systems increasingly explore **VLM/MLLM-based planners** that map perception (images/video) plus context (routing/intent/ego state) into **driving decisions**.
+Across real-world driving, (i) **multiple actions can be valid** for the same scene, (ii) decisions must satisfy **real-time constraints**, and (iii) developers often want **human-interpretable rationales**—ideally with some form of **consistency** between the rationale and the executed plan.  
+These three papers share that motivation, but differ in **action representation**, **reasoning representation**, and **how training enforces correctness vs diversity vs causal consistency**.


Saying "correctness vs diversity vs causal consistency" seems to imply that training is pitting these three things against each other. Is that the take-away here? Or are these just three related parts of training rather than 3 warring factions of it?

Not necessarily "warring factions" just different approaches to how this problem can be solved. Each group focused on different aspects while training, with causal consistency being a major argument that Alpamayo-R1 tries to sell.

crheckman

first 10 minutes of review period

crheckman · 2026-02-24T16:33:39Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+
+# Features (Inputs / Outputs / What “Action” Means)
+
+| Model | Primary Inputs | Primary Outputs | What “Action” is |


Primary inputs should include exactly what the framerate and window of history is passed along. These models have fundamental differences in their context length and the multimodal ingest that aren't clear based on this table. For instance, EMMA has a text representation of history, but does it only provide the t=0 image, or the t=-k image, or some subset of them?

crheckman · 2026-02-24T16:36:37Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+
+---
+
+# Training & Supervision


We need a section on training data here. What is the volume, how was the training data constructed, what does that implicitly emphasize.

crheckman · 2026-02-24T16:37:36Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+|---|---|---|---|
+| **AlphaDrive** | (1) Distill reasoning from a larger teacher → **SFT** warm-start; (2) **GRPO RL** refinement | GT meta-actions + reward shaping | **Multimodal planning** (diversity), **safety-critical weighting**, and structured output constraints |
+| **EMMA** | Multitask training with a unified language formulation; adds **CoT** prompting/training | **Future ego locations** from logs for planning; plus task-specific labels (detection/road-graph) | **Shared interface across tasks**; co-training yields cross-task gains |
+| **Alpamayo-R1** | Multi-stage: add action modality → SFT for reasoning → **RL post-training**; plus **CoC dataset/pipeline** | Structured **Chain-of-Causation** + trajectory objectives | **Causal structure**, **reasoning/action consistency**, and high-quality multimodal trajectories under runtime constraints |


Get as specific as you can about this one - what is a trajectory objective? Is it any different from the "future ego locations from logs" of EMMA?

crheckman · 2026-02-24T16:38:31Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+| Model | Training Stages | Key Supervision Signal | What the objective emphasizes |
+|---|---|---|---|
+| **AlphaDrive** | (1) Distill reasoning from a larger teacher → **SFT** warm-start; (2) **GRPO RL** refinement | GT meta-actions + reward shaping | **Multimodal planning** (diversity), **safety-critical weighting**, and structured output constraints |
+| **EMMA** | Multitask training with a unified language formulation; adds **CoT** prompting/training | **Future ego locations** from logs for planning; plus task-specific labels (detection/road-graph) | **Shared interface across tasks**; co-training yields cross-task gains |


Future ego locations from logs as a supervisory signal implies that the decision that the driver made was the right one. It also relies on an enormous corpus of human expert driving data.

Aside: who has the largest corpus of human expert driving data?

crheckman · 2026-02-24T16:40:36Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+|---|---|---|
+| **AlphaDrive** | Structured “planning reasoning” text (format explicitly rewarded) | Improves planning quality via distillation + RL; reasoning is trained as part of the output distribution |
+| **EMMA** | Chain-of-thought rationales (text) | Primarily an accompanying rationale paired with predicted outputs; leverages MLLM capabilities and unified prompting |
+| **Alpamayo-R1** | **Chain-of-Causation (CoC)** (decision-grounded causal links) | Intended to provide *structured* decision grounding and improved alignment between reasoning and action generation |


Did they not demonstrate a performance improvement when introducing CoC? All you mention here are "structure enforcement" (teacher forcing) and "alignment" (frictionless reasoning->action)

lorinachey · 2026-02-24T16:41:24Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+## 1. Summary
+
+EMMA is a **Gemini-powered end-to-end multimodal model** for autonomous driving that directly maps raw surround-view camera images into driving-specific outputs: **future ego trajectories**, 3D object detections, road graph elements, and scene understanding predictions. 
+All non-sensor inputs (navigation commands, ego history) and all outputs (trajectory waypoints, bounding boxes) are represented as **plain text**, unifying every task within a single language space and allowing task-specific behavior to be selected at inference time via prompt variation.


Is there any discussion of when this type of input translation to text might fail? It seems like this opens the door for some information degradation if all inputs of all types are being transformed into a text representation.

lorinachey · 2026-02-24T16:46:30Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+- the continuous waypoint output space is heterogeneous from the natural language space used for all other tasks
+
+The paper argues modular pipelines with fixed symbolic interfaces are brittle at the long tail, and end-to-end imitation approaches trained on limited datasets fail to generalize. 
+EMMA's resolution: leverage the pre-training scale and world knowledge of Gemini, fine-tuned within a unified language output space.


"Pre-training scale and world knowledge" of has become the norm for justifying the use of these large models. I say the same thing, but I do think it's worth investigating just how much we really gain from these models. For example, can we conclusively point to the "world knowledge of Gemini" and say that it has discovered long-tail scenarios for autonomous driving that we couldn't have predicted? I suppose this really isn't an actionable comment for this audit unless they explicitly disclose interesting findings.

lorinachey · 2026-02-24T16:49:36Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+|---|---|---|
+| Modular stacks + LLM augmentation | Specialized modules with LLMs for explainability/command | Fixed interfaces brittle to novel environments |
+| End-to-end imitation planners | Direct sensor-to-trajectory mapping | Prone to ego-status shortcuts; limited generalization |
+| **VLM-primary generalist (EMMA)** | MLLM as core compute; all tasks as VQA in unified language space | Closed-loop stability unvalidated |


What would it look like to validate the closed-loop stability? I guess I'm having trouble understanding if this critique is referring to the outputs to the car's controls or if it's referring to the stability of the model in terms of it's VQA performance. Specifically, what is the stability criteria for validation?

I think we should include an explanation of Open Loop vs Closed Loop testing. Open loop does not update the system, but closed loop interacts with a simulator and the vehicle for testing.

lorinachey · 2026-02-24T16:54:34Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+
+**Experiment B — "Multi-Task Output Consistency Verification"** <br />
+Define a consistency oracle: if a bounding box is predicted at $(x, y, \theta)$, the trajectory should respect that agent's right-of-way; if the road graph predicts a merge in 40m, the trajectory should respond within that horizon. Run on the full nuScenes validation set.
+- **Success**: Consistency failure rate < 5% for safety-critical classes, or inconsistencies used to build a consistency-regularized training objective.


What is the <5% metric based on? Is this an industry standard of some kind? It feels a bit arbitrary.

lorinachey · 2026-02-24T16:55:26Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+### 7.4 The Next 10,000 GPU-hour Experiment
+
+**Experiment A — "Closed-Loop Stability via Sensor Simulation"** <br />
+Leverage neural scene reconstruction (NeRF/Gaussian Splatting) to generate novel camera observations from perturbed ego states, enabling reactive closed-loop evaluation without a full physical simulator.


Why does it have to be NeRF/Gaussian Splats?

callie-jones · 2026-02-24T16:56:23Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+- disorganized reasoning with weak causal links,
+- overly long and ineffective reasoning.
+
+So they use a stronger cloud model (e.g., GPT-4o) to generate concise reasoning conditioned on real actions + state + nav, manually filter errors, and distill via SFT.


How many traces were generated and filtered?

lorinachey · 2026-02-24T16:57:10Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+
+## 1. Summary
+
+Alpamayo-R1 (AR1) is a **vision–language–action (VLA)** based driving policy designed to improve **generalization** in safety-critical long-tail scenarios where pure imitation learning is brittle.


Should "generalization" be defined here? It's a term we throw around a lot, but what does it really mean in the context of this paper?

callie-jones · 2026-02-24T16:58:02Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+
+- **Vision input ($V$):** Surround-view camera images (6–8 cameras, 360° coverage). No LiDAR, radar, or HD maps.
+- **Language input ($T$):** A high-level intent command (e.g., "go straight") and historical ego waypoints as plain-text BEV coordinate pairs: $T_{ego} = \{(x_t, y_t)\}_{t=-T_h}^{-1}$.
+- **Output format:** All outputs are natural language text: trajectory waypoints as floating-point pairs, bounding boxes as `[x, y, z, l, w, h, θ, class]`, road graph polylines as semicolon-delimited waypoint sequences with `valid`/`invalid` tagging.


Are outputs validated before downstream use?

lorinachey · 2026-02-24T16:58:50Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+The paper’s central claim is that “reasoning” only helps driving if it is **(i) causally grounded, (ii) decision-aligned, and (iii) behavior-consistent**, and that you need both *data* (via a reasoning specific dataset) and *training* to make it possible.
+
+AR1 couples two outputs:
+a structured **Chain of Causation (CoC)** reasoning trace, and a **6.4s future ego trajectory** (controls/trajectory), so the model is trained to jointly predict the *action* and the *thought process* in one step.


Though a reader might infer what is means to do "Chain of Causation" versus "Chain of Thought", I do think it's worth explicitly describing how they differ. Sorry if that's been done later on in the audit...

callie-jones · 2026-02-24T17:00:02Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+Self-Supervision requires only future ego locations from driving logs — no human annotation, perception labels, or HD map labels. 
+This is a critical scalability property enabling training on Waymo's internal fleet data (24M sequences, 203K hours).
+
+**Open-loop training, closed-loop gap:** Training is imitation learning on logged trajectories, so the model is never exposed to the distribution shift caused by its own compounding errors. 


nit: maybe add more info on closed-loop vs. open-loop

crheckman · 2026-02-24T16:42:57Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+
+Because high-quality driving “chain-of-thought” data is scarce, they use a multi-stage reasoning strategy: generate a small batch of reasoning traces using a stronger cloud model (e.g., GPT-4o), manually filter it, SFT warm-up on that reasoning data for stability, then run **RL on the full dataset**.
+
+On MetaAD (120k 3-second clips; 110k train / 10k val), AlphaDrive reports **77.12 overall planning accuracy**, outperforming fine-tuned baselines including a larger Qwen2VL-7B result (61.44).


what does 77.12 mean?

is it good enough to drive?

crheckman · 2026-02-24T16:44:58Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+
+On MetaAD (120k 3-second clips; 110k train / 10k val), AlphaDrive reports **77.12 overall planning accuracy**, outperforming fine-tuned baselines including a larger Qwen2VL-7B result (61.44).
+
+They further claim **+25.52%** planning accuracy vs an SFT-trained model, and that with only 20% training data they outperform SFT by **35.31%**, emphasizing data-efficiency.


Strange wording. Does this mean that reasoning improves over an already-SFT'd model by 25.52%? Where does the 35.31% number come from, do they ablate SFT to 20% of the original fine tuning data and then run their standard reasoning RL improvement?

crheckman · 2026-02-24T16:46:45Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+The paper argues that naive “correctness reward” used in math/programming applications does not transfer cleanly to planning because there often isn't a single verifiable solution in driving; you need a reward that is robust early in training and resistant to shortcut solutions.
+
+### 2.2 Context
+- **End-to-end driving models** can output trajectories/controls directly from sensors, but they are “black-box” systems that struggle with the long-tail of driving cases because they lack explicit reasoning.


but... the reasoning systems are black box too? (evidence: alpamayo-r1 acknowledges this by saying they will improve their system's direct alignment between reasoning->action through CoC)

crheckman · 2026-02-24T16:49:55Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+
+AlphaDrive’s “architecture” is best described as a **training + inference pipeline**.
+
+<img src="https://raw.githubusercontent.com/arpg/vla-foundations/audit/Zaaler-aritrach-autonomous-driving/content/textbook/audits/staging/figures/alphadrive/alphadrive_architecture.png" alt="AlphaDrive Architecture" width="900" />


this... is a very interesting example. why is the "User" prompt existing at all? Isn't this a driver, not a conversational agent? Also, what about the pedestrian overyielding that Answer1 suggests. I would agree that the pedestrians might move, and if we aren't running at a high enough frequency, the predictions will misalign with the action timing.

This seems like a useful example to walk through as a deep dive, if they provide more details on it within the paper.

crheckman · 2026-02-24T16:54:01Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+
+### 5.1 Dataset
+
+They adopt **MetaAD** \[*NOTE: Could not find this dataset anywhere, neither could their reviewers at ICLR 2026*\] as the benchmark:


wow, amazing that they have essentially no extra details on this. Was it human drivers? How did they "balance the distribution" over environments and planning actions.

crheckman · 2026-02-24T16:56:24Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+
+**Experiment B — “Multimodal plan selection” in closed-loop**
+- Motivation: they claim multimodal planning emerges post-RL.
+- Proposal: generate K plans, run a safety/rule feasibility filter, select, then evaluate closed-loop safety proxies (hard-brake rate, time-to-collision proxy, rule violations).


thereby multiplying your model footprint by K?

crheckman · 2026-02-24T16:57:06Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+
+### 7.5 Sign-Off Criteria
+
+**Technical recommendation:**


Research sure, but what about the data? Can't even see where to improve on this one without knowing what the complement of their data is

crheckman · 2026-02-24T16:58:22Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+
+### 4.1 Self-Supervised Planning Objective
+
+EMMA's core training objective is **autoregressive next-token prediction** over text-encoded trajectory waypoints, conditioned on visual and language tokens. 


text-encoded trajectory waypoints lol these researchers were just rushing out the first VLM for driving paper

crheckman · 2026-02-24T16:59:00Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+
+### 4.2 Chain-of-Thought Reasoning Structure
+
+EMMA predicts a four-component **driving rationale** before outputting waypoints, structured coarse-to-fine:


hugely important, since this is also what their SFT dataset consisted of

crheckman · 2026-02-24T16:59:51Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+*Table 2: End-to-end motion planning on nuScenes. 
+EMMA and EMMA+ achieve state-of-the-art, outperforming all prior supervised and self-supervised methods.*
+
+**WOMD** — EMMA+ (w/ CoT): **0.027 / 0.203 / 0.543m at 1s/3s/5s**, outperforming MotionLM (0.696m at 5s) and Wayformer (0.628m at 5s) by 13.5–22.5%, while those baselines consume LiDAR-derived agent boxes, HD maps, and traffic light states as input.


but they did not consume images.

krusnim · 2026-02-24T16:48:03Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+
+2. **Diversity reward producing “diverse but unsafe” plans**
+   Diversity is rewarded by penalizing frequency among sampled answers.
+   - Risk: incentivize disagreement without feasibility grounding, making downstream selection harder.


I think some more detail about the consequences of the diversity reward would be appropriate, since it's a major component of the paper. What kind of behaviors does it incentivize / disincentivize? How might it affect training behavior?

krusnim · 2026-02-24T16:48:06Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+### 7.1 Load-Bearing Assumptions
+
+1. **Reward alignment assumption**
+   The 4-reward design (F1 accuracy + action weights + diversity + format) must correlate with “better driving,” not just better label matching.


Relatedly, there's an assumption that the safety breakdown via action weighting actually captures safety. How exactly are different behaviors assigned different safety importance? (And if this is done at the action level - which seems to be the case - isn't that a very low fidelity approach to safety, given that their actions are pretty general?)

krusnim · 2026-02-24T16:50:34Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+
+3. **Format-induced brittleness**
+   Format reward is hard-zero when tags fail.
+   - Risk: rare formatting drift can be catastrophic in a production parser unless you robustify extraction.


I don't totally understand this point. Is the argument that the format reward should be more aggressive (or more informative), because a format failure in production would be catastrophic? What do you mean by "robustify[ing] extraction"?

lorinachey · 2026-02-24T17:01:12Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+2) **Causal understanding gap**  
+   Many “reasoning datasets” for AVs have explanations that are:
+   - vague (“be cautious”),
+   - not decision-committing (no explicit maneuver),


Aren't there times when not making an explicit maneuver is desirable? The act of not making a maneuver is a decision in itself, right?

lorinachey · 2026-02-25T19:36:44Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+
+The paper is addressing a concrete deployment failure pattern:
+
+> A policy can look good in open-loop trajectory metrics, yet still fail in closed-loop, interactive, long-tail scenarios.


An example of this would be really helpful.

lorinachey · 2026-02-25T19:38:13Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+
+#### Outputs
+- **CoC reasoning trace**  
+  A structured explanation aligned to a **closed-set driving decision** that is anchored to an *explicit* decision category.


What are the explicit decision categories?

lorinachey · 2026-02-25T19:38:41Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+- **CoC reasoning trace**  
+  A structured explanation aligned to a **closed-set driving decision** that is anchored to an *explicit* decision category.
+- **Continuous future trajectory**  
+  The model predicts a **future trajectory over a fixed horizon (6.4s)**.


Was 6.4 seconds the right balance between usefulness and inference time latency?

lorinachey · 2026-02-25T19:42:12Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+
+For post-training, the paper uses a **GRPO-style** (Group Relative Policy Optimization) approach:
+
+- Sample multiple rollouts per prompt/context.


How can we (as the reader) understand what's necessary for recreating this method? For example, how many rollouts are needed? It says score them with "reward models / critics" so how many of those are needed? What's the group size? How often is KL regularization applied (periodically or every iteration)? Maybe these details are coming later in the audit...

lorinachey · 2026-02-25T19:45:27Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+3. Apply rule-based matching across both axes.
+
+The reward is assigned as:
+- $r_\text{consistency} = 1$ if the reasoning-implied meta-actions match the trajectory-derived meta-actions **for both longitudinal and lateral behavior**,


What determines whether these match? Is it direct text comparison?

lorinachey · 2026-02-25T19:49:11Z

content/textbook/audits/staging/Zaaler-aritrach.mdx

+3) **Closed-set decision taxonomy is expressive enough**  
+CoC enforces decisions from a predefined set; AR1 assumes this is enough to capture the key maneuver choices relevant to long-tail safety.
+
+4) **Diffusion decoding produces plans that are controller-compatible**  


Is this an assumption or do they enforce controller-compatibility on the outputs to make them stable?

Zaaler and others added 8 commits February 17, 2026 22:03

blue print for paper review

58351c0

first pass at paper review of alpha drive, needs to explicitly fill o…

b78968b

…ut sections but content complete.

zaaler - paper review emma

4af0df6

aritrach: DraftAudit v0.1 of alphadrive completed. Need to add refere…

c41b223

…nces to flesh out narrative for it.

Added a summary.mdx, removed the regullar markdown file that was unne…

55939fa

…cessary, added a list of references to alphaDrive.mdx. TODO: finish citing for both mdx files.

read alpamayo, still need to fill out audit for group review. Added t…

6cf8036

…echnical paper audit sections and Ari's summary to final submission.

added Ari's AlphaDrive, added my partial of EMMA, created stubs for A…

808db66

…lpamayo-R1

work in progress draft of autonomous driving audit

93ec292

Zaaler and others added 17 commits February 18, 2026 12:49

blue print for paper review

43e2473

first pass at paper review of alpha drive, needs to explicitly fill o…

8e43b6e

…ut sections but content complete.

zaaler - paper review emma

042827d

aritrach: DraftAudit v0.1 of alphadrive completed. Need to add refere…

2646305

…nces to flesh out narrative for it.

Added a summary.mdx, removed the regullar markdown file that was unne…

e5c6684

…cessary, added a list of references to alphaDrive.mdx. TODO: finish citing for both mdx files.

read alpamayo, still need to fill out audit for group review. Added t…

db86a51

…echnical paper audit sections and Ari's summary to final submission.

added Ari's AlphaDrive, added my partial of EMMA, created stubs for A…

3791372

…lpamayo-R1

work in progress draft of autonomous driving audit

a4fa154

Added audit for alpamayo. Moved every sentence to one line for alphaD…

9ba98b2

…rive and alpamayo.

Merge remote-tracking branch 'origin/audit/Zaaler-aritrach-autonomous…

86db690

…-driving' into Zaaler-aritrach-autonomous-driving

Fixed the line issue in alphadrive post merge,

0c736a6

Fixed audit linter issue in the main mdx file

38b2711

some small fixes from read through of opening, AlphaDrive, and EMMA s…

81d74e2

…ections

Added alpamayo write-up to the main doc (forgot to when I finished)

f8b5588

fixed math error that was causing PR to fail (used latex instead of m…

4aa5a36

…dx formatting)

cleaned up references, and added author names

2c4dfbd

adding figures and updated EMMA

246280c

Zaaler force-pushed the audit/Zaaler-aritrach-autonomous-driving branch 3 times, most recently from d03904a to d56a357 Compare February 23, 2026 04:59

PR fix every sentence on new line

4b83838

Zaaler force-pushed the audit/Zaaler-aritrach-autonomous-driving branch from d56a357 to 4b83838 Compare February 23, 2026 05:02

aritrach and others added 4 commits February 23, 2026 12:10

added images for alphadrive and alpamayo, need to add them to mdx

68c84c8

incorporated images into main write-up

cecd116

removing old audit artifacts

19d3dff

fixed non loading images

f9008c9

Hhy903 reviewed Feb 24, 2026

View reviewed changes

lorinachey reviewed Feb 24, 2026

View reviewed changes

crheckman requested changes Feb 24, 2026

View reviewed changes

lorinachey reviewed Feb 24, 2026

View reviewed changes

callie-jones reviewed Feb 24, 2026

View reviewed changes

lorinachey reviewed Feb 24, 2026

View reviewed changes

callie-jones reviewed Feb 24, 2026

View reviewed changes

lorinachey reviewed Feb 24, 2026

View reviewed changes

callie-jones reviewed Feb 24, 2026

View reviewed changes

crheckman requested changes Feb 24, 2026

View reviewed changes

krusnim reviewed Feb 24, 2026

View reviewed changes

lorinachey reviewed Feb 24, 2026

View reviewed changes

lorinachey reviewed Feb 25, 2026

View reviewed changes

		- The trace must link these factors to the decision in a minimal, behavior-consistent way.

		---


		2) LLM/LRM critics are calibrated
		Reasoning reward is computed by a large reasoning model judge; the approach assumes the judge scores correlate with true causal fidelity and not superficial templates.


		# Features (Inputs / Outputs / What “Action” Means)

		\| Model \| Primary Inputs \| Primary Outputs \| What “Action” is \|


		## 1. Summary

		Alpamayo-R1 (AR1) is a vision–language–action (VLA) based driving policy designed to improve generalization in safety-critical long-tail scenarios where pure imitation learning is brittle.


		Because high-quality driving “chain-of-thought” data is scarce, they use a multi-stage reasoning strategy: generate a small batch of reasoning traces using a stronger cloud model (e.g., GPT-4o), manually filter it, SFT warm-up on that reasoning data for stability, then run RL on the full dataset.

		On MetaAD (120k 3-second clips; 110k train / 10k val), AlphaDrive reports 77.12 overall planning accuracy, outperforming fine-tuned baselines including a larger Qwen2VL-7B result (61.44).


		On MetaAD (120k 3-second clips; 110k train / 10k val), AlphaDrive reports 77.12 overall planning accuracy, outperforming fine-tuned baselines including a larger Qwen2VL-7B result (61.44).

		They further claim +25.52% planning accuracy vs an SFT-trained model, and that with only 20% training data they outperform SFT by 35.31%, emphasizing data-efficiency.


		AlphaDrive’s “architecture” is best described as a training + inference pipeline.

		<img src="https://raw.githubusercontent.com/arpg/vla-foundations/audit/Zaaler-aritrach-autonomous-driving/content/textbook/audits/staging/figures/alphadrive/alphadrive_architecture.png" alt="AlphaDrive Architecture" width="900" />


		### 5.1 Dataset

		They adopt MetaAD \[NOTE: Could not find this dataset anywhere, neither could their reviewers at ICLR 2026\] as the benchmark:


		### 4.1 Self-Supervised Planning Objective

		EMMA's core training objective is autoregressive next-token prediction over text-encoded trajectory waypoints, conditioned on visual and language tokens.


		### 4.2 Chain-of-Thought Reasoning Structure

		EMMA predicts a four-component driving rationale before outputting waypoints, structured coarse-to-fine:


		The paper is addressing a concrete deployment failure pattern:

		> A policy can look good in open-loop trajectory metrics, yet still fail in closed-loop, interactive, long-tail scenarios.


		For post-training, the paper uses a GRPO-style (Group Relative Policy Optimization) approach:

		- Sample multiple rollouts per prompt/context.

Conversation

Zaaler commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review Checklist

Next Steps

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crheckman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

github-actions bot commented Feb 18, 2026 •

edited

Loading

lorinachey Feb 25, 2026 •

edited

Loading