Here's a concrete numerical example with actual matrix values for Algorithm 2:
- Batch size: 1
-
Future tokens: 1 frame × 5 positions (
$N=5$ ) - Vocabulary: IDs {0,1,2,3,4,5,6,7}, Mask ID = 8
-
Iterations:
$K=4$ (steps$k=3,2,1,0$ ) -
Schedule:
$\gamma(t) = \cos(\frac{\pi t}{2})$
All tokens start masked:
x^4 = [8, 8, 8, 8, 8] # [MASK, MASK, MASK, MASK, MASK]
is_masked = [True, True, True, True, True]Iteration 1: $k=3$ (Keep $M = \lceil\cos(\frac{3\pi}{8}) \times 5\rceil = \lceil0.383 \times 5\rceil = 2$ tokens)
Model predicts logits for all 5 positions (random example values):
| Position | Logits [0-7] | Softmax Prob | Sampled Token |
|---|---|---|---|
| 0 | [2.0, 1.0, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1] | [0.60, 0.22, ...] | 0 |
| 1 | [0.1, 0.1, 3.0, 0.1, 0.1, 0.1, 0.1, 0.1] | [0.05, 0.05, 0.75, ...] | 2 |
| 2 | [1.0, 2.0, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1] | [0.27, 0.73, ...] | 1 |
| 3 | [0.1, 0.1, 0.1, 4.0, 0.1, 0.1, 0.1, 0.1] | [0.02, 0.02, 0.02, 0.88, ...] | 3 |
| 4 | [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5] | [0.125, 0.125, ...] | 6 |
x_tilde_0 = [0, 2, 1, 3, 6] # Sampled valuesConfidence = Log probability + Gumbel(0,1) × (3/4)
| Pos | Token | Log Prob | Gumbel Noise | Weight ( |
Confidence |
|---|---|---|---|---|---|
| 0 | 0 | 0.2 | 0.15 | -0.36 | |
| 1 | 2 | -0.1 | -0.075 | -0.365 | |
| 2 | 1 | 0.5 | 0.375 | +0.065 | |
| 3 | 3 | 0.8 | 0.6 | +0.47 | |
| 4 | 6 | -0.5 | -0.375 | -2.455 |
Currently all are masked (is_masked = [T,T,T,T,T]), so no changes to confidence.
Sort by confidence: Pos 3 (0.47) > Pos 2 (0.065) > Pos 0 (-0.36) > Pos 1 (-0.365) > Pos 4 (-2.455)
Keep top
# After remasking
x^3 = [8, 8, 1, 3, 8]
# ↑ ↑ ↑ ↑
# remasked kept remasked
is_masked = [True, True, False, False, True]Result after
Iteration 2: $k=2$ (Keep $M = \lceil\cos(\frac{\pi}{4}) \times 5\rceil = \lceil0.707 \times 5\rceil = 4$ tokens)
Model now sees [8,8,1,3,8] and predicts again (new random samples):
| Pos | Sampled |
|---|---|
| 0 | 5 |
| 1 | 2 |
| 2 | 1 (same as before, model is confident) |
| 3 | 3 (same as before) |
| 4 | 7 |
x_tilde_0 = [5, 2, 1, 3, 7]| Pos | Log Prob | Gumbel | Weight (2/4=0.5) | Raw Conf |
|---|---|---|---|---|
| 0 | -0.8 | 0.3 | 0.15 | -0.65 |
| 1 | -0.4 | 0.1 | 0.05 | -0.35 |
| 2 | -0.1 | 0.9 | 0.45 | +0.35 |
| 3 | -0.2 | 0.4 | 0.2 | 0.0 |
| 4 | -1.5 | -0.2 | -0.1 | -1.6 |
Positions 2 and 3 were unmasked in
| Pos | Was Masked? | Confidence After Step 5 |
|---|---|---|
| 0 | Yes | -0.65 |
| 1 | Yes | -0.35 |
| 2 | No | +∞ |
| 3 | No | +∞ |
| 4 | Yes | -1.6 |
Sorted: Pos 2 (∞), Pos 3 (∞), Pos 0 (-0.65), Pos 1 (-0.35), Pos 4 (-1.6)
Keep top 4: Positions 2, 3, 0, 1
# After remasking
x^2 = [5, 2, 1, 3, 8]
# ↑
# remasked (not in top 4)
is_masked = [False, False, False, False, True]Key observation: Positions 2 and 3 kept their values (1 and 3) from the previous step, even though the model sampled new values (1 and 3 again) and calculated new confidences. The
Iteration 3: $k=1$ (Keep $M = \lceil\cos(\frac{\pi}{8}) \times 5\rceil = \lceil0.924 \times 5\rceil = 5$ tokens)
Model sees [5,2,1,3,8]:
| Pos | Sampled |
|---|---|
| 0 | 5 |
| 1 | 2 |
| 2 | 4 (model changed its mind, but...) |
| 3 | 3 |
| 4 | 0 |
Raw confidence calculation, then set positions 0,1,2,3 to
| Pos | Confidence | After Lock |
|---|---|---|
| 0 | -0.2 | +∞ |
| 1 | -0.5 | +∞ |
| 2 | -0.1 | +∞ |
| 3 | -0.3 | +∞ |
| 4 | -0.8 | -0.8 |
Top 5 of 5 positions = all positions.
x^1 = [5, 2, 1, 3, 0]
is_masked = [False, False, False, False, False]Note: Position 2 stayed as 1 (from
All positions already unmasked. Model predicts final refinement:
| Pos | Final Sampled |
|---|---|
| 0 | 5 |
| 1 | 2 |
| 2 | 1 |
| 3 | 3 |
| 4 | 0 |
Since
x^0 = [5, 2, 1, 3, 0]# Evolution over iterations
x^4 = [8, 8, 8, 8, 8] # Initial (all mask)
x^3 = [8, 8, 1, 3, 8] # Step 1: 2 unmasked (positions 2,3)
x^2 = [5, 2, 1, 3, 8] # Step 2: 4 unmasked (added 0,1), kept 2,3 locked
x^1 = [5, 2, 1, 3, 0] # Step 3: All unmasked (added 4), kept 0,1,2,3 locked
x^0 = [5, 2, 1, 3, 0] # Step 4: Final refinement (no masks left)Without Step 5 (the bug): At