Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
71 changes: 71 additions & 0 deletions content/course/submissions/scratch-1/mel-krusniak.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
---
title: "Scratch-1 Submission: Mel Krusniak"
student: "Mel Krusniak"
date: "2026-02-03"
---

# Scratch-1: The Transformer Backbone

## BLUF: Required Components

### Loss Curve

Training for 40 epochs provided the following loss curve:

![Training Loss](./images/loss_curve_mel_krusniak.png)

The model converged after a total of 5640 gradient steps.
The final loss was 1.9982 on the training data.
(There were 141 batches per epoch - I used a batch size of 64 as I did not lack VRAM.)

Rarely have I ever seen a loss curve so smooth and ideal... enough so that I was somewhat suspicious.
But I held out 1000 trajectories as test data and, over those, achieved a test loss of 1.9885 - comparable to the final training loss.

### Attention Visualization

I visualized the attention patterns of the last causal self attention block, using an arbitrary 20-step trajectory:

![Attention Maps](./images/attention_maps_mel_krusniak.png)

The attention patterns seem to indicate that the most recent actions are nearly always the most salient ones.
Unfortunately, with no intuition as to the meaning of these action tokens, it's not exactly easy to understand why we do or do not attend to particular tokens.
However, we do see desirable signs of deeper temporal dependencies on some tokens (e.g., the \[85\] action in position 7.)

### Removing the Causal Mask

If the causal mask is removed, we are met with a very different loss curve:

![Training Loss w/o Causal Masking](./images/causal_masking_mel_krusniak.png)

As suggested, the model "sees the future" to cheat: without causal masking, there is nonzero information flow between the tokens at positions $\geq t$ and the prediction for position $t$.
Causal masking allows us to predict (and train for) the "next" token given the past tokens, with respect to all positions of the input simultaneously.
This is substantially faster than proceeding step by step.
When the mask is removed, we predict the "next" token given _all_ input tokens, past or future - which is clearly incorrect.

## Mastery Challenges

### KV Caching

I implemented KV caching in a very basic way: each CausalSelfAttention block stores and uses its KV tensors if manually flagged to do so.
Generally we want this flag to be true when generating autoregressively, but not during training.

My implementation of KV caching is not particularly fast, but it's still enough to result in a marginal speed boost.
As a simple ablation, I tested autoregressive generation of 10 steps in a batch size of 64 (see `test_inference_time(...)`).
Without caching, this took on average (over steps and trials) **0.0454 seconds**, measured with `time.perf_counter()`.
With caching, it about **0.0435** seconds.
I suspect that the savings would be much more substantial when used for more steps with a longer sequence length.

### Sinusoidal positional encodings

Much of the RoPE scaffolding already present can be applied to the original "Attention is All You Need"-style sinusoidal absolute encodings.
Rather than introduce unnecessary bloat I reused that code to test this alternative (see the commented code marked `ALTERNATIVE` in the RoPE implementation.)

Using this, I performed a simple ablation.
However, plotting the loss curves, I did not notice much of a difference:

![Training Loss w/ Sinusoidal Positions](./images/position_embeddings_mel_krusniak.png)

The final testing loss for the sinusoidal encodings was **2.0044**, performing only very slightly worse than RoPE.
The premise of RoPE points out an issue with this approach: ideally, a positional encoding is such that $Q \cdot K^\intercal$ is a function of the embeddings $x_m$ and $x_n$, and the _relative_ position $m-n$.
(That is, the exact values of $m$ and $n$ should not be considered.)
Barring a fully rigorous derivation at the moment (which would essentially be a summary of section 3.2 of the RoPE paper anyway), suffice it to say that the RoPE embeddings are designed to satisfy this property.
65 changes: 65 additions & 0 deletions grading_reports/GRADING_REPORT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
![Chris-Bot](~/chris_robot.png)
### 🤖 Chris's Grading Assistant - Feedback Report

**Student:** @krusnim
**PR:** #36
**Branch:** `scratch-1-melkrusniak`

Hi! I've reviewed your submission. Here's what I found:

---

## 📊 Component Feedback

### ✅ Causal Self-Attention

✅ Perfect! Your causal mask correctly prevents future token leakage.

✅ Test passed.

### ✅ RMSNorm

✅ RMSNorm implemented correctly with proper normalization and learnable scale.

✅ Test passed.

### ✅ Training Loop

✅ Excellent! Your model trains successfully and loss converges.

### ✅ RoPE Embeddings

✅ RoPE correctly applied to Q and K tensors.

### ✅ Model Architecture

✅ Model forward pass works end-to-end with correct output shapes.

✅ Model has the expected number of trainable parameters.

### ✅ Code Quality

Your code imports and runs cleanly. Nice! ✨

---

## 📝 Documentation & Analysis

✅ Report submitted! I found:
- `content/course/submissions/scratch-1/mel-krusniak.mdx`
- `README.md`

Your instructor will review the quality of your analysis.

---

## 🎯 Mastery Features Detected

I noticed you implemented:
- RoPE vs Sinusoidal ablation study

Great work going beyond the requirements! Your instructor will verify implementation quality.

---

> *Grading is automated but reviewed by an instructor. If you have questions, reach out on Slack!*
94 changes: 94 additions & 0 deletions grading_reports/PR_34_private.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# PRIVATE GRADING REPORT - PR #34

**Student:** jt7347
**Branch:** scratch-1-jimmy
**Graded:** 2026-02-16 20:16:28

---

## AUTOMATED SCORES (67/70 pts)

| Component | Score | Max | Details |
|-----------|-------|-----|---------|
| Causal Attention | 15 | 15 | ✅ Passed |
| RMSNorm | 10 | 10 | ✅ Passed |
| Training | 10 | 10 | ✅ Passed |
| RoPE | 15 | 15 | ✅ Passed |
| Model Architecture | 10 | 10 | ✅ Passed |
| Code Quality | 7 | 10 | ❌ Failed |

---

## MANUAL REVIEW REQUIRED

### Documentation (0-30 pts): _____ / 30

Report files found:
- content/course/submissions/scratch-1/jimmy.mdx
- README.md

Check for:
- [ ] Loss curve visualization (clear, labeled)
- [ ] Attention map visualization (interpretable)
- [ ] The Audit: causal mask removal analysis

### Mastery Components (0-10 pts): _____ / 10

Features detected:
- KV-Caching implementation

---

## FINAL SCORE

**Automated:** 67/70
**Documentation:** _____ / 30
**Mastery:** _____ / 10
**Adjustments:** _____ (e.g., -1 for programmatic fixes like missing dependencies)

**TOTAL:** _____ / 100

---

## TEST DETAILS


### Code Quality

- **test_import_success**: ✅ PASS (4/4 pts)
- ✅ Code imports successfully.
- **test_no_syntax_errors**: ✅ PASS (3/3 pts)
- ✅ Test passed.
- **test_no_todos_left**: ❌ FAIL (0/3 pts)
- ❌ Test failed.

### Rmsnorm

- **test_rmsnorm_implementation**: ✅ PASS (5/5 pts)
- ✅ RMSNorm implemented correctly with proper normalization and learnable scale.
- **test_rmsnorm_numerical_stability**: ✅ PASS (5/5 pts)
- ✅ Test passed.

### Causal Attention

- **test_causal_mask_leakage**: ✅ PASS (8/8 pts)
- ✅ Perfect! Your causal mask correctly prevents future token leakage.
- **test_causal_attention_shape_preservation**: ✅ PASS (7/7 pts)
- ✅ Test passed.

### Rope

- **test_rope_embeddings**: ✅ PASS (15/15 pts)
- ✅ RoPE correctly applied to Q and K tensors.

### Training

- **test_training_convergence**: ✅ PASS (10/10 pts)
- ✅ Excellent! Your model trains successfully and loss converges.

### Model

- **test_model_forward_pass**: ✅ PASS (5/5 pts)
- ✅ Model forward pass works end-to-end with correct output shapes.
- **test_model_has_trainable_parameters**: ✅ PASS (5/5 pts)
- ✅ Model has the expected number of trainable parameters.
69 changes: 69 additions & 0 deletions grading_reports/PR_34_public.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
![Chris-Bot](~/chris_robot.png)
### 🤖 Chris's Grading Assistant - Feedback Report

**Student:** @jt7347
**PR:** #34
**Branch:** `scratch-1-jimmy`

Hi! I've reviewed your submission. Here's what I found:

---

## 📊 Component Feedback

### ✅ Causal Self-Attention

✅ Perfect! Your causal mask correctly prevents future token leakage.

✅ Test passed.

### ✅ RMSNorm

✅ RMSNorm implemented correctly with proper normalization and learnable scale.

✅ Test passed.

### ✅ Training Loop

✅ Excellent! Your model trains successfully and loss converges.

### ✅ RoPE Embeddings

✅ RoPE correctly applied to Q and K tensors.

### ✅ Model Architecture

✅ Model forward pass works end-to-end with correct output shapes.

✅ Model has the expected number of trainable parameters.

### ❌ Code Quality

✅ Code imports successfully.

✅ Test passed.

❌ Test failed.

---

## 📝 Documentation & Analysis

✅ Report submitted! I found:
- `content/course/submissions/scratch-1/jimmy.mdx`
- `README.md`

Your instructor will review the quality of your analysis.

---

## 🎯 Mastery Features Detected

I noticed you implemented:
- KV-Caching implementation

Great work going beyond the requirements! Your instructor will verify implementation quality.

---

> *Grading is automated but reviewed by an instructor. If you have questions, reach out on Slack!*
95 changes: 95 additions & 0 deletions grading_reports/PR_35_private.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# PRIVATE GRADING REPORT - PR #35

**Student:** Tr0612
**Branch:** scratch-1-thanushraam
**Graded:** 2026-02-16 20:16:28

---

## AUTOMATED SCORES (70/70 pts)

| Component | Score | Max | Details |
|-----------|-------|-----|---------|
| Causal Attention | 15 | 15 | ✅ Passed |
| RMSNorm | 10 | 10 | ✅ Passed |
| Training | 10 | 10 | ✅ Passed |
| RoPE | 15 | 15 | ✅ Passed |
| Model Architecture | 10 | 10 | ✅ Passed |
| Code Quality | 10 | 10 | ✅ Passed |

---

## MANUAL REVIEW REQUIRED

### Documentation (0-30 pts): _____ / 30

Report files found:
- content/course/submissions/scratch-1/thanushraam.mdx
- README.md

Check for:
- [ ] Loss curve visualization (clear, labeled)
- [ ] Attention map visualization (interpretable)
- [ ] The Audit: causal mask removal analysis

### Mastery Components (0-10 pts): _____ / 10

Features detected:
- KV-Caching implementation
- RoPE vs Sinusoidal ablation study

---

## FINAL SCORE

**Automated:** 70/70
**Documentation:** _____ / 30
**Mastery:** _____ / 10
**Adjustments:** _____ (e.g., -1 for programmatic fixes like missing dependencies)

**TOTAL:** _____ / 100

---

## TEST DETAILS


### Code Quality

- **test_import_success**: ✅ PASS (4/4 pts)
- ✅ Code imports successfully.
- **test_no_syntax_errors**: ✅ PASS (3/3 pts)
- ✅ Test passed.
- **test_no_todos_left**: ✅ PASS (3/3 pts)
- ✅ Test passed.

### Rmsnorm

- **test_rmsnorm_implementation**: ✅ PASS (5/5 pts)
- ✅ RMSNorm implemented correctly with proper normalization and learnable scale.
- **test_rmsnorm_numerical_stability**: ✅ PASS (5/5 pts)
- ✅ Test passed.

### Causal Attention

- **test_causal_mask_leakage**: ✅ PASS (8/8 pts)
- ✅ Perfect! Your causal mask correctly prevents future token leakage.
- **test_causal_attention_shape_preservation**: ✅ PASS (7/7 pts)
- ✅ Test passed.

### Rope

- **test_rope_embeddings**: ✅ PASS (15/15 pts)
- ✅ RoPE correctly applied to Q and K tensors.

### Training

- **test_training_convergence**: ✅ PASS (10/10 pts)
- ✅ Excellent! Your model trains successfully and loss converges.

### Model

- **test_model_forward_pass**: ✅ PASS (5/5 pts)
- ✅ Model forward pass works end-to-end with correct output shapes.
- **test_model_has_trainable_parameters**: ✅ PASS (5/5 pts)
- ✅ Model has the expected number of trainable parameters.
Loading
Loading