arpg · krusnim · Feb 3, 2026 · Feb 17, 2026 · Feb 17, 2026 · Feb 17, 2026
diff --git a/content/course/submissions/scratch-1/images/attention_maps_mel_krusniak.png b/content/course/submissions/scratch-1/images/attention_maps_mel_krusniak.png
diff --git a/content/course/submissions/scratch-1/images/causal_masking_mel_krusniak.png b/content/course/submissions/scratch-1/images/causal_masking_mel_krusniak.png
diff --git a/content/course/submissions/scratch-1/images/loss_curve_mel_krusniak.png b/content/course/submissions/scratch-1/images/loss_curve_mel_krusniak.png
diff --git a/content/course/submissions/scratch-1/images/position_embeddings_mel_krusniak.png b/content/course/submissions/scratch-1/images/position_embeddings_mel_krusniak.png
diff --git a/content/course/submissions/scratch-1/mel-krusniak.mdx b/content/course/submissions/scratch-1/mel-krusniak.mdx
@@ -0,0 +1,71 @@
+---
+title: "Scratch-1 Submission: Mel Krusniak"
+student: "Mel Krusniak"
+date: "2026-02-03"
+---
+
+# Scratch-1: The Transformer Backbone
+
+## BLUF: Required Components
+
+### Loss Curve
+
+Training for 40 epochs provided the following loss curve:
+
+![Training Loss](./images/loss_curve_mel_krusniak.png)
+
+The model converged after a total of 5640 gradient steps.
+The final loss was 1.9982 on the training data.
+(There were 141 batches per epoch - I used a batch size of 64 as I did not lack VRAM.)
+
+Rarely have I ever seen a loss curve so smooth and ideal... enough so that I was somewhat suspicious.
+But I held out 1000 trajectories as test data and, over those, achieved a test loss of 1.9885 - comparable to the final training loss.
+
+### Attention Visualization
+
+I visualized the attention patterns of the last causal self attention block, using an arbitrary 20-step trajectory:
+
+![Attention Maps](./images/attention_maps_mel_krusniak.png)
+
+The attention patterns seem to indicate that the most recent actions are nearly always the most salient ones.
+Unfortunately, with no intuition as to the meaning of these action tokens, it's not exactly easy to understand why we do or do not attend to particular tokens.
+However, we do see desirable signs of deeper temporal dependencies on some tokens (e.g., the \[85\] action in position 7.)
+
+### Removing the Causal Mask
+
+If the causal mask is removed, we are met with a very different loss curve:
+
+![Training Loss w/o Causal Masking](./images/causal_masking_mel_krusniak.png)
+
+As suggested, the model "sees the future" to cheat: without causal masking, there is nonzero information flow between the tokens at positions $\geq t$ and the prediction for position $t$.
+Causal masking allows us to predict (and train for) the "next" token given the past tokens, with respect to all positions of the input simultaneously.
+This is substantially faster than proceeding step by step.
+When the mask is removed, we predict the "next" token given _all_ input tokens, past or future - which is clearly incorrect.
+
+## Mastery Challenges
+
+### KV Caching
+
+I implemented KV caching in a very basic way: each CausalSelfAttention block stores and uses its KV tensors if manually flagged to do so.
+Generally we want this flag to be true when generating autoregressively, but not during training.
+
+My implementation of KV caching is not particularly fast, but it's still enough to result in a marginal speed boost.
+As a simple ablation, I tested autoregressive generation of 10 steps in a batch size of 64 (see `test_inference_time(...)`).
+Without caching, this took on average (over steps and trials) **0.0454 seconds**, measured with `time.perf_counter()`.
+With caching, it about **0.0435** seconds. 
+I suspect that the savings would be much more substantial when used for more steps with a longer sequence length.
+
+### Sinusoidal positional encodings
+
+Much of the RoPE scaffolding already present can be applied to the original "Attention is All You Need"-style sinusoidal absolute encodings.
+Rather than introduce unnecessary bloat I reused that code to test this alternative (see the commented code marked `ALTERNATIVE` in the RoPE implementation.)
+
+Using this, I performed a simple ablation. 
+However, plotting the loss curves, I did not notice much of a difference:
+
+![Training Loss w/ Sinusoidal Positions](./images/position_embeddings_mel_krusniak.png)
+
+The final testing loss for the sinusoidal encodings was **2.0044**, performing only very slightly worse than RoPE.
+The premise of RoPE points out an issue with this approach: ideally, a positional encoding is such that $Q \cdot K^\intercal$ is a function of the embeddings $x_m$ and $x_n$, and the _relative_ position $m-n$.
+(That is, the exact values of $m$ and $n$ should not be considered.)
+Barring a fully rigorous derivation at the moment (which would essentially be a summary of section 3.2 of the RoPE paper anyway), suffice it to say that the RoPE embeddings are designed to satisfy this property.
diff --git a/grading_reports/GRADING_REPORT.md b/grading_reports/GRADING_REPORT.md
@@ -0,0 +1,65 @@
+![Chris-Bot](~/chris_robot.png)
+### 🤖 Chris's Grading Assistant - Feedback Report
+
+**Student:** @krusnim
+**PR:** #36
+**Branch:** `scratch-1-melkrusniak`
+
+Hi! I've reviewed your submission. Here's what I found:
+
+---
+
+## 📊 Component Feedback
+
+### ✅ Causal Self-Attention
+
+✅ Perfect! Your causal mask correctly prevents future token leakage.
+
+✅ Test passed.
+
+### ✅ RMSNorm
+
+✅ RMSNorm implemented correctly with proper normalization and learnable scale.
+
+✅ Test passed.
+
+### ✅ Training Loop
+
+✅ Excellent! Your model trains successfully and loss converges.
+
+### ✅ RoPE Embeddings
+
+✅ RoPE correctly applied to Q and K tensors.
+
+### ✅ Model Architecture
+
+✅ Model forward pass works end-to-end with correct output shapes.
+
+✅ Model has the expected number of trainable parameters.
+
+### ✅ Code Quality
+
+Your code imports and runs cleanly. Nice! ✨
+
+---
+
+## 📝 Documentation & Analysis
+
+✅ Report submitted! I found:
+- `content/course/submissions/scratch-1/mel-krusniak.mdx`
+- `README.md`
+
+Your instructor will review the quality of your analysis.
+
+---
+
+## 🎯 Mastery Features Detected
+
+I noticed you implemented:
+- RoPE vs Sinusoidal ablation study
+
+Great work going beyond the requirements! Your instructor will verify implementation quality.
+
+---
+
+> *Grading is automated but reviewed by an instructor. If you have questions, reach out on Slack!*
diff --git a/grading_reports/PR_34_private.md b/grading_reports/PR_34_private.md
@@ -0,0 +1,94 @@
+# PRIVATE GRADING REPORT - PR #34
+
+**Student:** jt7347
+**Branch:** scratch-1-jimmy
+**Graded:** 2026-02-16 20:16:28
+
+---
+
+## AUTOMATED SCORES (67/70 pts)
+
+| Component | Score | Max | Details |
+|-----------|-------|-----|---------|
+| Causal Attention | 15 | 15 | ✅ Passed |
+| RMSNorm | 10 | 10 | ✅ Passed |
+| Training | 10 | 10 | ✅ Passed |
+| RoPE | 15 | 15 | ✅ Passed |
+| Model Architecture | 10 | 10 | ✅ Passed |
+| Code Quality | 7 | 10 | ❌ Failed |
+
+---
+
+## MANUAL REVIEW REQUIRED
+
+### Documentation (0-30 pts): _____ / 30
+
+Report files found:
+- content/course/submissions/scratch-1/jimmy.mdx
+- README.md
+
+Check for:
+- [ ] Loss curve visualization (clear, labeled)
+- [ ] Attention map visualization (interpretable)
+- [ ] The Audit: causal mask removal analysis
+
+### Mastery Components (0-10 pts): _____ / 10
+
+Features detected:
+- KV-Caching implementation
+
+---
+
+## FINAL SCORE
+
+**Automated:** 67/70
+**Documentation:** _____ / 30
+**Mastery:** _____ / 10
+**Adjustments:** _____ (e.g., -1 for programmatic fixes like missing dependencies)
+
+**TOTAL:** _____ / 100
+
+---
+
+## TEST DETAILS
+
+
+### Code Quality
+
+- **test_import_success**: ✅ PASS (4/4 pts)
+  - ✅ Code imports successfully.
+- **test_no_syntax_errors**: ✅ PASS (3/3 pts)
+  - ✅ Test passed.
+- **test_no_todos_left**: ❌ FAIL (0/3 pts)
+  - ❌ Test failed.
+
+### Rmsnorm
+
+- **test_rmsnorm_implementation**: ✅ PASS (5/5 pts)
+  - ✅ RMSNorm implemented correctly with proper normalization and learnable scale.
+- **test_rmsnorm_numerical_stability**: ✅ PASS (5/5 pts)
+  - ✅ Test passed.
+
+### Causal Attention
+
+- **test_causal_mask_leakage**: ✅ PASS (8/8 pts)
+  - ✅ Perfect! Your causal mask correctly prevents future token leakage.
+- **test_causal_attention_shape_preservation**: ✅ PASS (7/7 pts)
+  - ✅ Test passed.
+
+### Rope
+
+- **test_rope_embeddings**: ✅ PASS (15/15 pts)
+  - ✅ RoPE correctly applied to Q and K tensors.
+
+### Training
+
+- **test_training_convergence**: ✅ PASS (10/10 pts)
+  - ✅ Excellent! Your model trains successfully and loss converges.
+
+### Model
+
+- **test_model_forward_pass**: ✅ PASS (5/5 pts)
+  - ✅ Model forward pass works end-to-end with correct output shapes.
+- **test_model_has_trainable_parameters**: ✅ PASS (5/5 pts)
+  - ✅ Model has the expected number of trainable parameters.
diff --git a/grading_reports/PR_34_public.md b/grading_reports/PR_34_public.md
@@ -0,0 +1,69 @@
+![Chris-Bot](~/chris_robot.png)
+### 🤖 Chris's Grading Assistant - Feedback Report
+
+**Student:** @jt7347
+**PR:** #34
+**Branch:** `scratch-1-jimmy`
+
+Hi! I've reviewed your submission. Here's what I found:
+
+---
+
+## 📊 Component Feedback
+
+### ✅ Causal Self-Attention
+
+✅ Perfect! Your causal mask correctly prevents future token leakage.
+
+✅ Test passed.
+
+### ✅ RMSNorm
+
+✅ RMSNorm implemented correctly with proper normalization and learnable scale.
+
+✅ Test passed.
+
+### ✅ Training Loop
+
+✅ Excellent! Your model trains successfully and loss converges.
+
+### ✅ RoPE Embeddings
+
+✅ RoPE correctly applied to Q and K tensors.
+
+### ✅ Model Architecture
+
+✅ Model forward pass works end-to-end with correct output shapes.
+
+✅ Model has the expected number of trainable parameters.
+
+### ❌ Code Quality
+
+✅ Code imports successfully.
+
+✅ Test passed.
+
+❌ Test failed.
+
+---
+
+## 📝 Documentation & Analysis
+
+✅ Report submitted! I found:
+- `content/course/submissions/scratch-1/jimmy.mdx`
+- `README.md`
+
+Your instructor will review the quality of your analysis.
+
+---
+
+## 🎯 Mastery Features Detected
+
+I noticed you implemented:
+- KV-Caching implementation
+
+Great work going beyond the requirements! Your instructor will verify implementation quality.
+
+---
+
+> *Grading is automated but reviewed by an instructor. If you have questions, reach out on Slack!*
diff --git a/grading_reports/PR_35_private.md b/grading_reports/PR_35_private.md
@@ -0,0 +1,95 @@
+# PRIVATE GRADING REPORT - PR #35
+
+**Student:** Tr0612
+**Branch:** scratch-1-thanushraam
+**Graded:** 2026-02-16 20:16:28
+
+---
+
+## AUTOMATED SCORES (70/70 pts)
+
+| Component | Score | Max | Details |
+|-----------|-------|-----|---------|
+| Causal Attention | 15 | 15 | ✅ Passed |
+| RMSNorm | 10 | 10 | ✅ Passed |
+| Training | 10 | 10 | ✅ Passed |
+| RoPE | 15 | 15 | ✅ Passed |
+| Model Architecture | 10 | 10 | ✅ Passed |
+| Code Quality | 10 | 10 | ✅ Passed |
+
+---
+
+## MANUAL REVIEW REQUIRED
+
+### Documentation (0-30 pts): _____ / 30
+
+Report files found:
+- content/course/submissions/scratch-1/thanushraam.mdx
+- README.md
+
+Check for:
+- [ ] Loss curve visualization (clear, labeled)
+- [ ] Attention map visualization (interpretable)
+- [ ] The Audit: causal mask removal analysis
+
+### Mastery Components (0-10 pts): _____ / 10
+
+Features detected:
+- KV-Caching implementation
+- RoPE vs Sinusoidal ablation study
+
+---
+
+## FINAL SCORE
+
+**Automated:** 70/70
+**Documentation:** _____ / 30
+**Mastery:** _____ / 10
+**Adjustments:** _____ (e.g., -1 for programmatic fixes like missing dependencies)
+
+**TOTAL:** _____ / 100
+
+---
+
+## TEST DETAILS
+
+
+### Code Quality
+
+- **test_import_success**: ✅ PASS (4/4 pts)
+  - ✅ Code imports successfully.
+- **test_no_syntax_errors**: ✅ PASS (3/3 pts)
+  - ✅ Test passed.
+- **test_no_todos_left**: ✅ PASS (3/3 pts)
+  - ✅ Test passed.
+
+### Rmsnorm
+
+- **test_rmsnorm_implementation**: ✅ PASS (5/5 pts)
+  - ✅ RMSNorm implemented correctly with proper normalization and learnable scale.
+- **test_rmsnorm_numerical_stability**: ✅ PASS (5/5 pts)
+  - ✅ Test passed.
+
+### Causal Attention
+
+- **test_causal_mask_leakage**: ✅ PASS (8/8 pts)
+  - ✅ Perfect! Your causal mask correctly prevents future token leakage.
+- **test_causal_attention_shape_preservation**: ✅ PASS (7/7 pts)
+  - ✅ Test passed.
+
+### Rope
+
+- **test_rope_embeddings**: ✅ PASS (15/15 pts)
+  - ✅ RoPE correctly applied to Q and K tensors.
+
+### Training
+
+- **test_training_convergence**: ✅ PASS (10/10 pts)
+  - ✅ Excellent! Your model trains successfully and loss converges.
+
+### Model
+
+- **test_model_forward_pass**: ✅ PASS (5/5 pts)
+  - ✅ Model forward pass works end-to-end with correct output shapes.
+- **test_model_has_trainable_parameters**: ✅ PASS (5/5 pts)
+  - ✅ Model has the expected number of trainable parameters.