Progressive GPU programming exercises using Metal on Apple Silicon.
# See all available commands
make help
# Build and run any problem using its prefix
make go-00 # Mandelbrot
make go-01 # Parallel scanEach problem follows the pattern:
- Sequential baseline - CPU reference implementation
- v1-naive - Direct GPU translation, identifies bottlenecks
- v2-optimized - Address memory/compute inefficiencies
- v3-advanced - Architecture-specific optimizations
The framework provides (in lib/):
MetalContext: Zero-boilerplate Metal setup, device managementTimer: GPU-aware performance measurement with bandwidth/FLOPS trackingVisualizer: Debug arrays as heatmaps, correctness verification, access pattern analysis
- Mandelbrot - Embarrassingly parallel warm-up, visual debugging
- Parallel Scan - Foundation for everything else
- Bitonic Sort - Fixed comparison network, visualizable parallelism
- Matrix Transpose - Bank conflicts, memory coalescing
- Reduction - Warp divergence, atomic operations
- Histogram - Atomic contention, privatization
- Sparse Matrix - Irregular workloads, load balancing
- Convolution - Constant memory, texture cache
- Einstein Summation - Tensor contractions, index arithmetic
The framework uses direct Objective-C++ Metal API instead of metal-cpp for zero dependencies. Key abstractions:
MetalContext: Device setup, shader loading, pipeline creationTimer: GPU-aware performance measurement with bandwidth trackingVisualizer: Array heatmaps, correctness verification, access pattern analysisScopedBuffer: RAII buffer management with automatic cleanup
- Occupancy: Active threads vs hardware maximum (1024 threads/threadgroup on M2)
- Memory Bandwidth: Achieved vs theoretical (450 GB/s on M2 Max)
- Bank Conflicts: Threadgroup memory contention patterns
- Divergence: SIMD efficiency within simdgroups (32 threads on Apple Silicon)
- Coalescing: Sequential vs strided memory access
# Quick benchmark
make benchmark
# Detailed profiling
xcrun xctrace record --template 'Metal System Trace' --launch ./build/problems/01-parallel-scan/benchmarkкаждая оптимизация должна быть видимой - visualize access patterns, measure everything, understand why performance changes. The goal isn't just making things fast, but understanding exactly why they're fast.