Skip to content

namingbe/metals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPU Experiments - Metal Compute

Progressive GPU programming exercises using Metal on Apple Silicon.

Quick Start

# See all available commands
make help

# Build and run any problem using its prefix
make go-00          # Mandelbrot
make go-01          # Parallel scan

Structure

Each problem follows the pattern:

  1. Sequential baseline - CPU reference implementation
  2. v1-naive - Direct GPU translation, identifies bottlenecks
  3. v2-optimized - Address memory/compute inefficiencies
  4. v3-advanced - Architecture-specific optimizations

The framework provides (in lib/):

  • MetalContext: Zero-boilerplate Metal setup, device management
  • Timer: GPU-aware performance measurement with bandwidth/FLOPS tracking
  • Visualizer: Debug arrays as heatmaps, correctness verification, access pattern analysis

Problem Progression

  1. Mandelbrot - Embarrassingly parallel warm-up, visual debugging
  2. Parallel Scan - Foundation for everything else
  3. Bitonic Sort - Fixed comparison network, visualizable parallelism
  4. Matrix Transpose - Bank conflicts, memory coalescing
  5. Reduction - Warp divergence, atomic operations
  6. Histogram - Atomic contention, privatization
  7. Sparse Matrix - Irregular workloads, load balancing
  8. Convolution - Constant memory, texture cache
  9. Einstein Summation - Tensor contractions, index arithmetic

Implementation Notes

The framework uses direct Objective-C++ Metal API instead of metal-cpp for zero dependencies. Key abstractions:

  • MetalContext: Device setup, shader loading, pipeline creation
  • Timer: GPU-aware performance measurement with bandwidth tracking
  • Visualizer: Array heatmaps, correctness verification, access pattern analysis
  • ScopedBuffer: RAII buffer management with automatic cleanup

Key Concepts to Track

  • Occupancy: Active threads vs hardware maximum (1024 threads/threadgroup on M2)
  • Memory Bandwidth: Achieved vs theoretical (450 GB/s on M2 Max)
  • Bank Conflicts: Threadgroup memory contention patterns
  • Divergence: SIMD efficiency within simdgroups (32 threads on Apple Silicon)
  • Coalescing: Sequential vs strided memory access

Profiling

# Quick benchmark
make benchmark

# Detailed profiling
xcrun xctrace record --template 'Metal System Trace' --launch ./build/problems/01-parallel-scan/benchmark

Philosophy

каждая оптимизация должна быть видимой - visualize access patterns, measure everything, understand why performance changes. The goal isn't just making things fast, but understanding exactly why they're fast.

About

Doing whatever with Apple Silicon thingy idk

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published