Skip to content

Latest commit

 

History

History
296 lines (242 loc) · 7.99 KB

File metadata and controls

296 lines (242 loc) · 7.99 KB

Performance Analysis and Optimization Guide

Introduction

This document provides comprehensive performance analysis techniques and optimization strategies for CUDA kernel development. Understanding performance characteristics is crucial for developing efficient GPU applications.

Performance Metrics

1. Throughput Metrics

FLOPS (Floating Point Operations Per Second)

GFLOPS = (Number of Operations) / (Execution Time in seconds) / 1e9

Example Calculation for Matrix Multiplication (N×N):

Operations = 2 × N³ (N³ multiplications + N³ additions)
If N=2048 and time=100ms:
GFLOPS = (2 × 2048³) / (0.1) / 1e9 = 171.8 GFLOPS

Memory Bandwidth

Bandwidth (GB/s) = (Bytes Transferred) / (Execution Time in seconds) / 1e9

Example for Vector Addition:

Bytes = 3 × N × sizeof(float) (read A, read B, write C)
If N=32M and time=10ms:
Bandwidth = (3 × 32M × 4) / (0.01) / 1e9 = 38.4 GB/s

2. Efficiency Metrics

Occupancy

Occupancy = (Active Warps per SM) / (Maximum Warps per SM)

Factors Affecting Occupancy:

  • Registers per thread
  • Shared memory per block
  • Threads per block
  • Architecture limits

Memory Efficiency

Memory Efficiency = (Requested Bandwidth) / (Actual Bandwidth) × 100%

Compute Utilization

Compute Utilization = (Actual FLOPS) / (Peak FLOPS) × 100%

Profiling Tools and Techniques

1. NVIDIA Nsight Systems

Command Line Usage:

nsys profile --output=profile.nsys-rep --trace=cuda,nvtx ./your_program

Key Metrics to Monitor:

  • Kernel execution time
  • Memory transfer time
  • CPU-GPU synchronization
  • API call overhead

2. NVIDIA Nsight Compute

Command Line Usage:

ncu --set full --target-processes all ./your_program

Important Sections:

  • GPU Speed of Light: Overall performance summary
  • Memory Workload Analysis: Memory access patterns
  • Compute Workload Analysis: Instruction throughput
  • Scheduler Statistics: Warp execution efficiency

3. nvprof (Legacy but still useful)

nvprof --print-gpu-trace --log-file profile.log ./your_program
nvprof --metrics achieved_occupancy,gld_efficiency ./your_program

Performance Optimization Strategies

1. Memory Optimization

Memory Coalescing

Good Pattern:

// Threads access consecutive memory locations
__global__ void coalesced_access(float* data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    data[idx] = some_computation(data[idx]);
}

Bad Pattern:

// Strided access pattern
__global__ void strided_access(float* data, int stride) {
    int idx = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
    data[idx] = some_computation(data[idx]);
}

Shared Memory Bank Conflicts

32-bit Access Pattern (No Conflicts):

__shared__ float shared_data[32][32];
// Each thread in warp accesses different bank
shared_data[threadIdx.y][threadIdx.x] = value;

Bank Conflict Example:

__shared__ float shared_data[32][32];
// All threads access bank 0
shared_data[threadIdx.x][0] = value; // 32-way bank conflict

Memory Hierarchy Utilization

L1/Texture Cache: ~100 GB/s, ~10 cycles latency
L2 Cache:         ~600 GB/s, ~200 cycles latency
Global Memory:    ~900 GB/s, ~400 cycles latency
Shared Memory:    ~19 TB/s,  ~20 cycles latency
Registers:        ~19 TB/s,  ~1 cycle latency

2. Compute Optimization

Warp Divergence Minimization

Problematic Divergence:

if (threadIdx.x < 16) {
    // Only half of warp executes this branch
    expensive_computation_a();
} else {
    // Other half executes this branch
    expensive_computation_b();
}

Optimized Version:

// Separate warps handle different branches
if (threadIdx.x / 32 == 0) {
    expensive_computation_a();
} else {
    expensive_computation_b();
}

Instruction Throughput

High Throughput Instructions:

  • Fused multiply-add (FMA): a * b + c
  • Simple arithmetic: +, -, *
  • Bitwise operations: &, |, ^, <<, >>

Lower Throughput Instructions:

  • Division: / (use reciprocal multiplication if possible)
  • Transcendental functions: sin, cos, exp, log
  • Integer modulo: %

3. Algorithm-Specific Optimizations

Matrix Multiplication

Optimization Progression:

  1. Naive: O(N³) with poor memory locality
  2. Tiled: Use shared memory for data reuse
  3. Blocked: Each thread computes multiple elements
  4. Tensor Cores: Use mixed-precision on modern hardware

Performance Expectations:

Naive:     ~500 GFLOPS
Tiled:     ~2000 GFLOPS
Optimized: ~5000 GFLOPS
cuBLAS:    ~15000 GFLOPS (on A100)

Reduction Optimization

Optimization Techniques:

  1. Tree Reduction: O(log n) steps
  2. Warp Primitives: Use __shfl_down_sync()
  3. Cooperative Groups: Modern synchronization
  4. Multiple Elements per Thread: Reduce kernel launch overhead

Convolution Optimization

Key Strategies:

  1. Im2col: Convert to matrix multiplication
  2. Winograd: Reduce arithmetic complexity
  3. FFT: For large kernels
  4. Separable Filters: 2D → two 1D convolutions

Benchmark Results Analysis

Expected Performance Ranges

Vector Addition (Memory-Bound)

Hardware          | Peak Bandwidth | Expected Performance
GTX 1080 Ti       | 484 GB/s      | ~400 GB/s (80-85%)
RTX 3080          | 760 GB/s      | ~650 GB/s (85-90%)
RTX 4090          | 1008 GB/s     | ~900 GB/s (85-90%)
A100              | 1935 GB/s     | ~1700 GB/s (85-90%)

Matrix Multiplication (Compute-Bound)

Hardware          | Peak FLOPS    | Expected Performance
GTX 1080 Ti       | 11.3 TFLOPS   | ~8-10 TFLOPS
RTX 3080          | 29.8 TFLOPS   | ~20-25 TFLOPS  
RTX 4090          | 83 TFLOPS     | ~60-70 TFLOPS
A100              | 156 TFLOPS    | ~120-140 TFLOPS (Tensor)

Performance Regression Detection

Automated Benchmarking

# Example benchmark script
for size in 1024 2048 4096; do
    echo "Testing matrix size: ${size}x${size}"
    ./matrix_multiplication --size=$size --iterations=10
done

Performance Thresholds

// Performance regression detection
const double EXPECTED_GFLOPS = 2000.0;
const double TOLERANCE = 0.1; // 10% tolerance

if (actual_gflops < EXPECTED_GFLOPS * (1.0 - TOLERANCE)) {
    std::cerr << "Performance regression detected!" << std::endl;
    std::cerr << "Expected: " << EXPECTED_GFLOPS << " GFLOPS" << std::endl;
    std::cerr << "Actual: " << actual_gflops << " GFLOPS" << std::endl;
}

Industry-Relevant Performance Targets

Deep Learning Workloads

  • Training: 50-80% of peak theoretical performance
  • Inference: 80-95% of peak theoretical performance
  • Memory efficiency: >90% for large models

HPC Applications

  • Dense Linear Algebra: 80-95% of peak FLOPS
  • Sparse Operations: 20-60% of peak (data-dependent)
  • Memory-bound kernels: 80-90% of peak bandwidth

Real-time Applications

  • Computer Vision: <10ms latency for HD frames
  • Graphics: 60+ FPS for real-time rendering
  • Financial: <1ms for high-frequency trading

Optimization Checklist

Pre-optimization

  • Profile with Nsight Systems/Compute
  • Identify performance bottlenecks
  • Establish baseline measurements
  • Set realistic performance targets

Memory Optimization

  • Ensure coalesced memory access
  • Minimize shared memory bank conflicts
  • Use appropriate memory types (texture, constant)
  • Optimize memory transfer patterns

Compute Optimization

  • Minimize warp divergence
  • Maximize occupancy
  • Use high-throughput instructions
  • Consider mixed-precision arithmetic

Algorithm Optimization

  • Choose appropriate parallel algorithm
  • Implement multi-level optimizations
  • Consider library alternatives (cuBLAS, cuDNN)
  • Validate correctness after optimization

Post-optimization

  • Re-profile and measure improvements
  • Document optimization techniques used
  • Set up performance regression tests
  • Consider maintainability vs performance trade-offs