Performance Analysis and Optimization Guide

Introduction

This document provides comprehensive performance analysis techniques and optimization strategies for CUDA kernel development. Understanding performance characteristics is crucial for developing efficient GPU applications.

Performance Metrics

1. Throughput Metrics

FLOPS (Floating Point Operations Per Second)

GFLOPS = (Number of Operations) / (Execution Time in seconds) / 1e9

Example Calculation for Matrix Multiplication (N×N):

Operations = 2 × N³ (N³ multiplications + N³ additions)
If N=2048 and time=100ms:
GFLOPS = (2 × 2048³) / (0.1) / 1e9 = 171.8 GFLOPS

Memory Bandwidth

Bandwidth (GB/s) = (Bytes Transferred) / (Execution Time in seconds) / 1e9

Example for Vector Addition:

Bytes = 3 × N × sizeof(float) (read A, read B, write C)
If N=32M and time=10ms:
Bandwidth = (3 × 32M × 4) / (0.01) / 1e9 = 38.4 GB/s

2. Efficiency Metrics

Occupancy

Occupancy = (Active Warps per SM) / (Maximum Warps per SM)

Factors Affecting Occupancy:

Registers per thread
Shared memory per block
Threads per block
Architecture limits

Memory Efficiency

Memory Efficiency = (Requested Bandwidth) / (Actual Bandwidth) × 100%

Compute Utilization

Compute Utilization = (Actual FLOPS) / (Peak FLOPS) × 100%

Profiling Tools and Techniques

1. NVIDIA Nsight Systems

Command Line Usage:

nsys profile --output=profile.nsys-rep --trace=cuda,nvtx ./your_program

Key Metrics to Monitor:

Kernel execution time
Memory transfer time
CPU-GPU synchronization
API call overhead

2. NVIDIA Nsight Compute

Command Line Usage:

ncu --set full --target-processes all ./your_program

Important Sections:

GPU Speed of Light: Overall performance summary
Memory Workload Analysis: Memory access patterns
Compute Workload Analysis: Instruction throughput
Scheduler Statistics: Warp execution efficiency

3. nvprof (Legacy but still useful)

nvprof --print-gpu-trace --log-file profile.log ./your_program
nvprof --metrics achieved_occupancy,gld_efficiency ./your_program

Performance Optimization Strategies

1. Memory Optimization

Memory Coalescing

Good Pattern:

// Threads access consecutive memory locations
__global__ void coalesced_access(float* data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    data[idx] = some_computation(data[idx]);
}

Bad Pattern:

// Strided access pattern
__global__ void strided_access(float* data, int stride) {
    int idx = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
    data[idx] = some_computation(data[idx]);
}

Shared Memory Bank Conflicts

32-bit Access Pattern (No Conflicts):

__shared__ float shared_data[32][32];
// Each thread in warp accesses different bank
shared_data[threadIdx.y][threadIdx.x] = value;

Bank Conflict Example:

__shared__ float shared_data[32][32];
// All threads access bank 0
shared_data[threadIdx.x][0] = value; // 32-way bank conflict

Memory Hierarchy Utilization

L1/Texture Cache: ~100 GB/s, ~10 cycles latency
L2 Cache:         ~600 GB/s, ~200 cycles latency
Global Memory:    ~900 GB/s, ~400 cycles latency
Shared Memory:    ~19 TB/s,  ~20 cycles latency
Registers:        ~19 TB/s,  ~1 cycle latency

2. Compute Optimization

Warp Divergence Minimization

Problematic Divergence:

if (threadIdx.x < 16) {
    // Only half of warp executes this branch
    expensive_computation_a();
} else {
    // Other half executes this branch
    expensive_computation_b();
}

Optimized Version:

// Separate warps handle different branches
if (threadIdx.x / 32 == 0) {
    expensive_computation_a();
} else {
    expensive_computation_b();
}

Instruction Throughput

High Throughput Instructions:

Fused multiply-add (FMA): a * b + c
Simple arithmetic: +, -, *
Bitwise operations: &, |, ^, <<, >>

Lower Throughput Instructions:

Division: / (use reciprocal multiplication if possible)
Transcendental functions: sin, cos, exp, log
Integer modulo: %

3. Algorithm-Specific Optimizations

Matrix Multiplication

Optimization Progression:

Naive: O(N³) with poor memory locality
Tiled: Use shared memory for data reuse
Blocked: Each thread computes multiple elements
Tensor Cores: Use mixed-precision on modern hardware

Performance Expectations:

Naive:     ~500 GFLOPS
Tiled:     ~2000 GFLOPS
Optimized: ~5000 GFLOPS
cuBLAS:    ~15000 GFLOPS (on A100)

Reduction Optimization

Optimization Techniques:

Tree Reduction: O(log n) steps
Warp Primitives: Use __shfl_down_sync()
Cooperative Groups: Modern synchronization
Multiple Elements per Thread: Reduce kernel launch overhead

Convolution Optimization

Key Strategies:

Im2col: Convert to matrix multiplication
Winograd: Reduce arithmetic complexity
FFT: For large kernels
Separable Filters: 2D → two 1D convolutions

Benchmark Results Analysis

Expected Performance Ranges

Vector Addition (Memory-Bound)

Hardware          | Peak Bandwidth | Expected Performance
GTX 1080 Ti       | 484 GB/s      | ~400 GB/s (80-85%)
RTX 3080          | 760 GB/s      | ~650 GB/s (85-90%)
RTX 4090          | 1008 GB/s     | ~900 GB/s (85-90%)
A100              | 1935 GB/s     | ~1700 GB/s (85-90%)

Matrix Multiplication (Compute-Bound)

Hardware          | Peak FLOPS    | Expected Performance
GTX 1080 Ti       | 11.3 TFLOPS   | ~8-10 TFLOPS
RTX 3080          | 29.8 TFLOPS   | ~20-25 TFLOPS  
RTX 4090          | 83 TFLOPS     | ~60-70 TFLOPS
A100              | 156 TFLOPS    | ~120-140 TFLOPS (Tensor)

Performance Regression Detection

Automated Benchmarking

# Example benchmark script
for size in 1024 2048 4096; do
    echo "Testing matrix size: ${size}x${size}"
    ./matrix_multiplication --size=$size --iterations=10
done

Performance Thresholds

// Performance regression detection
const double EXPECTED_GFLOPS = 2000.0;
const double TOLERANCE = 0.1; // 10% tolerance

if (actual_gflops < EXPECTED_GFLOPS * (1.0 - TOLERANCE)) {
    std::cerr << "Performance regression detected!" << std::endl;
    std::cerr << "Expected: " << EXPECTED_GFLOPS << " GFLOPS" << std::endl;
    std::cerr << "Actual: " << actual_gflops << " GFLOPS" << std::endl;
}

Industry-Relevant Performance Targets

Deep Learning Workloads

Training: 50-80% of peak theoretical performance
Inference: 80-95% of peak theoretical performance
Memory efficiency: >90% for large models

HPC Applications

Dense Linear Algebra: 80-95% of peak FLOPS
Sparse Operations: 20-60% of peak (data-dependent)
Memory-bound kernels: 80-90% of peak bandwidth

Real-time Applications

Computer Vision: <10ms latency for HD frames
Graphics: 60+ FPS for real-time rendering
Financial: <1ms for high-frequency trading

Optimization Checklist

Pre-optimization

Profile with Nsight Systems/Compute
Identify performance bottlenecks
Establish baseline measurements
Set realistic performance targets

Memory Optimization

Ensure coalesced memory access
Minimize shared memory bank conflicts
Use appropriate memory types (texture, constant)
Optimize memory transfer patterns

Compute Optimization

Minimize warp divergence
Maximize occupancy
Use high-throughput instructions
Consider mixed-precision arithmetic

Algorithm Optimization

Choose appropriate parallel algorithm
Implement multi-level optimizations
Consider library alternatives (cuBLAS, cuDNN)
Validate correctness after optimization

Post-optimization

Re-profile and measure improvements
Document optimization techniques used
Set up performance regression tests
Consider maintainability vs performance trade-offs

FilesExpand file tree

optimization-guide.md

Latest commit

History