This project implements FAST (Features from Accelerated Segment Test) corner detection using three distinct GPU programming models: OpenACC, CUDA Python (Numba), and Native CUDA C++. The goal is to demonstrate the performance trade-offs between development ease and raw hardware control, achieving real-time video processing speeds that far exceed naive CPU implementations.
- Hybrid Pipeline: Python handles video I/O (using
imageio) while C++/CUDA handles heavy computation. - Three Acceleration Methods:
- OpenACC: Directive-based acceleration for rapid C++ parallelization.
- CUDA Python: High-level kernel implementation using Numba JIT.
- CUDA C++: Low-level implementation using the CUDA Runtime API.
- Scalable Kernels: Implements Grid-Stride Loops to handle arbitrary image resolutions (1080p, 4K) safely.
- Performance: Achieved up to ~1958 FPS with OpenACC and a 176x speedup for the Python pipeline.
| File | Description |
|---|---|
main.py |
Primary entry point. Implements the Numba/CUDA Python kernels and visualizer. |
fast_both.cpp |
Contains the OpenACC GPU implementation and the CPU baseline logic. |
fast_both.cu |
Contains the Native CUDA C++ implementation using the Runtime API. |
run_cpu.py |
Python wrapper to benchmark and visualize the CPU Baseline (via C++ shared library). |
run_gpu.py |
Python wrapper to benchmark and visualize the OpenACC GPU implementation. |
run_fast.py |
Helper script for video I/O and data preprocessing (flattening frames). |
- Hardware: NVIDIA GPU (Compute Capability 6.0+ recommended).
- Software:
- NVIDIA HPC SDK (for
nvc++OpenACC compiler). - CUDA Toolkit (for
nvcc). - Python 3.8+ with libraries:
numpy,imageio,numba,opencv-python.
- NVIDIA HPC SDK (for
-
Compile Native CUDA (fast_both.cu) nvcc -Xcompiler -fPIC -shared -o libfast_cuda.so fast_both.cu
-
Install Python Dependencies pip install numpy imageio numba opencv-python
๐ป Usage Running the CUDA Python (Numba) Implementation python main.py
Running the OpenACC Implementation
Ensure libfast_both.so is in the same directory:
python run_gpu.py
Running the CPU Baseline
python run_cpu.py
๐ Performance Results Implementation Frame Rate (FPS) Speedup vs Baseline OpenACC ~1958.7 7.92ร (vs Optimized C++) CUDA Python ~92.0 176.35ร (vs Python CPU) CUDA C++ ~1267.9 3.76ร (vs Optimized C++)
๐ฅ Authors
Fahad Aloraini
Tariq Alshaya
Ziyad Hussein
Before running the Python scripts, you must compile the C++ and CUDA files into shared libraries (.so files).