Skip to content
This repository was archived by the owner on Apr 22, 2026. It is now read-only.

Latest commit

 

History

History
56 lines (37 loc) · 2.78 KB

File metadata and controls

56 lines (37 loc) · 2.78 KB

libHPC

libHPC is a high-performance computing library focused on Linux and Windows environments. It provides SIMD-optimized kernels, concurrent data structures, GPU utilities, and HPC-oriented memory management components.

Project Status

This public archive preserves the state of libHPC at the point where its core HPC primitives — GPU radix sort, ABA-safe lock-free queue, SIMD kernels, and cache-hierarchy benchmarks — reached a stable, validated milestone.

Active development continues privately. The archive is retained for study, reference, and portfolio purposes; commercial use, redistribution, or derivative proprietary work without explicit permission is not permitted.

Platform support status, known limitations, and benchmarks are documented in the sections below.

0x00 Platform Support

Platform Status
Linux (x86_64 / CUDA) ✓ Supported
Windows (MSVC / CUDA) ✓ Supported
macOS (Intel) ✓ Supported, limited
macOS (Apple Silicon / ARM64) Not supported

0x01 macOS Apple Silicon Notice

libHPC does not support macOS ARM (Apple Silicon).

The reason is simple:

Apple’s recent macOS / Xcode toolchain updates introduced ABI changes in libc++, causing oneTBB and other HPC components to fail at link-time.

Apple’s recent macOS / Xcode toolchain updates introduced ABI changes in libc++, causing oneTBB and other HPC components to fail at link-time.
Specifically, std::__1::__hash_memory, a critical dependency for oneTBB, has been removed/hidden at the SDK level. These issues do not occur on Linux or Windows, and they did not occur on older macOS versions.

Since the goal of libHPC is stable, reproducible high-performance computing, macOS ARM is excluded to avoid degraded reliability or performance.


0x02 GPU Performance Optimization Highlights

libHPC includes GPU-accelerated kernels optimized for high-throughput computation on NVIDIA CUDA-compatible devices:

  • Radix-Sort Kernel: Processes 500M elements in ~360ms on an RTX 3080 Ti(laptop), sustaining ~1.39B elements/sec throughput.
  • Warp-Synchronous & Tiled Memory Layouts: Maximizes shared memory utilization and minimizes global memory latency.
  • Concurrent GPU Pipelines: Supports asynchronous kernel launches and stream-based scheduling for overlapping compute and memory operations.
  • Profiling & Validation: Includes tools for warp efficiency, memory access analysis, and synchronization correctness across GPU architectures.
  • Realistic HPC Throughput: Designed for bulk-parallel computation and scientific workloads, not real-time ultra-low-latency trading systems.