A lightweight C++ / OpenCL backend for training and inference of custom
TensorFlow / Keras models on AMD GPUs that do not support ROCm — specifically
older GCN parts where only the open-source amdgpu + Mesa / rusticl OpenCL stack
is available.
Implemented as a set of TF custom ops (forward + backward), each backed by
hand-written OpenCL 1.2 kernels. Gradients are registered through
tf.RegisterGradient so tf.GradientTape and model.fit() flow through
transparently. Weights saved locally can be loaded on any TF environment
(CUDA, CPU) without changes — see Portable training.
Status — Phase 3 complete. All seven ops are implemented and tested. The full
build_model()graph trains end-to-end on the OpenCL backend — seeexamples/train_full_model.py. Phase 4 (performance: buffer pool, tiled GEMM, tree-reduction BN) is next — seedocs/ROADMAP.md.
| Layer / op | Forward | dL/dx | dL/dw | Notes |
|---|---|---|---|---|
Conv2D |
✅ | ✅ | ✅ | NHWC, SAME/VALID, any stride |
DepthwiseConv2D |
✅ | ✅ | ✅ | depth_multiplier supported |
BatchNormalization |
✅ | ✅ | ✅ | train + inference, EMA in Keras |
ReLU |
✅ | ✅ | — | elementwise |
Dense |
✅ | ✅ | ✅ | GEMM: one work-item per output element |
UpSampling2D (bilinear) |
✅ | ✅ | — | half-pixel centres; float atomic add |
Sigmoid |
✅ | ✅ | — | backward uses y (not x), skips recompute |
71 tests pass — run pytest tests/ -v to verify.
| Item | Requirement |
|---|---|
| GPU | Any that exposes OpenCL 1.2 via clinfo — AMD GCN (Topaz, Tonga, Fiji, Polaris, Vega), Intel integrated, or Mesa CPU fallback |
| OS | Ubuntu 22.04 / 24.04; other distros work if clinfo -l sees your device |
| OpenCL | 1.2 target; works on 2.0+ ICDs too |
| TF | 2.10 – 2.15 CPU build (ops are registered on DEVICE_CPU; GPU compute happens via OpenCL, not TF's CUDA/ROCm path) |
| Python | 3.8 – 3.13 |
The CLBackend automatically probes each OpenCL platform by compiling a trivial
test kernel and picks the first one that succeeds. Broken ICDs (e.g. Mesa rusticl
without the LLVM SPIR-V backend) are silently skipped. If your preferred platform
is not selected, set OCL_ICD_VENDORS=/etc/OpenCL/vendors/your.icd before running.
sudo apt update
sudo apt install -y build-essential cmake \
opencl-headers ocl-icd-opencl-dev \
mesa-opencl-icd clinfo \
python3-dev python3-pipVerify your device is visible:
clinfo -l
# Example output:
# Platform #0: rusticl
# `-- Device #0: AMD Radeon R5 M445 Series (radeonsi, ...)
# Platform #1: Intel(R) OpenCL Graphics
# `-- Device #0: Intel(R) UHD Graphics 620pip install .This compiles the C++ shared library via CMake, bundles the OpenCL kernel
sources inside the package, and installs everything into your Python
environment. After this, import opencl_tf works from any directory.
For a development / editable install (the .so lives in the repo, no
copying):
pip install -e .Or compile the .so in-place without installing:
python setup.py build_ext --inplaceMake:
pip install -r requirements.txt
make
# Produces opencl_tf/opencl_tf_ops.so — run from the repo rootCMake:
cmake -S . -B build
cmake --build build -j$(nproc)The Makefile auto-detects TF include and link flags from the active Python
environment (python3 -c 'import tensorflow as tf; ...').
pytest tests/ -v
# Expected: 71 passedimport opencl_tf as ocly = ocl.conv2d(x, w, strides=(1, 2, 2, 1), padding="SAME")
z = ocl.depthwise_conv2d(y, dw, strides=(1, 1, 1, 1), padding="SAME")
b, mean, var = ocl.batch_norm_training(z, gamma, beta, epsilon=1e-3)
a = ocl.relu(b)
s = ocl.sigmoid(a)
o = ocl.dense(s_flat, W, bias)Every layer is a drop-in for its tf.keras.layers counterpart:
from opencl_tf.layers import (
OpenCLConv2D, OpenCLDepthwiseConv2D,
OpenCLBatchNormalization, OpenCLReLU,
OpenCLDense, OpenCLUpSampling2D, OpenCLSigmoid,
)
def conv_bn_relu(x, filters, kernel_size, strides=1):
x = OpenCLConv2D(filters, kernel_size, strides=strides, padding="same")(x)
x = OpenCLBatchNormalization()(x)
return OpenCLReLU()(x)
def dsconv_block(x, filters, strides):
x = OpenCLDepthwiseConv2D(3, strides=strides, padding="same")(x)
x = OpenCLBatchNormalization()(x)
x = OpenCLReLU()(x)
x = OpenCLConv2D(filters, 1)(x)
x = OpenCLBatchNormalization()(x)
return OpenCLReLU()(x)
inp = tf.keras.Input(shape=(64, 64, 3))
x = conv_bn_relu(inp, 48, (3, 7), strides=(1, 2))
x = dsconv_block(x, 96, strides=(2, 2))
x = dsconv_block(x, 144, strides=(2, 2))
x = OpenCLDense(128)(tf.keras.layers.Flatten()(x))
x = OpenCLReLU()(x)
x = OpenCLUpSampling2D((2, 2))(tf.keras.layers.Reshape((4, 4, 8))(x))
out = OpenCLSigmoid()(x)
model = tf.keras.Model(inp, out)Gradients are auto-registered so model.fit() and tf.GradientTape just work:
model.compile(
optimizer=tf.keras.optimizers.Adam(1e-3),
loss="mse",
)
model.fit(x_train, y_train, epochs=10, batch_size=8)See examples/train_full_model.py for a
complete end-to-end training loop on the uwb_loc_v3 model.
Looking to port an existing Keras model? See
docs/MIGRATING_FROM_KERAS.md for a mapping
table, limitations, and before/after examples.
Train locally on your AMD GPU, then load weights on any other machine (Colab, a cloud CUDA box, a friend's laptop) with zero code changes.
# Local — uses OpenCL if opencl_tf is importable, otherwise falls back to
# standard TensorFlow automatically.
python examples/train_portable.py train --epochs 30 --out my_model.weights.h5
# Print the Colab paste snippet
python examples/train_portable.py colabThe key: save with model.save_weights() (weights only, no graph) and load
with model.load_weights(). Keras matches weights by layer name= string, so
OpenCLDense(name="fc1") and tf.keras.layers.Dense(name="fc1") share the
same weight slots. See examples/train_portable.py
for the full two-sided implementation.
# Default: 60 s per phase on whichever GPU OpenCL selects
python tools/stress_gpu.py
# Longer run, bigger matrices
python tools/stress_gpu.py --duration 120 --size 2048
# Force a specific platform
python tools/stress_gpu.py --platform rusticl # AMD via Mesa
python tools/stress_gpu.py --platform Intel # Intel integrated GPUThree phases: FP32 SGEMM (compute), buffer copy (bandwidth), and SGEMM + LUT
reads (mixed). Reports GFLOPS and GB/s. Reads GPU temperature via rocm-smi
if available.
OpenCL-Tensorflow/
├── kernels/ # OpenCL .cl source files
│ ├── conv2d_kernels.cl
│ ├── depthwise_conv2d_kernels.cl
│ ├── batchnorm_kernels.cl
│ ├── relu_kernels.cl
│ ├── sigmoid_kernels.cl
│ ├── dense_kernels.cl
│ └── upsampling_bilinear_kernels.cl
├── src/ # C++ TF custom-op sources
│ ├── cl_backend.{h,cc} # CLBackend singleton + RAII helpers
│ ├── padding_utils.h # SAME / VALID resolver
│ └── ops/
│ ├── conv2d_ops.cc
│ ├── depthwise_conv2d_ops.cc
│ ├── batchnorm_ops.cc
│ ├── relu_ops.cc
│ ├── sigmoid_ops.cc
│ ├── dense_ops.cc
│ └── upsampling_ops.cc
├── opencl_tf/ # Python package
│ ├── __init__.py
│ ├── _library.py # loads opencl_tf_ops.so + sets KERNELS_PATH
│ ├── gradients.py # @RegisterGradient for all 7 ops
│ ├── layers.py # OpenCL{Conv2D,DepthwiseConv2D,BN,ReLU,
│ │ # Dense,UpSampling2D,Sigmoid}
│ └── ops/
│ ├── conv2d.py
│ ├── depthwise_conv2d.py
│ ├── batchnorm.py
│ ├── relu.py
│ ├── sigmoid.py
│ ├── dense.py
│ └── upsampling.py
├── tests/ # pytest suite — parity vs tf.nn.*
│ ├── test_conv2d.py
│ ├── test_depthwise_conv2d.py
│ ├── test_batchnorm.py
│ ├── test_relu.py
│ ├── test_sigmoid.py # includes concurrent + sequential stress
│ ├── test_dense.py
│ └── test_upsampling.py
├── examples/
│ ├── train_full_model.py # full uwb_loc_v3 end-to-end training
│ ├── train_portable.py # local AMD GPU → Colab CUDA weight transfer
│ └── train_simple_cnn.py
├── tools/
│ └── stress_gpu.py # OpenCL compute / bandwidth stress test
├── docs/
│ ├── ARCHITECTURE.md # internals deep-dive
│ ├── ADDING_NEW_OPS.md # step-by-step guide for new ops
│ ├── TROUBLESHOOTING.md # common issues and fixes
│ └── ROADMAP.md
├── CMakeLists.txt
├── Makefile
├── requirements.txt
└── README.md
The C++ ops are registered on DEVICE_CPU — TF gives us host-pointer tensors,
we copy them into cl_mem device buffers, launch an OpenCL kernel, and
blocking-read the result back. A singleton CLBackend owns the
platform/device/context/queue and caches compiled programs and kernels.
The command queue is serialized with a mutex because cl_command_queue is not
thread-safe per spec, and TF will call Compute() concurrently from its inter-op
thread pool. Both clSetKernelArg and clEnqueue* calls must be inside that
mutex — clSetKernelArg on a cached kernel is not thread-safe either.
Gradients are registered in opencl_tf/gradients.py and chain forward ops to
their backward counterparts, all sharing the same CLBackend instance.
See docs/ARCHITECTURE.md for the full technical
breakdown and docs/ADDING_NEW_OPS.md for the
step-by-step guide to adding a new layer.
See docs/TROUBLESHOOTING.md for common issues
(OpenCL platform selection, GLIBCXX version errors, kernel compile failures).
MIT — see LICENSE.