Describe the bug
As observed in NVIDIA/cccl#4248:
Consider an extern "C" device function accepting two structs as inputs:
struct MyStruct {
char a;
char b;
char c;
};
extern "C" __device__ void foo(MyStruct a, MyStruct b) {
}
Here's the corresponding PTX:
.visible .func foo(
.param .align 1 .b8 foo_param_0[3],
.param .align 1 .b8 foo_param_1[3]
)
{
ret;
}
Now, consider defining this device function using numba.cuda, using @gpu_struct to define the type of the arguments, and keeping in mind that gpu_struct uses StructModel to define the underlying numba type:
import numba
from numba import cuda
import numpy as np
from cuda.parallel.experimental.struct import gpu_struct
@gpu_struct
class MyStruct:
a: np.int8
b: np.int8
c: np.int8
def op(a, b):
pass
tp = numba.typeof(MyStruct(1, 2, 3))
ptx, _ = cuda.compile(op, (tp, tp))
print(ptx)
Here's the output of the above script:
.visible .func (.param .b64 func_retval0) op(
.param .b32 op_param_0,
.param .b32 op_param_1,
.param .b32 op_param_2,
.param .b32 op_param_3,
.param .b32 op_param_4,
.param .b32 op_param_5
)
{
.reg .b64 %rd<2>;
mov.u64 %rd1, 0;
st.param.b64 [func_retval0+0], %rd1;
ret;
}
Comparing the two PTXs, we see an ABI difference: on the C++ side, struct arguments are of type .b8[N] where N is the size of the struct. On the numba side however, we note that structs have been decomposed into their members, and the function accepts a .b32 for each member. (reason for promotion to 32-bit).
This is because the C ABI calling convention is packing struct value arguments according to the Numba calling convention, not the C one. It just uses the default argument packer:
|
arginfo = self._get_arg_packer(argtys) |
Steps/Code to reproduce bug
Reproducer in code above.
Expected behavior
The C ABI calling convention should pack struct values according to the C ABI calling convention, yielding the same function prototypes as CUDA C/C++ code compiled with NVCC.
This will probably require the implementation of a new arg packer for the C calling convention.
Environment details (please complete the following information):
All environments.
Additional context
No other context.
Describe the bug
As observed in NVIDIA/cccl#4248:
Consider an extern "C" device function accepting two structs as inputs:
Here's the corresponding PTX:
Now, consider defining this device function using numba.cuda, using
@gpu_structto define the type of the arguments, and keeping in mind that gpu_struct uses StructModel to define the underlying numba type:Here's the output of the above script:
Comparing the two PTXs, we see an ABI difference: on the C++ side, struct arguments are of type
.b8[N]whereNis the size of the struct. On the numba side however, we note that structs have been decomposed into their members, and the function accepts a.b32for each member. (reason for promotion to 32-bit).This is because the C ABI calling convention is packing struct value arguments according to the Numba calling convention, not the C one. It just uses the default argument packer:
numba-cuda/numba_cuda/numba/cuda/target.py
Line 306 in fbbc040
Steps/Code to reproduce bug
Reproducer in code above.
Expected behavior
The C ABI calling convention should pack struct values according to the C ABI calling convention, yielding the same function prototypes as CUDA C/C++ code compiled with NVCC.
This will probably require the implementation of a new arg packer for the C calling convention.
Environment details (please complete the following information):
All environments.
Additional context
No other context.