Update main.cu

asimay · web-flow · commit 05628c449b18 · 2023-08-14T11:32:57.000+08:00
we already input N/2 in kernel launch function point, if we divide/2 here, the result is always wrong, seems half of the original result.
diff --git a/08_Reductions/src/main.cu b/08_Reductions/src/main.cu
@@ -201,8 +201,8 @@ __global__ void reduceFinal(const float* __restrict input, int N)
 
     __shared__ float data[BLOCK_SIZE];
     // Already combine two values upon load from global memory.
-    data[threadIdx.x] = id < N / 2 ? input[id] : 0;
-    data[threadIdx.x] += id + N/2 < N ? input[id + N / 2] : 0;
+    data[threadIdx.x] = id < N ? input[id] : 0;
+    data[threadIdx.x] += (id + N < 2*N) ? input[id + N] : 0;
 
     for (int s = blockDim.x / 2; s > 16; s /= 2)
     {
@@ -312,4 +312,4 @@ Can you observe any difference in terms of speed / computed results?
 2) Do you have any other ideas how the reduction could be improved?
 Making it even faster should be quite challenging, but if you have 
 some suggestions, try them out and see how they affect performance! 
-*/
+*/