Skip to content

Commit 05628c4

Browse files
authored
Update main.cu
we already input N/2 in kernel launch function point, if we divide/2 here, the result is always wrong, seems half of the original result.
1 parent 9fc2507 commit 05628c4

1 file changed

Lines changed: 3 additions & 3 deletions

File tree

08_Reductions/src/main.cu

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -201,8 +201,8 @@ __global__ void reduceFinal(const float* __restrict input, int N)
201201

202202
__shared__ float data[BLOCK_SIZE];
203203
// Already combine two values upon load from global memory.
204-
data[threadIdx.x] = id < N / 2 ? input[id] : 0;
205-
data[threadIdx.x] += id + N/2 < N ? input[id + N / 2] : 0;
204+
data[threadIdx.x] = id < N ? input[id] : 0;
205+
data[threadIdx.x] += (id + N < 2*N) ? input[id + N] : 0;
206206

207207
for (int s = blockDim.x / 2; s > 16; s /= 2)
208208
{
@@ -312,4 +312,4 @@ Can you observe any difference in terms of speed / computed results?
312312
2) Do you have any other ideas how the reduction could be improved?
313313
Making it even faster should be quite challenging, but if you have
314314
some suggestions, try them out and see how they affect performance!
315-
*/
315+
*/

0 commit comments

Comments
 (0)