28:18 each thread uses temp [threadIdx.x: threadIdx.x+2RADIUS]. it means that each thread should load A[base+threadIdx.x-RADIUS] into temp[threadIdx.x] and A[base+threadIdx.x+RADIUS] into temp[threadIdx.x+2RADIUS].The code on the slide works in a different way. So it seems it does not work corrently.
@frankj6650 Жыл бұрын
I think there is nothing wrong, the array A is also begin of index 0, you can try to understand it by aligning the two sequences in the ppt.
@weizhang54246 ай бұрын
Yes, the PPT illustration is a little misleading in that Input (A) and Output (B) are supposed to be left-aligned. Several invariants: len(Output)=n, len(Output)=n+2*RADIUS. Output[k]=\sum_{i=k}^{i=k+2*RADIUS}Input[k]. The "cooperatively-fetching" logic simply asks each thread i to load Input[i] (modulo base), and asks the first 2*RADIUS threads in each thread block to each load 1 extra input element to maintain the len(Output)=n+2*RADIUS invariant. Personally, it is probably more natural to have let the last 2*RADIUS threads each load the 1 extra input element. But the code as it is is fine.