57

edits
# Changes

## Group 6

,→Assignment 3 - Optimize

=== Assignment 3 - Optimize ===

The main kernel optimization did on 2 parts

The new '''third kernel''' is to sum up the PI results which generated in each block, by applying serial reduction algorithm which could reduce its Big-O classification is O(log n). Beside, it also sets the limitation to check whether the calculation is out of the range. As it can stop the sum-up to those useless thread value. And the total value of that block is passed to the first element of that block.

<pre>

// kernel 3

// the third kernel sum the result in each block

__global__ void sumPi(float* d_a, float*d_b, const int n) {

int i = blockIdx.x * blockDim.x + threadIdx.x;

int t = threadIdx.x;

__shared__ float s[ntpb];

s[t] = d_a[i];

__syncthreads();

// sum the data in shared memory

for (int stride = 1; stride < blockDim.x; stride <<= 1) {

if ((t % (2 * stride) == 0) && (i + stride < n)) {

s[t] += s[t + stride];

}

__syncthreads();

}

// store the sum in d_b;

if (t == 0) {

d_b[blockIdx.x] = s[0];

}

}

</pre>

To reduce the data transfer from device to host, the new '''forth kernel''' is created. It uses to sum up all blocks results from passing back to the host, which also applies the reduction algorithm.

<pre>

// kernel 4

// the forth kernel sum the result of all blocks

__global__ void accumulate(float* c, const int nblocks) {

// store the elements of c[] in shared memory

int i = blockIdx.x * blockDim.x + threadIdx.x;

int t = threadIdx.x;

__shared__ float s[ntpb];

s[t] = c[i];

__syncthreads();

// sum the data in shared memory

for (int stride = 1; stride < blockDim.x; stride <<= 1) {

if ((t % (2 * stride) == 0) && (i + stride < nblocks)) {

s[t] += s[t + stride];

}

__syncthreads();

}

// store the sum in c[0]

if (t == 0) {

c[blockIdx.x] = s[0];

}

}

</pre>

Even though the runtime is closed after implementing the optimization, the kernels has a heavy load than before. It also finishes the accumulation of all randomly generated PI. I consider this optimization performs well.

Here is my final source code