57

edits
# Changes

## Group 6

,→Assignment 3 - Optimize

# [mailto:xhuang110@myseneca.ca?subject=gpu610 Xiaowei Huang]

# [mailto:yyuan34@myseneca.ca?subject=gpu610 Yihang Yuan]

[mailto:xhuang110@myseneca.ca,yyuan34~~@myseneca.ca,zzhou33~~@myseneca.ca?subject=dps901-gpu610 Email All]

== Progress ==

==== The Monte Carlo Simulation (PI Calculation) ====

Got the code from here:

https://rosettacode.org/wiki/Monte_Carlo_methods#C.2B.2B

It uses random sampling to define constraints on the value and then makes a sort of "best guess."

{| class="wikitable mw-collapsible mw-collapsed"

</pre>

|}

As this algorithm is based on random sampling, so there is only one function that does all the work.Flat profile:<pre>Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls Ts/call Ts/call name 101.05 0.02 0.02 calculatePI(int, float*) 0.00 0.02 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z11calculatePIiPf</pre> Call graph<pre>granularity: each sample hit covers 2 byte(s) for 0.47% of 2.11 seconds index % time self children called name <spontaneous>[1] 100.0 2.11 0.00 calculatePI(int, float*) [~~File:Yihang~~1]----------------------------------------------- 0.00 0.~~JPG~~00 1/1 __libc_csu_init [17][9] 0.0 0.00 0.00 1 _GLOBAL__sub_I__Z11calculatePIiPf [9]-----------------------------------------------�Index by function name

'''Results for different scale of calculation'''

[[File:Yihang.JPG]]

=== Assignment 2 - Parallelize ===

[[File:yihang_p2_profile.png]]

In conclusion, by parallelized the serial version of the algorithm, we see an immediate improvement of performance.

=== Assignment 3 - Optimize ===

The main kernel optimization did on 2 parts

The new '''third kernel''' is to sum up the PI results which generated in each block, by applying serial reduction algorithm which could reduce its Big-O classification is O(log n). Beside, it also sets the limitation to check whether the calculation is out of the range. As it can stop the sum-up to those useless thread value. And the total value of that block is passed to the first element of that block.

<pre>

// kernel 3

// the third kernel sum the result in each block

__global__ void sumPi(float* d_a, float*d_b, const int n) {

int i = blockIdx.x * blockDim.x + threadIdx.x;

int t = threadIdx.x;

__shared__ float s[ntpb];

s[t] = d_a[i];

__syncthreads();

// sum the data in shared memory

for (int stride = 1; stride < blockDim.x; stride <<= 1) {

if ((t % (2 * stride) == 0) && (i + stride < n)) {

s[t] += s[t + stride];

}

__syncthreads();

}

// store the sum in d_b;

if (t == 0) {

d_b[blockIdx.x] = s[0];

}

}

</pre>

To reduce the data transfer from device to host, the new '''forth kernel''' is created. It uses to sum up all blocks results from passing back to the host, which also applies the reduction algorithm.

<pre>

// kernel 4

// the forth kernel sum the result of all blocks

__global__ void accumulate(float* c, const int nblocks) {

// store the elements of c[] in shared memory

int i = blockIdx.x * blockDim.x + threadIdx.x;

int t = threadIdx.x;

__shared__ float s[ntpb];

s[t] = c[i];

__syncthreads();

// sum the data in shared memory

for (int stride = 1; stride < blockDim.x; stride <<= 1) {

if ((t % (2 * stride) == 0) && (i + stride < nblocks)) {

s[t] += s[t + stride];

}

__syncthreads();

}

// store the sum in c[0]

if (t == 0) {

c[blockIdx.x] = s[0];

}

}

</pre>

Even though the runtime is closed after implementing the optimization, the kernels has a heavy load than before. It also finishes the accumulation of all randomly generated PI. I consider this optimization performs well.

[[File:Optimized PI calculation and accumulation.jpg]]