# Changes

## Avengers

, 6 April
Assignment 2
In [https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesArray Blaise Barney's Notes on Array Processing], an example of Array Processing is discussed. The example "demonstrates calculations on 2-dimensional array element; a function is evaluated on each array element."
I used the pseudo-code provided to create a program that creates a 2-dimensional array. The purpose of the program is to create and populate a 2-dimensional array of size n (provided by the user) with random numbers. In this case, the function that is applied to each element in the 2-dimensional array is 'std::rand()/100'. The code is available in the link below:
After using '''gprof''' to profile my program, a call graph is generated with this content:
[[File:Array_Processing_-_Call_Graph.jpeg]]
From the call graph file, it is evident that the generateRandom() function is an obvious hotspot. It as it is hogging 100% of the execution time. The function consists of 2 for loops, one nested in the other, which makes the function have a Big-O notation of O(n^2). In a serial manner, the function accesses each element in the array and assigns a random number to it.
The computations involved with each element in the array is independent from the rest of the elements, and therefore this function is a deserving candidate for parallelization. Additionally, the array elements can be evenly distributed into subarrays sub-arrays and a process can be assigned to each subarraysub-array.
=== Assignment 2 ===
For Assignment 2, we decided to parallelize the application selected by Bruno.
In the code, the function that took up a significant amount of time was the calculateDimensions() function. The flat profile indicates that this function takes 97.67% of the execution time.

Link to code on GitHub: https://github.com/brucremo/DPS915

==== Identifying Parallelize-able Code ====

calculateDimensions() has 3 nested for loops. Each for loop is used to set the value of one of the triangle sides. The inner-most for loop compares the two shorter sides of the triangle by first squaring them and then adding the squared results together. A condition is used to check if the sum of the squared side values is equivalent to the squared value of the hypotenuse. The results are printed when the condition is true.

[[File:NestedLoops.PNG]]

The nested for loops represent the serial way of calculating the dimensions.

To parallelize the code mentioned above, we did the following:

1. Use CUDA device properties to design the grid and blocks.

2. Adjust the number of threads to be used in the grid depending on the value passed in by the user (max hyptonuse value).

3. Allocated 2 arrays on device, initialized from 1 to the maximum hypotenuse (given by the user as an argument):
* 1 array represents the hypotenuse side
* 1 array represents one of the sides of the triangle
* This was done by using CUDA's thrust library.

[[File:allocInit.PNG]]

4. Calculated the number of blocks required and iterated through the thrust vector, passing each individual element to the kernel launch along with the two previously allocated arrays.

[[File:NBandLaunch.PNG]]

5. The kernel contains the instructions for verifying whether the value passed in is part of a Pythagorean triple. If a Pythagorean triple is found, the values are printed out.

[[File:kernel.PNG]]

==== Time Logging ====
To compare the timings of the serial version and the parallel version, we modified the original file to have 2 functions: calculateCUDA() and calculateSerial(). The execution of both of these functions was timed to see which function was quicker.

[[File:Timings.PNG]]

calculateSerial() contains the initial version of the application. It has 3 nested for loops and has a serialized approach to finding the Pythagorean triples. The time taken to find the triples is printed out after execution.

calculateCUDA() contains the parallelized version of the application. It sets the properties of a grid and its blocks, and launches a kernel to find the Pythagorean triples. The time taken to find the triples is printed out after execution.

==== Results ====
Below is a graph that shows the time taken for execution of both the serial approach and the parallel approach.

[[File:TimeComparison.PNG]]

=== Assignment 3 ===
==== Using Shared Memory ====
To optimize our code, we used shared memory inside the kernel. For our purposes, allocating arrays in the kernel using shared memory required a constant value for the number of threads per block. This meant that the number of threads per block could not be calculated at run time. Instead, we set the number of threads per block to 1024 and declared it as a constant in the beginning of the application. This allowed us to use shared memory inside the kernel and optimize our application.

Below is a picture of the optimized kernel:

[[File:QuickerKernel.PNG]]

==== Results ====

On average, the run time for all 7 different problem sizes was reduced by approximately 216 milliseconds.

Below is a graph that illustrates the time saved on all 7 different problem sizes with the optimized kernel:

[[File:OptimizationGraph.PNG]]

==== Other Optimizations ====

Another optimization would be to optimize the original algorithm itself. The pow() function itself very expensive because it is software implemented. A better alternative would be to use hardware functions. In this case, it would be quicker to use the (*) operator and multiply the values with themselves as opposed to setting them to the power of 2.

[[File:OptimizedAlgorithm.PNG]]

After modifying the code to reflect this change, both the GPU and CPU methods are roughly equal to each other (in terms of execution time) when the value of n is 800.
25
edits