Changes

Jump to: navigation, search

Avengers

1,230 bytes added, 23:49, 6 April 2019
Assignment 2
In [https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesArray Blaise Barney's Notes on Array Processing], an example of Array Processing is discussed. The example "demonstrates calculations on 2-dimensional array element; a function is evaluated on each array element."
I used the pseudo-code provided to create a program that creates a 2-dimensional array. The purpose of the program is to create and populate a 2-dimensional array of size n (provided by the user) with random numbers. In this case, the function that is applied to each element in the 2-dimensional array is 'std::rand()/100'. The code is available in the link below:
[https://github.com/jsidhu26/DPS915 Link to Array Processing code]
[[File:Array_Processing_-_Call_Graph.jpeg]]
From the call graph file, it is evident that the generateRandom() function is an obvious hotspot. It as it is hogging 100% of the execution time. The function consists of 2 for loops, one nested in the other, which makes the function have a Big-O notation of O(n^2).
The computations involved with each element in the array is independent from the rest of the elements, and therefore this function is a deserving candidate for parallelization. Additionally, the array elements can be evenly distributed into sub-arrays and a process can be assigned to each sub-array.
For Assignment 2, we decided to parallelize the application selected by Bruno.
In the code, the function that took up a significant amount of time was the calculateDimensions() function. The flat profile indicates that this function takes 97.67% of the execution time.
 
Link to code on GitHub: https://github.com/brucremo/DPS915
==== Identifying Parallelize-able Code ====
=== Assignment 3 ===
==== Using Shared Memory ====To optimize our code, we used shared memory inside the kernel. On averageFor our purposes, allocating arrays in the kernel using shared memory required a constant value for the number of threads per block. This meant that the number of threads per block could not be calculated at run time for all 7 different problem sizes was reduced by approximately 216 milliseconds. Instead, we set the number of threads per block to 1024 and declared it as a constant in the beginning of the application. This allowed us to use shared memory inside the kernel and optimize our application.
Below is a picture of the optimized kernel:
[[File:QuickerKernel.PNG]]
 
==== Results ====
 
On average, the run time for all 7 different problem sizes was reduced by approximately 216 milliseconds.
Below is a graph that illustrates the time saved on all 7 different problem sizes with the optimized kernel:
[[File:OptimizationGraph.PNG]]
 
==== Other Optimizations ====
 
Another optimization would be to optimize the original algorithm itself. The pow() function itself very expensive because it is software implemented. A better alternative would be to use hardware functions. In this case, it would be quicker to use the (*) operator and multiply the values with themselves as opposed to setting them to the power of 2.
 
[[File:OptimizedAlgorithm.PNG]]
 
After modifying the code to reflect this change, both the GPU and CPU methods are roughly equal to each other (in terms of execution time) when the value of n is 800.
38
edits

Navigation menu