Changes

← Older edit

Avengers

1,230 bytes added, 23:49, 6 April 2019

→‎Assignment 2

In [https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesArray Blaise Barney's Notes on Array Processing], an example of Array Processing is discussed. The example "demonstrates calculations on 2-dimensional array element; a function is evaluated on each array element."

I used the pseudo-code provided to create a program that creates a 2-dimensional array. The purpose of the program is to create and populate a 2-dimensional array of size n (provided by the user) with random numbers. In this case, the function that is applied to each element in the 2-dimensional array is 'std::rand()/100'. The code is available in the link below:

[https://github.com/jsidhu26/DPS915 Link to Array Processing code]

[[File:Array_Processing_-_Call_Graph.jpeg]]

From the call graph file, it is evident that the generateRandom() function is an obvious hotspot~~. It~~ as it is hogging 100% of the execution time. The function consists of 2 for loops, one nested in the other, which makes the function have a Big-O notation of O(n^2).

The computations involved with each element in the array is independent from the rest of the elements, and therefore this function is a deserving candidate for parallelization. Additionally, the array elements can be evenly distributed into sub-arrays and a process can be assigned to each sub-array.

For Assignment 2, we decided to parallelize the application selected by Bruno.

In the code, the function that took up a significant amount of time was the calculateDimensions() function. The flat profile indicates that this function takes 97.67% of the execution time.

Link to code on GitHub: https://github.com/brucremo/DPS915

==== Identifying Parallelize-able Code ====

=== Assignment 3 ===

==== Using Shared Memory ====To optimize our code, we used shared memory inside the kernel. ~~On average~~For our purposes, allocating arrays in the kernel using shared memory required a constant value for the number of threads per block. This meant that the number of threads per block could not be calculated at run time ~~for all 7 different problem sizes was reduced by approximately 216 milliseconds~~. Instead, we set the number of threads per block to 1024 and declared it as a constant in the beginning of the application. This allowed us to use shared memory inside the kernel and optimize our application.

Below is a picture of the optimized kernel:

[[File:QuickerKernel.PNG]]

==== Results ====

On average, the run time for all 7 different problem sizes was reduced by approximately 216 milliseconds.

Below is a graph that illustrates the time saved on all 7 different problem sizes with the optimized kernel:

[[File:OptimizationGraph.PNG]]

==== Other Optimizations ====

Another optimization would be to optimize the original algorithm itself. The pow() function itself very expensive because it is software implemented. A better alternative would be to use hardware functions. In this case, it would be quicker to use the (*) operator and multiply the values with themselves as opposed to setting them to the power of 2.

[[File:OptimizedAlgorithm.PNG]]

After modifying the code to reflect this change, both the GPU and CPU methods are roughly equal to each other (in terms of execution time) when the value of n is 800.

Brucremo

38

edits

Changes

Avengers

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools