Changes

GPUSquad

6,608 bytes added, 19:29, 11 April 2018

m

→‎Assignment 3

POTENTIAL FOR PARALLELIZATION:

The compress() function performs similar operations on a collection of text, however it relies on a dictionary and an expanding string to be tokenized in a dictrionary. This could potentially be paralellized through a divide and conquer strategy where gpu blocks with shared caches share their own ~~dictionary~~ dictionaries and iterate over their own block of text. This, however would not be particularly useful because LZW compression relies on a globally accessible dictionary that can be continuously modified. Having multiple threads try to access the same dictionary tokens in global memory would create race conditions, and creating block-local dictionaries in shared memory would reduce the efficacy of having a large globally available dictionary to tokenize strings. What all of this means is that LZW compression would probably be a poor candidate for parallelization due to the way that parallel memory would try to update the token dictionary.

==== Idea 3 - MergeSort ====

The iters loop launches a kernel for each iteration and we use double buffering (where we choose to launch the kernel with either d_a, d_b or d_b, d_a) since we can't simply swap pointers like in the serial code.

Double buffering is needed so that there is a static version of the the matrix before the current set of calculations are done. Since an element of the matrix is calculated based on elements on all four sides, if one of those elements updated itself while calculations were being done on another element, you would end up with the wrong answer/race conditions.

int ymin, int ymax, float dx, float dy, float dxxinv, float dyyinv) {//MODIFY to suit algorithm

int j = blockDim.x * blockIdx.x + threadIdx.x + 1;

//above: we are using block and thread indexes to replace some of the iteration logic

if (j < n - 1) {

=== Assignment 3 ===

Optimization techniques ~~used~~attempted

* Get rid of the for loop in the kernel and use 2D threading within blocks

* Use gpu constant memory for jacobi calculation constants

The ghost cell pattern is a technique used to allow threads in a particular block to access data that would normally be inside another block.

We make a shared memory array larger than the number of threads per block in a particular dimension, then when we are at the first or last thread,

can copy a neighbouring cells' data based on a global index, to our local ghost cell.

The following info and images are taken from: http://people.csail.mit.edu/fred/ghost_cell.pdf

all other blocks in the middle need ghost cells on both the left and right sides.

In A2 and A3 we only scaled the columns(we would have, ~~but~~ had we not mixed up the ~~rows~~indexing--the 2d version scales along the n dimension, while the 1d versions scale properly along the n dimension), ~~they~~ and the other dimension always stayed at 32. This made it easy for us to keep the grid of blocks one dimensional.

If we scaled the rows as well as the columns we would most likely have need to use a two dimensional grid of blocks.

This would make the problem more difficult as we would then not only need ghost columns, but also ghost rows on the top and bottom of blocks. This would likely cause synchronization problems because we would need to make sure that block execution was synchronized along 2 dimensions instead of just one.

[[File:dps915_gpusquad_2Dghostcells.png]]

CODE FOR 1D GRID WITH 2D THREAD SETUP (kernel contains shared and global memory versions)

int ti = threadIdx.x;

int tij = threadIdx.x + ntpbXY*threadIdx.y;

~~//__shared__ float bShared[ntpbXY*(ntpbXY+2)];//add a couple of ghost columns~~ ~~__shared__ float bShared[ntpbXY+2][ntpbXY];~~

int ij = i + d_m*j;

//GLOBAL MEMORY VERSION (COMMENT OUT SHARED SECTION BELOW TO RUN):==================================

// if (!(blockIdx.x == 0 && threadIdx.x == 0 || blockIdx.x == (d_nBlocks - 1) && threadIdx.x == (d_ntpbXY - 1)) && threadIdx.y != 0 && threadIdx.y != (d_ntpbXY - 1)) {

// float x = d_xmin + i*d_dx, y = d_ymin + j*d_dy;

// float input = abs(x) > 0.5 || abs(y) > 0.5 ? 0 : 1;

// a[ij] = (input + d_dxxinv*(b[ij - 1] + b[ij + 1])

// + d_dyyinv*(b[ij - d_m] + b[ij + d_m]))*d_dcent;

// }

//SHARED MEMORY SETUP (COMMENT OUT FOR GLOBAL VERSION)

__shared__ float bShared[ntpbXY+2][ntpbXY];

//NOTES ON SHARED INDEX OFFSET:

//x offset is -1, so to get the current index for the thread for global, it is bShared[ti+i][tj]

//set right ghost cells

if (threadIdx.x == ntpbXY - 1 && blockIdx.x != d_nBlocks - 1)

bShared[ti+2][tj] = b[~~d_ntpbXY* j + i~~ij+1];

//set ghost cell for current thread relative to global memory

__syncthreads();

//SHARED MEMORY VERSION:==========

A NOTE ON SCALABILITY: In our attempts to make the kernel scalable with ghost cells, we scaled along one dimension. However, we were inconsistent in our scaling. The 1D kernel scaled along the n (y) dimension while the 2d kernels scaled along the m (x) dimension. Scaling along the x dimension, while allowing results to be testable between serial and 2D parallelized versions of the code, produced distributions that were strangely banded and skewed. In other words, we made the code render weird things faster: [[File:MDimensionScale.png]] FINAL TIMINGS<pre style="color: red"> THE GRAPH IMMEDIATELY BELOW IS INCORRECT: there was an error recording the 1D runtimes for assignment 2</pre>

[[File:dps915_gpusquad_a3chart.png]]

PROPER TIMINGS:

[[File:Code_timings2.png]]

Blue = Serial (A1)

Orange = Parallel (A2)

Grey = Optimized Global (A3)

Yellow = Optimized Shared (A3)

The above graph includes the total run times for the serial code, the 1D kernel from assignment 2, a kernel with global and constant memory with a 2D thread arrangement, and the same 2D arrangement but with shared memory utilizing ghost cells.

We found that the most efficient version of the code was the 2d version that used constant memory and did not use shared memory. Because the shared memory version of the kernel required synchronization of threads to allocate shared memory every time a kernel was run, and a kernel was run 5000 times for each version of our code, this increased overhead for memory setup actually made the execution slower than the version with global memory.

We found that the most efficient version of the code was the 2d version that used constant memory and did not use shared memory. Because the shared memory version of the kernel required synchronization of threads to allocate shared memory every time a kernel was run, and a kernel was run 5000 times for each version of our code, the if statements required to set up the ghost cells for shared memory may have created a certain amount of warp divergence, thus slowing down the runtimes of each individual kernel.

Below, are two images that show 4 consecutive kernel runs for both global and shared versions of the code. It is apparent that shared kernel runs actually take more time than the global memory versions.

TIMES FOR THE GLOBAL KERNEL

[[File:kernelGlobalTimes.png]]

TIMES FOR THE SHARED KERNEL

[[File:sharedKernelTimes.png]]

Note how the run times for each kernel with shared memory are significantly longer than those with global.

To try to determine if this issue was one of warp divergence, we tried to time a kernel with global memory that also initialized shared memory, although referenced global memory when carrying out the actual calculations:

[[File:GlobalInitSharedKernelTimes.png]]

The run of a kernel that allocated shared memory using a series of if statements, but executed instructions using global memory is shown in the figure above. While slightly longer than the run with global memory where shared memory is not initialized for ghost cells, it still takes less time to run than the version with Global memory. It is likely that Our group's attempts to employ shared memory failed because we did not adequately schedule or partition the shared memory, and the kernel was slowed as a result. The supposed occupancy of a block of shared memory was 34x32 (the dimensions of the shared memory matrix) x 4 (the size of a float) which equals 4,352 bytes per block, which is supposedly less than the maximum of about 49KB stated for a device with a 5.0 compute capability (which this series of tests on individual kernel run times was performed on). With this is mind it is still unclear as to why the shared memory performed more poorly that the global memory implementation.

Unfortunately our group's inability to effectively use profiling tools has left this discrepancy as a mystery.

In conclusion, while it may be possible to parallelize the algorithm we chose well, the effort to do so would involve ensuring that shared memory is properly synchronized in two block dimensions (2 dimensions of ghost cells rather than the 1 we implemented), and to ensure that shared memory is allocated appropriately such that maximum occupancy is established within the GPU. Unfortunately, our attempts fell short, and while implementing constant memory seemed to speed up the kernel a bit, our solution was not fully scalable in both dimensions, and shared memory was not implemented in a way that improved kernel efficiency.

Moverall

41

edits

Changes

GPUSquad

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools