Changes

GPUSquad

1,127 bytes added, 14:26, 11 April 2018

→‎Assignment 3

We found that the most efficient version of the code was the 2d version that used constant memory and did not use shared memory. Because the shared memory version of the kernel required synchronization of threads to allocate shared memory every time a kernel was run, and a kernel was run 5000 times for each version of our code, this increased overhead for memory setup actually made the execution slower than the version with global memory.

~~One~~ We found that the most efficient version of the ~~issues encountered when trying to profile~~ code was the ~~fact~~ 2d version that ~~different group members were trying~~ used constant memory and did not use shared memory. Because the shared memory version of the kernel required synchronization of threads to allocate shared memory every time a kernel was run, and a kernel was run 5000 times for each version of our code, the if statements required to ~~work~~ set up the ghost cells for shared memory may have created a certain amount of warp divergence, thus slowing down the runtimes of each individual kernel. Below, are two images that show 4 consecutive kernel runs for both global and shared versions of the code. It is apparent that shared kernel runs actually take more time than the global memory versions. TIMES FOR THE GLOBAL KERNEL[[File:kernelGlobalTimes.png]] TIMES FOR THE SHARED KERNEL [[File:sharedKernelTimes.png]] Note how the run times for each kernel with shared memory are significantly longer than those with ~~different hardware~~global. ~~The hardware changed based on~~ To demonstrate that this is probably an issue of warp divergence, here is another diagram with timings where the ~~rooms we ended~~ kernel both sets up ~~profiling in~~ shared memory using if statments to determine when to initialize ghost cells, but runs the Jacobi calculations using global memory: [[File:GlobalInitSharedKernelTimes.png]] It turns out that this does not run as slowly either--the issue is probably with resource allocation (~~open lab vs lab computers vs laptops with different video cards~~trying to allocate more shared memory than a block can handle). ~~The above graph was done on an open lab computer with a QuadroK620 card~~.. try reducing the size of shared memory to 32x16?

[TODO: INCLUDE PROFILING BREAKDOWNS OF INDIVIDUAL (NOT 5000) KERNEL RUNS TO SEE SPECIFIC TIMELINE FEATURES. EXPLAIN THE DIFFERENCES IN RUN TIMES]

Moverall

41

edits

Changes

GPUSquad

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools