We found that the most efficient version of the code was the 2d version that used constant memory and did not use shared memory. Because the shared memory version of the kernel required synchronization of threads to allocate shared memory every time a kernel was run, and a kernel was run 5000 times for each version of our code, this increased overhead for memory setup actually made the execution slower than the version with global memory.
One of the issues encountered when trying to profile code was the fact that different group members were trying to work with different hardware. The hardware changed based on the rooms we ended up profiling in ( open lab vs lab computers vs laptops with different video cards). The above graph was done on an open lab computer with a QuadroK620 card.
[TODO: INCLUDE PROFILING BREAKDOWNS OF INDIVIDUAL (NOT 5000) KERNEL RUNS TO SEE SPECIFIC TIMELINE FEATURES. EXPLAIN THE DIFFERENCES IN RUN TIMES]