Note how the run times for each kernel with shared memory are significantly longer than those with global.
demonstrate that this is probably an issue of warp divergence, here is another diagram with timings where the kernel both sets up shared memory using if statments to determine when to initialize ghost cells, but runs the Jacobi calculations using global memory:
It turns out that this does not run as slowly either--the issue is probably with resource allocation (trying to allocate more shared memory than a block can handle).. . try reducing the size of shared memory to 32x16?
[TODO: INCLUDE PROFILING BREAKDOWNS OF INDIVIDUAL (NOT 5000) KERNEL RUNS TO SEE SPECIFIC TIMELINE FEATURES. EXPLAIN THE DIFFERENCES IN RUN TIMES]