Changes

Solo Act

990 bytes added, 16:25, 18 April 2018

→‎Assignment 3

The performance comparison can be seen in the graph above. The graph shows no significant improvement.

~~To improve~~Although the shared memory likely performs faster when it is read, ~~only~~ implementing shared memory required adding one instruction. By comparison, the global kernel 1) reads from global memory and then 2) writes to global memory. The shared kernel 1) reads from shared memory 2) writes to global memory and then 3) writes to shared memory. This extra instruction reduces the ~~previous round of leaves should have been stored in~~ benefit from using shared memory.

One other likely cause for such similar results is the effect of coalescence on memory access. Each leaf node in a round is stored concurrently within the global memory. This means that each thread is accessing concurrent memory, and the hardware is likely merging these global read requests which reduces the detriment of global access. The extra instruction reduces the benefit gain from shared memory, and the coalesced access speeds up the global memory. Despite both of these factors, the results are close to the same. This demonstrates just how much shared memory is faster, but also shows that the use of shared memory is not very effective in this situation. I only tested this up to x 32 elements to be sure I wouldn't run out of shared memory. For larger data sets, multiple blocks would have to be partitioned by memory , per block , and threads per block. These values will constrain the maximum leaf number.

Njsimas

120

edits

Changes

Solo Act

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools