Changes

GPU621/False Sharing

329 bytes added, 14:24, 26 November 2021

→‎Analyzing False Sharing

False sharing is a well-known problem that can occur when taking advantage of parallelization in modern multi-processor computers with shared memory. Although the results will be correct, frequent false sharing results in significant performance loss and reduction in scalability, effectively nullifying the benefits of parallel computing.

== The Cache ==

At the same time, only small amounts piece of data is needed at a given moment. Even if you brought everything in from memory, most of it will be unused. Utilizing the hierarchy of memory storage, the most relevant data can be siphoned from secondary memory ahead of time and stored in the cache and RAM. When searching for data, the CPU can grab it from the cache, a cache hit. If it is not there, a cache miss, then it looks moves down the hierarchy until it finds it. Minimizing the number of cache misses ensures the CPU has a steady flow of data it can quickly retrieve and compute.

=== Cache Coherence and Cache Line ===

[[File:cache_coherence.png|right]]

Each processor has their own local cache. When data is needed, a fixed block of memory is transferred to the cache; this block is known as a cache line.

The answer is we need cache coherence. '''Cache coherence''' is defined as the '''uniformity of shared resource data that ends up stored in multiple local caches'''. In other words, we must keep all the local caches synchronized. The challenge of doing so is known as the Cache Coherence Problem. To solve this problem, multi-processor systems rely on cache coherence protocols to manage and maintain the cache.

== False Sharing ==

The key thing to note is that you do not need to modify the same piece of data. If the modified data happens to belong to the same cache line, the cache will be invalidated, forcing a memory update to maintain cache coherency.

=== Example ===

}

</pre>

==== Results ====

[[File:naive_implementation.png|600px|thumb|Execution time of naive implementation without any optimization levels (Od)]]

The algorithm calculates the correct answer, but the performance is absolutely terrible ~~due~~ . The reason is that an array is a contiguous block of memory with each integer taking up 4 bytes. Assuming a 64 byte cache line, our entire array only takes up half of the cache opening up the possibility for multiple threads to share the same cache line resulting in false sharing. Although there were cases where higher thread count produced better results, there were many cases that performed worse than a single thread. This is due to the scheduling of thread execution that is out of the programmer's hands. It is possible that the selected schedule managed to minimize the frequency of false sharing giving better performance. However, this is extremely unreliable, so we need a better solution to false sharing.

== Solutions to False Sharing ==

=== Padding ===

Since the data is sharing

=== Synchronization ===

Kchou4

83

edits

Changes

GPU621/False Sharing

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools