Changes

← Older edit

GPU621/Analyzing False Sharing

12,545 bytes added, 15:14, 4 December 2022

→‎Use local variables for each thread

=Group Members=

~~ ~~

# [mailto:rleong4@myseneca.ca?subject=GPU621 Ryan Leong]

# [mailto:yppadsala@myseneca.ca?subject=GPU621 Yash Padsala]

# [mailto:sgpatel22@myseneca.ca?subject=GPU621 Shani Patel]

= '''Preface''' =

In multicore concurrent programming, if the mutual exclusion is a "performance killer", then false sharing is a worthy "performance assassin". The difference between "killer" and "assassin" is that a "killer" can be easily detected, and we always have many ways to deal with a "killer" when we fight it, such as fighting, running away, detouring, and begging for mercy, but an "assassin" is completely different. Assassins are good at hiding in the shadows and can give a fatal blow at any time, making people defensive.

In our concurrent development, when we encounter situations that affect concurrency performance, we always have many ways to find and improve the concurrency performance of the program. But false sharing leaves no trace in our code that it is in the "dark" and is seriously slowing down concurrency performance, making it hard to find the root cause of the problem, let alone improve concurrency performance.

= '''What you need to know before understanding false sharing''' =

== Cache ==

Basically, the cache is a location where data is stored for CPU use. But there are many types of memory for the CPU. Ideally, CPU can not access every data with lightning speed. So, let's talk about it in brief. In our PC and laptops, we have hard drives and SSDs which are slow in speed, but they can store big data. So, when the CPU tries to access this data it will take some time, so this process takes more computational time.

[[File:cache.jpg|right|400px]]

Moreover, DRMS is quite expensive. However, it is fast but still, DRMS is not enough for CPUs. DRMS works on electricity. Hence if power is cut that time data is not accessible. Furthermore, SRAM is very fast and smaller but expensive. SRAM memory used in cache. So, the first CPU finds data from the cache and if data is not in the cache, then the CPU starts to find it in the main memory and tries until it finds that data. So, this data transfers into the cache. This transfer works in two ways. The first way is that the same data is transferred into the cache for a short period of time and this is known as temporal locality. Another one is a spatial locality and this method also accessed nearby locations if it will be needed.

== Cache Coherence ==

[[File:cacheCoherence.jpg|center|800px]]

When the CPU accesses the data at that time data goes into the cache and this memory block is known as the cache line. But it is not possible that modify the original data when changes in the cache line. Here cache coherence helps that stored in multiple local caches. Cache coherence connects all cache.

== MESI ==

First, we know that the cache line has four different states. Like, exclusive, shared, modified, and invalid. So MESI is an acronym for all four states. To understand it well we throw light on this example.

[[File:MESI1.jpg|400px]]

This picture shows that the core A read value a from the main memory. So this core also fetches some nearby values from the main memory and stores these values in the cache line and makes it exclusive because only core A operates this cache line.

[[File:MESI2.jpg|400px]]

Later, core B also reads the value of b from memory. Now, a and b both read the same cache line and close so, both cores are shared.

[[File:MESI3.jpg|400px]]

Moreover, core A change the value of a so, this change is stored in the buffer and cache line tagged as modified. In the end, it communicates with core B and is known as an invalid cache line.

= '''What is False Sharing?''' =

False Sharing is sharing pattern that happens when multiple applications run and share the same memory. When more than one thread read or updates the same data in logical memory it ends up in the same catch line. Each application’s cache is a copy of a common source. So, when we modify one cache it will cause others to reload from the common source. In other words, when one program makes a change in one cache and it does not change the data which is used by the second program, this also forces another cache to reload so, in this scenario reload cache is a useless system resource and it may create a negative impact on the performance of the program. It is not easy to catch false sharing and stop it. However, there are some ways which help us to overcome false sharing.

The foremost reason why false sharing happens can be found in how operating systems read and write data. When a program reads or writes data from the hard drive or other sources at that time this data loads into a temporary cache. Because it is a very fast way to access it. This knows as a cache line.

= '''False sharing example''' =

#include <thread>

#include <vector>

#include <iostream>

#include <chrono>

using namespace std;

const int sizeOfNumbers = 10000; // Size of vector numbers

const int sizeOfSum = 2; // Size of vector sum

void sumUp(const vector<int> numbers, vector<long>& sum, int id) {

for (int i = 0; i < numbers.size(); i++) {

if (i % sizeOfSum == id) {

sum[id] += i;

}

// The output of this sum

cout << "sum " << id << " = " << sum[id] << endl;

}

int main() {

vector<int> numbers;

for (int i = 0; i < sizeOfNumbers; i++) { //Initialize vector numbers to 0 to 100

numbers.push_back(i);

}

cout << "-----Thread-----" << endl;

// Thread block

{

vector<long> sum(sizeOfSum, 0); //Set size=sizeOfSum and all to 0

auto start = chrono::steady_clock::now();

vector<thread> thread;

for (int i = 0; i < sizeOfSum; i++) {

thread.emplace_back(sumUp, numbers, ref(sum), i);

}

for (int i = 0; i < sizeOfSum; i++) {

thread[i].join();

}

auto end = chrono::steady_clock::now();

cout << "Thread time consumption: " << chrono::duration_cast<chrono::microseconds>(end - start).count() << "ms" << endl;

}

cout << endl << "-----Serial-----" << endl;

// Serial block

{

vector<long> sum(sizeOfSum, 0);

auto start = chrono::steady_clock::now();

for (int i = 0; i < sizeOfSum; i++) {

sumUp(numbers, sum, i);

}

auto end = chrono::steady_clock::now();

cout << "Serial time consumption: " << chrono::duration_cast<chrono::microseconds>(end - start).count() << "ms" << endl;

}

In this program code, we can see that at the beginning of the main function we declare a vector numbers and initialize it from 0 to 10000. We can also see in the main function that the code is divided into two main blocks. The first block is the Thread block, which is executed concurrently using multiple threads. While the second block is the Serial block, it is executed concurrently using normal serial logic.

The main purpose of the sumUp function is to calculate the sum of odd elements or even elements based on the data in the first vector argument in the argument list. Also, the sum will be recorded in the corresponding position of the second vector argument using the int argument as the index.

Which block of code do you think will take less time?

Theoretically, this code should be executed faster on a multicore machine with a Thread block than a serial block. But the result is.

[[File:exampleOutput1.jpg|400px]]

Or

[[File:exampleOutput2.jpg|400px]]

Or this

[[File:exampleOutput3.jpg|400px]]

To our surprise, the serial block took much less time, no matter how many times I ran it. This turned our existing knowledge upside down, but don't worry, it's because you don't understand False Sharing yet.

= '''Solutions Of False Sharing''' =

== Use local variables for each thread ==

As we said above, the smallest unit of CPU operation on the cache is the size of a cache line, which is 64 bytes. As you can see in our program code, the sum is a vector that stores two data with long type. The two long data are in fact located on the same cache line. When two threads read and write sum[0] and sum[1] separately, it looks like they are not using the same variable, but in fact, they are affecting each other.For example, if a thread updates sum[0] in CPU0, this causes the cache line of sum[1] in CPU1 to be invalidated. This causes the program to take longer to run, as it needs to re-read the data.

We try to change the sizeOfSum to 8, which means that changing the size of the vector sum to 8 and the if condition in the sumUp function to i%8==id will give the program more data to process.

[[File:sizeOfSum8.jpg|400px]]

Nevertheless, the result still does not bring the time between the Serial block and Thread block closer and the Serial block still takes less time.

What would it take to show the advantages of multicore? As you can see in our program, multiple threads are not operating on the same variable, but only a single thread is reading and writing a lot of variables in the same cache line, so you can introduce a local variable.

We implement another function newSumUp in the program code:

void newSumUP(const vector<int> numbers, vector<long>& sum, int id) {

long thisSum = 0;

for (int i = 0; i < numbers.size(); ++i) {

if (i % sizeOfSum == id) {

thisSum += i;

}

sum[id] = thisSum;

cout << "sum " << id << " = " << sum[id] << endl;

}

Also, we add the following code to our main function:

cout << endl << "----Local variable block---" << endl;

// Local variable block

{

vector<long> sum(sizeOfSum, 0);

auto start = chrono::steady_clock::now();

vector<thread> thread;

for (int i = 0; i < sizeOfSum; ++i) {

thread.emplace_back(newSumUP, numbers, ref(sum), i);

}

for (int i = 0; i < sizeOfSum; ++i) {

thread[i].join();

}

auto end = chrono::steady_clock::now();

cout << "Local variable block consumption: " << chrono::duration_cast<chrono::microseconds>(end - start).count() << "ms" << endl;

}

When we ran the program again, we came to this conclusion:

[[File:newBlockOutput(2).jpg|400px]]

We can see that the Local variable block is already much faster than the Thread block, and even comparable to the Serial block is almost the same. But overall the Serial block is still the least time-consuming because the Local variable block still needs to incur extra overhead for thread creation and scheduling.

We could try increasing sizeOfNumbers to 1000000 as well, which would allow the program to process more data, thus compensating for the extra overhead of thread creation and scheduling.

[[File:newBlockOutput1000000(new).jpg|400px]]

Now we can already see the advantage of multi-threading. Even when the vector numbers reach the size of 1000000, the Thread block even runs faster than the Serial block.

So the problem of false sharing sometimes cannot be generalized, and the occurrence of false sharing does not necessarily make performance lower. But no matter how the Local variable block is the optimal solution.

== Byte alignment ==

If you think the above example is too limiting, not all cases can be abstracted to such a local variable. So let's let the two variables exist on different cache lines!

#include <thread>

#include <iostream>

#include <chrono>

using namespace std;

struct Number {

int num1;

int num2;

};

void increment1(Number& number) {

for (int i = 0; i < 1000000; ++i) {

number.num1++;

}

cout << "number.num1:" << number.num1 << endl;

}

void increment2(Number& number) {

for (int i = 0; i < 1000000; ++i) {

number.num2++;

}

cout << "number.num2:" << number.num2 << endl;

}

int main() {

Number number = { 0, 0 };

auto start = chrono::steady_clock::now();

thread thread1(increment1, ref(number));

thread thread2(increment2, ref(number));

thread1.join();

thread2.join();

auto end = chrono::steady_clock::now();

cout << "Consumption: " << chrono::duration_cast<chrono::microseconds>(end - start).count() << "ms" << endl;

}

In this program code, we can easily see that the Number struct has two member variables (num1 and num2). They are for loop in two different threads to execute the increment operator 1000000 times. This time we use Linux with multiple cores to compile (note that here we are not using any optimization).

g++ -std=c++11 -pthread false_sharing.cpp

This is the result after running:

[[File:LinuxOutput.jpg|400px]]

[[File:LinuxOutput2.jpg|400px]]

~~= '''Preface''' =~~[[File:LinuxOutput3.jpg|400px]] We can see from this result that the runtime is actually unstable and takes a long time. Although the fields within struct have their own default byte alignment, they are generally only aligned to the word length. In ~~multicore concurrent programming~~this code example, if that is, 8-byte alignment, which does not result in num1 and num2 being in separate cache lines. So we ~~compare~~ need to use the gcc extension syntax to align the program to 64 bytes. We can define a macro at the ~~contention~~ beginning of ~~mutually exclusive locks~~ the program code to ~~"performance killers", then pseudo-sharing is~~ use the ~~equivalent of "performance assassins"~~gcc attribute function. ~~The difference between~~ #define ALIGN __attribute__((aligned(64))) We also need to redefine the Number struct: struct Number { int ALIGN num1; int ALIGN num2; }; This allows a member variable with a ~~"killer" and an "assassin" is that~~ macro name added to the ~~killer is visible and we~~ Number struct to be 64 bytes.If you don't believe me you can ~~choose~~ try to ~~fight~~test Number with sizeof, ~~run~~the result is 128 bytes. But you can see that sizeof num1 and num2 are still 4 bytes respectively, ~~detour, and beg for mercy when we encounter~~ that's because the ~~killer, but~~ padded bytes are not counted in the ~~"assassin"~~ field. [[File:testBytes.jpg|400px]] ''Here is ~~different~~a reminder: not all machines are 64 bytes. ~~The "assassin"~~ If your CPU is ~~always hiding in~~ a three-level cache, you should look at the size of the ~~shadows~~L3 cache line, ~~waiting for an opportunity to give you a fatal blow~~not the L1! If your CPU does not have L3, ~~which is impossible to prevent~~then it will be based on L2. ~~In our concurrent programming, when~~ '' When we ~~encounter lock contention that affects concurrency performance~~execute the program code again after the bytes are aligned, we get this result: [[File:LinuxOutput4.jpg|400px]] [[File:LinuxOutput5.jpg|400px]] [[File:LinuxOutput6.jpg|400px]] We can notice that the time taken for multiple executions is significantly more stable and shorter than before. Let's take ~~various measures~~ another look at what happens when O2 is turned on. g++ -std=c++11 -pthread -O2 false_sharing.cpp Output(~~such as shortening the critical area, atomic operations, etc~~'''with''' byte alignment): [[File:with1.jpg|400px]] [[File:with2.jpg|400px]] Output('''without''' byte alignment) ~~to improve~~ : [[File:without1.jpg|400px]] [[File:without2.jpg|400px]] After running it, I found that the ~~performance of~~ difference between the ~~program~~runtime with and without byte alignment is very small and there is no instability, ~~but pseudo-sharing~~ the runtime is around 200ms. But the unexpected thing is ~~something~~ that ~~we cannot see from the~~ when I execute increment1() and increment2() in serial code ~~we write~~only, it only consumes 16ms! [[File:mySerial.jpg|400px]] Have you learned? The so we -called false sharing can make performance worse in fact this cannot ~~find~~ be generalized. We can see that with O2 on, the ~~problem~~ CPU and ~~cannot solve it. This leads~~ compiler are able to ~~pseudo~~optimize the serial version of the code very well, while the multi-~~sharing in the "dark"~~threaded code is not well optimized, which is leads to higher performance of the serial version of the code. Even using a ~~serious drag on concurrency performance~~byte-aligned cache line size does not help. In the first example, ~~but~~ we ~~can't do anything about it~~increased the number of vector numbers, which caused the serial code to slow down significantly. The differences in the two examples are related to the specific execution logic.

= So don'~~''What to know before understanding~~ t treat false sharing~~''' === Cache Lines ==[[File:Pyramid Model.png|800px]] ~~In order to carry out the following discussion, we need to first familiarize ourselves with the concept of cache lines. Students who have studied this part of the OS course on storage architecture should be impressed by the pyramid model of the memory hierarchy, where the pyramid from top to bottom represents as a ~~reduction in the cost and larger capacity of the storage medium~~beast, and ~~from bottom to top represents an increase in access speed~~don't treat 64-byte alignment as a panacea. ~~The top of the pyramid model is located in the CPU registers, followed by the CPU cache (L1, L2, L3), then down~~ Everything has to ~~the memory, the bottom is the disk, the operating system uses this storage hierarchy model is mainly to solve the contradiction between the CPU's high speed and memory disk low speed, the CPU will~~ be ~~recently used data read in advance~~ actually tested to ~~the Cache, the next time to access the same data, the CPU can be directly from the faster CPU~~know. ~~The next time~~ Don't throw out the ~~same data is accessed,~~ conclusion under O0 and then take it ~~can be read directly from the faster CPU cache~~as a guideline, ~~avoiding slowing down the overall speed by reading from memory or disk~~that makes no sense ......

~~The smallest unit of CPU cache is the cache line~~= '''Conclusion''' =In conclusion, the ~~cache line size varies depending on~~ term false sharing refers to the ~~architecture,~~ use of shared memory by multiple threads at the ~~most~~ same time. A multiprocessor environment decreases performance. Shared data is modified by multiple processors simultaneously. This is a very common ~~ones are 64Byte and 32Byte~~occurrence in the loop. In many cases, false sharing can be overlooked, ~~the CPU cache accesses data from within the cache line unit~~resulting in a program's inability to scale. In parallel programming, ~~each time taking the entire cache line~~ where ~~the data needs to be read~~performance is fundamental, ~~even if the adjacent data~~ keeping an eye out for problems and recognizing them quickly is ~~not used will also be cached in the CPU cache~~essential.

~~[[File:CPUCacheLines~~Finally, we discussed some solutions to minimize false sharing by making as much private data as possible to reduce how often shared data needs to be updated. Making use of the compiler's optimization features to reduce memory loads and stores. Also, we can improve performance by byte alignment.~~png|800px]]~~

~~For example, double is 8 bytes in C++ and our cache line~~ = References =https://www.easytechjunkie.com/what-is ~~64 bytes, when we read one of the elements of the double array from memory, the CPU will read the eight elements around that element into the cache line~~-false-sharing.htm

~~== Cache Consistency ==~~In https://levelup.gitconnected.com/false-sharing-the ~~case of single~~-~~core CPUs, the above method works fine and ensures that the data cached in the CPU cache is always "clean" because no other CPU will change the data in memory, but in the case of multi~~lesser-known-performance-~~core CPUs, the situation becomes a bit more complicated. In multi~~killer-CPU, each CPU has its own private cache (possibly shared L3 cache), and when one CPU1 operates on the cached data in the Cache, if CPU2 has changed the data before, the data in CPU1 is no longer "clean", i.e., it should be invalid. Cache coherency is to ensure that the cache is consistent across multiple CPUs.bbb6c1354f07

~~The MESI protocol is used in Linux systems to handle cache coherency, by which is meant the four states of the CPU cache~~https://wiki.cdot.senecacollege.ca/wiki/GPU621/False_Sharing

* M (Modified)https: ~~the local processor has modified the cache line, i~~//www.~~e., it is a dirty line, its contents are not the same as those in memory, and there is only one local copy of this cache (proprietary)~~baeldung.* E (Exclusive)com/java-false-sharing-contended#: ~~the contents of the cache line are the same as in memory and the line is not available to any other processor.~~* S (Shared)~: ~~the contents of the cache line are the same as in memory~~text=In%20the%20MESI%20protocol%2C%20each, ~~and it is possible that a copy of~~ the ~~cache line exists in other processors.~~* I (Invalid): The cache line is invalid, and cannot be used%20acronym%20of%20these%20states.

~~[[File~~https:~~CacheConsistency~~//learn.~~jpg]]~~microsoft.com/en-us/archive/msdn-magazine/2008/october/net-matters-false-sharing

~~Each CPU cache line transitions between four states to determine whether the CPU cache is invalidated~~http://www. ~~For example, if CPU1 performs a write operation on a cache line, this operation will cause the cache line of other CPUs to enter the Invalid state, and the CPU will need to re~~nic.uoregon.edu/~khuck/ts/acumem-~~read the cache line from memory when it needs to use it. This solves the problem of cache coherency among multiple CPUs~~report/manual_html/multithreading_problems.html

Ryan Leong

118

edits

CDOT Wiki β

Changes

GPU621/Analyzing False Sharing

CDOT Wiki ^β