32
edits
Changes
→Analyzing False Sharing
== Analyzing False Sharing ==
[[File:False Sahring 2023.pdf]]
== Team Members ==
== Introduction : What is False Sharing? ==
== Cache Consistency ==
[[File:cacheCons.png|500px|center]]
<br style="clear:both" />
== The MESI protocol ==
'''Invalid''' - Line data is not valid.
For efficiency this core will now read the stored cache value instead of reading the same value from memory when possible this will save time and processing power.
core 2 also starts to read values but from value 2 from the main memory.
Due to core 2 being in the same cache line as core 1 both cores will tag their cache lines as '''shared'''.
If core 2 decides to modify the value of 1. It modifies its value in local cache and changes its state from '''shared''' to '''modified'''
Lastly. core 2 notifies core 1 of the changes in values. which marks the cache line as '''invalid'''.
[[File:MESIDIAGRAM123.PNG]]
== Example ==
In this example, we have two threads running in parallel, each incrementing a separate counter variable. The volatile keyword is used to ensure that the compiler doesn't optimize away the increments. We also define the cache line size as 64 bytes.
<pre>
#include <iostream>
#include <thread>
#include <chrono>
using namespace std;
using namespace chrono;
// Constants to control the program
const int NUM_THREADS = 2;
const int NUM_ITERATIONS = 100000000;
const int CACHE_LINE_SIZE = 64;
// Define a counter struct to ensure cache line alignment
struct alignas(CACHE_LINE_SIZE) Counter {
volatile long long value; // the counter value
char padding[CACHE_LINE_SIZE - sizeof(long long)]; // padding to align the struct to cache line size
};
// Define two counter variables
Counter counter1, counter2;
// Function to increment counter 1
void increment1() {
for (int i = 0; i < NUM_ITERATIONS; i++) {
counter1.value++;
}
}
// Function to increment counter 2
void increment2() {
for (int i = 0; i < NUM_ITERATIONS; i++) {
counter2.value++;
}
}
// Function to check the runtime of the program
void checkRuntime(high_resolution_clock::time_point start_time, high_resolution_clock::time_point end_time) {
auto duration = duration_cast<milliseconds>(end_time - start_time).count();
cout << "Runtime: " << duration << " ms" << endl;
}
int main() {
// Print the cache line size
cout << "Cache line size: " << CACHE_LINE_SIZE << " bytes" << endl;
// Run the program using a single thread
cout << "Running program using a single thread" << endl;
high_resolution_clock::time_point start_time = high_resolution_clock::now();
for (int i = 0; i < NUM_ITERATIONS; i++) {
counter1.value++;
counter2.value++;
}
high_resolution_clock::time_point end_time = high_resolution_clock::now();
checkRuntime(start_time, end_time);
cout << "Counter 1: " << counter1.value << endl;
cout << "Counter 2: " << counter2.value << endl;
// Run the program using multiple threads
cout << "Running program using " << NUM_THREADS << " threads" << endl;
counter1.value = 0;
counter2.value = 0;
start_time = high_resolution_clock::now();
thread t1(increment1);
thread t2(increment2);
t1.join();
t2.join();
end_time = high_resolution_clock::now();
checkRuntime(start_time, end_time);
cout << "Counter 1: " << counter1.value << endl;
cout << "Counter 2: " << counter2.value << endl;
return 0;
}
</pre>
In this example we have provided we provided a single thread and a multi thread solution.
[[File:False Analyzing Test 1.PNG]]
Based on results the single thread solution runs faster then the multi thread. But why?
The reason for this is the synchronization and cache invalidation overheads cause the slower execution time for the multi-threaded solution. Despite the execution being split between 2 threads.
Although if the number of iterations is increased. At some point running the solution in serial will cause it to run slower then the multi threaded solution. Why is so?
[[File:False Analyzing Test 2.PNG]]
By increasing the NUM_ITERATIONS it causes the serial solution to run slower because the loop in in the serial solution is excecated sequentially.
As a result more work is being applied on the thread which causes it to bottleneck.
Meanwhile the multi threaded solution does not cause a bottle neck because the work is split between 2 threads allowing them to work on '''less''' iterations at the '''same''' time.
As a result it decreases the amount of time it takes to run through those iterations.
== Solution ==
When two threads update different variables that share the same cache line, this is known as false sharing. A performance hit results as a result of each thread invalidating the cache line for the other thread. In order to ensure that each variable is aligned to its own cache line, we can fix this issue by adding padding to the Counter struct. By ensuring that each thread updates a different cache line, false sharing is prevented.
We can further optimize the code by switching to atomic variables from volatile variables, which solves the issue of false sharing as well. In order to ensure that the variables are updated atomically, atomic variables offer a higher level of synchronization.
The updated code, which addresses false sharing and optimization, is provided below:
<pre>
#include <iostream>
#include <thread>
#include <chrono>
#include <atomic>
using namespace std;
using namespace chrono;
// Constants to control the program
const int NUM_THREADS = 2;
const int NUM_ITERATIONS = 100000000;
const int CACHE_LINE_SIZE = 64;
// Define a counter struct with padding for cache line alignment
struct alignas(CACHE_LINE_SIZE) Counter {
atomic<long long> value; // the counter value
char padding[CACHE_LINE_SIZE - sizeof(atomic<long long>)]; // padding to align the struct to cache line size
};
// Define two counter variables
Counter counter1, counter2;
// Function to increment counter 1
void increment1() {
for (int i = 0; i < NUM_ITERATIONS; i++) {
counter1.value++;
}
}
// Function to increment counter 2
void increment2() {
for (int i = 0; i < NUM_ITERATIONS; i++) {
counter2.value++;
}
}
// Function to check the runtime of the program
void checkRuntime(high_resolution_clock::time_point start_time, high_resolution_clock::time_point end_time) {
auto duration = duration_cast<milliseconds>(end_time - start_time).count();
cout << "Runtime: " << duration << " ms" << endl;
}
int main() {
// Print the cache line size
cout << "Cache line size: " << CACHE_LINE_SIZE << " bytes" << endl;
// Run the program using a single thread
cout << "Running program using a single thread" << endl;
high_resolution_clock::time_point start_time = high_resolution_clock::now();
for (int i = 0; i < NUM_ITERATIONS; i++) {
counter1.value++;
counter2.value++;
}
high_resolution_clock::time_point end_time = high_resolution_clock::now();
checkRuntime(start_time, end_time);
cout << "Counter 1: " << counter1.value << endl;
cout << "Counter 2: " << counter2.value << endl;
== Sources ==
https://wiki.cdot.senecacollege.ca/wiki/Team_False_Sharing
https://www.cs.utexas.edu/~pingali/CS377P/2018sp/lectures/mesi.pdf