Changes

← Older edit

GPU621/Intel Parallel Studio VTune Amplifier

3,658 bytes added, 10:11, 9 December 2021

→‎Conclusion

For more information on System Overview click [https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/analyze-performance/platform-analysis-group/system-overview-analysis.html here]

===~~'''~~Versions of the software:~~'''~~===

*Standalone VTune Profiler Graphical Interface

*Web Server Interface

// initial values for testing

const ~~long long~~ int N = 9; const ~~long long~~ int in_[N]{ 3, 1, 7, 0, 1, 4, 5, 9, 2 };

// command line arguments - none for testing, 1 for large arrays

~~long long~~ int n, nt{ 1 };

if (argc == 1) {

n = N;

if (n < N) n = N;

}

~~long long~~int* in = new ~~long long~~int[n]; ~~long long~~int* out = new ~~long long~~int[n];

// initialize

// Exclusive Prefix Scan

ts = std::chrono::steady_clock::now();

scan<~~long long~~int, decltype(add)>(in, out, n, add, (int)0);

te = std::chrono::steady_clock::now();

====Performance====

As can be seen from the screenshot below, in the OpenMP solution, the work is spread unevenly between 8 threads. It can be described by the fact that the first node is responsible for initializing the arrays and single construct. Also, there is a lot of idle time due to the barrier construct. But the Prefix can itself seems to ~~be spread almost evenly~~fall into average optimal CPU utilization.

[[File:OMP_Scan.png]]

===TBB Prefix Scan===

Finally, the TBB solution that uses tbb:parallel scan and Body functor, as well as auto partitioning

====Code====

// Iurii Kondrakov

// TBB_Main.cpp

// 2021.12.07

#include <iostream>

#include <chrono>

#include <tbb/tbb.h>

#include <tbb/parallel_reduce.h>

template<typename T, typename C>

class Body {

T accumul_;

const T* const in_;

T* const out_;

const T identity_;

const C combine_;

public:

Body(T* out, const T* in, T i, C c) :

accumul_(i), identity_(i),

in_(in), out_(out),

combine_(c) {}

Body(Body& src, tbb::split) :

accumul_(src.identity_), identity_(src.identity_),

in_(src.in_), out_(src.out_),

combine_(src.combine_) {}

template<typename Tag>

void operator() (const tbb::blocked_range<T>& r, Tag) {

T temp = accumul_;

for (auto i = r.begin(); i < r.end(); i++) {

if (Tag::is_final_scan())

out_[i] = temp;

temp = combine_(temp, in_[i]);

}

accumul_ = temp;

}

T get_accumul() {

return accumul_;

}

void reverse_join(Body& src) {

accumul_ = combine_(accumul_, src.accumul_);

}

void assign(Body& src) {

accumul_ = src.accumul_;

}

};

// report system time

void reportTime(const char* msg, std::chrono::steady_clock::duration span) {

auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(span);

std::cout << msg << " - took - " <<

ms.count() << " milliseconds" << std::endl;

}

int main(int argc, char** argv) {

if (argc > 3) {

std::cerr << argv[0] << ": invalid number of arguments\n";

std::cerr << "Usage: " << argv[0] << "\n";

std::cerr << "Usage: " << argv[0] << " power_of_2\n";

std::cerr << "Usage: " << argv[0] << " power_of_2 grainsize\n";

return 1;

}

unsigned grainsize{ 0 };

if (argc == 3) {

grainsize = (unsigned)atoi(argv[2]);

std::cout << "TBB Prefix Scan - grainsize = "

<< grainsize << std::endl;

} else {

std::cout << "TBB Prefix Scan - auto partitioning" << std::endl;

}

// initial values for testing

const int N = 9;

const int in_[N]{ 3, 1, 7, 0, 1, 4, 5, 9, 2 };

// command line arguments - none for testing, 1 for large arrays

int n;

if (argc == 1) {

n = N;

}

else {

n = 1 << std::atoi(argv[1]);

if (n < N) n = N;

}

int* in = new int[n];

int* out = new int[n];

// initialize

for (int i = 0; i < N; i++)

in[i] = in_[i];

for (int i = N; i < n; i++)

in[i] = 1;

auto add = [](int a, int b) { return a + b; };

// Inclusive Prefix Scan

std::chrono::steady_clock::time_point ts, te;

ts = std::chrono::steady_clock::now();

Body<int, decltype(add)> body(out, in, 0, add);

if (argc == 3)

tbb::parallel_scan(tbb::blocked_range<int>(0, n, grainsize), body);

else

tbb::parallel_scan(tbb::blocked_range<int>(0, n), body);

te = std::chrono::steady_clock::now();

for (int i = 0; i < N; i++)

std::cout << out[i] << ' ';

std::cout << out[n - 1] << std::endl;

reportTime("Exclusive Scan", te - ts);

delete[] in;

delete[] out;

}

</source>

====Performance====

As can be seen from the screenshot below, there is a lot of overhead due to tbb::parallel_scan scheduling. Additionally, it seems that most work is done by thread 1, which can be explained by the fact that the array is still initialized serially. The solution can be optimized by choosing the proper grain size which is the first suggestion Vtune gave.

[[File:TBB_Scan.png]]

== Sources ==

Ikondrakov

70

edits

CDOT Wiki β

Changes

GPU621/Intel Parallel Studio VTune Amplifier

CDOT Wiki ^β