# Team Lion F2017

## Contents

# Group Members

**Intel Parallel Studio vTune Amplifier **

## What is VTune Amplifier?

- A tool created by Intel to provide performance analysis on software.
- Offers both a GUI and command-line version for both Windows and Linux
- GUI only for OSX
- Basic features available on both Intel and AMD processors, but advanced features only for Intel

## How to use it?

- Available as a standalone unit or part of the following packages:
- Intel Parallel Studio XE Cluster Edition and Professional Edition
- Intel Media Server Studio Professional Edition
- Intel System Studio

Can be run on a local machine

## Hotspots

### Basic hotspot analysis

We used our workshop 6 as an example to demonstrate this particular aspect of Intel Vtune Amplifer

the image above shows the timings for each function

matmul_0 - represents serial version

matmul_1 - represents serial version with reverse logic

matmul_2 - uses cilk_for

matmul_3 - uses cilk_for and reducer hyperboject

matmul_4 - uses cilk_for, reducer and vectorization

## Parallelism

### Concurrency

- Best for visualizing thread parallelism on available cores, finding areas with high or low concurrency, and identifying serial bottlenecks in your code
- Provides information on how many threads were running at each moment during application execution
- Includes threads which are currently running or ready to run and therefore are not waiting at a defined waiting or blocking API
- Also shows CPU time while the hotspot was executing and estimates its effectiveness either by CPU usage or by Threads Concurrency

#### Results of Concurrency tests on Workshop 6

I ran the Concurrency test on each of the functions in Workshop 6. I isolated each function by commenting out all others, then ran them 1 by 1. This was to get an idea of how they perform on their own. Finally I ran them all together to see how the program runs overall.

#### matmul_0 (Serial)

double matmul_0(const double* a, const double* b, double* c, int n) { for (int i = 0; i < n; i++) { for (int j = 0; j < n; j++) { double sum = 0.0; for (int k = 0; k < n; k++) sum += a[i * n + k] * b[k * n + j]; c[i * n + j] = sum; } } double diag = 0.0; for (int i = 0; i < n; i++) diag += c[i * n + i]; return diag; }

#### matmul_1 (Serial with j-k loops reversed)

double matmul_1(const double* a, const double* b, double* c, int n) { for (int i = 0; i < n; i++) { for (int k = 0; k < n; k++) { double sum = 0.0; for (int j = 0; j < n; j++) sum += a[i * n + k] * b[k * n + j]; c[i * n + k] = sum; } } double diag = 0.0; for (int i = 0; i < n; i++) diag += c[i * n + i]; return diag; }

#### matmul_2 (Cilk Plus with cilk_for)

double matmul_2(const double* a, const double* b, double* c, int n) { cilk_for (int i = 0; i < n; i++) { cilk_for (int j = 0; j < n; j++) { double sum = 0.0; for(int k = 0; k < n; k++) { sum += a[i * n + k] * b[k * n + j]; } c[i * n + j] = sum; } } double diag = 0.0; for (int i = 0; i < n; i++) diag += c[i * n + i]; return diag; }

#### matmul_3 (+array notation, reducer)

double matmul_3(const double* a, const double* b, double* c, int n) { cilk_for(int i = 0; i < n; i++) { cilk_for(int j = 0; j < n; j++) { double sum = 0.0; for (int k = 0; k < n; k++) { sum += a[i * n + k] * b[k * n + j]; } c[i * n + j] = sum; } } cilk::reducer_opadd <double> diag(0.0); cilk_for(int i = 0; i < n; i++) { diag += c[i * n + i]; } return diag.get_value(); }

#### matmul_4 (+vectorization)

double matmul_4(const double* a, const double* b, double* c, int n) { cilk_for(int i = 0; i < n; i++) { cilk_for(int j = 0; j < n; j++) { double sum = 0.0; #pragma simd for (int k = 0; k < n; k++) { sum += a[i * n + k] * b[k * n + j]; } c[i * n + j] = sum; } } cilk::reducer_opadd <double> diag(0.0); cilk_for(int i = 0; i < n; i++) { diag += c[i * n + i]; } return diag.get_value(); }

#### Final test with all functions

### Locks & Waits

- Best for locating causes of low concurrency, such as heavily used locks and large critical sections.
- Locks are when threads are waiting too long on synchronization objects.
- Uses user-mode sampling and tracing collection to identify processes.
- This analysis shows time spent waiting on synchronizations.

## references

https://en.wikipedia.org/wiki/VTune

https://software.intel.com/en-us/get-started-with-vtune

https://software.intel.com/en-us/vtune-amplifier-help-analysis-types

https://software.intel.com/en-us/vtune-amplifier-help-basic-hotspots-analysis

https://software.intel.com/en-us/vtune-amplifier-help-advanced-hotspots-analysis

https://software.intel.com/en-us/vtune-amplifier-help-concurrency-analysis

https://software.intel.com/en-us/vtune-amplifier-help-locks-and-waits-analysis