Difference between revisions of "GPU621/Code"

From CDOT Wiki
Jump to: navigation, search
(Intel® Advisor)
Line 235: Line 235:
*Version 2 – the matrix multiplication logic is still inside the parallel for statement, but it is being dynamically scheduled and certain variables are selected to be private or shared.
*Version 2 – the matrix multiplication logic is still inside the parallel for statement, but it is being dynamically scheduled and certain variables are selected to be private or shared.
[https://github.com/coreyjjames/CoreyJJames/tree/Lab3_VTune_Example Example Code]
'''[https://github.com/coreyjjames/CoreyJJames/tree/Lab3_VTune_Example Example Code]'''

Revision as of 00:31, 28 November 2018

Debugging Threads in Intel Parallel Studio

Group Members

  1. Corey James
  2. Guozhao Liang
  3. Oleksii Kozachenko
  4. eMail All

Intel® Parallel Studio XE

Intel Parallel Studio XE is a software development tool suite for compiling applications and optimizing performance with less effort.

The Intel provides with next tools to do thread debugging:

  • Intel Debugger Extension
  • Intel® Advisor
  • Intel® Inspector
  • Intel® VTune™ Amplifier

Let's take a quick look at each of them:

  • Intel Debugger Extension

IPSXE setup.png

Debugger Extension is not supported in VS 2017.

VS support.jpg

Intel® Advisor

Vectorization optimization and thread prototyping.

Use this tool in the vectorization and threading stages of the flow.

  • Identify where it is safe to force compiler vectorization.
  • Quickly find what's blocking vectorization capabilities.
  • Use Threading Advisor to fast-track threading design.
  • Delivers the data and tips you need to make faster design and optimization decisions.

Intel® Inspector

Memory and thread debugger.

Use this tool to find races, deadlocks, and illegal memory accesses.

  • Locate root cause errors early―before you release
  • Quickly debug intermittent races and deadlocks

Intel® VTune™ Amplifier

Performance profiler.

Use this tool in the threading and bandwidth optimization stages and for advanced vectorization optimization.

  • Save money: Locate root cause errors early―before you release
  • Save time: Quickly debug intermittent races and deadlocks

Intel Inspector

Intel Inspector is a dynamic memory and threading error checking instrument to inspect serial and multi-threaded programs.

Intel Inspector comes with Intel Parallel Studio XE along with two other debugging tools - VTune and Advisor.

Create a project

There are 2 ways to work with inspector.

  • Run inspector directly from Visual Studio

This is the easiest and fastest way that requires no additional configurations.

Inspector VS.PNG

  • Run as a separate program.

Working with Intel Inspector application requires passing it a compiled version of your program. Additionally you may need to link some libraries (lib, dll, etc).

Inspector app.PNG

Configure a project

Intel suggests using small data set sizes and load threads with small chunks of work..

This will reduce run time and the speed of the analysis.

Choose analysis type

Inspector allows you to choose between predefined types of analysis.

Inspector 03 Threading analysis startpage.PNG

Choose the type of Analysis using a drop-down menu (2):

  • Memory error analysis
    • Detect leaks
    • Detect Memory problems
    • Locate memory problems
  • Threading error analysis
    • Detect deadlocks
    • Detect deadlocks and data races
    • Locate deadlocks and data races
  • Custom analysis types - users can create their own types based on selected preset type.

Use a slider (3) to select a predefined analysis type.

Types at the the top have smaller a scope but faster in execution;

Types at the the bottom have larger a scope but they are considerably slower;

Edit settings of the analysis in the settings section (4)

How it works

Inspector performs the analysis in multiple steps:

1. The program is executed

2. It identifies problems that may need to be resolved.

3. Gathers problems.

4. Converts symbol information into filenames and line numbers.

5. Applies suppression rules.

6. Removes duplicates.

7. Creates problem sets.

8. Opens a debugging session.

Interpreting results

After the analysis completes, IPS XE Inspector will show you information on 2 pages:

  • Collection Log

Gives a general information about the execution of the program.

From there you can see the execution time, number of threads, the caller of threads, if they were active or not.

Inspector 03 Threading analysis CollectionLog.PNG

  • Summary

The Summary window is divided into 4 parts:

1) Problems section.

It shows problems (if any found) that we asked the Inspector to look for.

It provides you with a name of the problem, the file where the problem is located, the executable module which contains it, and the state of problem (which changes when do you a rescan).

2) Filters (On the right side of the Problems section)

Gives the summary of all problems (Source files affected, total of problems by type, etc)

3) Code locations

When we select a problem, code locations will show a preview of a source file and highlights the line on which the problems was detected!

Moreover it shows the operation that is performed (Read, Write), including thread operations.

Source files can be opened and edited directly from the Inspector by double clicking the problem.

4) Timeline

Shows threads that involved at the certain step.

There is a thread and timeline information for all code locations in one or all occurrences of the problem(s) highlighted in the Problems pane

Inspector 03 Threading analysis 02 dataRace .PNG

Memory leak problem : https://www.codeproject.com/Tips/1184749/Allocating-Memory-in-C-Cplusplus-How-to-Avoid-Memo

  • Source

Shows source file and a call stack on different threads.

Inspector 03 Threading analysis 02 dataRace source .PNG

Intel VTune™ Amplifier

About VTune

Intel VTune amplifier is a analysis software that allows you the ability to measure performance of your serial or multithreaded program.

VTune allows you to analyze the performance of your algorithms and multithreading. It can help with debugging threads by calculating overhead, finding bottlenecks or inefficiencies.

VTune can be ran as a stand alone program or Integrated with Visual Studio.

Starting VTune

When you hover over Intel VTune Amplifier 2019 in the tool’s menu. You will see more options appear.

Select the configure Analysis option.


To see the different analysis's available to be ran with Vtune. Click on the three little dots circled in the below picture.


This menu will appear, it contains different tests that you can run against your program.

Startup options.PNG


The following is some code from the Matrix Multiplication exercise we did in lab3. It contains 2 versions of the “matMul” function.

Change the value of the version macro inside "MatMul.cpp" to run the different versions.

  • Version 1 – the matrix multiplication logic has been put inside a parallel for statement
  • Version 2 – the matrix multiplication logic is still inside the parallel for statement, but it is being dynamically scheduled and certain variables are selected to be private or shared.

Example Code

To run the example code, copy the code into Visual Studio and Build it.

- Run the program with VTune threading analysis.

- The point of interest in the program is under the platform tab.

- You will notice in version 1 some of the threads finish before other's. The work is not being spread evenly.

- In version 2, that issue is resolved all the threads end at the same time. When I ran Version 2 I saw around a 0.6s increase in performance.

Important notes:

  • Run in Release X64, using OpenMP and Intel compiler.
  • Turn off optimization so you can see source code Hotspot's
  • Rebuild after any changes.

Interpreting results

Determining the results from VTune will be a different process for your program then mine.

To be success full make sure to read through the results and look for anomalies.

Example of anomalies:

  • Poor utilization of all the available threads.
  • Uneven distribution of the work across the threads.
  • High spin or overhead time.
  • thread's waiting for no reason.
  • Hotspots in the code

VTune navigation bar (Depending on the Analysis):

  • Analysis configuration
    • Main configuration page for VTune
  • Collection log
    • Logs from the analysis
  • Summary
    • Display's relevant information about the analysis
  • Bottom-up
    • Allows you to se the call stack of a function starting from the first call.
  • Caller/Callee
    • Allows you to see details on each function and see callers and callees for each function
  • Top-down tree
    • Shows the call stack of the program as a tree starting from the top.
  • Platform
    • Displays the time and the utilization of each thread.


  • When reviewing results pay attention to any red flags that are displayed beside results in VTune. If you hover over them, it will give you more information.
  • Almost everything that is in the color red means their is a potential problem.
  • OpenMP integration, VTune does have native support for OpenMP some results are generated by VTune specifically for OpenMP.

Intel® Advisor

Intel® Advisor gives software architects and developers the data and analysis tools they need to build well-threaded and vectorized code that exploits modern hardware capabilities.

Create a project

First of all you need to have Intel parallel XE installed on your machine it will allows you to have access to tools such as Intel Advisor, Vtune Amplifier and Inspector on top of your Visual Studio window.

Lets start by creating a project on Visual Studio, I am using a sample code from Intel Advisor folder which they provide different samples for you to test the functionality of it. They can be found under <Your-Installed-Directory>\IntelSWTools\Advisor 2019\samples\en. I am using the "nqueens_Advisor" one. After choosing your project build the solution and look the following Icon from the image below


Configure the project

Before starting to play around with Intel Advisor, I encourage to set the optimazation of your project to O2 since it will speed up the analysis process on Intel Advisor which can be found under Debug > Properties > C/C++ > Optimization > Optimization > Maximize Speed (/O2)

Now, you can see from the left side of Intel Advisor window Vectorization Workflow and Threading Workflow. I will be guiding you on how to debug Thread using this tool so click on Threading Workflow


How it works

For debugging Thread purpose we will have to run 2 analysis report in Intel Advisor - Threading Workflow pane: Survey Target and Check Dependencies.

  • Survey Target gives you a report on how the threads are behaving and its run time on different calls so that you can try to improve it.
  • Check Dependencies gives you a report on the errors and problems that thread are causing so that you can know where they are and fixing it.

First step is to Collect the Survey Target report by clicking "Collect"


After Collecting the data for the report the following window will appear and there are 3 important things we should pay attention on


1 - Here display the list of the most important functions and loops which took a large part of the total program run time and tells a bit of information about them at the same time giving you some suggestions on how to improve it

2 - In this menu bar there are several options but the most important two are Source and Top Down.

  • Source shows the actual code of the selected function call or loop selected from the #1 window
  • Top Down shows all the run time of each calls and loop in your program which allows you to go more in detail with each of them

Top Down.png

3 - It is a dropdown list which shows you all the threads runned in this program which it is important in this case since we are focusing on debugging threads

Second step is running Check Dependecies For running Check Dependecies we should select one of the function call or loop from the first step from the following window


Pay attention to the black water drop which indicates me that I have selected a loop or function for Annotation. This means that part of the code will be inside of an Annotation range specially for the analysis. This is a new feature since before it did not exist and we would have to go into the source code and add the Annotation range for the part of the code which we would like to analyze.

After selecting the loop or function for Annotation we will proceed to start the Check Dependecies


Starting Check Dependencies will take a bit longer than running the program so be patient with it. After it is done the following window should appear.

No lock.png

1 - It displays the selected loop or function for Annotation and tells us the number of errors it contains.

2 - Here is a very important window since it tells us in detail all the errors that needs to be fixed.

3 - In this case I selected the P6 error and it tells me in which function it is in and the specific variable affected.

Fixing the problem

We have Identified that there are several problems from the Check Dependencies report lets try to fix one of them.

From the "Dependecies Report - Problem and Messages" Window. right-click any data row and choose Edit Source to open the source file. It will bring you directly to the code of the selected problem.

In my case I selected the previously selected error P6 and the error is because multiple tasks may try and increment nrOfSolutions at the same time. So by adding Lock into it can avoid the multiple access to the variable making it one by one with the following way.


After adding the lock, we build the solution again and re run the 2 analysis following the previous steps. This time we can notice that our Check Dependencies Report has changed


The previous error has disappeared. Notice that previously there were 2 errors in RAW and now there is just one. We can also verify from the "Dependecies Report - Problem and Messages" Window that the previous problem is not there anymore which means it has been fixed.