Changes

Jump to: navigation, search

GPU621/VTuners

6,676 bytes added, 00:44, 7 December 2022
Microarchitecture and Memory Bottlenecks
== Microarchitecture and Memory Bottlenecks ==
=== Identifying significant hardware issues affecting [[File:Top-Down Analysis Method.png | frame | 400px | Microarchitecture Exploration Summary: This shows you the different functions utilized throughout the application and their respective performance using microarchitecture exploration analysis ===metrics that tell us the percentage of Front-End Bound and Back-End Memory Bound, and others]]
=== Pinpointing = Main Benefits of The Microarchitecture and Memory Modules ====The Intel Vtune Profiler allows you to utilize microarchitecture exploration analysis to improve the performance of your applications by pinpointing issues with hardware and is also able to identify memory-access-related problems including cache misses and high-bandwidth problems.   ====Top-down Microarchitecture Analysis ==== The Intel Vtune Profiler includes a tool to conduct Microarchitecture Exploration analysis using events collected in the top-down characterization and allows user to pinpoint hardware issues in an application. The Microarchitecture Exploration records other metrics important to performance and are displayed in the Microarchitecture Exploration viewpoint. Using the hotspot analysis from the algorithm optimization section we are able to identify areas in which our code is taking a lot of CPU time to run. This then allows us to pinpoint an area to utilize the ME analysis tool to determine the level of efficiency the code running through the core pipeline. The ME analysis instructs the Vtune Profiler to collect a list of events for analysis and determines metrics which allow easier identification of performance issues at the hardware level.
== Accelerators and XPUs ==
=== Why XPUs? ===
Nowadays, it’s irreversible that the way of computing has become heterogeneous, thanks to the fast-growing development of applications such as machine learning, video editing, and gameplay. That means separation of machine architecture is preferred instead of using multi-purpose hardware. The typical examples are the separation of GPUs from CPUs and the application of FPGAs . The GPU among the parts becoming critical for those compute-intensive applications. It is a highly parallelized machine with several smaller processing cores that work together. While single-core serial performance on a GPU is much slower than on a CPU, applications must take advantage of the massive parallelism available in a GPU. Also, the growth of heterogeneous computing has led developers to discover that different types of workloads perform best on different GPU hardware architectures. Thus, Intel VTune Profiler enables us to evaluate overhead when offloading onto an Intel GPU and analyze it. There are three measurements in this feature:
GPU offload
Explore code execution on various CPU and GPU cores on your platform, estimate how your code benefits from offloading to the GPU, and identify whether your application is CPU or GPU bound.
 
=== GPU Compute/Media Hotspot (preview) ===
Analyze the most time-consuming GPU kernels, characterize GPU utilization based on GPU hardware metrics, identify performance issues caused by memory latency or inefficient kernel algorithms, and analyze GPU instruction frequency per certain instruction types.
 
=== CPU/FPGA interaction ===
Analyze CPU/FPGA interaction issues through these ways:
 
1. Focus on the kernels running on the FPGA.
 
2. Identify the most time-consuming kernels.
 
3. Look at the corresponding metrics on the device side (like Occupancy or Stalls).
 
4. Correlate with CPU and platform profiling data.
== Parallelism ==
| Parallelism Pattern || OpenMP, OpenMP-MPI, TBB
|}
 
[[File:Vtune Roadmap.png|400px|frame]]
'''Top waiting objects'''': the Top Waiting Object section provides a table listing object names that took most time waiting in the application. Reasons for waiting could be function calls or synchronization. The higher wait time the more reductions of parallelism.
[[File:Vtune Roadmap-2.png|frame|400px]]
[[File:Effective-gpu.png|500px]]
The Profiler can typically (by default) takes a snapshot of the whole application. Although, there is functionality to have it focus on particular area within an application to analyze. It will provide a general program overview, while highlighting specific problematic areas. These problematic areas can then be further analyzed to improve performance. 
= Vtune Profiler Coding Excercise in Practice The following is code we utilized to test out the features of the Vtune Profiler. The code is produced by Microsoft and is intended to demonstrate how to convert a basic loop with OpenMP using the Concurrency Runtime algorithm. The purpose of the code itself is to compute the number of prime numbers found in an array of randomly generated numbers. <pre>// concrt-omp-count-primes.cpp// compile with: /EHsc /openmp#include <ppl.h>#include <random>#include <array>#include <iostream> using namespace concurrency;using namespace std; // Determines whether the input value is prime.bool is_prime(int n){ if (n < 2) return false; for (int i = 2; i < n; ++i) { if ((n % i) == 0) return false; } return true;} // Uses OpenMP to compute the count of prime numbers in an array.void omp_count_primes(int* a, size_t size){ if (size == 0) return;  size_t count = 0; #pragma omp parallel for for (int i = 0; i < static_cast<int>(size); ++i) { if (is_prime(a[i])) { #pragma omp atomic ++count; } }  wcout << L"found " << count << L" prime numbers." << endl;} // Uses the Concurrency Runtime to compute the count of prime numbers in an array.void concrt_count_primes(int* a, size_t size){ if (size == 0) return;  combinable<size_t> counts; parallel_for<size_t>(0, size, [&](size_t i) { if (is_prime(a[i])) { counts.local()++; } });  wcout << L"found " << counts.combine(plus<size_t>()) << L" prime numbers." << endl;} int wmain(){ // The length of the array. const size_t size = 1000000; // Create an array and initialize it with random values. int* a = new int[size]; mt19937 gen(42); for (size_t i = 0; i < size; ++i) { a[i] = gen(); }  // Count prime numbers by using OpenMP and the Concurrency Runtime.  wcout << L"Using OpenMP..." << endl; omp_count_primes(a, size);  wcout << L"Using the Concurrency Runtime..." << endl; concrt_count_primes(a, size);  delete[] a;}</pre> The output of the code is fairly simple and only relays back the number of prime numbers found using the OpenMP and Concurrency Runtime methods and nothing else. <pre>Using OpenMP...found 107254 prime numbers.Using the Concurrency Runtime...found 107254 prime numbers.</pre> ==References==*[https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html#gs.ki4qoa Intel® VTune™ Profiler]*[https://www.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top/methodologies/top-down-microarchitecture-analysis-method.html Code Tuning Methods]*[https://www.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top/tuning-recipes/false-sharing.html Profile a Memory-Bound Application]*[https://www.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top/configuration-recipes/analyzing-hot-code-paths-using-flame-graphs.html Analyzing Hot Code Paths]*[https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/analyze-performance/algorithm-group/basic-hotspots-analysis.html Analyze Hot Spots]*[https://learn.microsoft.com/en-us/cpp/parallel/concrt/how-to-convert-an-openmp-parallel-for-loop-to-use-the-concurrency-runtime?view=msvc-170 How to: Convert an OpenMP parallel for Loop to Use the Concurrency Runtime]
117
edits

Navigation menu