Changes

Jump to: navigation, search

GPU621/GPU Targeters

5,170 bytes added, 14:01, 18 December 2020
Instructions for AMD
3. Yunseon Lee
== Progress ==  == Difference of CPU and GPU for parallel applications (Yunseon) ==
[[File:cpuGpu.png|700px|center|]]
== Latest Flagship GPU specs ==   '''AMD RX 6900 XT vs RTX 3090: Specifications''': [[File:latestGpuSpecs.jpg]] == Means of parallelisation on GPUs ==   '''CUDA ''' CUDA is a parallel computing platform and programming model developed by Nvidia for general computing on its own GPUs (graphics processing units). CUDA enables developers to speed up compute-intensive applications by harnessing the power of GPUs for the parallelizable part of the computation. CUDA and NVIDIA GPUs have been improved together in past years. The combination of CUDA and Nvidia GPUs dominates several application areas, including deep learning, and is a foundation for some of the fastest computers in the world. CUDA version 9.2, using multiple P100 server GPUs, you can realize up to 50x performance improvements over CPUs.    '''HIP''' What is Heterogeneous-Computing Interface for Portability (HIP)? It’s a C++ dialect designed to ease conversion of Cuda applications to portable C++ code. It provides a C-style API and a C++ kernel language. The C++ interface can use templates and classes across the host/kernel boundary. The HIPify tool automates much of the conversion work by performing a source-to-source transformation from Cuda to HIP. HIP code can run on AMD hardware (through the HCC compiler) or Nvidia hardware (through the NVCC compiler) with no performance loss compared with the original Cuda code. [More information: https://www.olcf.ornl.gov/wp-content/uploads/2019/09/AMD_GPU_HIP_training_20190906.pdf]  '''OpenCL (Open Compute Language)''' What is OpenCL ? It’s a framework for developing programs that can execute across a wide variety of heterogeneous platforms. AMD, Intel and Nvidia GPUs support version 1.2 of the specification, as do x86 CPUs and other devices (including FPGAs and DSPs). OpenCL provides a C run-time API and C99-based kernel language. When to Use OpenCL:Use OpenCL when you have existing code in that language and when you need portability to multiple platforms and devices. It runs on Windows, Linux and Mac OS, as well as a wide variety of hardware platforms (described above).  '''OpenMP (Open MultiProcessing)'''OpenMP is a parallel programming model based on compiler directives which allows application developers to incrementally add parallelism to their application codes. OpenMP API specification for parallel programming provides an application programming interface (API) that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran, on most platforms. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior. Benefits of OpenMP. Why to choose over GPU kernel model?-supports multi-core, vectorization and GPU-allows for "teams of threads"-portable between various plaforms-heterogeneous memory allocation and custom data mappers [More information (compare OpenMP syntax with CUDA, HIP and other):https://github.com/ROCm-Developer-Tools/aomp/blob/master/docs/openmp_terms.md]
'''NVIDIA GPU Spec''':== Programming GPUs with OpenMP ==<h3>Target Region</h3>* The target region is the offloading construct in OpenMP.<pre>int main() {// This code executes on the host (CPU)
[[File:latestGpuSpecsNvidia2.jpg|900px|]]#pragma omp target // This code executes on the device
}
</pre>
'''AMD * An OpenMP program will begin executing on the host (CPU). * When a target region is encountered the code that is within the target region will begin to execute on a device (GPU Spec''':).
[[File:latestGpuSpecsAmdIf no other construct is specified, for instance a construct to enable a parallelized region (''#pragma omp parallel''). By default, the code within the target region will execute sequentially. The target region does not express parallelism, it only expresses where the contained code is going to be executed on.jpg|900px|]]
There is an implied synchronization between the host and the device at the end of a target region. At the end of a target region the host thread waits for the target region to finish execution and continues executing the next statements.
<h3>Mapping host and device data</h3>
* In order to access data inside the target region it must be mapped to the device.* The host environment and device environment have separate memory. * Data that has been mapped to the device from the host cannot access that data until the target region (Device) has completed its execution. The map clause provides the ability to control a variable over a target region. ''#pragma omp target map(map-type : list)'' * ''list'' specifies the data variables to be mapped from the host data environment to the target's device environment. * ''map-type'' is one of the types '''to''', '''from''', '''tofrom''', or '''alloc'''. '''to''' - copies the data to the device on execution. '''from''' - copies the data to the host on exit. '''tofrom''' - copies the data to the device on execution and back on exit. '''alloc''' - allocated an uninitialized copy on the device (without copying from the host environment). <pre>// Offloading to the target device, but still without parallelism.#pragma omp target map(to:A,B), map(tofrom:sum){ for (int i = 0; i < N; i++) sum += A[i] + B[i];}</pre> <h3>Dynamically allocated data</h3>If we have dynamically allocated data in the host region that we'd like to map to the target region. Then in the map clause we'll need to specify the number of elements that we'AMD RX 6900 XT vs RTX 3090d like to copy over. Otherwise all the compiler would have is a pointer to some region in memory. As it would require the size of allocated memory that needs to be mapped over to the target device.  <pre>int* a = (int*)malloc(sizeof(int) * N);#pragma omp target map(to: Specificationsa[0:N]) // [start:length]</pre> <h3>Parallelism on the GPU</h3>GPUs contain many single stream multiprocessors (SM), each of which can run multiple threads within them. OpenMP still allows us to use the traditional OpenMP constructs inside the target region to create and use threads on a device. However a parallel region executing inside a target region will only execute on one single stream multiprocessor (SM). So parallelization will work but will only be executed on one single stream multiprocessor (SM), leaving most of the cores on the GPU idle. Within a single stream multiprocessor no synchronization is possible between SMs, since GPU's are not able to support a full threading model outside of a single stream multiprocessor (SM). <pre>// This will only execute one single stream multiprocessor.// Threads are still created but the iteration can be distributed across more SMs. #pragma omp target map(to:A,B), map(tofrom:sum)#pragma omp parallel for reduction(+:sum)for (int i = 0; i < N; i++) { sum += A[i] + B[i];}</pre> <h3>Teams construct</h3> In order to provide parallelization within the GPU architectures there is an additional construct known as the ''teams''construct, which creates multiple master threads on the device. [[File: Teams.JPG|thumb|upright=1.2|right|alt=OpenMP teams]] Each master thread can spawn a team of its own threads within a parallel region. But threads from different teams cannot synchronize with other threads outside of their own team. [[File:Distribute.JPG|thumb|upright=1.2|right|alt=OpenMP distribute]]<pre>int main() { #pragma omp target // Offload to device#pragma omp teams // Create teams of master threads#pragma omp parallel // Create parallel region for each team { // Code to execute on GPU } }</pre>
[[File:latestGpuSpecs<h3> Distribute construct </h3>The ''distribute'' construct allows us to distribute iterations. This means if we offload a parallel loop to the device, we will be able to distribute the iterations of the loop across all of the created teams, and across the threads within the teams.jpg]]
== Means of parallelisation on GPUs ==Similar to how the ''for'' construct works, but ''distribute'' assigns the iterations to different teams (single stream multiprocessors).<pre>// Distributes iterations to SMs, and across threads within each SM.
short introduction and advantages and disadvantages of#pragma omp target teams distribute parallel for\map(to:A,B), map(tofrom:sum) reduction(+:sum)for (int i = 0; i < N; i++) { sum += A[i] + B[i];}</pre>
CUDA (Yunseon)<h3>Declare Target</h3>''Calling functions within the scope of a target region.''
OpenMP * The ''declare target'' construct will compile a version of a function that can be called on the device.* In order to offload a function onto the target's device region the function must be first declare on the target.<pre>#pragma omp declare target int combine(Elenaint a, int b);#pragma omp end declare target
HIP #pragma omp target teams distribute parallel for \map(to: A, B), map(tofrom:sum), reduction(+:sum)for (int i = 0; i < N; i++) { sum += combine(ElenaA[i], B[i])}</pre>
OpenCL (Nathan)
https://stackoverflow.com/questions/7263193/opencl-vs-openmp-performance#7263823
== Instructions for NVIDEA NVIDIA ==
'''How to set up the compiler and target offloading for Linux with a target NVIDIA GPU'''
$ mv cfe-7.0.0.src llvm-7.0.0.src/tools/clang
$ mv openmp-7.0.0.src llvm-7.0.0.src/projects/openmp
$ sudo usermod -a -G video $USER
</pre>
</pre>
== Instructions for AMD==
How to set up compiler and target offloading for Linux on AMD GPU:
== Instructions for AMD==How Note: user should be member of 'video' group; if this doesn't help, may add user to set up compiler and target offloading for Linux on AMD GPU: (Elena)'render' group
[AOMP https://github.com/ROCm-Developer-Tools/aomp] is an open source Clang/LLVM based compiler with added support for the OpenMP® API on Radeon™ GPUs.
AOMP is an open source Clang/LLVM based compiler with added support for the OpenMP® API on Radeon™ GPUs. Use this repository for releases, issues, documentation, packaging, and examples,.
https://github.com/ROCm-Developer-Tools/aomp
== Programming GPUs with OpenMP ==To install AOMP compiler on ubuntu:<h3>Target Region<https:/h3>* The target region is the offloading construct in OpenMP/github.<pre>int main() {com/ROCm-Developer-Tools/aomp/blob/master/docs/ This code executes on the host (CPU)UBUNTUINSTALL.md
#pragma omp target
// This code executes on the device
}
</pre>
* An OpenMP program AOMP will begin executing on install to /usr/lib/aomp. The AOMP environment variable will automatically be set to the host (CPU)install location. * When This may require a target region is encountered new terminal to be launched to see the code that is within the target region will begin to execute on a device (GPU)change.
If no other construct is specified, for instance a construct to enable a parallelized region On Ubuntu 18.04 LTS (''#pragma omp parallel''bionic beaver). By default, the code within the target region will execute sequentially. The target region does not express parallelism, it only expresses where the contained code is going to be executed on.run these commands:
There is an implied synchronization between the host and the device at the end of a target region<pre>wget https://github. At the end of a target region the host thread waits for the target region to finish execution and continues executing the next statementscom/ROCm-Developer-Tools/aomp/releases/download/rel_11.11-2/aomp_Ubuntu1804_11.11-2_amd64.debsudo dpkg -i aomp_Ubuntu1804_11.11-2_amd64.deb</pre>
'''Prerequisites'''AMD KFD DriverThese commands are for supported Debian-based systems and target only the rock_dkms core component. More information can be found HERE.<h3pre>Mapping host and device data<echo 'SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video"' | sudo tee /etc/udev/rules.d/70-kfd.ruleswget -qO - http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key | sudo apt-key add -echo 'deb [arch=amd64] http://repo.radeon.com/rocm/apt/debian/ xenial main' | sudo tee /etc/apt/sources.list.d/h3>rocm.listsudo apt updatesudo apt install rock-dkms
* In order to access data inside the target region it must be mapped to the device.sudo reboot* The host environment and device environment have separate memory. sudo usermod -a -G video $USER* Data that has been mapped to the device from the host cannot access that data until the target region (Device) has completed its execution.</pre>
The map clause provides the ability to control a variable over a target region.'''ALTERNATIVELY'''
''#pragma omp target mapYou may also decide to install full ROCm (map-type Radeon Open Compute) driver package, before installing AOMP package: list)''
* ''list'' specifies the data variables to be mapped from the host data environment to the target's device environmenthttps://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html
* ''map-type'' is one of the types '''to''', '''from''', '''tofrom''', or '''alloc'''.
'''to''' - copies the data to the device on execution.More AOMP documentation:
'''from''' - copies the data to the host on exithttps://rocmdocs.amd.com/en/latest/Programming_Guides/aomp.html
'''tofrom''' - copies the data to the device on execution and back on exit.
'''allocHello world compilation example:''' - allocated an uninitialized copy on the device (without copying from the host environment).
<pre>
// Offloading to the target device, but still without parallelismFile helloWorld.c#pragma include <omp target map(to:A,B), map.h>#include <stdio.h>int main(tofrom:sum)
{
#pragma omp parallel for { printf(int i = 0"Hello world!"); i < N; i++) sum += A[i] + B[i]; }
}
</pre>
<h3>Dynamically allocated data</h3>If we have dynamically allocated data in the host region that we'd like Make sure to map export your new AOMP to the target region. Then in the map clause we'll need to specify the number of elements that we'd like to copy over. Otherwise all the compiler would have is a pointer to some region in memory. As it would require the size of allocated memory that needs to be mapped over to the target device. PATH
<pre>
int* a export AOMP= (int*)malloc(sizeof(int) * N);#pragma omp target map(to: a[0:N]) "/usr/lib/ [start:length]aomp"<export PATH=$AOMP/pre>bin:$PATH
<h3>Parallelism on the GPU</h3>GPUs contain many single stream multiprocessors (SM), each of which can run multiple threads within themclang -fopenmp helloWorld.c -o helloWorld
OpenMP still allows us to use the traditional OpenMP constructs inside the target region to create and use threads on a device. However a parallel region executing inside a target region will only execute on one single stream multiprocessor (SM). So parallelization will work but will only be executed on one single stream multiprocessor (SM), leaving most of the cores on the GPU idle./helloWorld</pre>
Within a single stream multiprocessor no synchronization is possible between SMs, since '''Hello world on GPUexample'''s are not able to support a full threading model outside of a single stream multiprocessor (SM).
<pre>
// This will only execute one single stream multiprocessorFile helloWorld.c// Threads are still created but the iteration can be distributed across more SMs#include <omp.h>#include <stdio.h>int main(void){#pragma omp target map(to:A,B), map(tofrom:sum) #pragma omp parallel for reduction printf("Hello world from GPU! THREAD %d\n", omp_get_thread_num(+:sum)for (int i = 0; i < N); i++) { sum += A[i] + B[i];
}
</pre>
<h3>Teams construct</h3>
 
In order to provide parallelization within the GPU architectures there is an additional construct known as the ''teams'' construct, which creates multiple master threads on the device. [[File: Teams.JPG|thumb|upright=1.2|right|alt=OpenMP teams]]
Each master thread can spawn a team of its own threads within a parallel region. But threads from different teams cannot synchronize with other threads outside of their own team.
[[File: Distribute.JPG|thumb|upright=1.2|right|alt=OpenMP distribute]]
<pre>
int main() {export AOMP="/usr/lib/aomp"export PATH=$AOMP/bin:$PATHexport LIBOMPTARGET_KERNEL_TRACE=1
#pragma omp clang -O2 -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target // Offload to device#pragma omp teams // Create teams of master threads#pragma omp parallel // Create parallel region for each team { // Code to execute on GPU }=amdgcn-amd-amdhsa -march=gfx803 helloWorld.c -0 helloWorld
}./helloWorld
</pre>
<h3> Distribute construct </h3>
The ''distribute'' construct allows us to distribute iterations. This means if we offload a parallel loop to the device, we will be able to distribute the iterations of the loop across all of the created teams, and across the threads within the teams.
Similar to how the ''for'' construct works, but ''distribute'' assigns the iterations to different teams (single stream multiprocessors).
<pre>
// Distributes iterations to SMs, and across threads within each SM.
#pragma omp target teams distribute parallel for\map(to: A,B), map(tofrom:sum) reduction(+:sum)To see the name of your device for (int i -march= 0; i < N; i++gfx803) { sum += A[i] + B[i];}</pre> <h3>Declare Target</h3>''Calling functions within the scope of a target region.you may run 'rocminfo'tool:
* The ''declare target'' construct will compile a version of a function that can be called on the device.
* In order to offload a function onto the target's device region the function must be first declare on the target.
<pre>
#pragma omp declare target int combine(int a, int b);#pragma omp end declare target #pragma omp target teams distribute parallel for \map(to: A, B), map(tofrom:sum), reduction(+:sum)for (int i = 0; i < N; i++) { sum += combine(A[i], B[i])}$ /opt/rocm/bin/rocminfo
</pre>
== Results If further problems with compiling and Graphs (Nathanrunning, try starting with examples:https:/Elena) ==  == Conclusions (Nathan/Elenagithub.com/ROCm-Developer-Tools/aomp/tree/master/examples/Yunseon) ==openmp
== Sources ==
[https://www.pcmag.com/encyclopedia/term/core-i7 https://www.pcmag.com/encyclopedia/term/core-i7 CPU picture ]
 
https://rocmdocs.amd.com/en/latest/Programming_Guides/Programming-Guides.html?highlight=hip <-- hip, openCL
51
edits

Navigation menu