Open main menu

CDOT Wiki β

Changes

GPU621/To Be Announced

4,328 bytes added, 19:12, 3 December 2020
Sources
[[File:latestGpuSpecsAmd.jpg|900px|]]
 
[[File:latestGpuSpecsNvidia2.jpg|900px|]]
 
 
'''RX 6900 XT vs RTX 3090: Specifications''':
 
[[File:latestGpuSpecs.jpg]]
== Means of parallelisation on GPUs ==
https://stackoverflow.com/questions/7263193/opencl-vs-openmp-performance#7263823
== Instructions for NVIDEA ==
'''How to set up the compiler and target offloading for Linux with a target NVIDIA GPU'''
</pre>
After you will be able to compiler compile an OpenMP application with offloading with and offload it to a target region by running the clang Clang compiler and with some addition additional flags for to ensure offloading.
<pre>
$ clang -fopenmp -fopenmp-targets=nvptx64 -O2 foo.c
</pre>
== Instructions for NVIDEA ==
'''How to set up compiler and target offloading for Linux on AMD GPU: (Elena)'''
How to set up AOMP is an open source Clang/LLVM based compiler and target offloading with added support for Linux the OpenMP® API on AMD GPURadeon™ GPUs. Use this repository for releases, issues, documentation, packaging, and examples,. https: (Elena)//github.com/ROCm-Developer-Tools/aomp
== Programming GPUs with OpenMP ==
<pre>
// Offloading to the target device, but still without parallelism.
#pragma omp target map(to:A,B), map(tofrom:sum)
{
for (int i = 0; i < N; i++)
sum += A[i] + B[i];
}
</pre>
 
<h3>Dynamically allocated data</h3>
If we have dynamically allocated data in the host region that we'd like to map to the target region. Then in the map clause we'll need to specify the number of elements that we'd like to copy over. Otherwise all the compiler would have is a pointer to some region in memory. As it would require the size of allocated memory that needs to be mapped over to the target device.
 
<pre>
int* a = (int*)malloc(sizeof(int) * N);
#pragma omp target map(to: a[0:N]) // [start:length]
</pre>
 
<h3>Parallelism on the GPU</h3>
GPUs contain many single stream multiprocessors (SM), each of which can run multiple threads within them.
 
OpenMP still allows us to use the traditional OpenMP constructs inside the target region to create and use threads on a device. However a parallel region executing inside a target region will only execute on one single stream multiprocessor (SM). So parallelization will work but will only be executed on one single stream multiprocessor (SM), leaving most of the cores on the GPU idle.
 
Within a single stream multiprocessor no synchronization is possible between SMs, since GPU's are not able to support a full threading model outside of a single stream multiprocessor (SM).
 
<pre>
// This will only execute one single stream multiprocessor.
// Threads are still created but the iteration can be distributed across more SMs.
 
#pragma omp target map(to:A,B), map(tofrom:sum)
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
sum += A[i] + B[i];
}
</pre>
 
<h3>Teams construct</h3>
 
In order to provide parallelization within the GPU architectures there is an additional construct known as the ''teams'' construct, which creates multiple master threads on the device. [[File: Teams.JPG|thumb|upright=1.2|right|alt=OpenMP teams]]
Each master thread can spawn a team of its own threads within a parallel region. But threads from different teams cannot synchronize with other threads outside of their own team.
[[File: Distribute.JPG|thumb|upright=1.2|right|alt=OpenMP distribute]]
<pre>
int main() {
 
#pragma omp target // Offload to device
#pragma omp teams // Create teams of master threads
#pragma omp parallel // Create parallel region for each team
{
// Code to execute on GPU
}
 
}
</pre>
 
<h3> Distribute construct </h3>
The ''distribute'' construct allows us to distribute iterations. This means if we offload a parallel loop to the device, we will be able to distribute the iterations of the loop across all of the created teams, and across the threads within the teams.
 
Similar to how the ''for'' construct works, but ''distribute'' assigns the iterations to different teams (single stream multiprocessors).
<pre>
// Distributes iterations to SMs, and across threads within each SM.
 
#pragma omp target teams distribute parallel for\
map(to: A,B), map(tofrom:sum) reduction(+:sum)
for (int i = 0; i < N; i++) {
sum += A[i] + B[i];
}
</pre>
''Calling functions within the scope of a target region.''
* The ''declare target'' construct will compile a version of a function that can be called on the device.
* In order to offload a function onto the target's device region the function must be first declare on the target.
<pre>
#pragma omp declare target
int combine(int a, int b);
#pragma omp end declare target
 
#pragma omp target teams distribute parallel for \
map(to: A, B), map(tofrom:sum), reduction(+:sum)
for (int i = 0; i < N; i++) {
sum += combine(A[i], B[i])
}
</pre>
== Code for tests (Nathan) ==  == Results and Graphs (Nathan/Elena) ==
https://hpc-wiki.info/hpc/Building_LLVM/Clang_with_OpenMP_Offloading_to_NVIDIA_GPUs
 
https://www.ibm.com/support/knowledgecenter/en/SSXVZZ_16.1.0/com.ibm.xlcpp161.lelinux.doc/compiler_ref/prag_omp_teams.html
 
https://www.ibm.com/support/knowledgecenter/en/SSXVZZ_16.1.0/com.ibm.xlcpp161.lelinux.doc/compiler_ref/prag_omp_distribute.html
 
https://www.ibm.com/support/knowledgecenter/en/SSXVZZ_16.1.0/com.ibm.xlcpp161.lelinux.doc/compiler_ref/prag_omp_dis_pfor.html
http://www.nvidia.com/en-us/geforce/graphics-cards/30-series/
https://www.pcmag.com/encyclopedia/term/core-i7 AMD RX-580 GPU architecture
 
https://premiumbuilds.com/comparisons/rx-6900-xt-vs-rtx-3090/ -> compare Flagship GPU's 2020
[http://www.nvidia.com/en-us/geforce/graphics-cards/30-series/ http://www.nvidia.com/en-us/geforce/graphics-cards/30-series/ nvidia]
[https://www.pcmag.com/encyclopedia/term/core-i7 https://www.pcmag.com/encyclopedia/term/core-i7 CPU picture ]
 
https://rocmdocs.amd.com/en/latest/Programming_Guides/Programming-Guides.html?highlight=hip <- HIP, openCl
51
edits