Changes

Jump to: navigation, search

DPS921/OpenACC vs OpenMP Comparison

1,064 bytes added, 19:03, 7 December 2020
Added professor's comments on GPU computation
- Nov 18, 2020: Successful installation of required compiler and compilation of OpenACC code
- Nov 19, 2020: Adding MPI into discussion
 
== Important notes (prof's feedback) ==
Limitation with GPU computation is that GPU can only handle float precision. When calculating double precision, values need to be broken into floating point precision values, and combined back to double precision. This is one of the primary reasons why GPUs are not used in scientific computations.
 
Before using GPU to speed up computation, make sure you know the level of precision required for the results and intermediate results. For AI/CV/ML where precision requirement is low, it is safe to use GPU to speed up your computation.
= OpenACC =
}
</source>
For example in this piece of code, the <code>kernels</code> directive tells the GPU compiler that it is up to the GPU compiler to decide how to parallelize the following loop.
<source>
</source>
'''Copy vs. PCopy'''
<source>
OpenACC OpenMP
int x[10],y[10]; int x[10],y[10];
#pragma acc data copy(x) pcopycopy(y) #pragma omp target data map(x,y)
{ {
... ...
#pragma acc kernels copy(x) pcopycopy(y) #pragma omp target update to(x)
{ #pragma omp target map(y)
// Accelerator Code {
while ( err > tol && iter < iter_max ) {
err=0.0f;
#pragma omp parallel { #pragma omp for shared(nx, Anew, A) reduction(max:err) for(int i = 1; i < nx-1; i++) { Anew[i] = 0.5f * (A[i+1] + A[i-1]); err = fmax(err, fabs(Anew[i] - A[i])); } #pragma omp for shared(nx, Anew, A) for( int i = 1; i < nx-1; i++ ) { A[i] = Anew[i]; } iter++; }}</source> === OpenACC Basic Implementation ===A basic OpenACC implementation looks like this.<source>while ( err > tol && iter < iter_max ) { err=0.0f; #pragma acc parallel loop reduction(max:err)
for(int i = 1; i < nx-1; i++) {
Anew[i] = 0.5f * (A[i+1] + A[i-1]);
err = fmax(err, fabs(Anew[i] - A[i]));
}
#pragma omp acc parallel for shared(nx, Anew, A)loop
for( int i = 1; i < nx-1; i++ ) {
A[i] = Anew[i];
</source>
=== OpenACC Basic Implementation ===A basic OpenACC implementation looks like Or you can let the compiler handle it by using <code>kernel</code> instead of <code>parallel loop</code>. You will be notified during compilation how the compiler thinks thisthing should be parallelized.
<source>
while ( err > tol && iter < iter_max ) {
err=0.0f;
#pragma acc parallel loop reduction(max:err)kernel
for(int i = 1; i < nx-1; i++) {
Anew[i] = 0.5f * (A[i+1] + A[i-1]);
err = fmax(err, fabs(Anew[i] - A[i]));
}
#pragma acc parallel loopkernel
for( int i = 1; i < nx-1; i++ ) {
A[i] = Anew[i];
</source>
Or you can let === OpenACC Proper Implementation ===The above implementation is actually slower than the compiler handle it by using <code>kernel</code> instead serial version, that is due to the fact that there exists data transfer at the end of <code>parallel loop</code>each iteration. You will be notified during compilation how In order to prevent that from happening, we need to copy the compiler thinks this thing should be parallelizeddata into the accelerator's memory and copy it out when done.
<source>
#pragma acc data copyin(A[0:nx]) copyout(Anew[0:nx])
while ( err > tol && iter < iter_max ) {
err=0.0f;
}
</source>
In the above code, we added a <code>copyin(list)</code> for the original matrix of values, and <code>copyout(list)</code> for the computed matrix of results. There are other related directives such as <code>copy(list)</code> which is the combination of both <code>copyin(list)</code> and <code>copyout(list)</code>, <code>create(list)</code> for creating a memory region in the accelerator but not copy any data into it, and <code>present(list)</code> which indicates the list is already in the accelerator, which is often used along with <code>create(list)</code>
 
=== OpenMP GPU Basic Implementation ===
err=0.0f;
#pragma omp target
#pragma omp parallel
{
#pragma omp parallel for shared(nx, Anew, A) reduction(max:err)
for(int i = 1; i < nx-1; i++) {
Anew[i] = 0.5f * (A[i+1] + A[i-1]);
err = fmax(err, fabs(Anew[i] - A[i]));
}
#pragma omp parallel for shared(nx, Anew, A)
for( int i = 1; i < nx-1; i++ ) {
A[i] = Anew[i];
}
</source>
 
=== OpenACC Proper Implementation ===
The above implementation is actually slower than the serial version, that is due to the fact that there exists data transfer at the end of each iteration. In order to prevent that from happening, we need to copy the data into the accelerator's memory and copy it out when done.
<source>
#pragma acc data copyin(A[0:nx]) copyout(Anew[0:nx])
while ( err > tol && iter < iter_max ) {
err=0.0f;
#pragma acc kernel
for(int i = 1; i < nx-1; i++) {
Anew[i] = 0.5f * (A[i+1] + A[i-1]);
err = fmax(err, fabs(Anew[i] - A[i]));
}
#pragma acc kernel
for( int i = 1; i < nx-1; i++ ) {
A[i] = Anew[i];
}
iter++;
}
</source>
In the above code, we added a <code>copyin(list)</code> for the original matrix of values, and <code>copyout(list)</code> for the computed matrix of results. There are other related directives such as <code>copy(list)</code> which is the combination of both <code>copyin(list)</code> and <code>copyout(list)</code>, <code>create(list)</code> for creating a memory region in the accelerator but not copy any data into it, and <code>present(list)</code> which indicates the list is already in the accelerator, which is often used along with <code>create(list)</code>
=== OpenMP GPU Proper Implementation ===
=== Execution time ===
Without access to GPU, the * OpenMP CPU: 1x* OpenACC code runs about Basic: ~2x (twice as slow)* OpenACC Proper: ~0.14x (7 times faster comparing to the )* OpenMP one, using the Nvidia HPC SDK <code>nvc</code> compiler. According to other data sources, with access to GPU, the OpenACC version should run about 7 Basic: ~10x (10 times faster than the slower)* OpenMP solution that runs on CPU; and the OpenACC version runs about 2GPU Proper: ~0.21x (5 times faster than an OpenMP version with GPU offloading.)
= Collaboration =
When OpenMP and OpenACC works together, it is usually one CPU with several accelerators as that is how OpenMP works. When there are multiple CPUs and each have access to multiple accelerators, OpenMP will not be enough, and we can introduce MPI.
As we learned that MPI is used to allow communication and data transfer between threads during parallel execution. In the case of multiple accelerators, one of the ways we can use the two together is to use MPI to communicate between different accelerators.
 
The following is a screenshot taken from Nvidia's Advanced OpenACC lecture, showing how does MPI works with OpenACC.
[[File: Mpiopenacc.png|800px]]
 
= References =
https://developer.nvidia.com/blog/benchmarking-cuda-aware-mpi/
 
https://developer.nvidia.com/hpc-sdk
 
https://gcc.gnu.org/wiki/OpenACC
 
https://on-demand.gputechconf.com/gtc/2015/webinar/openacc-course/advanced-openacc-techniques.pdf
 
https://on-demand.gputechconf.com/gtc/2016/presentation/s6510-jeff-larkin-targeting-gpus-openmp.pdf
 
https://on-demand.gputechconf.com/gtc/2016/webinar/openacc-course/Advanced-OpenACC-Course-Lecture2--Multi-GPU-20160602.pdf
36
edits

Navigation menu