Changes

Jump to: navigation, search

DPS921/OpenACC vs OpenMP Comparison

8,678 bytes added, 19:03, 7 December 2020
Added professor's comments on GPU computation
- Nov 9, 2020: Added project description
- Nov 13, 2020: Determine content sections to be discussed
- Nov 15, 2020: Investigate OpenACC learning material and tools
- Nov 18, 2020: Successful installation of required compiler and compilation of OpenACC code
- Nov 19, 2020: Adding MPI into discussion
 
== Important notes (prof's feedback) ==
Limitation with GPU computation is that GPU can only handle float precision. When calculating double precision, values need to be broken into floating point precision values, and combined back to double precision. This is one of the primary reasons why GPUs are not used in scientific computations.
 
Before using GPU to speed up computation, make sure you know the level of precision required for the results and intermediate results. For AI/CV/ML where precision requirement is low, it is safe to use GPU to speed up your computation.
= OpenACC =
#pragma acc kernels
{
for (int i = 0; i < N; i++) { y[i] = a * x[i] + y[i]; }
}
</source>
For example in this piece of code, the <code>kernels</code> directive tells the GPU compiler that it is up to the GPU compiler to decide how to parallelize the following loop.
<source>
#pragma acc parallel loop
{
for (int i = 0; i < N; i++) {
y[i] = a * x[i] + y[i];
}
}
</source>
Or if you don't want the compiler to handle everything for you, you can specify there you want this loop to be parallelized, however you as the programmer need to be certain of what you are doing as this will take away some of compiler's freedom of parallelize the code for you.
== Compiler support ==
= OpenMP vs OpenACC =
 
[[File:Openaccvsopenmp.png|800px]]
We are comparing with OpenMP for two reasons. First, OpenMP is also based on directives to parallelize code; second, OpenMP started support of offloading to accelerators starting OpenMP 4.0 using <code>target</code> constructs. OpenACC uses directives to tell the compiler where to parallelize loops, and how to manage data between host and accelerator memories. OpenMP takes a more generic approach, it allows programmers to explicitly spread the execution of loops, code regions and tasks across teams of threads.
== Code comparison Equivalent directives ==
'''Explicit conversions'''
</source>
'''Copy vs. PCopy'''
<source>
OpenACC OpenMP
int x[10],y[10]; int x[10],y[10];
#pragma acc data copy(x) pcopycopy(y) #pragma omp target data map(x,y)
{ {
... ...
#pragma acc kernels copy(x) pcopycopy(y) #pragma omp target update to(x)
{ #pragma omp target map(y)
// Accelerator Code {
== Jacobi Iteration ==
Jacobi iteration is a common algorithm that iteratively computes the solution based on neighbour values. The following code sample is a serial version of 1D Jacobi iteration.
<source>
while ( err > tol && iter < iter_max ) {
err=0.0f;
for(int i = 1; i < nx-1; i++) {
Anew[i] = 0.5f * (A[i+1] + A[i-1]);
err = fmax(err, fabs(Anew[i] - A[i]));
}
for( int i = 1; i < nx-1; i++ ) {
A[i] = Anew[i];
}
iter++;
}
</source>
 
=== OpenMP CPU Implementation ===
An OpenMP implementation would look like the following, with shared data and reduction on summation values
<source>
while ( err > tol && iter < iter_max ) {
err=0.0f;
#pragma omp parallel
{
#pragma omp for shared(nx, Anew, A) reduction(max:err)
for(int i = 1; i < nx-1; i++) {
Anew[i] = 0.5f * (A[i+1] + A[i-1]);
err = fmax(err, fabs(Anew[i] - A[i]));
}
#pragma omp for shared(nx, Anew, A)
for( int i = 1; i < nx-1; i++ ) {
A[i] = Anew[i];
}
iter++;
}
}
</source>
 
=== OpenACC Basic Implementation ===
A basic OpenACC implementation looks like this.
<source>
while ( err > tol && iter < iter_max ) {
err=0.0f;
#pragma acc parallel loop reduction(max:err)
for(int i = 1; i < nx-1; i++) {
Anew[i] = 0.5f * (A[i+1] + A[i-1]);
err = fmax(err, fabs(Anew[i] - A[i]));
}
#pragma acc parallel loop
for( int i = 1; i < nx-1; i++ ) {
A[i] = Anew[i];
}
iter++;
}
</source>
 
Or you can let the compiler handle it by using <code>kernel</code> instead of <code>parallel loop</code>. You will be notified during compilation how the compiler thinks this thing should be parallelized.
<source>
while ( err > tol && iter < iter_max ) {
err=0.0f;
#pragma acc kernel
for(int i = 1; i < nx-1; i++) {
Anew[i] = 0.5f * (A[i+1] + A[i-1]);
err = fmax(err, fabs(Anew[i] - A[i]));
}
#pragma acc kernel
for( int i = 1; i < nx-1; i++ ) {
A[i] = Anew[i];
}
iter++;
}
</source>
 
=== OpenACC Proper Implementation ===
The above implementation is actually slower than the serial version, that is due to the fact that there exists data transfer at the end of each iteration. In order to prevent that from happening, we need to copy the data into the accelerator's memory and copy it out when done.
<source>
#pragma acc data copyin(A[0:nx]) copyout(Anew[0:nx])
while ( err > tol && iter < iter_max ) {
err=0.0f;
#pragma acc kernel
for(int i = 1; i < nx-1; i++) {
Anew[i] = 0.5f * (A[i+1] + A[i-1]);
err = fmax(err, fabs(Anew[i] - A[i]));
}
#pragma acc kernel
for( int i = 1; i < nx-1; i++ ) {
A[i] = Anew[i];
}
iter++;
}
</source>
In the above code, we added a <code>copyin(list)</code> for the original matrix of values, and <code>copyout(list)</code> for the computed matrix of results. There are other related directives such as <code>copy(list)</code> which is the combination of both <code>copyin(list)</code> and <code>copyout(list)</code>, <code>create(list)</code> for creating a memory region in the accelerator but not copy any data into it, and <code>present(list)</code> which indicates the list is already in the accelerator, which is often used along with <code>create(list)</code>
 
 
=== OpenMP GPU Basic Implementation ===
Here's the OpenMP GPU basic implementation. Almost everything is the same, just need to enclose everything into an <code>omp target</code> region
<source>
while ( err > tol && iter < iter_max ) {
err=0.0f;
#pragma omp target
#pragma omp parallel
{
#pragma omp for shared(nx, Anew, A) reduction(max:err)
for(int i = 1; i < nx-1; i++) {
Anew[i] = 0.5f * (A[i+1] + A[i-1]);
err = fmax(err, fabs(Anew[i] - A[i]));
}
#pragma omp for shared(nx, Anew, A)
for( int i = 1; i < nx-1; i++ ) {
A[i] = Anew[i];
}
iter++;
}
}
</source>
 
=== OpenMP GPU Proper Implementation ===
Similar to OpenACC, the basic is slow because of data transfer issues, but to optimize OpenMP, you need to explicitly tell the thread how to team up and how to distribute.
<source>
#pragma omp target data map(alloc:Anew) map(A)
while ( err > tol && iter < iter_max ) {
err=0.0f;
 
#pragma omp target teams distribute parallel for reduction(max:error) collapse(2) schedule(static,1)
for(int i = 1; i < nx-1; i++) {
Anew[i] = 0.5f * (A[i+1] + A[i-1]);
err = fmax(err, fabs(Anew[i] - A[i]));
}
 
#pragma omp target teams distribute parallel for collapse(2) schedule(static,1)
for( int i = 1; i < nx-1; i++ ) {
A[i] = Anew[i];
}
iter++;
}
</source>
 
 
=== Execution time ===
 
* OpenMP CPU: 1x
* OpenACC Basic: ~2x (twice as slow)
* OpenACC Proper: ~0.14x (7 times faster)
* OpenMP GPU Basic: ~10x (10 times slower)
* OpenMP GPU Proper: ~0.21x (5 times faster)
= Collaboration =
== OpenACC with OpenMP ==
OpenMP and OpenACC can be used together. HoweverUsing the example above, PGI stated that there are still some issues when interoperating between OpenMP and OpenACC we can easily come up with something like <source>#pragma acc data copyin(A[0:nx]) copyout(Anew[https0:nx])while ( err > tol && iter < iter_max ) { err=0.0f; #pragma omp parallel for shared(nx, Anew, A) #pragma acc kernel for(int i = 1; i < nx-1; i++) { Anew[i] = 0.5f * (A[i+1] + A[i-1]); err = fmax(err, fabs(Anew[i] - A[i])); } #pragma omp parallel for shared(nx, Anew, A) #pragma acc kernel for( int i = 1; i < nx-1; i++ ) { A[i] = Anew[i]; } iter++;}<//pgroupsource> this way, for each thread created by OpenMP, the computation will be offloaded to an accelerator, with results joined back together. Combining OpenACC and OpenMP together may be an overkill for the 1D example, a 2D example may be a better fit.com/resources/openacc_faq<source>#pragma acc data copy(A), create(Anew)while ( error > tol && iter < iter_max ){ error = 0.htmf;  #pragma omp parallel for shared(m, n, Anew, A) #pragma acc kernels loop gang(32), vector(16) for( int j = 1; j < n-1; j++) { #cpupragma acc loop gang(16), vector(32) for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmaxf( error, fabsf(Anew[j][i]-A[j][i])); } } #pragma omp parallel for shared(m, since their runtime library n, Anew, A) #pragma acc kernels loop for( int j = 1; j < n-1; j++) { #pragma acc loop gang(16), vector(32) for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } }  iter++;}</source>Here we can insert additional instructions into the inner loop on how many gangs and vectors to use. <code>Gang</code> and <code>vector</code> are not completely OpenACC terminologies. A <code>vector</code> is one thread-safeperforming single operation on multiple data (SIMD), a <code>worker</code> computes one <code>vector</code>, and a <code>gang</code> comprises of multiple workers that share resource. They are looking forward to improving the interaction between the two libraries in the future releases [[File: Gangworkervector.png]]
== OpenACC with MPI ==
When OpenMP and OpenACC works together, it is usually one CPU with several accelerators as that is how OpenMP works. When there are multiple CPUs and each have access to multiple accelerators, OpenMP will not be enough, and we can introduce MPI.
As we learned that MPI is used to allow communication and data transfer between threads during parallel execution. In the case of multiple accelerators, one of the ways we can use the two together is to use MPI to communicate between different accelerators.
 
The following is a screenshot taken from Nvidia's Advanced OpenACC lecture, showing how does MPI works with OpenACC.
[[File: Mpiopenacc.png|800px]]
 
= References =
https://developer.nvidia.com/blog/benchmarking-cuda-aware-mpi/
 
https://developer.nvidia.com/hpc-sdk
 
https://gcc.gnu.org/wiki/OpenACC
 
https://on-demand.gputechconf.com/gtc/2015/webinar/openacc-course/advanced-openacc-techniques.pdf
 
https://on-demand.gputechconf.com/gtc/2016/presentation/s6510-jeff-larkin-targeting-gpus-openmp.pdf
 
https://on-demand.gputechconf.com/gtc/2016/webinar/openacc-course/Advanced-OpenACC-Course-Lecture2--Multi-GPU-20160602.pdf
36
edits

Navigation menu