Changes

Jump to: navigation, search

DPS921/OpenACC vs OpenMP Comparison

2,978 bytes added, 02:40, 3 December 2020
m
no edit summary
#pragma acc kernels
{
for (int i = 0; i < N; i++) { y[i] = a * x[i] + y[i]; }
}
</source>
For example in this piece of code, the <code>kernels</code> directive tells the GPU that it is up to the GPU to decide how to parallelize the following loop.
<source>
#pragma acc parallel loop
{
for (int i = 0; i < N; i++) {
y[i] = a * x[i] + y[i];
}
}
</source>
Or if you don't want the compiler to handle everything for you, you can specify there you want this loop to be parallelized, however you as the programmer need to be certain of what you are doing as this will take away some of compiler's freedom of parallelize the code for you.
== Compiler support ==
= OpenMP vs OpenACC =
 
[[File:Openaccvsopenmp.png|800px]]
We are comparing with OpenMP for two reasons. First, OpenMP is also based on directives to parallelize code; second, OpenMP started support of offloading to accelerators starting OpenMP 4.0 using <code>target</code> constructs. OpenACC uses directives to tell the compiler where to parallelize loops, and how to manage data between host and accelerator memories. OpenMP takes a more generic approach, it allows programmers to explicitly spread the execution of loops, code regions and tasks across teams of threads.
== Code comparison Equivalent directives ==
'''Explicit conversions'''
== Jacobi Iteration ==
Jacobi iteration is a common algorithm that iteratively computes the solution based on neighbour values. The following code sample is a serial version of 1D Jacobi iteration.
<source>
while ( err > tol && iter < iter_max ) {
err=0.0f;
for(int i = 1; i < nx-1; i++) {
Anew[i] = 0.5f * (A[i+1] + A[i-1]);
err = fmax(err, fabs(Anew[i] - A[i]));
}
for( int i = 1; i < nx-1; i++ ) {
A[i] = Anew[i];
}
iter++;
}
</source>
 
=== OpenMP Implementation ===
An OpenMP implementation would look like the following, with shared data and reduction on summation values
<source>
while ( err > tol && iter < iter_max ) {
err=0.0f;
#pragma omp parallel for shared(nx, Anew, A) reduction(max:err)
for(int i = 1; i < nx-1; i++) {
Anew[i] = 0.5f * (A[i+1] + A[i-1]);
err = fmax(err, fabs(Anew[i] - A[i]));
}
#pragma omp parallel for shared(nx, Anew, A)
for( int i = 1; i < nx-1; i++ ) {
A[i] = Anew[i];
}
iter++;
}
</source>
 
=== OpenACC Basic Implementation ===
A proper OpenACC implementation looks like this.
<source>
#pragma acc data copyin(A[0:nx]) copyout(Anew[0:nx])
while ( err > tol && iter < iter_max ) {
err=0.0f;
#pragma acc parallel loop reduction(max:err)
for(int i = 1; i < nx-1; i++) {
Anew[i] = 0.5f * (A[i+1] + A[i-1]);
err = fmax(err, fabs(Anew[i] - A[i]));
}
#pragma acc parallel loop
for( int i = 1; i < nx-1; i++ ) {
A[i] = Anew[i];
}
iter++;
}
</source>
 
Or you can let the compiler handle it by using <code>kernel</code> instead of <code>parallel loop</code>. You will be notified during compilation how the compiler thinks this thing should be parallelized.
<source>
#pragma acc data copyin(A[0:nx]) copyout(Anew[0:nx])
while ( err > tol && iter < iter_max ) {
err=0.0f;
#pragma acc kernel
for(int i = 1; i < nx-1; i++) {
Anew[i] = 0.5f * (A[i+1] + A[i-1]);
err = fmax(err, fabs(Anew[i] - A[i]));
}
#pragma acc kernel
for( int i = 1; i < nx-1; i++ ) {
A[i] = Anew[i];
}
iter++;
}
</source>
 
=== Execution time ===
Without access to GPU, the OpenACC code runs about twice faster comparing to the OpenMP one, using the Nvidia HPC SDK <code>nvc</code> compiler. According to other data sources, with access to GPU, the OpenACC version should run about 7 times faster than the OpenMP solution that runs on CPU; and the OpenACC version runs about 2.5 times faster than an OpenMP version with GPU offloading.
= Collaboration =
36
edits

Navigation menu