Difference between revisions of "DPS921/OpenACC vs OpenMP Comparison"

From CDOT Wiki
Jump to: navigation, search
m (added openmp vs openacc difference in summary)
m
Line 34: Line 34:
 
Originally, OpenACC compilation is supported by the PGI compiler which requires subscription, there has been new options in recent years.
 
Originally, OpenACC compilation is supported by the PGI compiler which requires subscription, there has been new options in recent years.
  
=== Nvidia HPC SDK ===
+
=== Nvidia HPC SDK[https://developer.nvidia.com/hpc-sdk] ===
Evolved from PGI Compiler community edition
+
Evolved from PGI Compiler community edition. Installation guide are provided in the official website. Currently only supports Linux systems, but Windows support will come soon.
 +
<source>
 +
wget https://developer.download.nvidia.com/hpc-sdk/20.9/nvhpc-20-9_20.9_amd64.deb \
  
=== GCC ===
+
  https://developer.download.nvidia.com/hpc-sdk/20.9/nvhpc-2020_20.9_amd64.deb
 +
 
 +
sudo apt-get install ./nvhpc-20-9_20.9_amd64.deb ./nvhpc-2020_20.9_amd64.deb
 +
</source>
 +
 
 +
After installation, the compilers can be found under <code>/opt/nvidia/hpc_sdk/Linux_x86_64/20.9/compilers/bin</code>, and OpenACC code can be compiled with <code>nvc -acc -gpu=manage demo.c</code>, where <code>-acc</code> indicates that the code will include OpenACC directives, and <code>-gpu=manage</code> indicates how should memory be managed. <code>nvc</code> is used here because source code is C code, there is <code>nvc++</code> for compiling C++ code.
 +
 
 +
The compiler can also tell how the parallel regions are generalized if you pass in a <code>-Minfo</code> option like
 +
<source>
 +
$ nvc -acc -gpu=managed -Minfo demo.c
 +
main:
 +
    79, Generating implicit copyin(A[:256][:256]) [if not already present]
 +
        Generating implicit copy(_error) [if not already present]
 +
        Generating implicit copyout(Anew[1:254][1:254]) [if not already present]
 +
    83, Loop is parallelizable
 +
    85, Loop is parallelizable
 +
        Accelerator kernel generated
 +
        Generating Tesla code
 +
        83, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
 +
            Generating implicit reduction(max:_error)
 +
        85, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
 +
    91, Generating implicit copyout(A[1:254][1:254]) [if not already present]
 +
        Generating implicit copyin(Anew[1:254][1:254]) [if not already present]
 +
    95, Loop is parallelizable
 +
    97, Loop is parallelizable
 +
        Accelerator kernel generated
 +
        Generating Tesla code
 +
        95, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
 +
        97, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
 +
</source>
 +
This tells which loops are parallelized with line numbers for reference.
 +
 
 +
For Windows users that would like to try this SDK, WSL2 is one way to go.
 +
 
 +
=== GCC[https://gcc.gnu.org/wiki/OpenACC] ===
 
Latest GCC version, GCC 10 has support to OpenACC 2.6
 
Latest GCC version, GCC 10 has support to OpenACC 2.6
  
Line 43: Line 79:
 
= OpenMP vs OpenACC =
 
= OpenMP vs OpenACC =
  
We are comparing with OpenMP for two reasons. First, OpenMP is also based on directives to parallelize code; second, OpenMP started support of offloading to accelerators starting OpenMP 4.0 using `target` constructs. OpenACC uses directives to tell the compiler where to parallelize loops, and how to manage data between host and accelerator memories. OpenMP takes a more generic approach, it allows programmers to explicitly spread the execution of loops, code regions and tasks across teams of threads.
+
We are comparing with OpenMP for two reasons. First, OpenMP is also based on directives to parallelize code; second, OpenMP started support of offloading to accelerators starting OpenMP 4.0 using <code>target</code> constructs. OpenACC uses directives to tell the compiler where to parallelize loops, and how to manage data between host and accelerator memories. OpenMP takes a more generic approach, it allows programmers to explicitly spread the execution of loops, code regions and tasks across teams of threads.
  
 
OpenMP's directives tell the compiler to generate parallel code in that specific way, leaving little room to the discretion of the compiler and the optimizer. The compiler must do as instructed. It is up to the programmer to guarantee that generated code is correct, parallelization and scheduling are also responsibility of the programmer, not the compiler at runtime.
 
OpenMP's directives tell the compiler to generate parallel code in that specific way, leaving little room to the discretion of the compiler and the optimizer. The compiler must do as instructed. It is up to the programmer to guarantee that generated code is correct, parallelization and scheduling are also responsibility of the programmer, not the compiler at runtime.

Revision as of 18:20, 30 November 2020

Project Overview

The idea of this project is to introduce OpenACC as a parallel processing library, compare how parallelization is done in different libraries, and identify benefits of using each of the libraries. According to description of both libraries, OpenACC does parallelization more automatically whereas OpenMP allows developers to manually set regions to parallelize and assign to threads. The deliverable of this project would be a introduction to OpenACC along with performance comparison between OpenMP and OpenACC, and a discussion on usage of each one.

Group Members

1. Ruiqi Yu

2. Hanlin Li

3. Le Minh Pham

Progress

OpenACC

What is OpenACC

OpenAcc (Open Accelerators) is a programming standard for parallel computing on accelerators such as GPUs, mainly targets Nvidia GPUs. OpenACC is designed to simplify GPU programming, unlike CUDA and OpenCL where you need to write your programs in a different way to achieve GPU acceleration, OpenACC takes a similar approach as OpenMP, which is inserting directives into the code to offload computation onto GPUs and parallelize the code at CUDA core level. It is possible for programmers to create efficient parallel OpenACC code with only minor changes to a serial CPU code.

Example

#pragma acc kernels
{
	for (int i = 0; i < N; i++) {
	y[i] = a * x[i] + y[i];
	}
}

GPU offloading

[image]

Installation

Originally, OpenACC compilation is supported by the PGI compiler which requires subscription, there has been new options in recent years.

Nvidia HPC SDK[1]

Evolved from PGI Compiler community edition. Installation guide are provided in the official website. Currently only supports Linux systems, but Windows support will come soon.

wget https://developer.download.nvidia.com/hpc-sdk/20.9/nvhpc-20-9_20.9_amd64.deb \

  https://developer.download.nvidia.com/hpc-sdk/20.9/nvhpc-2020_20.9_amd64.deb

sudo apt-get install ./nvhpc-20-9_20.9_amd64.deb ./nvhpc-2020_20.9_amd64.deb

After installation, the compilers can be found under /opt/nvidia/hpc_sdk/Linux_x86_64/20.9/compilers/bin, and OpenACC code can be compiled with nvc -acc -gpu=manage demo.c, where -acc indicates that the code will include OpenACC directives, and -gpu=manage indicates how should memory be managed. nvc is used here because source code is C code, there is nvc++ for compiling C++ code.

The compiler can also tell how the parallel regions are generalized if you pass in a -Minfo option like

$ nvc -acc -gpu=managed -Minfo demo.c
main:
     79, Generating implicit copyin(A[:256][:256]) [if not already present]
         Generating implicit copy(_error) [if not already present]
         Generating implicit copyout(Anew[1:254][1:254]) [if not already present]
     83, Loop is parallelizable
     85, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         83, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
             Generating implicit reduction(max:_error)
         85, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
     91, Generating implicit copyout(A[1:254][1:254]) [if not already present]
         Generating implicit copyin(Anew[1:254][1:254]) [if not already present]
     95, Loop is parallelizable
     97, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         95, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
         97, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */

This tells which loops are parallelized with line numbers for reference.

For Windows users that would like to try this SDK, WSL2 is one way to go.

GCC[2]

Latest GCC version, GCC 10 has support to OpenACC 2.6


OpenMP vs OpenACC

We are comparing with OpenMP for two reasons. First, OpenMP is also based on directives to parallelize code; second, OpenMP started support of offloading to accelerators starting OpenMP 4.0 using target constructs. OpenACC uses directives to tell the compiler where to parallelize loops, and how to manage data between host and accelerator memories. OpenMP takes a more generic approach, it allows programmers to explicitly spread the execution of loops, code regions and tasks across teams of threads.

OpenMP's directives tell the compiler to generate parallel code in that specific way, leaving little room to the discretion of the compiler and the optimizer. The compiler must do as instructed. It is up to the programmer to guarantee that generated code is correct, parallelization and scheduling are also responsibility of the programmer, not the compiler at runtime.

OpenACC's parallel directives tells the compiler that the loop is a parallel loop. It is up to the compiler to decide how to parallelize the loop. For example the compiler can generate code to run the iterations across threads, or run the iterations across SIMD lanes. The compiler gets to decide method of parallelization based on the underlying hardware architecture, or use a mixture of different methods.

So the real difference between the two is how much freedom is given to the compilers.

Code comparison

Explicit conversions

OpenACC                                 OpenMP

#pragma acc kernels                     #pragma omp target			
{                                       {
    #pragma acc loop worker                 #pragma omp parallel for private(tmp)
    for(int i = 0; i < N; i++){             for(int i = 0; i < N; i++){
        tmp = …;                                tmp = …;
        array[i] = tmp * …;                     array[i] = tmp * …;
    }                                       }
    #pragma acc loop vector                 #pragma omp simd
    for(int i = 0; i < N; i++)                  for(int i = 0; i < N; i++)
        array2[i] = …;                              array2[i] = …;
}                                       }

ACC parallel

OpenACC                                 OpenMP

#pragma acc parallel                    #pragma omp target
{                                       #pragma omp parallel
    #pragma acc loop                    {
    for(int i = 0; i < N; i++){             #pragma omp for private(tmp) nowait
        tmp = …;                            for(int i = 0; i < N; i++){
        array[i] = tmp * …;                     tmp = …;			
    }                                           array[i] = tmp * …;
    #pragma acc loop                        }
    for(int i = 0; i < N; i++)              #pragma omp for simd
        array2[i] = …;                      for(int i = 0; i < N; i++)
}                                               array2[i] = …;
                                        }

ACC Kernels

OpenACC                                 OpenMP

#pragma acc kernels                     #pragma omp target
{                                       #pragma omp parallel
    for(int i = 0; i < N; i++){         {	
        tmp = …;                            #pragma omp for private(tmp)
        array[i] = tmp * …;                 for(int i = 0; i < N; i++){
    }                                           tmp = …; 
    for(int i = 0; i < N; i++)                  array[i] = tmp * …;
        array2[i] = …                       }	
}                                           #pragma omp for simd
                                            for(int i = 0; i < N; i++)
                                                array2[i] = …
                                        }

Copy vs. PCopy

OpenACC                                     OpenMP

int x[10],y[10];                            int x[10],y[10];
#pragma acc data copy(x) pcopy(y)           #pragma omp target data map(x,y)
{                                           {
    ...                                         ...
    #pragma acc kernels copy(x) pcopy(y)        #pragma omp target update to(x)
    {                                           #pragma omp target map(y)
        // Accelerator Code                     {
    ...                                             // Accelerator Code
    }                                               ...
    ...                                         }
}                                           }