From CDOT Wiki
Revision as of 13:13, 19 April 2013 by Rwstanica (talk | contribs) (Assignment 3)
Jump to: navigation, search

GPU610/DPS915 | Student List | Group and Project Index | Student Resources | Glossary


Tudy and I are working on a branch of the LHC (Large Hadron Collider) BPM (beam position measurement) analysis code to be made parallel using GPUs. The branch can be found here on GitHub. This code performs harmonic analysis of turn-by-turn data from a circular acceleration (measured data or from numerical simulation). It is heavily used in the optics re-construction for the LHC and the same tools are being exported to the lower-energy machines at CERN.

Team Members

  1. Robert Stanica
  2. Tudy Minea
  3. Email All


Assignment 1

Robert Stanica:

Analysis of all LHC BPMs is a major real time computational bottleneck. As such, an effort has been underway to decrease the real computational time of the SUSSIX codes by paralellizing it. This has been achieved using the OpenMP API to parallelize both the Fortran and C implementations of the code. We can then safely assume that the code can also be successfully paralellized to run off the GPU.

For Assignment 1, I profiled sussix4drivexxNoO.f and Drive_God_lin.c.. The most time was spent on two subroutines in the Fortran portion of the program.

time   seconds   seconds    calls  ms/call  ms/call  name    
47.04      5.80     5.80   311575     0.02     0.02  zfunr_
25.63      8.96     3.16      522     6.05     6.05  ordres_
 8.43     10.00     1.04   312021     0.00     0.00  cfft_
 3.89     10.48     0.48   312559     0.00     0.02  tunelasr_

zfunr and ordres take the majority of the total run time so we'll focus on parallelizing these two subroutines first.

Assignment 1 UPDATE

Sad news for Team Tudybert, but we were unable to successfully compile the project using CUDA libraries. The issue was the dynamic nature of the CUDA libraries. There's a "-static" flag that's being added in the makefile of Drive_God_lin that prevents dynamic libraries from being used, and CUDA libraries are all dynamic. This flag was there in the original makefile and removing it causes the program to throw undefined references to a few methods that I was unable to find in the source code or online. I'm assuming that since this flag was put in place by a coder from CERN, removing it is going to require changes to the code that are beyond my abilities. As such, I've switched projects to Natalia's image processing application.

Assignment 1 Revisted

I can't thank Natalia enough for letting me use her program! This was an extremely last minute switch so I was fortunate to find a new project so quickly. I'm now working on parallelizing a very basic image processing application. The application supports the .PGM file format - an image file that was designed to be easy to learn and write programs for. This file contains the type of image (in our case P5), the file name, the number of rows and columns, and the data in the form of a 2D array. There's no compression and all data is readable with a text editor. The main source for the application can be found here [1]. The initial source is written in C++ and contains methods for enlarging, shrinking, translating, rotating, negating, and reflecting images. Natalia has already parallelized the negate method, so it was agreed that I would pick one of the others. I chose the enlarge image method.

Here's the flat profile for 50 runs of enlarging a 512px x 512px image 4 times:

 %   cumulative   self              self     total
time   seconds   seconds    calls  ms/call  ms/call  name
49.64      1.39     1.39       50    27.80    27.80  Image::operator=(Image const&)
28.93      2.20     0.81       50    16.20    16.20  Image::Image(int, int, int)
19.64      2.75     0.55                             Image::enlargeImage(int, Image&)
 1.07      2.78     0.03  4194304     0.00     0.00  Image::getPixelVal(int, int) 

The code for enlargeImage():

 int rows, cols, gray;
 int pixel;
 int enlargeRow, enlargeCol;
 rows = oldImage.N * value;
 cols = oldImage.M * value;
 gray = oldImage.Q;    
 Image tempImage(rows, cols, gray);
 for(int i = 0; i < oldImage.N; i++)
   for(int j = 0; j < oldImage.M; j++)
     pixel = oldImage.pixelVal[i][j];
     enlargeRow = i * value;
     enlargeCol = j * value;
     for(int c = enlargeRow; c < (enlargeRow + value); c++)
       for(int d = enlargeCol; d < (enlargeCol + value); d++)
         tempImage.pixelVal[c][d] = pixel;
 oldImage = tempImage;

The four for loops look like they could be parallelized since they just serve as counters. From the flat file, it seems that the majority of the time is spent in the overloaded operator=() method. The code for this is:

   N = oldImage.N;
   M = oldImage.M;
   Q = oldImage.Q;
   if(dim1 != NULL)
       delete[] dim1;
   pixelVal = new int* [N];
   dim1 = new int[N*M];
   for(int i = 0; i < N; i++)
       pixelVal[i] = new int [M];
       for(int j = 0; j < M; j++)
           pixelVal[i][j] = oldImage.pixelVal[i][j];
           dim1[i*N + j] = oldImage.dim1[i*N + j];

The chunk of the processing is wasted on copying the two arrays over from one image to another. If I have time I might look into parallelizing this as well. It would be interesting to see if the speed of the GPU can overcome the overhead of copying to and from the device.

Assignment 2

For Assignment 2 I simply put the four for loops into a kernel and replaced the outermost loop with thread indices. I made a helper method that set up memory on the device and launched the kernel with a 1 dimension array of blocks each containing 1 thread. I launched as many blocks of 1 thread as there were rows in the image file. I figured this was the quickest way to get this method parallelized. Unfortunately I hit a wall with my data sizes. The CPU version of the enlarge image method fails when run for more than 50 loops. The error thrown is a Visual Studio debugging error so I'm think VS isn't too happy with having the CPU hogged for so long. As a result I've had to extrapolate times for larger loops by assuming a linear increase in time taken.

Here's the code for newly parallelized method:

   int idx = blockIdx.x * blockDim.x + threadIdx.x;
   int enlargeRow, enlargeCol;
   __shared__ int pixel;
   for(int j = 0; j < nj; j++)
       pixel = work[idx * nj + j];
       enlargeRow = idx * factor;
       enlargeCol = j * factor;
       for(int c = enlargeRow; c < (enlargeRow + factor); c++)
           for(int d = enlargeCol; d < (enlargeCol + factor); d++)
               result[d + c * blockDim.x * gridDim.x * factor] = pixel;

While I did see a decrease in the time taken to run 50 loops, the decrease wasn't as significant as I had hoped. Obviously this kernel isn't optimized so I'm looking forward to some more impressive results as I update the code.

Assignment 3

After making sure memory access is coalesced and replacing the second counter loop with threads from a 2 dimensional block of 2 dimensional threads, I've achieved significant speed ups in the program. All it took was launching the kernel with an optimized 2D array of blocks each containing a 2D array of threads. For assignment 2 I had a grid with 1 thread for each column in the image. That meant each thread was running 3 nested for loops to do the necessary calculations for enlarging. Figuring out the math for calculating the correct index in the arrays proved to be tricky. Although I knew exactly what to do in concept, the two extra nested for loops threw me off. For a long time the image was being enlarged correctly but the physical dimensions of the image weren't increasing. Once I had that figured out the image was enlarging but not to the new dimensions. After some tracing and trial and error I managed to find the right formula to calculate the indices.

Here's the final, optimized enlarge method: