Difference between revisions of "GroupNumberUndefined"

From CDOT Wiki
Jump to: navigation, search
(Assignment 3)
(Assignment 3)
Line 95: Line 95:
  
 
Upon further inspection of the function and kernel, I realized that the array of pixels taken from the oldImage was never used inside the kernel, so it was removed entirely. This include the removal of its memory allocation and the copying of the array from host to device, further reducing the run time of the function.  
 
Upon further inspection of the function and kernel, I realized that the array of pixels taken from the oldImage was never used inside the kernel, so it was removed entirely. This include the removal of its memory allocation and the copying of the array from host to device, further reducing the run time of the function.  
 +
 +
[[File:hArrayRemove.jpg]]
  
 
Furthermore, I previously put the "check for bounds" calculation and the "fill in empty pixels" calculation inside two separate nested for-loops. I have combined them into one, removing one nested for loops which will increase performance dramatically.
 
Furthermore, I previously put the "check for bounds" calculation and the "fill in empty pixels" calculation inside two separate nested for-loops. I have combined them into one, removing one nested for loops which will increase performance dramatically.
 +
 +
[[File:NestedCombined.jpg]]
  
 
Overall, this is what the optimized rotateImage() function and the rotate() kernel looks like:
 
Overall, this is what the optimized rotateImage() function and the rotate() kernel looks like:
Line 102: Line 106:
  
 
Profiling with the same images gives the following result.
 
Profiling with the same images gives the following result.
 +
 +
[[File:OptimizedChart.jpg]]

Revision as of 01:00, 11 April 2017


GPU610/DPS915 | Student List | Group and Project Index | Student Resources | Glossary

GroupNumberUndefined Project

Team Members

  1. Andrey Bykin
  2. Darren Suhadi
Email All

Progress

Assignment 1

Image Processing: Andrey's Profile

For profiling I decided to use Image Processing. I found a simple image processing code posted on online forums that is written in C++ and has many ways it can be optimized : http://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/. This code is able to take PGM images and manipulate them. I ran my tests using a file with the size of 24mb ( Only the file size matters in the program run time, since that determines how many pixels we have to modify.) I had to modify the code in order for it to be compatible with real time results. First run was done using no optimization when compiling, and shows the follow results.

Size24NORunTime.PNG

This was the run time with 0 optimization enabled.

Size24NO.PNG

This was the profiling information, as you can see the time in the function is almost split equally, and any of the functions can be modified for efficiency and parallel programming. For example this is the implementation of rotation function :

RotateImage.PNG

I also compiled the program using the -O2 flag so that compiler can perform its own optimization and this was the result :

Size24O2RuneTime.PNG

And the Profiling information :

Size24O2.PNG

I think for second part of the assignment I will see which functions can be improved the most and focus on that, but still leaving myself time to work on other functions if necessary.

Maze Solver: Darren's Profile

For my profiling, I decided to profile a maze solver algorithm by Paul Griffiths I found at http://www.paulgriffiths.net/program/c/maze.php. This algorithm will take a maze made in text files and find the solution. Considering the maze provided was fairly small (10x 10), I decided to attach multiple mazes together to form a 6867x101 maze to show a more prominent profile.

The first profiling was done with zero optimization:

FlatProfile.JPG

The second profiling was done with the -O2 flag compiler optimization:

O2Flag FlatProfile.JPG

As you can see, the majority of the cumulative time is done in the "look" function shown here:

LookFunction.jpg

While this function is done recursively, I feel like with a simple parallelization, this single function can be made much faster. Although, other parts of the code will also need to be changed to support larger dimension of mazes. With the current supported dimension, the amount of execution time is trivial, and if larger dimensions of maze is taken as input, a segmentation fault will occur.

Assignment 2

For assignment two we will be doing image processing. Choose to do the function enlarge() which basically makes the given image bigger. This is it's current implementation :

EnlargeImage.PNG

New Implementation :

EnlargeImageOld.PNG

And this is the kernel implementation :

Snippet.PNG

Before and after parallization analysis of Enlarge function:

EnlargeFunctionAnalysis.jpg




Another function that we chose to parallelize is the rotate function, which will take the user's input as the degrees in which to rotate the picture. The original implementation is:

OldRotate.JPG

The new implementation and the kernel is:

NewRotate.jpg

Kernel:

RotateKernel.JPG

Before and after parallization analysis of Rotate function:

RotateFunctionAnalysis.jpg

Assignment 3

The optimization for the rotate function and kernel was something I had to think about. At first, I was taking the suggestion of putting the check bounds and filling empty pixels inside the kernel, but I realized that is not possible without passing the image itself inside the kernel, and removing the use of "inBounds()" function caused the program to crash even before its execution. Therefore I decided to go against putting the image inside the shared memory for the kernel calculation because of the cudaMemcpy. It had been concluded that a majority of the run time is allocated to the memory transfer between the host and device. Considering that the rotate kernel only needs to populate two arrays, passing the image into the device memory will only add to the run time, especially with larger resolution images.

Upon further inspection of the function and kernel, I realized that the array of pixels taken from the oldImage was never used inside the kernel, so it was removed entirely. This include the removal of its memory allocation and the copying of the array from host to device, further reducing the run time of the function.

HArrayRemove.jpg

Furthermore, I previously put the "check for bounds" calculation and the "fill in empty pixels" calculation inside two separate nested for-loops. I have combined them into one, removing one nested for loops which will increase performance dramatically.

NestedCombined.jpg

Overall, this is what the optimized rotateImage() function and the rotate() kernel looks like:


Profiling with the same images gives the following result.

OptimizedChart.jpg