# Team Z

## Team Members

1. Bhrugesh Patel

# Assignment 1

## Project Image Spitfire

I am using the Image Processor code that Natalia found as the source code to find the CUDA optimization possibilities. The image processor includes various functions like displaying image, getting sub image, enlarge image, shrink image, reflect image, translate image, image rotation and image negative. The application uses lots of for looping to achieve its goal and this provide an opportunity to use the GPU to achieve output much faster and hence enhancing the application performance. Most of the code follows O(n2) notation with nested loops that can be paralleled using GPU and CUDA code.

Some of the source code that can be enhanced is as follows.

Image::Image(int numRows, int numCols, int grayLevels)

/* Creates an Image of numRows x numCols and creates the arrays for it*/ {

N = numRows;
M = numCols;
Q = grayLevels;

pixelVal = new int *[N];
for(int i = 0; i < N; i++)
{
pixelVal[i] = new int [M];
for(int j = 0; j < M; j++)
pixelVal[i][j] = 0;
}

}

// Convert the unsigned characters to integers

int val;

for(i=0; i<N; i++)

for(j=0; j<M; j++)
{
val = (int)charImage[i*M+j];
image.setPixelVal(i, j, val);
}

IMAGE Rotation

// goes through the array of the oldImage, uses the formulas to find where the pixel should go
//  then puts the old pixel value into the new pixel position on the tempImage
for(int r = 0; r < rows; r++)
{
for(int c = 0; c < cols; c++)
{
r1 = (int) (r0 + ((r - r0) * cos(rads)) - ((c - c0) * sin(rads)));
c1 = (int) (c0 + ((r - r0) * sin(rads)) + ((c - c0) * cos(rads)));

if(inBounds(r1,c1))  // makes sure the new pixel location is in bounds,
{
tempImage.pixelVal[r1][c1] = oldImage.pixelVal[r][c];
}
}
}

for(int i = 0; i < rows; i++)
{
for(int j = 0; j < cols; j++)
{
if(tempImage.pixelVal[i][j] == 0)
tempImage.pixelVal[i][j] = tempImage.pixelVal[i][j+1];
}
}

# Assignment 2

By the end of the semester, I have realized that picking up this source code was more or less a mistake. My greatest shortcoming of the program is that it only uses PPM or PGM type of pictures which are not freely available and even from the pictures I found, 90% of them didn't work with the program at larger sizes. The program refused to accept those images as PGM or PPM. I have finally come to the understanding of how concurrent programming works. As I noted in assignment 1, the current code has lots of few openings where we can make use of concurrent programming using GPU.

The current structure of the image process holds an image object. The data structure is as follows:

class Image
{
public:
//various methods
public:
int N; // number of rows
int M; // number of columns
int Q; // number of gray levels
int **pixelVal;
};
extern Image img;

The pixelVal which holds the value for each pixel is defined as int** which became a roadblock in my path. for most part I was unable to make it compatible with CUDA coding. I started my work on transforming the constructor. The constructor / copy constructor is called in almost every method of the class to preform any action. The code of constructor is as follows:

inline Image::Image(int numRows, int numCols, int grayLevels)
/* Creates an Image of numRows x numCols and creates the arrays for it*/
{

N = numRows;
M = numCols;
Q = grayLevels;
pixelVal = new int *[n];

for(int i = 0; i < N; i++)
{
pixelVal[i] = new int [M];
for(int j = 0; j < M; j++)
pixelVal[i][j] = 0;
}
}

I made CUDA code for the constructor which would assign 0 to all the pixels in GPU concurrently which would recude the execution time. The CUDA kernel code is as follows:

__global__ void init(int* a, int v, int n)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n)
a[idx] = v;
}

A very simple kernel that would assign value of v to each member of array a. This is how i called the kernel from code : inline Image::Image(int numRows, int numCols, int grayLevels)

/* Creates an Image of numRows x numCols and creates the arrays for it*/
{

N = numRows;
M = numCols;
Q = grayLevels;
pixelVal = new int *[n];
int n = N * M;

//Allocate device memory
int* dpixels;

cudaMalloc((void**)&dpixels, n * sizeof(int));
int nblks = (n + NTPB - 1) / NTPB;
init<<<nblks, NTPB>>>(dpixels, 0, n);
cudaMemcpy(pixelVal, *dpixels, n * sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(dpixels);
*/
}

The code worked just fine but i stumbled across two problems. 1) The kernel execution time for a 257kb PGM file was 0.157 sec while the execution time on CPU was near 0 sec. 2) The constructor executed just fine but program crashed afterwards indicating that copy from int* to int** didn't work well as expected.

After testing the code with a 769kb PGM file, the CUP time was recorded 0.1 sec while kernel recorded 0.16. 'The raise of only 0.03 sec.' Somehow the program wouldn't run big files on either windows or linux, CPU or GPU code. I was unable find a large file that this program would run

so it blocked my further research. The results so far indicates that the GPU code is extremely inefficient with small files but as the file size

grows, the CUDA kernel performs proportionally well.

Afterwards I tried to upgrade the enlargeImage functionality into CUDA code but due to the nature of the code, I failed to convert the code to CUDA.

I began working on reflectImage. After few failures and some difficulties with transforming int** to int*, I was able to successfully transform the code from CPU to CUDA. The original code is as follows:

int rows = oldImage.N;
int cols = oldImage.M;
Image tempImage(oldImage);
if(flag == true) //horizontal reflection
{
for(int i = 0; i < rows; i++)
{
for(int j = 0; j < cols; j++)
tempImage.pixelVal[rows - (i + 1)][j] = oldImage.pixelVal[i][j];
}
}

I transformed it in CUDA as follows:

__global__ void fliph(int* a, int* b, int r, int c)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i >= r || j >= c)
return;
a[(r - i * c) + j] = b[i * c + j];
}

The results were same as constructor. The CUDA memory allocation cudaMalloc is the most time consuming operation in the kernel execution. The CPU execution time for 257kb size image reflection was 0.01 sec while CUDA kernel took about 0.13 sec. However when code was tested with 769kb image, the CPU execution went up to 0.04 while CUDA kernel took 0.16. The CPU execution time went up about 4 times but CUDA kernel time didnt change much. This proves that while CUDA code is highly efficient at low picture size due to the cudaMalloc time, it improves performance drastically as the size of input increases.

According to nVIDIA Profiler, the code was not using the GPU at full performance in terms of compute utilization and memory copy efficiency. The CUDA kernel can further be enhanced with the shared memory and Coalesced memory access.