Changes

GPU610/TeamKappa

3,982 bytes added, 17:55, 13 December 2015

→‎Assignment 3

For the rotation code, at around 500x600 dimensions or 300,000 pixels, the speed is about the same. However, as the image size increases, the scalar code will become much slower in comparison to the GPU code. To parallelize the code, I just used the straight forward tactic of unrolling the two for loops and assigning one thread for what would be each iteration. Since each index of the array was looked at individually, there is no problem doing that. As you can see in the benchmarks, through this process the speed was reduced greatly at higher resolutions.

== Assignment 3 (Matt Jang) == === Image Rotation === This is the kernel that I had originally made for Assignment 2. To optimize this I wanted to make use of all the techniques that we learned in class like shared memory and the like. Although I wasn't able to take advantage of everything I wanted to do, I was still able to speed up this kernel to preform between 3 and 4 times faster. The following are the steps that I used to optimize.; Changed the number of threads per block to 128 from 1024. : This step was to improve the occupancy. However, the occupancy was already high on my card so this didn't make much of a difference. On the smallest image, the time went from '''293μs to 265μs''' and on the largest image, the time went from '''3995μs to 3617μs'''.; Optimized out rows/2 and cols/2 into parameters. : Since this operation was preformed every single iteration, I figured it would be quicker to pass through that value in the parameters. On the smallest image, the time went from '''265μs to 193μs''' and on the largest image, the time went from '''3617μs to 2621μs'''. ; Used device functions __cosf and __sinf. : In class we had learned that there were special device functions for trigonometry. I figured I would use them but the speed up was much more than I had expected. On the smallest image, the time went from '''193μs to 93μs''' and on the largest image, the time went from '''2621μs to 1232μs'''.; Optimize uses of __cosf and __sinf. : Since using the device functions made such a big change and since there were two identical calls to each, I stored the value of their result in a register and used the register twice. This didn't make such a big change but I am sure that if I had done it before using __cosf and __sinf, the difference would have been much bigger. On the smallest image, the tiem went from '''93μs to 91μs''' and on the largest image, the time went from '''1232μs to 1196μs'''.; Experimental division optimization. : This last optimization made almost no difference but I more wanted to try something new. I changed "'''int col = index / rows;'''" to "'''int col = (index - row) / rows;'''". The idea is that if the computer gets a whole number from division, it would be a bit faster. I don't fully understand why it worked but there was a consistent speed up of around 0.5% to 1%. On the smallest image, the time went from '''91μs to 90μs''' and on the largest image, the time went from '''1196μs to 1191μs'''. ==== Unoptimized ==== const unsigned ntpb = 1024; __global__ void kernel_rotate(int * old_image, int * temp_image, float rads, int rows, int cols) { int index = blockIdx.x * blockDim.x + threadIdx.x; if (index > rows * cols) { return; } int row = index % rows; int col = index / rows; int new_row = (int)(rows / 2 + ((row - rows / 2) * cos(rads)) - ((col - cols / 2) * sin(rads))); int new_col = (int)(cols / 2 + ((row - rows / 2) * sin(rads)) + ((col - cols / 2) * cos(rads))); if (!(new_row >= rows || new_row < 0 || new_col >= cols || new_col < 0)) { temp_image[rows * new_col + new_row] = old_image[index]; } } ==== Optimized ==== __global__ void kernel_rotate(int * old_image, int * temp_image, float rads, int rows, int cols, int half_rows, int half_cols, int rows_x_cols) { int index = blockIdx.x * blockDim.x + threadIdx.x; if (index > rows_x_cols) { return; } int row = index % rows; int col = (index - row) / rows; float cosf_rads = __cosf(rads); float sinf_rads = __sinf(rads); int new_row = (int)(half_rows + ((row - half_rows) * cosf_rads) - ((col - half_cols) * sinf_rads)); int new_col = (int)(half_cols + ((row - half_rows) * sinf_rads) + ((col - half_cols) * cosf_rads)); if (!(new_row >= rows || new_row < 0 || new_col >= cols || new_col < 0)) { temp_image[rows * new_col + new_row] = old_image[index]; } }

Matt Jang

1

edit

Changes

GPU610/TeamKappa

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools