Assignment 3 (Matt Jang)
=== Image Reflection ===
Since I wasn't able to do too many big optimization techniques with the image rotation, I decided to do an additional image manipulation function. Although I still wasn't able to use shared memory to make anything faster, I was able to try one or two new things.
; Split the reflection into two kernels. : This is the most obvious of the optimizations. Since there are two distinct operations (horizontal flip, vertical flip), it only makes sense to have one kernel for each. Each kernel would be optimized for each one. On the smallest image, the time went from '''59μs to 47μs''' and on the largest image the time went from '''711μs to 629μs'''.
; Only process half the image. : This is also the other obvious optimization. Instead of going through each pixel and flipping each one to a temporary array, I would only iterate through half of them and swap each pixel with the one on the other side. There was one catch to this optimization. For the two different kernels, I had to populate my one dimensional array as either row major or column major order. This was so that the first half of index were either the top side or the left side. That way, on each of the horizontal and vertical kernels, I just had to subtract either rows or cols from a value. The memory access is also sequential. On the smallest image, the time went from '''47μs to 31μs''' and on the largest image, the time went from '''629μs to 416μs'''.
==== Kernel Speeds ====
{| class="wikitable" border="1"
! Image Size !! Before !! After
| 500 x 600 || 59μs || 31μs
| 800 x 800 || 110μs || 63μs
| 1600 x 900 || 244μs || 141μs
| 1920 x 1080 || 388μs || 219μs
| 2747 x 1545 || 711μs || 416μs
==== Unoptimized ====
__global__ void kernel_reflect(int * old_image, int * temp_image, bool flag, int rows, int cols) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index > rows * cols) {
int row = index % rows;
int col = index / rows;
int new_row = 0;
int new_col = 0;
if (flag) {
new_row = row;
new_col = cols - col;
else {
new_row = rows - row;
new_col = col;
temp_image[rows * new_col + new_row] = old_image[index];
==== Optimized ====
const int reflect_ntpb = 128;
__global__ void kernel_reflect_horizontal(int * old_image, int rows, int cols, int half_cols) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index > rows * half_cols) {
int other_index = rows * (cols - index / rows) + index % rows;
int temp = old_image[other_index];
old_image[other_index] = old_image[index];
old_image[index] = temp;
__global__ void kernel_reflect_vertical(int * old_image, int rows, int half_rows, int cols) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index > half_rows * cols) {
int other_index = ((rows - index / cols) * cols) + index % cols;
int temp = old_image[other_index];
old_image[other_index] = old_image[index];
old_image[index] = temp;
long long Image::reflectImage(bool flag, Image & source) {
int rows = source.N;
int cols = source.M;
int half_cols = cols / 2;
int half_rows = rows / 2;
int nb = 0;
if (flag)
nb = (rows * half_cols + reflect_ntpb - 1) / reflect_ntpb;
nb = (half_rows * cols + reflect_ntpb - 1) / reflect_ntpb;
int * d_old_image;
int * h_old_image = new int[rows * cols];
if (flag) {
for (int r = 0; r < rows; r++)
for (int c = 0; c < cols; c++)
h_old_image[rows * c + r] = source.pixelVal[r][c];
else {
for (int r = 0; r < rows; r++)
for (int c = 0; c < cols; c++)
h_old_image[cols * r + c] = source.pixelVal[r][c];
cudaMalloc((void**)&d_old_image, rows * cols * sizeof(int));
if (!d_old_image) {
cout << "CUDA: out of memory (d_old_image)" << endl;
return -1;
high_resolution_clock::time_point first_start;
first_start = high_resolution_clock::now();
cudaMemcpy(d_old_image, h_old_image, rows * cols * sizeof(int), cudaMemcpyHostToDevice);
dim3 dGrid(nb);
dim3 dBlock(reflect_ntpb);
if (flag)
kernel_reflect_horizontal << <dGrid, dBlock >> >(d_old_image, rows, cols, half_cols);
kernel_reflect_vertical << <dGrid, dBlock >> >(d_old_image, rows, half_rows, cols);
cudaMemcpy(h_old_image, d_old_image, rows * cols * sizeof(int), cudaMemcpyDeviceToHost);
if (flag) {
for (int r = 0; r < rows; r++)
for (int c = 0; c < cols; c++)
source.pixelVal[r][c] = h_old_image[rows * c + r];
else {
for (int r = 0; r < rows; r++)
for (int c = 0; c < cols; c++)
source.pixelVal[r][c] = h_old_image[cols * r + c];
auto duration = duration_cast<milliseconds>(high_resolution_clock::now() - first_start);
return duration.count();
=== Conclusions ===
With this project, I had originally expected to get a bigger speed difference when optimizing one way or another but it turns out that it isn't so easy to do. I was never able to get any meaningful results using shared memory in these kernels because every pixel is only looked at once. My optimization benchmarks came from NSIGHT so they didn't include the code I had that created and read the 1D arrays so if I were to want to make a very fast image library, I would want to read and store data in the same way that the kernels expect it to avoid that overhead.

