Open main menu

CDOT Wiki β

Changes

Three-Star

1,650 bytes added, 09:54, 9 April 2018
Assignment 3
=== Assignment 1 ===
==== Image Profiling Processor Application ====Chosen to profile image profiling processing as shown here: http://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ , using the sample program files(main/image.h/image.cpp), where it processeses PGM(portable gray map) files. 
pulled PGM sample files from here: https://userpages.umbc.edu/~rostamia/2003-09-math625/images.html
[[File:Callgraphpt1.png]]
 
[[File:Callgraphpt2.png]]
Rotate image function is one of the longer running functions and looks like it has potential for parallelization.
There are some possible issues with working with the simple-lzw-compression-algorithm and CUDA. You cannot use the C++ string type in a kernel because CUDA does not include a device version of the C++ String library that would be able run on the GPU. Even if it was possible to use string in a kernel, it's not something you would want to do because string handles memory dynamically, which would be likely to be slow.
 
https://stackoverflow.com/questions/26993351/is-there-a-penalty-to-using-char-variables-in-cuda-kernels?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
Essentially, using chars on the gpu would require we use 8-bit arithmetic and need to convert from 32-bit arithmetic for operations. On top of that, the compress function refers to a map of strings/int pairs to shrink the size of the file. Even if we did manage the character operations, it would need to somehow use the string map to get the corresponding integer which could lead to being unable to use device memory for optimization.
=== Assignment 2 ===
[All test from this point on use 100 degrees as rotate input]
Original CPU Implementation:
{|
|'''Function'''|'''CPU-Only'''|'''GPU-CPU'''|'''speedup(%)'''
|-
|'''Cuda Memory Allocation'''|sn/a
| 1164 ms
|%n/a
|-
|'''Copy Image to Device memory'''|sn/a
| 6 ms
|%n/a
|-
|'''Kernel'''|sn/a
| 0 ms
|%n/a
|-
|'''Copy device image to host temp variable'''|sn/a
| 6 ms
|%n/a
|-
|'''copy temp image to original image variable'''|sn/a
| 43 ms
|%n/a
|-
|'''Total Rotation Time (no allocation, with memcpy)'''
|1717ms
| 55ms
|3021.82%
|-
|'''Total Run Time'''
|1775 ms
|1294 ms
|37.17%
|}
'''Shared Memory''' (Derrick Leung)
Shared memory does not really help, because we are not performing any computations on the matrix in the kernel - only thing being done is copying memory.
'''Coalesced Memory''' (Derrick Leung)
changed matrix access from column to row(16x16 block size)
 
[[File:Coaslescedchangepng.png]]
 
{|
|
'''Block Size''' (Timothy Moy)
The code modified was line 22  const int ntpb = 16; // number of threads per block The first quick method to try and improve it was to change the block size. Playing with the block size changed the kernel run times, but it wasn't apparent what exactly causes it.Most likely it is due to the 16*16 block configuration being able to not take up all the memory of the SM, but is still large enough that it gives us a boost in execution times. https://devtalk.nvidia.com/default/topic/1026825/how-to-choose-how-many-threads-blocks-to-have-/
[[Media:assign3Assign3-ntpb.png]]
In the end, a block size of 16 by 16 proved to be best for run times.
I then tried merging the sinf() and cosf() function calls into one via sincosf() so that the kernel made less function calls. That proved to be trim the run times a bit, but then I noticed that sin and cos never change since our angle never changes. Thus, this led to testing of the sin and cos functions to use the Host to calculate it and pass them in as parameters for the kernel. The result was a much more significant run time since our kernel is no longer calculating the same number in each thread.
Kernel Signature Changes:  __global__ void rotateKernel(int* oldImage, int* newImage, int rows, int cols, float rads) { vs  __global__ void rotateKernel(int* oldImage, int* newImage, int rows, int cols, /*float rads*/ float sinRads, float cosRads) { Kernel Code Changes  float sinRads = sinf(rads); float cosRads = cosf(rads); //float sinRads, cosRads; //__sincosf(rads, &sinRads, &cosRads); vs  //float sinRads = sinf(rads); //float cosRads = cosf(rads); float sinRads, cosRads; __sincosf(rads, &sinRads, &cosRads); vs  //float sinRads = sinf(rads); //float cosRads = cosf(rads); //float sinRads, cosRads; //__sincosf(rads, &sinRads, &cosRads); and Host Function Additions:  float cos1 = cos(rads); float sin1 = sin(rads); Kernel Launch Changed  rotateKernel<<<dGrid, dBlock >>>(d_a, d_b, rows, cols, rads); vs  rotateKernel<<<dGrid, dBlock >>>(d_a, d_b, rows, cols, sin1, cos1); The graph below shows the pronounced difference between the different sin cos methods. [[MediaFile:assign3-sincosall.png]]
There may be other variables that could be moved outside the kernel like r0 and c0, but due to time limitations they weren't tested.
All assignments compared in this file under "a1a2a3comps" sheet. [[File:assignment2 profileAssignment3_profile.xlsx.txt]]
https://github.com/dleung25/GPU610-Assignment3-Image-Profiling
 
[We used the images in the Github and 100 degrees for all our tests]
122
edits