Open main menu

CDOT Wiki β

Changes

Savy Cat

3,348 bytes added, 22:14, 10 April 2018
Assignment 3
=== Assignment 3 ===
==== "Register" Index ====
For my first attempt at optimization, I thought maybe, just maybe, the index calculations were being performed from within global memory:
 
<nowiki>
__global__ void rot90(PX_TYPE* src, PX_TYPE* dst, int src_w, int src_h, int z) {
int k = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (k < src_w && j < src_h)
dst[(src_h - 1 - j) + k * src_h + src_w * src_h * z] = src[threadIdx.x + threadIdx.y * src_w + src_w * src_h * z];
 
}</nowiki>
 
So I declared two register variables and determined the indexes prior:
 
<nowiki>
__global__ void rot90(PX_TYPE* src, PX_TYPE* dst, int src_w, int src_h, int z) {
int k = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int d = (src_h - 1 - j) + k * src_h + src_w * src_h * z;
int s = threadIdx.x + threadIdx.y * src_w + src_w * src_h * z;
if (k < src_w && j < src_h)
dst[d] = src[s];
 
}</nowiki>
 
This only had a slightly negative effect. Although, such a small difference may have been due to the luck of the run:
 
[[File:Summary-4.png]]
 
==== Unsigned Char vs. Float ====
The first real improvement came from changing PX_TYPE from float back to unsigned char, as used in the serial version. Unsigned char is good enough for all .jpg colour values (255). GPUs are designed to perform operations on floating point numbers, however, we are not performing any calculations outside of the indexing. The performance of the kernel was the same for float or unsigned char. We copy the source image to device once, and back to the host 12 times, making size relevant.
 
{| class="wikitable"
|+Size Comparison
|-
|
|Unsigned Char
|Float
|-
|Tiny_Shay.jpg
|1.93 KB
|7.73 KB
|-
|Medium_Shay.jpg
|5.71 MB
|22.8 MB
|-
|Large_Shay.jpg
|22.8 MB
|91.4 MB
|-
|Huge_Shay.jpg
|91.4 MB
|365 MB
|}
 
This saves almost one second worth of latency for the largest file, bringing cudaMemcpy down to about the same time as the kernel execution:
 
[[File:Summary-5.png]]
 
==== Shared Memory ====
I could not think of how to utilize shared memory for this application. No calculations are being performed. Copying to shared memory would be an additional operation, as one write to global memory is required either way. By copying a small chunk of the source image to shared memory to improve read time, the indexing logic would no longer work.
 
==== Constant Memory ====
Utilizing constant memory for the source image was something I wanted to try. The largest unsigned char file of 91.4 MB seemed affordable, and we do not write to it.
 
Since it's required to use a constant value when declaring the size of the host variable, I needed to define the size of the largest file and use that for all files:
 
<nowiki>#define SRC_MAX_SIZE 95883264
 
__constant__ PX_TYPE d_src[SRC_MAX_SIZE];</nowiki>
 
Copy the actual number of elements over:
 
<nowiki>// Copy h_src to d_src
std::cout << "Copying source image to device ..." << std::endl;
cudaMemcpyToSymbol(d_src, h_src, w * h * sizeof(PX_TYPE) * 3);</nowiki>
 
Compiling gave an error saying 91MB is too much memory to use:
 
<nowiki>CUDACOMPILE : ptxas error : File uses too much global constant data (0x5b71000 bytes, 0x10000 max)</nowiki>
 
The only example file that fit and compiled was Tiny_Shay.jpg, which there is no point in improving.
93
edits