44
edits
Changes
K2
,→Assignment3
The bitonic sorting algorithm is devised for the input length n being a power of 2.
To check the noticeable time gap, we put 2^1516, 2^20, 2^2524.
<source>
void bitonicSort(int* array, int N){
==Assignment2==
Let's see how this algorithm works to sortand explanation of several terms.
[[File:BitonicSort1.png]]
This is launch statement.
We use nested for loop. i is stage and it increases until log2(n). For example, let say n is 16, so i is 1 to 4. Another for loop represents round and round cannot be bigger than stage.
[[File:bitonicLaunch.png|500px|thumb|left|Bitonic sort launch]]
This is bitonic kernel. There are several variables.
[[File:BitonicKernel.jpg|500px|thumb|left|Bitonic sort kernel]]
'''idx''' shows which elements in array will be picked by threads. For example, thread 0 in stage 1 will have (0 / 1) * 1 * 2 + 0 % 1 = 0, and thread 1 will have (1 / 1) * 1 * 2 + 1 % 1 = 2, and thread 2 will have (2 / 1) * 1 * 2 + 2 % 1 = 4 ... 6, 8, 10. because in stage 1, index 0 will be compared with 1, 2 compared with 3, 4 compared with 5 .... another example, if thread 0 in round 1 in stage 2 will have ( 0 / 2)*2*2 + 0%2 = 0, thread 1 will have ( 1/2)*2*2 + 1%2 = 1, thread 2 will have ( 2/2)*2*2 + 2%2 = 4, thread 3 will have (3/2) * 2 * 2 + 3%2 = 5. So threads will have 0, 1, 4, 5, 8, 9 .... because in round 1 in stage 2, index 0 will be compared with 2 and 1 compared with 3, 4 compared with 6 and 5 compared with 7.
N is 65536, 2^16.
The red box shows Bitonic using GPU, the blue box shows quick using CPU, green box shows bitonic using CPU.
[[File:655362power to 16.pngjpg|500px|thumb|left|2^16]]
N is 1048576, 2^20.
[[File:1048576.png|500px|thumb|left|2^20]]
N is 16777216, 2^24
[[File:16777216.png|500px|thumb|left|2^24]]
Below is the comparison between bitonic sort using GPU, quick sort using CPU, bitonic sort using CPU.
[[File:BitonicGraph.jpg|500px|thumb|left|sorting algorithms comparison result]] ==Assignment3==While working on the project, we discovered that cudaMemcpy(HtoD) and (DtoH) takes long time. so, we decided to use pinned memory instead of pageable memory to improve its performance. "cudaHostAlloc() will allocate pinned memory which is always stored in RAM and can be accessed by GPU's DMA(direct memory access) directly" [[File:Resultone.png|500px|thumb|left|pinned memory vs pageable memory]] using pinned memory is 2 times faster than pageable memory