Open main menu

CDOT Wiki β

Changes

K2

878 bytes added, 20:20, 16 April 2018
Assignment3
The bitonic sorting algorithm is devised for the input length n being a power of 2.
To check the noticeable time gap, we put 2^1516, 2^20, 2^2524.
<source>
void bitonicSort(int* array, int N){
==Assignment2==
Let's see how this algorithm works to sortand explanation of several terms
[[File:BitonicSort1.png]]
This is launch statement.
We use nested for loop. i is stage and it increases until log2(n). For example, let say n is 16, so i is 1 to 4. Another for loop represents round and round cannot be bigger than stage.
[[File:bitonicLaunch.png|500px|thumb|left|Bitonic sort launch]]      
This is bitonic kernel. There are several variables.
[[File:BitonicKernel.jpg|500px|thumb|left|Bitonic sort kernel]]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
[[File:BitonicKernel.jpg]]
'''idx''' shows which elements in array will be picked by threads. For example, thread 0 in stage 1 will have (0 / 1) * 1 * 2 + 0 % 1 = 0, and thread 1 will have (1 / 1) * 1 * 2 + 1 % 1 = 2, and thread 2 will have (2 / 1) * 1 * 2 + 2 % 1 = 4 ... 6, 8, 10. because in stage 1, index 0 will be compared with 1, 2 compared with 3, 4 compared with 5 .... another example, if thread 0 in round 1 in stage 2 will have ( 0 / 2)*2*2 + 0%2 = 0, thread 1 will have ( 1/2)*2*2 + 1%2 = 1, thread 2 will have ( 2/2)*2*2 + 2%2 = 4, thread 3 will have (3/2) * 2 * 2 + 3%2 = 5. So threads will have 0, 1, 4, 5, 8, 9 .... because in round 1 in stage 2, index 0 will be compared with 2 and 1 compared with 3, 4 compared with 6 and 5 compared with 7.
 
N is 65536, 2^16.
The red box shows Bitonic using GPU, the blue box shows quick using CPU, green box shows bitonic using CPU.
[[File:655362power to 16.pngjpg|500px|thumb|left|2^16]]                      
N is 1048576, 2^20.
[[File:1048576.png|500px|thumb|left|2^20]]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
[[File:1048576.png]]
N is 16777216, 2^24
[[File:16777216.png|500px|thumb|left|2^24]]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
[[File:16777216.png]]
Below is the comparison between bitonic sort using GPU, quick sort using CPU, bitonic sort using CPU.
[[File:BitonicGraph.jpg|500px|thumb|left|sorting algorithms comparison result]]                            ==Assignment3==While working on the project, we discovered that cudaMemcpy(HtoD) and (DtoH) takes long time.​ so, we decided to use pinned memory instead of pageable memory​ to improve its performance. "cudaHostAlloc() will allocate pinned memory which is always stored in RAM and can be accessed by GPU's DMA(direct memory access) directly" [[File:Resultone.png|500px|thumb|left|pinned memory vs pageable memory]]                            using pinned memory is 2 times faster than pageable memory
44
edits