Open main menu

CDOT Wiki β

Changes

K2

1,308 bytes added, 20:20, 16 April 2018
Assignment3
The bitonic sorting algorithm is devised for the input length n being a power of 2.
To check the noticeable time gap, we put 2^1516, 2^20, 2^2524.
<source>
void bitonicSort(int* array, int N){
==Assignment2==
 
Let's see how this algorithm works to sort and explanation of several terms.
 
 
[[File:BitonicSort1.png]]
First, array is unsorted array.Second box is stage 1, and the orange boxes are sorted in ascending order and green boxes are sorted in descending order.Third box is stage 2 and it has 2 rounds. In first round, there are 2 orange and green boxes that contains 4 elements. In this case, index 0 and 2 and 1 and 3 will be compared and sorted in ascending if orange box and in descending if green box. Then in round 2, index 0 and 1 , 2 and 3 ... will be compared and sorted. The third and fourth boxes are same.  This is launch statement. We use nested for loop. i is stage and it increases until log2(n). For example, let say n is 16, so i is 1 to 4. Another for loop represents round and round cannot be bigger than stage.[[File:bitonicLaunch.png|500px|thumb|left|Bitonic sort launch]]           This is bitonic kernel. There are several variables.[[File:BitonicKernel.jpg|500px|thumb|left|Bitonic sort kernel]]                                
second box is stage 1, and orange boxes are sorted in ascending order and green boxes are sorted in descending order.
Third box is stage 2 and it has 2 rounds. In first round, there are 2 orange and green boxes that contains 4 elements. in this case, index 0 and 2 and 1 and 3 will be compared and sorted in ascending if orange box and in descending if green box. then in round 2, index 0 and 1 , 2 and 3 ... will be compared and sorted. Third box and Fourth box are same.
[[File:bitonicLaunch.png]]
This is launch statement. we use nested for loop. i is stage and it increases until log2(n). for example, let say n is 16, so i is 1 to 4. Another for loop represents round and round cannot be bigger than stage.
[[File:bitonicKernel.png]]
This is bitonic kernel. There are several variables.
'''Gap''' expresses gap between 2 elements that will be compared. for For example, if stage is 2 and round is 1, gap will be 2. because index 0 will be compared with index 2 and index 1 will be compared with index 3. if stage is 2 and round is 2, gap will be 1 and index 0 will be compared with index 1 so on. 
'''Group''' shows where each threads are involved in whether ascending or descending order. for example, thread 0 in stage 1 will be 0 because (0 / ( 1 << (1-1))) % 2 = 0 and thread 1 will be 1 because (1 / (1 << (1-1))) % 2 = 1. so in stage 1, thread will have 0 1 0 1 0 1 .... so each threads will sort in ascending and in descending in order. if stage is 2, thread 0 will be 0 and thread 1 will also be 0 because ( 1 / (1 << 1)) % 2 = 0. so each threads in stage 2 will have 0 0 1 1 0 0 1 1 ....
 '''idx''' shows which elements in array will be picked by threads. for For example, thread 0 in stage 1 will have (0 / 1) * 1 * 2 + 0 % 1 = 0, and thread 1 will have (1 / 1) * 1 * 2 + 1 % 1 = 2, and thread 2 will have (2 / 1) * 1 * 2 + 2 % 1 = 4 ... 6, 8, 10. because in stage 1, index 0 will be compared with 1, 2 compared with 3, 4 compared with 5 .... another example, if thread 0 in round 1 in stage 2 will have ( 0 / 2)*2*2 + 0%2 = 0, thread 1 will have ( 1/2)*2*2 + 1%2 = 1, thread 2 will have ( 2/2)*2*2 + 2%2 = 4, thread 3 will have (3/2) * 2 * 2 + 3%2 = 5. So threads will have 0, 1, 4, 5, 8, 9 .... because in round 1 in stage 2, index 0 will be compared with 2 and 1 compared with 3, 4 compared with 6 and 5 compared with 7.    N is 65536, 2^16. The red box shows Bitonic using GPU, the blue box shows quick using CPU, green box shows bitonic using CPU.[[File:2power to 16.jpg|500px|thumb|left|2^16]]                         N is 1048576, 2^20.[[File:1048576.png|500px|thumb|left|2^20]]                         N is 16777216, 2^24[[File:16777216.png|500px|thumb|left|2^24]]                         Below is the comparison between bitonic sort using GPU, quick sort using CPU, bitonic sort using CPU. [[File:BitonicGraph.jpg|500px|thumb|left|sorting algorithms comparison result]]                            ==Assignment3==While working on the project, we discovered that cudaMemcpy(HtoD) and (DtoH) takes long time.​ so, we decided to use pinned memory instead of pageable memory​ to improve its performance. "cudaHostAlloc() will allocate pinned memory which is always stored in RAM and can be accessed by GPU's DMA(direct memory access) directly" [[File:Resultone.png|500px|thumb|left|pinned memory vs pageable memory]]                            using pinned memory is 2 times faster than pageable memory
44
edits