Jump to: navigation, search

Algo holics

3,812 bytes removed, 00:23, 8 April 2019
Assignment 2
=== Assignment 2 ===
After carefully weighing the projects from assignment 1, we first decided to parallelize the neural network. However, on further inspection of the source code, it was found that the network had a very low accuracy even after ten thousand iteration. To improve the algorithm first and then parallelize did not sound very pragmatic. So we decided with the next most interesting topic for the group, which was merge sort. Here is how the code was changed to its parallel version.
====CUDA enabled functions====The main function was changed to perform the copying of data from host to device, launch the kernel, copy back results from the device to host and release all memory, on host and device. COSINE TRANSFORMATION
cudaMalloc((void**)&d_A, (size) * sizeof(long)); cudaMalloc((void**)&d_B, (size) * sizeof(long)); cudaMemcpy(d_A, h_A, size * sizeof(long), cudaMemcpyHostToDevice); cudaMalloc((void**)&d_ntpb, sizeof(dim3)); cudaMalloc((void**)&d_nbpg, sizeof(dim3)); cudaMemcpy(d_ntpb, &ntpb, sizeof(dim3), cudaMemcpyHostToDevice); cudaMemcpy(d_nbpg, &nbpg, sizeof(dim3), cudaMemcpyHostToDevice); long *d_actual = d_A; long *d_swap = d_B; long totalThreads = ntpb.x * nbpg.x; float start_time = clock(); A Discrete Cosine Transform for (int width = 2; width < (size << 1); width <<= 1) { long slices = size / (totalThreads * width) + 1; merge_sort << <nbpg, ntpb >> > (d_actual, d_swap, size, width, slices, d_ntpb, d_nbpg); cudaDeviceSynchronize(); d_actual = d_actual == d_A ? d_B : d_A; d_swap = d_swap == d_A ? d_B : d_A; } float end_time = clock() - start_time; //copy back the device array to host array cudaMemcpy(h_C, d_actual, size * sizeof(long), cudaMemcpyDeviceToHost); The serial code did not use recusion, which is usually utilized for merge sort. Instead, it did a bottoms-up merge, wherein the array was sliced into pieces and each piece was sorted. The width of each slice changed with each iteration. This was done using two functions int the serial code: merge(...) and merge_sort(...). The former function was responsible for slicing the array into different widths. The latter received this slices and sorted them by creating a local STL map. In the CUDA code for this, it seemed advisable to make merge_sort as the kernel. This can be justified because each thread now gets a slice of the array to sort, which means more work is done concurrently than before. Also, this enabled removal of the two nested for-loops from the serial code. Here is how the new merge_sort function looks like:  __global__ void merge_sort(long *sourceA, long *destA, long size, long width, long slices, dim3 *threads, dim3 *blocks) { int index = threadIdx.x + blockIdx.x * blockDim.x; //may need to change this int beg = width * index * slices, middle, end; for (long slice = 0; slice < slices; slice++) { //basically removed the two for loops if (beg < size) { middle = min(beg + (width >> 1), size); end = min(beg + width, size); merge(sourceA, destA, beg, middle, end); beg += width; } } } As previously, the merge_sort() function calls the merge function. In the serial code, an STL map is utilized to store the sliced array's values that are later used as values to be assigned to the original array based on the index comparison or value comparison.Now, the merge function is strictly a device function that is called from the kernel for each slice to be sorted.  __device__ void merge(long *sourceA, long *destA, long b, long c, long e) { long i1 = b; long i2 = c; for (long k = b; k < e; k++) { if (i1 < c && (i2 >= e || sourceA[i1] < sourceA[i2])) { destA[k] = sourceA[i1]; i1++; } else { destA[k] = sourceA[i2]; i2++; } } } ====Profiling results==Real Data ==
=== Assignment 3 ===

Navigation menu