# Changes

## Algo holics

, 01:37, 1 April 2019
Assignment 2
=== Assignment 2 ===
We After carefully weighing the projects from assignment 1, we first decided to choose parallelize the neural network . However, on further inspection of the source code to paralellize. The reason being , it gives us was found that the network had a good learning opportunity very low accuracy even after ten thousand iteration. To improve the algorithm first and more importantly that then parallelize did not sound very pragmatic. So we can potentially speed up decided with the next most interesting topic for the execution time of group, which was merge sort. Here is how the program by using CUDAcode was changed to its parallel version.
====CUDA enabled functions====
The main function was changed to perform the copying of data from host to device, launch the kernel, copy back results from the device to host and release all memory, on host and device.

cudaMalloc((void**)&d_A, (size) * sizeof(long));
cudaMalloc((void**)&d_B, (size) * sizeof(long));
cudaMemcpy(d_A, h_A, size * sizeof(long), cudaMemcpyHostToDevice);
cudaMalloc((void**)&d_ntpb, sizeof(dim3));
cudaMalloc((void**)&d_nbpg, sizeof(dim3));
cudaMemcpy(d_ntpb, &ntpb, sizeof(dim3), cudaMemcpyHostToDevice);
cudaMemcpy(d_nbpg, &nbpg, sizeof(dim3), cudaMemcpyHostToDevice);
long *d_actual = d_A;
long *d_swap = d_B;
long totalThreads = ntpb.x * nbpg.x;
float start_time = clock();
for (int width = 2; width < (size << 1); width <<= 1) {
long slices = size / (totalThreads * width) + 1;
merge_sort << <nbpg, ntpb >> > (d_actual, d_swap, size, width, slices, d_ntpb, d_nbpg);
d_actual = d_actual == d_A ? d_B : d_A;
d_swap = d_swap == d_A ? d_B : d_A;
}
float end_time = clock() - start_time;
//copy back the device array to host array
cudaMemcpy(h_C, d_actual, size * sizeof(long), cudaMemcpyDeviceToHost);

The serial code did not use recusion, which is usually utilized for merge sort. Instead, it did a bottoms-up merge, wherein the array was sliced into pieces and each piece was sorted. The width of each slice changed with each iteration. This was done using two functions int the serial code: merge(...) and merge_sort(...). The former function was responsible for slicing the array into different widths. The latter received this slices and sorted them by creating a local STL map.
In the CUDA code for this, it seemed advisable to make merge_sort as the kernel. This can be justified because each thread now gets a slice of the array to sort, which means more work is done concurrently than before. Also, this enabled removal of the two nested for-loops from the serial code. Here is how the new merge_sort function looks like:

__global__ void merge_sort(long *sourceA, long *destA, long size, long width, long slices, dim3 *threads, dim3 *blocks) {
int index = threadIdx.x + blockIdx.x * blockDim.x; //may need to change this
int beg = width * index * slices, middle, end;
for (long slice = 0; slice < slices; slice++) { //basically removed the two for loops
if (beg < size) {
middle = min(beg + (width >> 1), size);
end = min(beg + width, size);
merge(sourceA, destA, beg, middle, end);
beg += width;
}
}
}

As previously, the merge_sort() function calls the merge function. In the serial code, an STL map is utilized to store the sliced array's values that are later used as values to be assigned to the original array based on the index comparison or value comparison.
Now, the merge function is strictly a device function that is called from the kernel for each slice to be sorted.

__device__ void merge(long *sourceA, long *destA, long b, long c, long e) {
long i1 = b;
long i2 = c;
for (long k = b; k < e; k++) {
if (i1 < c && (i2 >= e || sourceA[i1] < sourceA[i2])) {
destA[k] = sourceA[i1];
i1++;
}
else {
destA[k] = sourceA[i2];
i2++;
}
}
}
=== Parallelize ===
57
edits