Changes

BarraCUDA Boiz

950 bytes added, 02:00, 26 March 2017

→‎Problem

}

=== Analysis === After analyzing this block of code. We decided to parallelize ~~this~~ it. Here is the kernel that we programmed.

__global__ void setCenter(float* d_center, float* d_sample, int n, int dim, int randi) {

d_center[j * n + i] = d_sample[j * randi + i];

}

Launching the kernel

int nb = (n + ntpb - 1) / ntpb;

dim3 dGrid(nb, nb, 1);

dim3 dBlock(ntpb, ntpb, 1);

float* d_center = nullptr;

cudaMalloc((void**)&d_center, centers.rows * centers.cols * sizeof(float));

cudaMemcpy(d_center, (float*)centers.data, centers.rows * centers.cols * sizeof(float), cudaMemcpyHostToDevice);

check(cudaGetLastError());

float* d_sample = nullptr;

cudaMalloc((void**)&d_sample, samples.rows * samples.cols * sizeof(float));

cudaMemcpy(d_sample, (float*)samples.data, centers.rows * centers.cols * sizeof(float), cudaMemcpyHostToDevice);

int rand = genrand_int31() % n;

setCenter << <dGrid, dBlock >> >(d_center, d_sample, N, dim, rand);

cudaDeviceSynchronize();

After programming this kernel. we noticed an improvement in performace.

Here are is a graph comparing the run-times of the serial program vs parallelized.

[[File:Assignment2Graph.png]]

52

edits