Changes

Jump to: navigation, search

GPU610/Team AGC

2,727 bytes added, 00:56, 27 November 2014
Block times
The 1D Wave Equation is already optimized for multiple CPU threads using the standard MPI library, spreading the sections of the curve to be calculated in parallel with as many available CPU threads at a time. However, a lot of this code is better left as a serial application to be dealt with the CPU, as GPU streams will perform much slower. The CUDA cores will take advantage of the highly parallelizable code in the update function. I am hoping that the separation of CPU cores will not cause complications while they each attempt to use the device to run the kernel and access the GPU's memory, and that it will only optimize it further.
 
I have included calls to clock() to determine specifically where the most time is being spent in the update function:
 
<pre>
void update(int left, int right) {
clock_t start0, start1, start2, start3, start4, end1, end2, end3, end4, end0;
start0 = clock();
double block1 = 0.0, block2 = 0.0, block3 = 0.0, block4 = 0.0;
int i, j;
double dtime, c, dx, tau, sqtau;
MPI_Status status;
 
dtime = 0.3;
c = 1.0;
dx = 1.0;
tau = (c * dtime / dx);
sqtau = tau * tau;
 
/* Update values for each point along string */
for (i = 1; i <= nsteps; i++) {
start1 = clock();
/* Exchange data with "left-hand" neighbor */
if (first != 1) {
MPI_Send(&values[1], 1, MPI_DOUBLE, left, RtoL, MPI_COMM_WORLD);
MPI_Recv(&values[0], 1, MPI_DOUBLE, left, LtoR, MPI_COMM_WORLD,
&status);
}
end1 = clock();
block1 += double(end1 - start1)/CLOCKS_PER_SEC;
start2 = clock();
/* Exchange data with "right-hand" neighbor */
if (first + npoints -1 != TPOINTS) {
MPI_Send(&values[npoints], 1, MPI_DOUBLE, right, LtoR, MPI_COMM_WORLD);
MPI_Recv(&values[npoints+1], 1, MPI_DOUBLE, right, RtoL,
MPI_COMM_WORLD, &status);
}
end2 = clock();
block2 += double(end2 - start2)/CLOCKS_PER_SEC;
start3 = clock();
/* Update points along line */
for (j = 1; j <= npoints; j++) {
/* Global endpoints */
if ((first + j - 1 == 1) || (first + j - 1 == TPOINTS))
newval[j] = 0.0;
else
/* Use wave equation to update points */
newval[j] = (2.0 * values[j]) - oldval[j]
+ (sqtau * (values[j-1] - (2.0 * values[j]) + values[j+1]));
}
end3 = clock();
block3 += double(end3 - start3)/CLOCKS_PER_SEC;
start4 = clock();
for (j = 1; j <= npoints; j++) {
oldval[j] = values[j];
values[j] = newval[j];
}
end4 = clock();
block4 += double(end4 - start4)/CLOCKS_PER_SEC;
}
end0 = clock();
std::cout << "Block 1: " << block1 << std::endl;
std::cout << "Block 2: " << block2 << std::endl;
std::cout << "Block 3: " << block3 << std::endl;
std::cout << "Block 4: " << block4 << std::endl;
}
</pre>
 
Since function is called (1-10000000) times depending on the number of steps chosen for the user, I have calculated the sum of 4 different blocks:
 
 
<pre>
Block 1: 4.18654
Block 2: 0.98329
Block 3: 13.2884
Block 4: 8.3342
 
Block 1: 1.02494
Block 2: 4.53157
Block 3: 12.8947
Block 4: 8.36864
</pre>
 
As you can see, most of the time is spent in the 3rd and 4th blocks, which is where I will begin optimization.

Navigation menu