Changes

Jump to: navigation, search

GPU610/Team AGC

532 bytes removed, 04:01, 30 November 2014
Sample Output
= Team AGC =
== Team Members ==
<s>
# [mailto:acooc@myseneca.ca?subject=gpu610 Andy Cooc], Some responsibility
# [mailto:gcastrolondono@myseneca.ca?subject=gpu610 Gabriel Castro], Some other responsibility</s># [mailto:cmarkieta@myseneca.ca?subject=gpu610 Christopher Markieta], Some other All responsibility
[mailto:acooc@myseneca.ca,gcastrolondono@myseneca.ca,cmarkieta@myseneca.ca?subject=gpu610 Email All]
The first step was to create a private repository on Bitbucket to avoid any plagerism plagiarism issues with this course for next semester students, as well as provide code revision and protection in case my progress is lost or corrupt.
The next step is to convert the following C file into C++ code that will be compatible with CUDA.:
mpi_wave.c<pre>[https://*************************************************************************** * FILE: mpi_wavecomputing.c * OTHER FILES: draw_wavellnl.c * DESCRIPTION: * MPI Concurrent Wave Equation - C Version * Point-to-Point Communications Example * This program implements the concurrent wave equation described * in Chapter 5 of Fox et al., 1988, Solving Problems on Concurrent * Processors, vol 1. * A vibrating string is decomposed into points. Each processor is * responsible for updating the amplitude of a number of points over * time. At each iteration, each processor exchanges boundary points with * nearest neighbors. This version uses low level sends and receives * to exchange boundary points. * AUTHOR: Blaise Barney. Adapted from Ros Leibensperger, Cornell Theory * Center. Converted to MPI: George L. Gusciora, MHPCC (1gov/tutorials/95) * LAST REVISED: 07mpi/05samples/05***************************************************************************C/#include "mpimpi_wave.h"#include <stdioc mpi_wave.h>#include <stdlib.h>#include <math.h>c]
#define MASTER 0#define TPOINTS 800#define MAXSTEPS 10000#define PI 3.14159265And include the following dependency in the directory:
int RtoL = 10;int LtoR = 20;int OUT1 = 30;int OUT2 = 40;[https://computing.llnl.gov/tutorials/mpi/samples/C/draw_wave.c draw_wave.c]
void init_master(void);void init_workers(void);void init_line(void);void update (int left, int right);void output_master(void);void output_workers(void);extern void draw_wave(double *);====== System Requirements ======
int taskid, This project will be built and tested on <s>Windows 7 64-bit</* task ID *s> Fedora 20 ([http:/ numtasks, /* number of processes *www.r-tutor.com/ nsteps, gpu-computing/* number of time steps *cuda-installation/ npointscuda6.5-fc20 tutorial], remember to [http:/* number of points handled by this processor */ first; www.if-not-true-then-false.com/* index of 1st point handled by this processor *2011/double etime, fedora-16-nvidia-drivers-install-guide-disable-nouveau-driver/* elapsed time #troubleshooting blacklist nouveau in seconds */ values[TPOINTS+2your grub config], /* values at time t */ oldval[TPOINTS+2], /* values at time .) with an Intel Core i5-4670K Haswell CPU (t-dtoverclocked to 4.9 GHz) *and an Nvidia GTX 480 GPU (overclocked to 830/ newval[TPOINTS+2]; 924/* values at time (t+dt1660 MHz) */manufactured by Zotac with 1.5 GB of VRAM.
/* ------------------------------------------------------------------------ * Master obtains timestep input value from user and broadcasts it * ------------------------------------------------------------------------ */void init_master(void) { char tchar[8];mpi_wave will require the OpenMPI library to compile.
Here is the profiling of the original CPU application, with an increased maximum step to better the test comparison, and calculate the curve at the given step value.
<pre> % cumulative self self total time seconds seconds calls s/* Set number of number of time steps and then print and broadcast*call s/call name nsteps = 0; while ((nsteps < 100.04 9.62 9.62 1) || 9.62 9.62 update(nsteps > MAXSTEPSint, int)) { printf("Enter number of time steps 0.00 9.62 0.00 2 0.00 0.00 MPI::Is_initialized(1-%d) 0.00 9.62 0.00 2 0.00 0.00 MPI::Comm:: \n",MAXSTEPS~Comm(); scanf 0.00 9.62 0.00 2 0.00 0.00 MPI::Comm_Null::~Comm_Null("%s", tchar); nsteps = atoi 0.00 9.62 0.00 1 0.00 0.00 _GLOBAL__sub_I_RtoL 0.00 9.62 0.00 1 0.00 0.00 init_master(tchar); if ((nsteps < 0.00 9.62 0.00 1) || 0.00 0.00 output_master(nsteps > MAXSTEPS)) printf 0.00 9.62 0.00 1 0.00 0.00 __static_initialization_and_destruction_0("Enter value between 1 and %d\n"int, MAXSTEPSint); } 0.00 9.62 0.00 1 0.00 0.00 draw_wave(double*) MPI_Bcast 0.00 9.62 0.00 1 0.00 0.00 init_line(&nsteps, 1, MPI_INT, MASTER, MPI_COMM_WORLD); }</pre>
/* ------------------------------------------------------------------------- * Workers receive timestep input value from master * -------------------------------------------------------------------------*/void init_workers(void) { MPI_Bcast(&nstepsAs you can see, 1the majority of the CPU time is spent in the update function, MPI_INT, MASTER, MPI_COMM_WORLD); }which is where I will begin implementing my code.
/* ------------------------------------------------------------------------ * All processes initialize points on line * --------------------------------------------------------------------- */void init_line(void) { int nminThe 1D Wave Equation is already optimized for multiple CPU threads using the standard MPI library, nleftspreading the sections of the curve to be calculated in parallel with as many available CPU threads at a time. However, nptsa lot of this code is better left as a serial application to be dealt with the CPU, ias GPU streams will perform much slower. The CUDA cores will take advantage of the highly parallelizable code in the update function. I am hoping that the separation of CPU cores will not cause complications while they each attempt to use the device to run the kernel and access the GPU's memory, j, k; double x, fac;and that it will only optimize it further.
/* calculate initial values based on sine curve */ nmin = TPOINTS/numtasks; nleft = TPOINTS%numtasks; fac = 2.0 * PI; for I have included calls to clock(i = 0, k = 0; i < numtasks; i++) { npts = (i < nleft) ? nmin + 1 to determine specifically where the most time is being spent in the update function: nmin; if (taskid == i) { first = k + 1; npoints = npts; printf ("task=%3d first point=%5d npoints=%4d\n", taskid, first, npts); for (j = 1; j <= npts; j++, k++) { x = (double)k/(double)(TPOINTS - 1); values[j] = sin (fac * x); } } else k += npts; } for (i = 1; i <= npoints; i++) oldval[i] = values[i]; }
/* ------------------------------------------------------------------------- * All processes update their points a specified number of times * -------------------------------------------------------------------------*/<pre>
void update(int left, int right) {
clock_t start0, start1, start2, start3, start4, end1, end2, end3, end4, end0;
start0 = clock();
double block1 = 0.0, block2 = 0.0, block3 = 0.0, block4 = 0.0;
int i, j;
double dtime, c, dx, tau, sqtau;
MPI_Status status;
dtime = 0.3;
c = 1.0;
dx = 1.0;
/* Update values for each point along string */
for (i = 1; i <= nsteps; i++) {
start1 = clock();
/* Exchange data with "left-hand" neighbor */
if (first != 1) {
&status);
}
end1 = clock();
block1 += double(end1 - start1)/CLOCKS_PER_SEC;
start2 = clock();
/* Exchange data with "right-hand" neighbor */
if (first + npoints -1 != TPOINTS) {
MPI_COMM_WORLD, &status);
}
end2 = clock();
block2 += double(end2 - start2)/CLOCKS_PER_SEC;
start3 = clock();
/* Update points along line */
for (j = 1; j <= npoints; j++) {
+ (sqtau * (values[j-1] - (2.0 * values[j]) + values[j+1]));
}
end3 = clock();
block3 += double(end3 - start3)/CLOCKS_PER_SEC;
start4 = clock();
for (j = 1; j <= npoints; j++) {
oldval[j] = values[j];
values[j] = newval[j];
}
end4 = clock();
block4 += double(end4 - start4)/CLOCKS_PER_SEC;
}
end0 = clock();
std::cout << "Block 1: " << block1 << std::endl;
std::cout << "Block 2: " << block2 << std::endl;
std::cout << "Block 3: " << block3 << std::endl;
std::cout << "Block 4: " << block4 << std::endl;
}
</pre>
 
Since function is called (1-10000000) times depending on the number of steps chosen for the user, I have calculated the sum of 4 different blocks:
 
 
<pre>
Block 1: 4.18654
Block 2: 0.98329
Block 3: 13.2884
Block 4: 8.3342
 
Block 1: 1.02494
Block 2: 4.53157
Block 3: 12.8947
Block 4: 8.36864
</pre>
 
As you can see, most of the time is spent in the 3rd and 4th blocks, which is where I will begin optimization.
 
Since the number of npoints is 800 in total, divided into separate CPU threads, we will never reach the maximum number of threads per block, 1024.
 
====== Sample Output ======
 
Steps: 1
 
[[Image:wave_output1.jpg]]
 
Steps: 500
 
[[Image:wave_output2.jpg]]
 
Steps: 1,000
/* ------------------------------------------------------------------------ * Master receives results from workers and prints * ------------------------------------------------------------------------ */void output_master(void) { int i, j, source, start, npts, buffer[2]; double results[TPOINTSImage:wave_output3.jpg]; MPI_Status status; /* Store worker's results in results array */ for (i = 1; i < numtasks; i++) { /* Receive first point, number of points and results */ MPI_Recv(buffer, 2, MPI_INT, i, OUT1, MPI_COMM_WORLD, &status); start = buffer[0]; npts = buffer[1]; MPI_Recv(&results[start-1], npts, MPI_DOUBLE, i, OUT2, MPI_COMM_WORLDSteps: 10, &status); }000
/* Store master's results in results array */ for (i = first; i < first + npoints; i++) results[i-1[Image:wave_output4.jpg] = values[i];
j = 0; printf("***************************************************************\n"); printf("Final amplitude values for all points after %d steps:\n"At this point,nsteps); for (i = 0; i < TPOINTS; i++) { printf("%6I am noticing the delay in constantly transferring data between the RAM and Video RAM.2f "Splitting the array into multiple sections requires constant checking of the left and right columns of those arrays. Thus, results[i]); j = j++; if (j == 10) { printf("\n"); j = 0; } } printf("***************************************************************\n"); printf("\nDrawing graph...\n"); printf("Click I will re-factor the EXIT button or entire code to use CTRL-C to quit\n");only 1 CPU thread and remove MPI.
/* display results with draw_wave routine */ draw_wave(&results[0]); }====== Optimization ======
/* ------------------------------------------------------------------------- * Workers send the updated After using shared memory and prefetching values to perform operations in the kernel, my GPU no longer crashes on extreme operations involving millions of steps. It also outperforms my CPU running the master * -------------------------------------------------------------------------*/ void output_workers(void) { int buffer[2]; MPI_Status status;MPI version of this application in 4 threads running at 4.9 GHz each.
/* Send first point, number Since my video card has 48 KB of points shared memory and results I am not using more than 20 KB with all of my arrays, I do not need to master */ buffer[0] = first; buffer[1] = npoints; MPI_Send(&buffer, 2, MPI_INT, MASTER, OUT1, MPI_COMM_WORLD); MPI_Send(&values[1], npoints, MPI_DOUBLE, MASTER, OUT2worry about coalescing my data, MPI_COMM_WORLD); }since shared memory is much faster.
/* ------------------------------------------------------------------------ * Main program * ------------------------------------------------------------------------ */Due to operational limits, the kernel is being killed short of completion by the watchdog of the operation system. Thus I have updated the maximum step count to be 1 million, otherwise the kernel would need to be rethought or be run in Tesla Compute Cluster (TCC) mode with a secondary GPU not being used for display, but I just don't have that kind of money right now.
int main (int argc, char *argv[]){int left, right, rc;====== Testing ======
/* Initialize I have written the following script for testing purposes against the MPI */MPI_Init(&argcimplementation in dual-core and quad-core modes,&argv);MPI_Comm_rank(MPI_COMM_WORLD,&taskid);MPI_Comm_size(MPI_COMM_WORLD,&numtasks);if (numtasks < 2) { printf("ERRORand the CUDA implementation using 1 block of 800 threads: Number of MPI tasks set to %d\n",numtasks); printf("Need at least 2 tasks! Quitting...\n"); MPI_Abort(MPI_COMM_WORLD, rc); exit(0); }
<pre>#!/* Determine left and right neighbors *usr/if (taskid == numtasks-1) right = 0;else right = taskid + 1;bin/env bash
if # 1D Wave Equation Benchmark# output_master(taskid == 0)must be commented out left = numtasks - 1;else left = taskid - 1;# Author: Christopher Markieta
/* Get program parameters and initialize wave values */if (taskid == MASTER) { printf ("Starting mpi_wave using %d tasks.\n", numtasks); printf ("Using %d points set -e # Exit on the vibrating string.\n", TPOINTS); init_master(); }else init_workers();error
init_lineMYDIR=$(dirname $0);
/* Update values along the line for nstep time steps */if [ "$1" == "mpi" ]; thenupdate(left, right) if [ -z $2 ];then echo "Usage: $0 mpi [2-8]" exit 1 fi
# Number of threads to launch run="mpirun -n $2 $MYDIR/* Master collects results from workers and prints */wave.o"if (taskid elif [ "$1" == MASTER)"cuda" ]; then output_master(); run="$MYDIR/wave.o"
else
output_workers(); echo "Usage: $0 [cuda|mpi] ..." exit 1fi
MPI_Finalize(); # 1 millionreturn 0;for steps in 1 10 100 1000 10000 100000 1000000}do time echo $steps | $run &> /dev/nulldone
</pre>
 
The final results show that the optimization was a success:
 
[[Image:cuda_wave.jpg]]
 
Although this application might not profit from such large number of steps, it could be useful for scientific computation. The kernel can be improved to support infinitely large number of steps, but I am lacking the hardware and for demonstration purposes, this should be enough.

Navigation menu