Changes

← Older edit

GPU610/Team AGC

8,573 bytes added, 04:01, 30 November 2014

→‎Sample Output

= Team AGC =

== Team Members ==

<s>

# [mailto:acooc@myseneca.ca?subject=gpu610 Andy Cooc], Some responsibility

# [mailto:gcastrolondono@myseneca.ca?subject=gpu610 Gabriel Castro], Some other responsibility</s># [mailto:cmarkieta@myseneca.ca?subject=gpu610 Christopher Markieta], ~~Some other~~ All responsibility

[mailto:acooc@myseneca.ca,gcastrolondono@myseneca.ca,cmarkieta@myseneca.ca?subject=gpu610 Email All]

=== Assignment 3 ===

==== Christopher's Findings ====

Due to the difficulty of Assignment 2, I have chosen a new project to parallelize.

===== 1-D Wave Equation =====

[[Image:wave2.gif]]

The first step was to create a private repository on Bitbucket to avoid any plagiarism issues with this course for next semester students, as well as provide code revision and protection in case my progress is lost or corrupt.

The next step is to convert the following C file into C++ code that will be compatible with CUDA:

[https://computing.llnl.gov/tutorials/mpi/samples/C/mpi_wave.c mpi_wave.c]

And include the following dependency in the directory:

[https://computing.llnl.gov/tutorials/mpi/samples/C/draw_wave.c draw_wave.c]

====== System Requirements ======

This project will be built and tested on <s>Windows 7 64-bit</s> Fedora 20 ([http://www.r-tutor.com/gpu-computing/cuda-installation/cuda6.5-fc20 tutorial], remember to [http://www.if-not-true-then-false.com/2011/fedora-16-nvidia-drivers-install-guide-disable-nouveau-driver/#troubleshooting blacklist nouveau in your grub config].) with an Intel Core i5-4670K Haswell CPU (overclocked to 4.9 GHz) and an Nvidia GTX 480 GPU (overclocked to 830/924/1660 MHz) manufactured by Zotac with 1.5 GB of VRAM.

mpi_wave will require the OpenMPI library to compile.

Here is the profiling of the original CPU application, with an increased maximum step to better the test comparison, and calculate the curve at the given step value.

<pre>

% cumulative self self total

time seconds seconds calls s/call s/call name

100.04 9.62 9.62 1 9.62 9.62 update(int, int)

0.00 9.62 0.00 2 0.00 0.00 MPI::Is_initialized()

0.00 9.62 0.00 2 0.00 0.00 MPI::Comm::~Comm()

0.00 9.62 0.00 2 0.00 0.00 MPI::Comm_Null::~Comm_Null()

0.00 9.62 0.00 1 0.00 0.00 _GLOBAL__sub_I_RtoL

0.00 9.62 0.00 1 0.00 0.00 init_master()

0.00 9.62 0.00 1 0.00 0.00 output_master()

0.00 9.62 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)

0.00 9.62 0.00 1 0.00 0.00 draw_wave(double*)

0.00 9.62 0.00 1 0.00 0.00 init_line()

</pre>

As you can see, the majority of the CPU time is spent in the update function, which is where I will begin implementing my code.

The 1D Wave Equation is already optimized for multiple CPU threads using the standard MPI library, spreading the sections of the curve to be calculated in parallel with as many available CPU threads at a time. However, a lot of this code is better left as a serial application to be dealt with the CPU, as GPU streams will perform much slower. The CUDA cores will take advantage of the highly parallelizable code in the update function. I am hoping that the separation of CPU cores will not cause complications while they each attempt to use the device to run the kernel and access the GPU's memory, and that it will only optimize it further.

I have included calls to clock() to determine specifically where the most time is being spent in the update function:

<pre>

void update(int left, int right) {

clock_t start0, start1, start2, start3, start4, end1, end2, end3, end4, end0;

start0 = clock();

double block1 = 0.0, block2 = 0.0, block3 = 0.0, block4 = 0.0;

int i, j;

double dtime, c, dx, tau, sqtau;

MPI_Status status;

dtime = 0.3;

c = 1.0;

dx = 1.0;

tau = (c * dtime / dx);

sqtau = tau * tau;

/* Update values for each point along string */

for (i = 1; i <= nsteps; i++) {

start1 = clock();

/* Exchange data with "left-hand" neighbor */

if (first != 1) {

MPI_Send(&values[1], 1, MPI_DOUBLE, left, RtoL, MPI_COMM_WORLD);

MPI_Recv(&values[0], 1, MPI_DOUBLE, left, LtoR, MPI_COMM_WORLD,

&status);

}

end1 = clock();

block1 += double(end1 - start1)/CLOCKS_PER_SEC;

start2 = clock();

/* Exchange data with "right-hand" neighbor */

if (first + npoints -1 != TPOINTS) {

MPI_Send(&values[npoints], 1, MPI_DOUBLE, right, LtoR, MPI_COMM_WORLD);

MPI_Recv(&values[npoints+1], 1, MPI_DOUBLE, right, RtoL,

MPI_COMM_WORLD, &status);

}

end2 = clock();

block2 += double(end2 - start2)/CLOCKS_PER_SEC;

start3 = clock();

/* Update points along line */

for (j = 1; j <= npoints; j++) {

/* Global endpoints */

if ((first + j - 1 == 1) || (first + j - 1 == TPOINTS))

newval[j] = 0.0;

else

/* Use wave equation to update points */

newval[j] = (2.0 * values[j]) - oldval[j]

+ (sqtau * (values[j-1] - (2.0 * values[j]) + values[j+1]));

}

end3 = clock();

block3 += double(end3 - start3)/CLOCKS_PER_SEC;

start4 = clock();

for (j = 1; j <= npoints; j++) {

oldval[j] = values[j];

values[j] = newval[j];

}

end4 = clock();

block4 += double(end4 - start4)/CLOCKS_PER_SEC;

}

end0 = clock();

std::cout << "Block 1: " << block1 << std::endl;

std::cout << "Block 2: " << block2 << std::endl;

std::cout << "Block 3: " << block3 << std::endl;

std::cout << "Block 4: " << block4 << std::endl;

}

</pre>

Since function is called (1-10000000) times depending on the number of steps chosen for the user, I have calculated the sum of 4 different blocks:

<pre>

Block 1: 4.18654

Block 2: 0.98329

Block 3: 13.2884

Block 4: 8.3342

Block 1: 1.02494

Block 2: 4.53157

Block 3: 12.8947

Block 4: 8.36864

</pre>

As you can see, most of the time is spent in the 3rd and 4th blocks, which is where I will begin optimization.

Since the number of npoints is 800 in total, divided into separate CPU threads, we will never reach the maximum number of threads per block, 1024.

====== Sample Output ======

Steps: 1

[[Image:wave_output1.jpg]]

Steps: 500

[[Image:wave_output2.jpg]]

Steps: 1,000

[[Image:wave_output3.jpg]]

Steps: 10,000

[[Image:wave_output4.jpg]]

At this point, I am noticing the delay in constantly transferring data between the RAM and Video RAM. Splitting the array into multiple sections requires constant checking of the left and right columns of those arrays. Thus, I will re-factor the entire code to use only 1 CPU thread and remove MPI.

====== Optimization ======

After using shared memory and prefetching values to perform operations in the kernel, my GPU no longer crashes on extreme operations involving millions of steps. It also outperforms my CPU running the MPI version of this application in 4 threads running at 4.9 GHz each.

Since my video card has 48 KB of shared memory and I am not using more than 20 KB with all of my arrays, I do not need to worry about coalescing my data, since shared memory is much faster.

Due to operational limits, the kernel is being killed short of completion by the watchdog of the operation system. Thus I have updated the maximum step count to be 1 million, otherwise the kernel would need to be rethought or be run in Tesla Compute Cluster (TCC) mode with a secondary GPU not being used for display, but I just don't have that kind of money right now.

====== Testing ======

I have written the following script for testing purposes against the MPI implementation in dual-core and quad-core modes, and the CUDA implementation using 1 block of 800 threads:

<pre>

#!/usr/bin/env bash

# 1D Wave Equation Benchmark

# output_master() must be commented out

# Author: Christopher Markieta

set -e # Exit on error

MYDIR=$(dirname $0)

if [ "$1" == "mpi" ]; then

if [ -z $2 ]; then

echo "Usage: $0 mpi [2-8]"

exit 1

fi

# Number of threads to launch

run="mpirun -n $2 $MYDIR/wave.o"

elif [ "$1" == "cuda" ]; then

run="$MYDIR/wave.o"

else

echo "Usage: $0 [cuda|mpi] ..."

exit 1

fi

# 1 million

for steps in 1 10 100 1000 10000 100000 1000000

do

time echo $steps | $run &> /dev/null

done

</pre>

The final results show that the optimization was a success:

[[Image:cuda_wave.jpg]]

Although this application might not profit from such large number of steps, it could be useful for scientific computation. The kernel can be improved to support infinitely large number of steps, but I am lacking the hardware and for demonstration purposes, this should be enough.

Christopher Markieta

70

edits

Changes

GPU610/Team AGC

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools