Changes

← Older edit

GPU610/Turing

1,915 bytes added, 12:58, 13 December 2015

→‎Assignment 3

= Team Turing =

== Team Members ==

# [mailto:cjcampbell2@myseneca.ca?subject=gpu610 Colin Campbell~~], Team Leader# [mailto:jyshin3@myseneca.ca?subject=gpu610 James Shin]# [mailto:cbailey8@myseneca.ca?subject=gpu610 Chaddwick Bailey~~]

[mailto:cjcampbell2~~@myseneca.ca;jyshin3@myseneca.ca;cbailey8~~@myseneca.ca?subject=dps901-gpu610 Email All]

== Progress ==

|}

What takes the longest ~~running function~~ is the function that copies the old image into the new output image. Below is the code for the function.

Image::Image(const Image& oldImage)

}

I believe there is potential at possibly parallelizing the nested for loop.==== Chadd's Research ====

Data Decomposition is involves the dividing to a Large chunk a of Data into smaller sections of data. A processes is

then preformed on the smaller pieces of data until the entire chunk of data is processed or the programs end because

it has reached its goal.

==== Chadd's Example ====

[[Image:Data Decomp.png|600px ]]

the user. The program also counts the number of lines in the file and the number of matches found in the file.

=== ~~Paralyse Nested Loops~~ = Potential for parallelism === ~~for(int i~~ = ~~0; i < numberoflines; i++){~~

~~std~~[[Image:~~:getline(fin, line);~~ ~~std::istringstream iss{line};~~loop.png|600px ]]

~~while(std::getline(iss, temp, ' '))~~

{

~~If(input1~~I think by making the nested loop above parallel I should be able to produce a more efficient program. This section of the code uses a majority of the CPU's power. This is where the program is actually though the file line by line and then word by word. Profiling datais available below.~~compare(temp)==0){ matches ++; } }~~

} [[Image:profile2.png|600px ]]

}

~~I think by paralysing the nested loop above I should be able to~~

~~produce~~ ==== Conclusion ====We have decided to use the diffusion equation code for assignment 2 because it uses simple matrixes, making it a good candidate for parallelism. === Assignment 2 ======= Colin's Report ====For assignment 2 we chose the Heat Equation problem. I profiled both the serial and CUDA versions of the code by taking the average of 25 steps in milliseconds. The tests were run on a laptop with a GeForce 650 GPU. Due to memory constraints the maximum size of the matrix that could be run was 15000x15000. I've created a chart comparing the runtimes [[Image:GPUA2Colin.png|400px]] ====== Conclusions ======There were no major issues converting the code to CUDA as it's a simple matrix, which made it very straightforward. When I first converted the code I did however notice that it was running slower than the CPU version. This was caused by inefficient block sizing. I managed to fix it by modifying the number of threads per block until it made more efficient ~~program~~use of the CUDA cores. ~~This section~~ In the end, without any other optimizations it runs at around twice the speed of the CPU code ~~uses~~ . ==== Chadd's Findings ====Profiling the Diffusion Equation I noticed that the majority of the time is spent in the Evolvetimestep function.Using my home computer with a GTX 560 ti Nvidia graphics card I ran a matrix 9000x9000 10 times. I have the runtime results in the chart below. I've created a chart comparing the runtimes [[Image:Runtime.png|400px]]

~~majority of the CPU's power. This is where the program is actually~~

~~though~~ I used 32 threads per block size in my paralellization of the ~~file line by line and then word by word~~nested for loop found in the Evolvetimestep function. The results were very good.

~~=== Assignment 2 ===~~

=== Assignment 3 ===

The first optimization I was able to make was using thread coalescence. This lead to a moderate per step speedup as seen in this graph.

[[Image:ColinCampbellGPU610A3G1.png|600px| ]]

I then attempted to modify the code to use shared memory. Unfortunately the way the algorithm accesses rows and columns out of order made this not viable. I tried to convert the problem to use tiling to get around this but was not able to make it work correctly. Because of this I was not able to implement any more optimizations as most were based around using shared memory efficiently.

Colin Campbell

1

edit

Changes

GPU610/Turing

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools