Changes

Jump to: navigation, search

GPU610/Turing

1,915 bytes added, 12:58, 13 December 2015
Assignment 3
= Team Turing =
== Team Members ==
# [mailto:cjcampbell2@myseneca.ca?subject=gpu610 Colin Campbell], Team Leader# [mailto:jyshin3@myseneca.ca?subject=gpu610 James Shin]# [mailto:cbailey8@myseneca.ca?subject=gpu610 Chaddwick Bailey]
[mailto:cjcampbell2@myseneca.ca;jyshin3@myseneca.ca;cbailey8@myseneca.ca?subject=dps901-gpu610 Email All]
== Progress ==
|}
What takes the longest running function is the function that copies the old image into the new output image. Below is the code for the function.
Image::Image(const Image& oldImage)
}
}
I believe there is potential at possibly parallelizing the nested for loop.==== Chadd's Research ====
Data Decomposition is involves the dividing to a Large chunk a of Data into smaller sections of data. A processes is
then preformed on the smaller pieces of data until the entire chunk of data is processed or the programs end because
it has reached its goal.
==== Chadd's Example ====
[[Image:Data Decomp.png|600px ]]
the user. The program also counts the number of lines in the file and the number of matches found in the file.
=== Paralyse Nested Loops = Potential for parallelism === for(int i = 0; i < numberoflines; i++){
std[[Image::getline(fin, line); std::istringstream iss{line};loop.png|600px ]]
while(std::getline(iss, temp, ' '))
{
If(input1I think by making the nested loop above parallel I should be able to produce a more efficient program. This section of the code uses a majority of the CPU's power. This is where the program is actually though the file line by line and then word by word. Profiling datais available below.compare(temp)==0){ matches ++; } }
} [[Image:profile2.png|600px ]]
}
I think by paralysing the nested loop above I should be able to
produce ==== Conclusion ====We have decided to use the diffusion equation code for assignment 2 because it uses simple matrixes, making it a good candidate for parallelism.  === Assignment 2 ======= Colin's Report ====For assignment 2 we chose the Heat Equation problem. I profiled both the serial and CUDA versions of the code by taking the average of 25 steps in milliseconds. The tests were run on a laptop with a GeForce 650 GPU. Due to memory constraints the maximum size of the matrix that could be run was 15000x15000. I've created a chart comparing the runtimes [[Image:GPUA2Colin.png|400px]] ====== Conclusions ======There were no major issues converting the code to CUDA as it's a simple matrix, which made it very straightforward. When I first converted the code I did however notice that it was running slower than the CPU version. This was caused by inefficient block sizing. I managed to fix it by modifying the number of threads per block until it made more efficient programuse of the CUDA cores. This section In the end, without any other optimizations it runs at around twice the speed of the CPU code uses . ==== Chadd's Findings ====Profiling the Diffusion Equation I noticed that the majority of the time is spent in the Evolvetimestep function.Using my home computer with a GTX 560 ti Nvidia graphics card I ran a matrix 9000x9000 10 times. I have the runtime results in the chart below.  I've created a chart comparing the runtimes [[Image:Runtime.png|400px]]
majority of the CPU's power. This is where the program is actually
though I used 32 threads per block size in my paralellization of the file line by line and then word by wordnested for loop found in the Evolvetimestep function. The results were very good.
=== Assignment 2 ===
=== Assignment 3 ===
The first optimization I was able to make was using thread coalescence. This lead to a moderate per step speedup as seen in this graph.
 
[[Image:ColinCampbellGPU610A3G1.png|600px| ]]
 
I then attempted to modify the code to use shared memory. Unfortunately the way the algorithm accesses rows and columns out of order made this not viable. I tried to convert the problem to use tiling to get around this but was not able to make it work correctly. Because of this I was not able to implement any more optimizations as most were based around using shared memory efficiently.

Navigation menu