Changes

Jump to: navigation, search

GPU610/TeamKCM

9,999 bytes added, 15:20, 5 December 2014
no edit summary
= Team KCM =
== Team Members ==
# [mailto:bkim35@myseneca.ca?subject=gpu610 Byunghi Byungho Kim], Some responsibility
# [mailto:tchung22@myseneca.ca?subject=gpu610 Taeyang Chung], Some responsibility
# [mailto:symoon@myseneca.ca?subject=gpu610 SeungYeon Moon], Some responsibility
| K<sub>0</sub> || K<sub>1</sub> || K<sub>2</sub>
|-
| K<sub>43</sub> || K<sub>54</sub> || k<sub>65</sub>
|-
| k<sub>76</sub> || k<sub>87</sub> || k<sub>98</sub>
|-
|}
The 33th output pixel will be given by
OP<sub>33</sub> = IP<sub>22</sub> x K<sub>0</sub> + IP<sub>23</sub> x K<sub>1</sub> + IP<sub>24</sub> x K<sub>2</sub> + IP<sub>32</sub> x K<sub>3</sub>4 + IP<sub>33</sub> x K<sub>54</sub> + IP<sub>34</sub> x K<sub>65</sub> + IP<sub>42</sub> x K<sub>76</sub> + IP<sub>43</sub> x K<sub>87</sub> + IP<sub>44</sub> x K<sub>98</sub>
=====Analysis=====
====Taeyang's Findings====
My program is to solve the heat transfer problem using explicit method.I found this source code at http://www.pengfeidu.net/Media/pdf/1dUnsteadyHeatTransferCode.pdf The 'Solve()' function takes 3 arguments and as t-value decreases the processing time increases as shown below[[File:A1_3.png]]  [[File:A1_4.png]] The 'Solve()' function:  void solve(){ int j=0; while(j<tTotal/dt){ solutionNew[0]=10; solutionNew[xGridNum]=120; for(int k=1;k<xGridNum;k++){ solutionNew[k] = solutionLast[k]+dt*(K/dx/dx* (solutionLast[k-1]+solutionLast[k+1]-2.*solutionLast[k] +source(x[k],j*dt))); } j++; solutionLast=solutionNew; } } By Implementing parallel solution, I believe I can shorten the processing time. 
====SeungYeon's Findings====
There are many types of image processing or operations can be done. Some of examples are rotating,re-sizing,blurring, etc...
=== Assignment 2 ===
For our Assignment 2 we decided to work with heat transfer problem.
 
To compare the speed difference between the original code and parallelized code, we first profiled the original code in matrix environment an parallelized code in school computer. (using gporf and nvvp)
 
==== Profile results ====
Each Program was tested with 5 different input data.(500,1000,2500,5000,10000)
 
=====Profile #1 (Original code) =====
Sample Data = 500
 
[[File:500-Original.png]]
 
 
Sample Data = 2500
 
[[File:2500-Original.png]]
 
 
Sample Data = 10000
 
[[File:10000-Original.png]]
 
 
=====Profile #2 (GPU implementation version) =====
Sample Data = 500
 
[[File:500.png]]
 
 
 
Sample Data = 2500
 
[[File:2500.png]]
 
 
Sample Data = 10000
 
[[File:10000.png]]
 
 
===== Summary of profiling =====
We tested CPU and GPU implementation version with same data sizes and showed significant speedup in result after using kernel.
 
 
==== Summary of profiling in Chart ====
 
 
[[File:Assignment2.png]]
 
 
==== Charts with some of main operation(function) calls for CPU version and GPU(kernel) ====
 
===== Original =====
 
[[File:Originalcode.png]]
 
===== GPU (kernel) =====
 
[[File:gpu(kernal).png]]
 
==== Screen shot of solve function in CPU and GPU(kernel) versions ====
 
===== CPU =====
 
[[File:Original-code.png]]
 
===== GPU(Solve) and Kernel=====
 
[[File:GPU(kernel).png]]
 
[[File:gpu-code.png]]
 
 
 
=== Assignment 3 ===
For assignment 3, we were checking everything we can do to improve the performance and found 2 thing we can do.
Firstly, in the while loop, there were 6 times of memory copy functions called, and we found that we can reduce 6 times to 1 time by using device address pointer switching.
Furthermore, we found that if the sample number n is less that 1024, we can use shared memory in the kernel.
 
==== Reducing Memory Copy Calls====
 
===== Assignment 2 Code =====
<syntaxhighlight lang="cpp">
void solve(){
int j=0;
int d;
 
cudaDeviceProp prop;
cudaGetDevice(&d);
cudaGetDeviceProperties(&prop, d);
unsigned ntpb = prop.maxThreadsDim[0];
unsigned ntpg = ntpb * prop.maxGridSize[0];
if (xGridNum > ntpg) {
xGridNum = ntpg;
std::cout << "n reduced to " << xGridNum << std::endl;
}
 
double* d_x;
double* d_solutionLast;
double* d_solutionNew;
 
cudaMalloc((void**)&d_x, (xGridNum + 1) * sizeof(float));
cudaMalloc((void**)&d_solutionLast, xGridNum * sizeof(double));
cudaMalloc((void**)&d_solutionNew, xGridNum * sizeof(double));
 
while(j<tTotal/dt){
solutionNew[0]=10;
solutionNew[xGridNum]=120;
 
cudaMemcpy(d_x, x, (xGridNum + 1) * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_solutionLast, solutionLast, xGridNum * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(d_solutionNew, solutionNew, xGridNum * sizeof(double), cudaMemcpyHostToDevice);
 
 
kernel<<<(xGridNum + ntpb - 1) / ntpb, ntpb>>>(d_solutionNew, d_solutionLast, d_x, xGridNum, dt, dx, K, j);
j++;
 
cudaMemcpy(x, d_x, (xGridNum + 1) * sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(solutionLast, d_solutionLast, xGridNum * sizeof(double), cudaMemcpyDeviceToHost);
cudaMemcpy(solutionNew, d_solutionNew, xGridNum * sizeof(double), cudaMemcpyDeviceToHost);
 
solutionLast=solutionNew;
}
 
for(j = 0; j <= xGridNum; j++){
std::cout << solutionLast[j] << endl;
}
 
cudaFree(d_x);
cudaFree(d_solutionLast);
cudaFree(d_solutionNew);
}
</syntaxhighlight>
 
 
===== Improved Code =====
<syntaxhighlight lang="cpp">
void solve(){
int j=0;
int d;
cudaDeviceProp prop;
cudaGetDevice(&d);
cudaGetDeviceProperties(&prop, d);
unsigned ntpb = prop.maxThreadsDim[0];
unsigned ntpg = ntpb * prop.maxGridSize[0];
if (xGridNum > ntpg) {
xGridNum = ntpg;
std::cout << "n reduced to " << xGridNum << std::endl;
}
 
double* d_x;
double* d_solutionA;
double* d_solutionB;
 
cudaMalloc((void**)&d_x, (xGridNum + 1) * sizeof(float));
cudaMalloc((void**)&d_solutionA, xGridNum * sizeof(double));
cudaMalloc((void**)&d_solutionB, xGridNum * sizeof(double));
double* d_solutionTemp;
 
cudaMemcpy(d_x, x, (xGridNum + 1) * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_solutionA, solutionNew, xGridNum * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(d_solutionB, solutionLast, xGridNum * sizeof(double), cudaMemcpyHostToDevice);
 
std::ofstream myfile;
myfile.open ("output.txt");
// A = new, B = last
while( j < tTotal / dt){
kernel<<<(xGridNum + ntpb - 1) / ntpb, ntpb>>>(d_solutionA, d_solutionB, d_x, xGridNum, dt, dx, K, j);
j++;
cudaMemcpy(solutionNew, d_solutionA, xGridNum * sizeof(double), cudaMemcpyDeviceToHost);
/*
myfile << "Time" << tTotal/dt << std::endl;
for(int i = 0; i <= xGridNum; i++){
myfile << solutionNew[i] << ":";
}
*/
d_solutionTemp = d_solutionA;
d_solutionA = d_solutionB;
d_solutionB = d_solutionTemp;
}
 
myfile.close();
cudaFree(d_x);
cudaFree(d_solutionA);
cudaFree(d_solutionB);
cudaDeviceReset(); // Important
}
</syntaxhighlight>
 
 
==== Using Shared Memory ====
 
===== Assignment 2 Code =====
<syntaxhighlight lang="cpp">
__global__ void kernel(double* solutionNew, double* solutionLast, double* x, int n, float dt, float dx, float K, int j){
int k = blockIdx.x * blockDim.x + threadIdx.x;
 
if( k < n){
k += 1;
solutionNew[k] = solutionLast[k]+dt*(K/dx/dx*
(solutionLast[k-1]+solutionLast[k+1]-2.*solutionLast[k]
+(x[k]*j*dt*1000))
);
}
}
</syntaxhighlight>
===== Improved Code =====
<syntaxhighlight lang="cpp">
__global__ void kernel(double* solutionNew, double* solutionLast, double* x, int n, float dt, float dx, float K, int j){
int k = blockIdx.x * blockDim.x + threadIdx.x;
double *t_c;
__shared__ double s_c[1024];
if ( n < 1024 && k < (n + 1))
s_c[k] = solutionLast[k];
__syncthreads();
if( n < 1024) t_c = s_c;
else t_c = solutionLast;
 
if( k == 0){
solutionNew[k] = 10;
} else if( k == n){
solutionNew[k] = 120;
}
if( k < n){
k += 1;
solutionNew[k] = t_c[k]+dt*(K/dx/dx*
(t_c[k-1]+t_c[k+1]-2.*t_c[k]
+(x[k]*j*dt*1000))
);
}
}
</syntaxhighlight>
 
 
==== Profile Result ====
 
===== Sample Data = 500 =====
[[File:as3v1-500.png]]
 
===== Sample Data = 1000 =====
[[File:as3v1-2500.png]]
 
 
===== Sample Data = 5000 =====
[[File:as3v1-5000.png]]
 
 
===== Sample Data = 10000 =====
[[File:as3v1-10000.png]]
 
==== Chart ====
[[File:a3 duration chart.png]]
 
 
=== Assignment 3 v2 ===
==== Description of v2 ====
Changed Double precision to single precision float
 
==== Sample Date = 500 ====
[[File:as3v2-500.png]]
 
==== Sample Date = 1000 ====
[[File:as3v2-1000.png]]
 
==== Sample Date = 5000 ====
[[File:as3v2-5000.png]]
 
==== Sample Date = 10000 ====
[[File:as3v2-10000.png]]
 
==== Char ====
[[File:Table.png]]
 
 
 
==== Conclusions / Problems Encountered ====
Using CUDA, Our team achieved around 8000% speed up in total run time compare to original project and final result. We were certain that by implementing kernel to the original project will result in huge speed up because a calculation was done in a loop(serial) to get each heat points in specific time. To accomplish the project, first we focused on understanding the original project to find out the "hot spot" in the code and different variables and their uses. Working with heat equation problem was challenging, because we had to understand how heat points are calculated with a equation and had to find out which part/variables are needed and not to develop a kernel. The original project did not have any dependent variables given from user input, so we had to decide which part of the equation we want to be dependent. As we develop kernel and work with different resources in the program we had to decide how to work with and approach different memories in the program(accessing global memory, using shared memory..etc).One of the principle logic the program used is, to get a new heat value a last heat value is needed for calculation. So, when developing kernel our team had to figure out how many times memory copies are need and what parts can be done in GPU side(shared,register memories...) In result, we managed to minimize the number of memory copies from CPU to GPU and use more shared and register memories.
1
edit

Navigation menu