Difference between revisions of "Hu3Team"

From CDOT Wiki
Jump to: navigation, search
Line 38: Line 38:
 
Based on assignment 1, we added the source code to make possible the program to be executed on a CUDA device, as follows.
 
Based on assignment 1, we added the source code to make possible the program to be executed on a CUDA device, as follows.
  
 +
<pre>
  
 
__global__ void copyMat(const double *w, double *u){
 
__global__ void copyMat(const double *w, double *u){
Line 106: Line 107:
  
 
}
 
}
 +
 +
</pre>
 +
 +
Moreover, we made the input of the error tolerance (Epsilon) to be set on the code. After lots of difficulties found while we were coding, we finally got good results in comparison with the code of assignment 1. The runtime was decreased, and it made us to see the power that CUDA may provide to optimize the processing.
 +
 +
=== Assignment 3 ===
 +
 +
====CUDA Coding====
 +
Based on assignment 2, we made optimizations to speed up the execution, as follows.
 +
 +
<pre>
 +
__global__ void copyMat(const float *w, float *u){
 +
int i = blockIdx.x * blockDim.x + threadIdx.x;
 +
int j = blockIdx.y * blockDim.y + threadIdx.y;
 +
if (i < M && j < N) {
 +
u[i * M + j] = w[i * M + j];
 +
}
 +
__syncthreads();
 +
}
 +
__global__ void calcHeat(float *w, float *u, float *d, int m, int n, float* d_array){
 +
int i = blockIdx.x * blockDim.x + threadIdx.x;
 +
int j = blockIdx.y * blockDim.y + threadIdx.y;
 +
int tx = threadIdx.x;
 +
int ty = threadIdx.y;
 +
 +
__shared__ float s_u[ntpb][ntpb];
 +
__shared__ float s_w[ntpb][ntpb];
 +
__shared__ float s_dif[ntpb][ntpb];
 +
if (tx < ntpb && ty < ntpb) {
 +
s_w[ty][tx] = w[j * M + i];
 +
s_u[ty][tx] = w[j * M + i];
 +
}
 +
__syncthreads();
 +
 +
if ( ( tx < (ntpb-1) && ty < (ntpb-1) ) && ( tx >0 && ty > 0 ) && ( i < M && j < N ) ) {
 +
s_w[ty][tx] = ( s_u[ty - 1][tx] + s_u[ty + 1][tx] + s_u[ty][tx - 1] + s_u[ty][tx + 1] ) / 4.0;
 +
 +
s_dif[ty][tx] = fabsf(s_w[ty][tx] - s_u[ty][tx]);
 +
 +
//if (s_dif[ty][tx] < 0){ s_dif[ty][tx] *= -1; }
 +
}
 +
__syncthreads();
 +
if (tx < ntpb && ty < ntpb) {
 +
w[j * M + i] = s_w[ty][tx];
 +
//u[j * M + i] = s_w[ty][tx];
 +
d_array[j * M + i] = s_dif[ty][tx];
 +
}
 +
__syncthreads();
 +
}
 +
__global__ void bigDiff(float* d_array, float* d, int m, int n){
 +
int i = blockIdx.x * blockDim.x + threadIdx.x;
 +
 +
for (int x = 1; i + x < m*n; x *= 2) {
 +
if (d_array[i] > *d || d_array[i + x] > *d){
 +
if (d_array[i] > d_array[i + x])
 +
*d = d_array[i];
 +
else
 +
*d = d_array[i + x];
 +
}
 +
__syncthreads();
 +
}
 +
}
 +
</pre>
 +
 
====Comparing the results====
 
====Comparing the results====
 
As a result of the source code included, it was possible to reduce the processing time.  
 
As a result of the source code included, it was possible to reduce the processing time.  
Line 112: Line 177:
  
 
[[File:runtime-CUDA.JPG]]
 
[[File:runtime-CUDA.JPG]]
 
 
Moreover, we made the input of the error tolerance (Epsilon) to be set on the code. After lots of difficulties found while we were coding, we finally got good results in comparison with the code of assignment 1. The runtime was decreased, and it made us to see the power that CUDA may provide to optimize the processing.
 
 
=== Assignment 3 ===
 

Revision as of 17:34, 5 December 2014

Project Name Goes here

Team Members

  1. Bruno Di Giuseppe
  2. Carlos Silva

Email All

Progress

Assignment 1

Bruno's Findings

I started off with this simple heat equation

It's interesting because the calculation involves taking the average between each matrix element's neighbour(north, south, east and west) and keeps on doing it over and over again until you get a precise calculation(when the biggest difference between the new calculated element and the older is smaller than the defined Epsilon error margin). Like this you are able to get a nice heat dispersion calculation.

I was worried about data dependency, since, as I said, each element depends on each other to be calculated. But this solution uses 2 matrix, one old and one new, where the new matrix will receive the average value of the old matrix and, if the difference is still bigger than epsilon, then the old matrix receives the values of the new matrix and the whole iteration happens again, where the new matrix is going to receive the average values of the old matrix, that now holds the most recent values.

So this is a good candidate for parallelization because we can send each iteration of the average calculation to a different GPU thread and since this is a simple average calculation, the GPU will be able to do it. Running with a 1000x1000 matrix, with a epsilon error factor of 0.001, the program took almost 5 minutes to run completely and 99% of the time was on the getHeat() method (the heat calculation core).

Execution-NO CUDA.JPG

Number of iterations x Value reached that represents the first value smaller than Epsilon.

Carlos's Findings

The array processing is a good choice and it is used in many applications, for example game development and image processing. It can calculate the transformation of a game world scene and its objects; besides, add filters on images as we can see in mobile apps such as Instagram.

It is a good candidate because it operates simple and repetitive calculations that can be divide among CUDA processors.

Conclusion

After analyzing both solutions and their applications, we have chosen the Heat Equation because of the number of different algorithms that we can find to solve array processing problems, such as CBLAS and cuBLAS. Due to the possibility of doing something that is not often done, we will work with the Heat Equation during this semester.


Assignment 2

CUDA Coding

Based on assignment 1, we added the source code to make possible the program to be executed on a CUDA device, as follows.


__global__ void copyMat(const double *w, double *u){

	int i = blockIdx.x * blockDim.x + threadIdx.x;

	int j = blockIdx.y * blockDim.y + threadIdx.y;

	if (i < M && j < N) {

		u[i * M + j] = w[i * M + j];

	}

	__syncthreads();

}

__global__ void calcHeat(double *w, const double *u, double *d, int m, int n, double* d_array){

	int i = blockIdx.x * blockDim.x + threadIdx.x;

	int j = blockIdx.y * blockDim.y + threadIdx.y;

	if ( i == 0 )i++; 

	if ( j == 0 )j++;

	if (i < m && j < n) {

		w[i * m + j] = (u[(i - 1) * m + j] + u[(i + 1) * m + j] + u[i * m + (j - 1)] + u[i * m + (j + 1)]) / 4.0;

		d_array[i * m + j] = w[i * m + j] - u[i * m + j];

		if( d_array[i * m + j] < 0 ){d_array[i * m + j] *= -1;}

	}

	*d = -1;

	__syncthreads();

}

__global__ void bigDiff(double* d_array, double* d, int m, int n){

	int i = blockIdx.x * blockDim.x + threadIdx.x;



	for (int x = 1; i+x < m*n; x*=2) {

		if (d_array[i] > *d || d_array[i + x] > *d){

			if (d_array[i] > d_array[i + x])

				*d = d_array[i];

			else

				*d = d_array[i + x];

		}

		__syncthreads();

	}

}

Moreover, we made the input of the error tolerance (Epsilon) to be set on the code. After lots of difficulties found while we were coding, we finally got good results in comparison with the code of assignment 1. The runtime was decreased, and it made us to see the power that CUDA may provide to optimize the processing.

Assignment 3

CUDA Coding

Based on assignment 2, we made optimizations to speed up the execution, as follows.

__global__ void copyMat(const float *w, float *u){
	int i = blockIdx.x * blockDim.x + threadIdx.x;
	int j = blockIdx.y * blockDim.y + threadIdx.y;
	if (i < M && j < N) {
		u[i * M + j] = w[i * M + j];
	}
	__syncthreads();
}
__global__ void calcHeat(float *w, float *u, float *d, int m, int n, float* d_array){
	int i = blockIdx.x * blockDim.x + threadIdx.x;
	int j = blockIdx.y * blockDim.y + threadIdx.y;
	int tx = threadIdx.x;
	int ty = threadIdx.y;

	__shared__ float s_u[ntpb][ntpb];
	__shared__ float s_w[ntpb][ntpb];
	__shared__ float s_dif[ntpb][ntpb];
	if (tx < ntpb && ty < ntpb) {
		s_w[ty][tx] = w[j * M + i];
		s_u[ty][tx] = w[j * M + i];
	}
	__syncthreads();

	if ( ( tx < (ntpb-1) && ty < (ntpb-1) ) && ( tx >0 && ty > 0 ) && ( i < M && j < N ) ) {
		s_w[ty][tx] = ( s_u[ty - 1][tx] + s_u[ty + 1][tx] + s_u[ty][tx - 1] + s_u[ty][tx + 1] ) / 4.0;

		s_dif[ty][tx] = fabsf(s_w[ty][tx] - s_u[ty][tx]);

		//if (s_dif[ty][tx] < 0){ s_dif[ty][tx] *= -1; }
	}
	__syncthreads();
	if (tx < ntpb && ty < ntpb) {
		w[j * M + i] = s_w[ty][tx];
		//u[j * M + i] = s_w[ty][tx];
		d_array[j * M + i] = s_dif[ty][tx];
	}
	__syncthreads();
}
__global__ void bigDiff(float* d_array, float* d, int m, int n){
	int i = blockIdx.x * blockDim.x + threadIdx.x;

	for (int x = 1; i + x < m*n; x *= 2) {
		if (d_array[i] > *d || d_array[i + x] > *d){
			if (d_array[i] > d_array[i + x])
				*d = d_array[i];
			else
				*d = d_array[i + x];
		}
		__syncthreads();
	}
}

Comparing the results

As a result of the source code included, it was possible to reduce the processing time.


Runtime-CUDA.JPG