BetaT

From CDOT Wiki
Revision as of 20:52, 1 April 2017 by Jadach1 (talk | contribs) (New Kernel)
Jump to: navigation, search

BetaT

Assignment 1 


Profile Assessment

Application 1 naiver strokes flow velocity.

This application calculates the naiver strokes flow velocity.

Naiver Strokes is an equation for Flow Velocity.

Navier–Stokes equations are useful because they describe the physics of many phenomena of scientific and engineering interest. They may be used to model the weather, ocean currents, water flow in a pipe and air flow around a wing. The Navier–Stokes equations in their full and simplified forms help with the design of aircraft and cars, the study of blood flow, the design of power stations, the analysis of pollution, and many other things. Coupled with Maxwell's equations they can be used to model and study magnetohydrodynamics. courtesy of wikipedia ("https://en.wikipedia.org/wiki/Navier%E2%80%93Stokes_equations")

problem

The problem with this application comes in the main function trying to calculate the finite-difference


The user inputs 2 values which will be used as a reference for the loop.


 // Finite-difference loop:
 for (int it=1; it<=nt-1; it++)
   {
     for (int k=0; k<=nx-1; k++)
   {
     un[k][it-1] = u[k][it-1];
   }
     for (int i=1; i<=nx-1; i++)
   {
     u[0][it] = un[1][it-1];
     u[i][it] = un[i][it-1] - c*dt/dx*(un[i][it-1]-un[i-1][it-1]);
   }
   }


Tests ran with no optimization on linux

By using the command line argument cat /proc/cpuinfo We can find the CPU specs for the VM we are operating linux through. for this test we have: Dual-Core AMD Opteron cpu MHz at 2792

Naiver Equation
n Time in Milliseconds
100 x 100 24
500 x 500 352
1000 x 1000 1090
2000 x 2000 3936
5000 x 5000 37799
5000 x 10000 65955
10000 x 10000 118682
12500 x 12500 220198


gprof

it gets a bit messy down there, but basically 89.19% of the program is spent in the main() calculating those for loops shown above. The additional time is spent allocating the memory which might cause some slowdown when transferring it to the GPU across the bus int he future.

But the main thing to take away here is that main() is 89.19% and takes 97 seconds.

Each sample counts as 0.01 seconds.

 %   cumulative   self              self     total           
time   seconds   seconds    calls   s/call   s/call  name    
89.19     97.08    97.08                             main
 4.73    102.22     5.14 1406087506     0.00     0.00  std::vector<std::vector<double, std::allocator<double> >, std::allocator<std::vector<double, std::allocator<double> > > >::operator[](unsigned int)
 4.49    107.11     4.88 1406087506     0.00     0.00  std::vector<double, std::allocator<double> >::operator[](unsigned int)

Potential Speed Increase with Amdahls Law

Using Amdahls Law ---- > Sn = 1 / ( 1 - P + P/n )

We can examine how fast out program is capable of increasing its speed.

P = is the part of the program we want to optimize which from above is 89.17% n = the amount of processors we will use. One GPU card has 384 processors or CUDA cores and another GPU we will use has 1020 processor or CUDA cores.

Applying the algorithm gives us.

Amdahls Law for GPU with 384 Cores---- > Sn = 1 / ( 1 - 0.8919 + 0.8919/384 )

                                        Sn = 9.0561125222

Amdahls Law for GPU with 1024 Cores---- > Sn = 1 / ( 1 - 0.8919 + 0.8919/1024 )

                                         Sn = 9.176753777

Therefor According to Amdahls law we can expect a 9x increase in speed.

97 seconds to execute main / 9 amdahls law = 10.7777 seconds to execute after using GPU

Interestingly according to the law the difference in GPU cores does not significantly increase speed. Future tests will confirm or deny these results.


Potential Speed Increase with Gustafsons Law

Gustafsons Law S(n) = n - ( 1 - P ) ∙ ( n - 1 )

(Quadro K2000 GPU) S = 380 - ( 1 - .8918 ) * ( 380 - 1 ) = 339.031

(GeForce GTX960 GPU) S = 1024 - ( 1 - .8918 ) * ( 1024 - 1 ) = 913.3114


Using Gustafsons law we see drastic changes in the amount speed increase, this time the additional Cores made a big difference and applying these speed ups we get:

(Quadro K2000 GPU) 97 seconds to execute / 339.031 = 0.29

(GeForce GTX960 GPU) 97 seconds to execute / 913.3114 = 0.11


Tests ran with no optimization on Windows nisghts

System Specifications


Application 2 Calculating Pi

This application is pretty straightforward, it calculates Pi to the decimal point which is given by the user. So an input of 10 vs 100,000 will calculate Pi to either the 10th or 100 thousandth decimal.

problem

Inside the function calculate we have:

void calculate(std::vector<int>& r, int n) { int i, k; int b, d; int c = 0;

for (i = 0; i < n; i++) {

r[i] = 2000;

}

for (k = n; k > 0; k -= 14) {

d = 0; i = k;

for (;;) {

d += r[i] * 10000;

b = 2 * i - 1;

r[i] = d % b;

d /= b;

i--;

if (i == 0) break;

d *= i; }

//printf("%.4d", c + d / 10000); c = d % 10000; } }

I Believe the 2 for loops will cause a delay in the program execution time.

Tests ran with no optimization on linux

for this test the linux VM has: Dual-Core AMD Opteron cpu MHz at 2792


Naiver Equation
n Time in Milliseconds
1000 2
10000 266
100000 26616
200000 106607
500000 671163


gprof

As with the other application that was profiled it can be a bit hard to read the gprof results. Basically the program spends 87% of the time in the calculate() method and with a problem size of 500,000 it spend a cumulative of 354 seconds. Hopefully we can get this number down.

But the main thing to take away here is that main() is 89.19% and takes 97 seconds. Flat profile:

Each sample counts as 0.01 seconds.

 %   cumulative   self              self     total           
time   seconds   seconds    calls   s/call   s/call  name    
87.39    354.08   354.08        1   354.08   395.84  calculate(std::vector<int, std::allocator<int> >&, int)
10.31    395.84    41.76 678273676     0.00     0.00  std::vector<int, std::allocator<int> >::operator[](unsigned int)

Potential Speed Increase with Amdahls Law

Using Amdahls Law ---- > Sn = 1 / ( 1 - P + P/n )

We can examine how fast out program is capable of increasing its speed.

P = is the part of the program we want to optimize which from above is 87.39% n = the amount of processors we will use. One GPU card has 384 processors or CUDA cores and another GPU we will use has 1020 processor or CUDA cores.

Applying the algorithm gives us.

Amdahls Law for GPU with 384 Cores---- > Sn = 1 / ( 1 - 0.8739 + 0.8739/384 )

                                        Sn = 7.789631

Amdahls Law for GPU with 1024 Cores---- > Sn = 1 / ( 1 - 0.8739 + 0.8739/1024 )

                                         Sn = 7.876904

Therefor According to Amdahls law we can expect a 7.7x to 7.9x increase in speed.

97 seconds to execute main / 7.8 amdahls law = 45.3948 seconds to execute after using GPU

Interestingly the last application had p = 89% (9x speed up) and this application p = 87% (7.8x speed up), 2% made quite a difference.


Potential Speed Increase with Gustafsons Law

Gustafsons Law S(n) = n - ( 1 - P ) ∙ ( n - 1 )

(Quadro K2000 GPU) S = 380 - ( 1 - .8739 ) * ( 380 - 1 ) = 332.2081

(GeForce GTX960 GPU) S = 1024 - ( 1 - .8739 ) * ( 1024 - 1 ) = 894.9837


Using Gustafsons law we see drastic changes in the amount speed increase, this time the additional Cores made a big difference and applying these speed ups we get:

(Quadro K2000 GPU) 354 seconds to execute / 332.2081 = 1.065597

(GeForce GTX960 GPU) 354 seconds to execute / 894.9837 = 0.395537

Conclusions with Profile Assessment

Based on the problem we have for both applications which is quadratic(A nested for loop). The time spent processing the main problem which was 89.19% and 87.39%. Plus the amount of time in seconds the program spent on the particular problem which was 97 & 354 seconds. I believe it is feasible to optimize one of these application with CUDA to improve performance.

I will attempt to optimize the naiver strokes flow velocity program as that application is more interesting to me.

Parallelize

Original Problems when converting a program to use the GPU

In order to use the GPU with this program some changes need to be implemented.

The original program was using a two dimensional Vector. unfortunately a vector cannot be copied using *cudaMemCpy* or allocated for using *cudaMalloc*, so there needs to be a change, the vector will be converted into a 2D Array of float** u[][]

Now that the program is iterating trough an Array rather than a Vector the program executes at a much faster speed, but it can still go faster so I will continue optimization with the GPU.

The next problem encountered is like a vector we cannot transfer a 2D array to a *cudaMalloc* or *cudaMemCPy* function call. So the 2D array needs to be converted into a single array. And that brings another problem regarding the original algorithm which is shown below.

for (int i=0; i <= nx-1; i++)

   {
     if (i*dx >= 0.5 && i*dx <= 1)
   {
     u[i][0] = 2;
   }
     else
   {
     u[i][0] = 1;
   }
   }

 // Finite-difference loop:
 for (int it=1; it<=nt-1; it++)
   {
     for (int k=0; k<=nx-1; k++)
   {
     un[k][it-1] = u[k][it-1];
   }
     for (int i=1; i<=nx-1; i++)
   {
     u[0][it] = un[1][it-1];
     u[i][it] = un[i][it-1] - c*dt/dx*(un[i][it-1]-un[i-1][it-1]);
   }

}


In order to get this algorithm to work using a 1D array some changes are made, see below


for (int k = 0; k <= nx - 1; k++)
 if (k*dx >= 0.5 && k*dx <= 1)
 {	
 u[k * nt + 0] = 2;	
 }	
 else	
 {		
 u[k * nt + 0] = 1;	
 }
 for (int it = 1; it <= nx - 1; it++)
 {	
  for (int k = 0; k <= nx - 1; k++)	
  {		
   un[k * nx + it - 1] = u[k * nx + it - 1];	
  }	
  for (int m = 1; m <= nx - 1; m++)	
  {		
   u[0 * nx + it] = un[1 * nx + it - 1];		
   u[m * nx + it] = un[m * nx + it - 1] - c*dt / dx*(un[m * nx + it - 1] - un[(m - 1) * nx + it - 1]);	
  }	

}


It is possible to now iterate through a single dimensional array using the original algorithm.

One more small problem came up, not every element in the array is initialized so there are parts which are pointing to a nullptr garbage in memory. Beause originally the program used a vector, by default all elememnts are initialized to 0, now to solve this problem a separate function is implemented to fill each array index with he value of 0 upon initialization.

After these implementations, testing the code produced the same results as the original program, so it is a positive confirmation that we can proceed to optimizing the cod using the GPU

Optimizing Problems

At the beginning of the program, a handle was created for grabbing the computers Cuda Device, through this handle we can extract the Cuda Device properties to examine our parameters

//list the properties

std::cout << "Name:" << prop.name << std::endl;

std::cout << "Compute Capability: " << prop.major << '.' << prop.minor << std::endl;

std::cout << "Total Global Memory: " << prop.totalGlobalMem << std::endl;

std::cout << "Max Threads per block: " << prop.maxThreadsPerBlock << std::endl;

std::cout << "Clock Rate in khz: " << prop.clockRate << "\n\n";


We grab the devices properties so that we do not exceed resources.

The original algorithm was split into 2 kernels. The first kernel causing no problems is as follows


 __global__ void Initalize(double* u, double* un, int nx, int nt, double dx)
 {  
 int i = blockIdx.x * blockDim.x + threadIdx.x;  
 int j = blockIdx.y * blockDim.y + threadIdx.y;
 
 for (int k = 0; k <= nx - 1; k++)	 
 if (k*dx >= 0.5 && k*dx <= 1)	

 {		 
 u[k * nt] = 2;		
 __syncthreads();	
 }	 
 else	 
 {		 
 u[k * nt] = 1;		
 __syncthreads();	 
 }
 }



The second kernel works perfectly find for arguments less than 1024 1024 (user inputs 2 values), anything higher for example an argument of 2000 2000 will crash the NVidia driver and results will be set to pre kernel launch. The kernel code is below:

 __global__ void Calculate (double* u, double* un,int nx, int c, double dx, double dt)
 {
   for (int it = 1; it <= nx - 1; it++)	
     {	 
     for (int k = 0; k <= nx - 1; k++)		
       {			
        un[k * nx + it - 1] = u[k * nx + it - 1];		
       }		
      for (int m = 1; m <= nx - 1; m++)		
      {
       u[0 * nx + it] = un[1 * nx + it - 1];
       u[m * nx + it] = un[m * nx + it - 1] - c*dt / dx*(un[m * nx + it - 1] - un[(m - 1) * nx + it - 1]);
      }		
      }

}

Solution to first Kernel problem

The problem was resulting because of this calculation, u[m * nx + it] = un[m * nx + it - 1] - c*dt / dx*(un[m * nx + it - 1] - un[(m - 1) * nx + it - 1]); Perhaps using the c, dt & dx values incorporated to many trips to global memory which caused a hang in the operation and CUDA automatically crashed the driver. To solve this problem a scalar variable (local variable) was created to store this value in registered memory for each thread to access. double total = c*dt / dx;... Now the program executes with an argument of 2000 2000 and yeilds similar results to the original program.

 __global__ void Calculate (double* u, double* un,int nx, int c, double dx, double dt)
 {
    double total = c*dt / dx;
       for (int it = 1; it <= nx - 1; it++)
          {
          for (int k = 0; k <= nx - 1; k++)	
             {		
              un[k * nx + it - 1] = u[k * nx + it - 1];		
             }		
          for (int m = 1; m <= nx - 1; m++)
             {		
              u[0 * nx + it] = un[1 * nx + it - 1];		
              u[m * nx + it] = un[m * nx + it - 1] - total * ( un[m * nx + it - 1] - un[(m - 1) * nx + it - 1] );	
             }
          }

}

Problem with second Kernel

 Unfortunately a new problem has risen, when the argument is raised above 2000 & 2000 the NVidia driver once again crashes and I am stuck with no solution currently.


Re - Parallelize

The original parallelized solution I had was greatly flawed... So I have parallelized the code once again, this time using Thread Identifiers from the Grid.

New Kernel

 __global__ void Initalize(float* u, float* un, int nx, int nt, float dx)
 {
 	
   int i = blockIdx.x * blockDim.x + threadIdx.x;	
   int j = blockIdx.y * blockDim.y + threadIdx.y;
 		
 if (i < nx && j < nx)
  {				
 // replace k with i.  because i represents the x dimension which will ascend in a range from 0 to nx.  
 // So i * dx will essentially by equivalent to...for (int k = 0; k < nx; k++)		
   if (i*dx >= 0.5 && i*dx <= 1)
 // replace k here with i for the X dimension for the same reason as above			
    u[i * nx] = 2;			
   else			
    u[i * nx] = 1;		
   }	
  }	

The old code was:


 for (int k = 0; k <= nx - 1; k++)
   if (k*dx >= 0.5 && k*dx <= 1)		
    {			
     u[k * nx] = 2;					 
    }		
   else		
    {			 
     u[k * nx] = 1;		
    }


So I removed the for loop and simply changed the "k" in the "if" statement to represnt the threadIdx.x identifier because X is the fastest moving dimension it will range from 0 to NX like a for look. This matches the orignal Naiver output.

After this a separate kernel was created with the following code...


 int i = blockIdx.x * blockDim.x + threadIdx.x;
 int j = blockIdx.y * blockDim.y + threadIdx.y;
 // The original code had the following statement:: u[m * nx + it] = un[m * nx + it - 1] - c*dt / dx*(un[m * nx + it - 1] - un[(m - 1) * nx + it - 1]);
 // Rather than having each thread perform this calculation which will be an additional 2 instructions per thread, i have just stored it in a variable
 float total = c*dt / dx;
  if (i < nx && j < nx)
   {
    // The original code as can be seen below is basically copying array un to array u.  So i arranged the threads to do the same
    un[j * nx + i] = u[j * nx + i];	
    __syncthreads();	
     if (i != 0)		
      {
 // This part was a bit trickier.  As seen in the original code below array u would access all threads in the [0,0] [0,1] [0,2] etc... 
 // And copy a value from array un's [1,1] [1,2] [1,3]..etc range.  The trick here was the -1 difference at the end
 // Because in the original for look, (it) starts at the value 1, I added and if condition to make sure the threads don't perform the operation on the          thread of value 0.  But it can still be access through the -1 operator.		
       u[i] = un[1 * nx + i-1];		
       __syncthreads();	
      }
    }

Compared to the original code...


 for (int it = 1; it <= nx - 1; it++)			
 {				
   for (int k = 0; k <= nx - 1; k++)				
     {				
      un[k * nx + it - 1] = u[k * nx + it - 1];			
     }				
   for (int m = 1; m <= nx - 1; m++)				
     {
      u[0 * nx + it] = un[1 * nx + it - 1];				
      u[m * nx + it] = un[m * nx + it - 1] - c*dt / dx*(un[m * nx + it - 1] - un[(m - 1) * nx + it - 1]);			
     }
  }			}