A-Team

From CDOT Wiki
Revision as of 11:56, 10 March 2019 by Spdjurovic (talk | contribs) (Assignment 1)
Jump to: navigation, search

Back Propagation Acceleration

Team Members

  1. Sebastian Djurovic, Team Lead and Developer
  2. Henry Leung, Developer and Quality Control
  3. ...

Email All

Progress

Assignment 1

Our group decided to profile a couple of different solutions, the first being a simple neural network and ray tracing solution, in order to determine the best project to generate a solution for.

Neural Network
Sebastian's findings

I found a simple neural network that takes a MNIST data set and preforms training on batches of the data. For a quick illustration MNIST is a numerical data set that contains many written numbers --in a gray scale format at 28 x 28 pixels in size. As well as the corresponding numerical values; between 0 and 9. The reason for this data set is to train networks such that they will be able to recognize written numbers when they confront them.

MnistExamples.png

Initial Profile
Flat profile:
Each sample counts as 0.01 seconds.
 %   cumulative   self              self     total           
time   seconds   seconds    calls  ns/call  ns/call  name    
97.94    982.46   982.46                             dot(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&, int, int, int)
 1.45    997.05    14.58                             transpose(float*, int, int)
 0.15    998.56     1.51                             operator-(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&)
 0.15   1000.06     1.50                             relu(std::vector<float, std::allocator<float> > const&)
 0.15   1001.55     1.49                             operator*(float, std::vector<float, std::allocator<float> > const&)
 0.07   1002.27     0.72 519195026     1.39     1.39  void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&)
 0.06   1002.91     0.63                             operator*(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&)
 0.05   1003.37     0.46                             reluPrime(std::vector<float, std::allocator<float> > const&)
 0.02   1003.62     0.25                             softmax(std::vector<float, std::allocator<float> > const&, int)
 0.01   1003.75     0.13                             operator/(std::vector<float, std::allocator<float> > const&, float)
 0.01   1003.87     0.12   442679   271.35   271.35  void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&)
 0.01   1003.96     0.09 13107321     6.87     6.87  void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&)
 0.01   1004.02     0.06                             split(std::string const&, char)
 0.01   1004.08     0.06   462000   130.00   130.00  void std::vector<std::string, std::allocator<std::string> >::_M_emplace_back_aux<std::string const&>(std::string const&)
 0.00   1004.11     0.03                             std::vector<std::string, std::allocator<std::string> >::~vector()
 0.00   1004.12     0.01                             random_vector(int)
 0.00   1004.12     0.00        3     0.00     0.00  std::vector<float, std::allocator<float> >::vector(unsigned long, std::allocator<float> const&)
 0.00   1004.12     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z5printRKSt6vectorIfSaIfEEii

Neuralnet chart.jpg

After the initial profile it is obvious that the dot product function consumes 97.94% of our run time. Additionally, the transpose function also consumes 1.45% which seems messily, however during back propagation transpose is also called, as well as two rectifiers(activation functions), reluPrime and relu. Where reluPrime is a binary activation function.

Relu = f(x) = {0 for x > 0, x otherwise}
ReluPrime = f(x) = {0 for x > 0, 1 otherwise}
Code Snippets
       // Back propagation
       vector<float> dyhat = (yhat - b_y);
       // dW3 = a2.T * dyhat
       vector<float> dW3 = dot(transpose( &a2[0], BATCH_SIZE, 64 ), dyhat, 64, BATCH_SIZE, 10);
       // dz2 = dyhat * W3.T * relu'(a2)
       vector<float> dz2 = dot(dyhat, transpose( &W3[0], 64, 10 ), BATCH_SIZE, 10, 64) * reluPrime(a2);
       // dW2 = a1.T * dz2
       vector<float> dW2 = dot(transpose( &a1[0], BATCH_SIZE, 128 ), dz2, 128, BATCH_SIZE, 64);
       // dz1 = dz2 * W2.T * relu'(a1)
       vector<float> dz1 = dot(dz2, transpose( &W2[0], 128, 64 ), BATCH_SIZE, 64, 128) * reluPrime(a1);
       // dW1 = X.T * dz1
       vector<float> dW1 = dot(transpose( &b_X[0], BATCH_SIZE, 784 ), dz1, 784, BATCH_SIZE, 128);


vector <float> dot (const vector <float>& m1, const vector <float>& m2, const int m1_rows, const int m1_columns, const int m2_columns) { 
   vector <float> output (m1_rows*m2_columns);
   
   for( int row = 0; row != m1_rows; ++row ) {
       for( int col = 0; col != m2_columns; ++col ) {
           output[ row * m2_columns + col ] = 0.f;
           for( int k = 0; k != m1_columns; ++k ) {
               output[ row * m2_columns + col ] += m1[ row * m1_columns + k ] * m2[ k * m2_columns + col ];
           }
       }
   }
   
   return output;
}
Amdahl's law

When Amdahl's law is applied the theoretical speed up is 48.54x, however due to the exception the actual prediction is no more then 10x faster.

Theoretical:

s = 1/(1 - 97.94)
= 48.54

Prediction:

P = 102s
Hypothesis

Our Hypothesis for this solution is a acceleration of roughly 10x; when dot() is parallelized.

Ray Tracing

Assignment 2

Assignment 3