Difference between revisions of "Algo holics"

From CDOT Wiki
Jump to: navigation, search
(Team Members)
(Progress)
Line 11: Line 11:
  
  
<h5>SUDOKU PUZZLE SOLVER by Gurpreet Singh</h5>
+
====Sudoku Puzzle Solver by Gurpreet Singh====
  
 
Is it a program that solves Sudoku puzzles(9X9) using Bruteforce algorithm. The user can either pass a Sudoku files as an input or enter the values manually. Moreover, the file or the manual entry must strictly have 9 rows and 9 columns in them. Last but not the least, all the cells must be separated by a space and the cells that needs to be solved must have 0 in them as their value.
 
Is it a program that solves Sudoku puzzles(9X9) using Bruteforce algorithm. The user can either pass a Sudoku files as an input or enter the values manually. Moreover, the file or the manual entry must strictly have 9 rows and 9 columns in them. Last but not the least, all the cells must be separated by a space and the cells that needs to be solved must have 0 in them as their value.
Line 17: Line 17:
 
The original source code can be found at [https://github.com/shafeeq/Sudoku Link]
 
The original source code can be found at [https://github.com/shafeeq/Sudoku Link]
  
<h5>Logic</h5>
+
=====Logic=====
  
 
In this program the Bruteforce algorithm first put 1 in the first cell. Then it moves to the second cell and put 1 in there and check if it satisfies all the rules and conditions. If it don't, then the algorithm will increment it's value to 2 and then check again. The value can change from 0-9 to find the correct value for a cell. If none of the value from the range of 0-9 satisfies the cell, then the program will iterate back and change the value of the first cell to 2 and then try the whole process again. In this way it will solve the puzzle.
 
In this program the Bruteforce algorithm first put 1 in the first cell. Then it moves to the second cell and put 1 in there and check if it satisfies all the rules and conditions. If it don't, then the algorithm will increment it's value to 2 and then check again. The value can change from 0-9 to find the correct value for a cell. If none of the value from the range of 0-9 satisfies the cell, then the program will iterate back and change the value of the first cell to 2 and then try the whole process again. In this way it will solve the puzzle.
  
<h5>Compiling the program</h5>
+
=====Compiling the program=====
 
Enter the following commands:
 
Enter the following commands:
  
Line 56: Line 56:
  
  
<h5>Analysis</h5>
+
=====Analysis=====
 
To analyze the call graph, enter the following command:
 
To analyze the call graph, enter the following command:
 
     gprof -q -b a> a.clg
 
     gprof -q -b a> a.clg
Line 212: Line 212:
 
From the above Call graph we can see that for a harder Sudoku puzzle, the time increased significantly. Moreover, it can also be seen that almost 50% of the time is consumed by the checkRow function, 18.8% by checkColumn and finally 12% by the checkSquare function. Thousand of calls were made to these 3 functions, if we parallelizing these functions then the efficiency of the program can be increased significantly.
 
From the above Call graph we can see that for a harder Sudoku puzzle, the time increased significantly. Moreover, it can also be seen that almost 50% of the time is consumed by the checkRow function, 18.8% by checkColumn and finally 12% by the checkSquare function. Thousand of calls were made to these 3 functions, if we parallelizing these functions then the efficiency of the program can be increased significantly.
  
 +
----
 +
==== Simple Artificial Neural Network by Sukhbeer====
  
 +
=====Introduction=====
 +
I am very interested in neural networks and I started learning about them recently. I think this is a good opportunity to build on my knowledge of a NN while also parallelising it. For that purpose, I have selected a very basic neural network which feeds forward with ReLu and softmax and back-propagates on a sample batch from MNIST handwritten digits dataset. In each iteration, the weights are adjusted to train the network for better predictions.
 +
The code performs matrix-multiplication(dot-product) each time that the activation vector and delta vector are calculated for the next node and the previous node respectively. 
 +
 +
=====Source Code=====
 +
Here is the [https://cognitivedemons.wordpress.com/2018/06/08/neural-network-in-c-part-2-mnist-handwritten-digits-dataset/ source code] used.
 +
The result given below is the comparison of predictions as made by the trained network after 10000 iterations. The ground truth is the actual value of the labels between 0-9 (true for the corresponding digit in the dataset).
 +
 +
-----------------------------------------------Epoch 10000--------------------------------------------------
 +
  Predictions:
 +
  0.000848207 9.07445e-06 0.000145165 0.797735 4.94866e-06 0.19374 1.55013e-06 0.000244941 0.00657041 0.000700498
 +
  1.36476e-05 1.07548e-07 8.3835e-05 0.000744837 0.299883 9.37717e-05 3.53349e-05 0.00822595 0.00210021 0.688819
 +
  5.11556e-06 0.000616957 0.000233088 0.87458 2.20579e-05 0.0140489 5.03569e-08 0.000518445 0.0826038 0.0273714
 +
  0.0178851 3.64621e-08 0.0174107 0.000322792 0.716312 0.00120967 0.189534 0.00303238 0.00613965 0.0481543
 +
  7.40077e-07 0.96872 0.014224 0.00555447 2.56397e-05 0.000115577 0.000157107 0.00366156 0.00669771 0.000842866
 +
  7.37584e-05 0.00306397 0.0184482 0.056542 0.000217984 0.0807415 0.000430994 1.09367e-05 0.838792 0.00167921
 +
  1.23026e-05 1.10682e-09 6.47478e-07 0.000129503 1.28475e-05 1.20242e-05 1.18166e-09 0.953265 2.63176e-05 0.046541
 +
  0.974183 3.50241e-18 1.99895e-07 3.4534e-07 2.3755e-11 0.0257772 1.96811e-09 6.99407e-09 3.92052e-05 2.28711e-08
 +
  2.21581e-05 9.26954e-09 0.000182046 0.00336899 3.40876e-05 0.0800376 8.35955e-07 1.2496e-07 0.914781 0.00157335
 +
  8.59312e-07 4.1739e-05 0.000106891 0.000122639 0.00018295 4.02451e-05 7.21105e-07 0.898311 0.00405182 0.0971408
 +
 +
Ground truth:
 +
  0 0 0 1 0 0 0 0 0 0
 +
  0 0 0 0 0 0 0 0 0 1
 +
  0 0 0 1 0 0 0 0 0 0
 +
  0 0 0 0 1 0 0 0 0 0
 +
  0 1 0 0 0 0 0 0 0 0
 +
  0 0 0 0 0 0 0 0 1 0
 +
  0 0 0 0 0 0 0 1 0 0
 +
  1 0 0 0 0 0 0 0 0 0
 +
  0 0 0 0 0 0 0 0 1 0
 +
  0 0 0 0 0 0 0 1 0 0
 +
                                            Loss 0.184251
 +
--------------------------------------------End of Epoch :(------------------------------------------------
 +
 +
 +
=====Profiling=====
 +
Here are the results of profiling the program
 +
{| class="wikitable mw-collapsible mw-collapsed"
 +
! Flat profile
 +
|-
 +
 +
 +
Flat profile:
 +
 +
Each sample counts as 0.01 seconds.
 +
  %  cumulative  self              self    total         
 +
time  seconds  seconds    calls  ns/call  ns/call  name   
 +
97.98  1061.73  1061.73                            dot(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&, int, int, int)
 +
  1.41  1076.95    15.23                            transpose(float*, int, int)
 +
  0.16  1078.65    1.70                            operator-(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&)
 +
  0.14  1080.13    1.48                            operator*(float, std::vector<float, std::allocator<float> > const&)
 +
  0.12  1081.47    1.33                            relu(std::vector<float, std::allocator<float> > const&)
 +
  0.08  1082.34    0.87 519195026    1.68    1.68  void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&)
 +
  0.07  1083.07    0.73                            operator*(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&)
 +
  0.05  1083.63    0.56                            reluPrime(std::vector<float, std::allocator<float> > const&)
 +
  0.03  1083.93    0.30                            softmax(std::vector<float, std::allocator<float> > const&, int)
 +
  0.02  1084.14    0.21  442679  474.87  474.87  void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&)
 +
  0.02  1084.31    0.17 13107321    12.98    12.98  void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&)
 +
  0.01  1084.45    0.14                            operator/(std::vector<float, std::allocator<float> > const&, float)
 +
  0.01  1084.58    0.13  462000  281.67  281.67  void std::vector<std::string, std::allocator<std::string> >::_M_emplace_back_aux<std::string const&>(std::string const&)
 +
  0.01  1084.68    0.10                            split(std::string const&, char)
 +
  0.00  1084.68    0.00        3    0.00    0.00  std::vector<float, std::allocator<float> >::vector(unsigned long, std::allocator<float> const&)
 +
  0.00  1084.68    0.00        1    0.00    0.00  _GLOBAL__sub_I__Z5printRKSt6vectorIfSaIfEEii
 +
 +
 +
|}
 +
{| class="wikitable mw-collapsible mw-collapsed"
 +
! Call graph
 +
|-
 +
|
 +
 +
Call graph
 +
 +
 +
granularity: each sample hit covers 2 byte(s) for 0.00% of 1084.68 seconds
 +
 +
index % time    self  children    called    name
 +
                                                <spontaneous>
 +
[1]    97.9 1061.73    0.00                dot(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&, int, int, int) [1]
 +
-----------------------------------------------
 +
                                                <spontaneous>
 +
[2]      1.4  15.23    0.00                transpose(float*, int, int) [2]
 +
-----------------------------------------------
 +
                                                <spontaneous>
 +
[3]      0.2    1.70    0.00                operator-(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&) [3]
 +
-----------------------------------------------
 +
                                                <spontaneous>
 +
[4]      0.1    0.56    0.97                reluPrime(std::vector<float, std::allocator<float> > const&) [4]
 +
                0.82    0.00 491520000/519195026    void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) [7]
 +
                0.15    0.00  310000/442679      void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) [11]
 +
-----------------------------------------------
 +
                                                <spontaneous>
 +
[5]      0.1    1.48    0.00                operator*(float, std::vector<float, std::allocator<float> > const&) [5]
 +
-----------------------------------------------
 +
                                                <spontaneous>
 +
[6]      0.1    1.33    0.01                relu(std::vector<float, std::allocator<float> > const&) [6]
 +
                0.00    0.00  307321/13107321    void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&) [12]
 +
                0.00    0.00 2075026/519195026    void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) [7]
 +
                0.00    0.00    2679/442679      void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) [11]
 +
-----------------------------------------------
 +
                0.00    0.00 2075026/519195026    relu(std::vector<float, std::allocator<float> > const&) [6]
 +
                0.04    0.00 25600000/519195026    softmax(std::vector<float, std::allocator<float> > const&, int) [9]
 +
                0.82    0.00 491520000/519195026    reluPrime(std::vector<float, std::allocator<float> > const&) [4]
 +
[7]      0.1    0.87    0.00 519195026        void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) [7]
 +
-----------------------------------------------
 +
                                                <spontaneous>
 +
[8]      0.1    0.73    0.00                operator*(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&) [8]
 +
-----------------------------------------------
 +
                                                <spontaneous>
 +
[9]      0.1    0.30    0.27                softmax(std::vector<float, std::allocator<float> > const&, int) [9]
 +
                0.17    0.00 12800000/13107321    void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&) [12]
 +
                0.06    0.00  130000/442679      void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) [11]
 +
                0.04    0.00 25600000/519195026    void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) [7]
 +
-----------------------------------------------
 +
                                                <spontaneous>
 +
[10]    0.0    0.10    0.13                split(std::string const&, char) [10]
 +
                0.13    0.00  462000/462000      void std::vector<std::string, std::allocator<std::string> >::_M_emplace_back_aux<std::string const&>(std::string const&) [14]
 +
-----------------------------------------------
 +
                0.00    0.00    2679/442679      relu(std::vector<float, std::allocator<float> > const&) [6]
 +
                0.06    0.00  130000/442679      softmax(std::vector<float, std::allocator<float> > const&, int) [9]
 +
                0.15    0.00  310000/442679      reluPrime(std::vector<float, std::allocator<float> > const&) [4]
 +
[11]    0.0    0.21    0.00  442679        void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) [11]
 +
-----------------------------------------------
 +
                0.00    0.00  307321/13107321    relu(std::vector<float, std::allocator<float> > const&) [6]
 +
                0.17    0.00 12800000/13107321    softmax(std::vector<float, std::allocator<float> > const&, int) [9]
 +
[12]    0.0    0.17    0.00 13107321        void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&) [12]
 +
-----------------------------------------------
 +
                                                <spontaneous>
 +
[13]    0.0    0.14    0.00                operator/(std::vector<float, std::allocator<float> > const&, float) [13]
 +
-----------------------------------------------
 +
                0.13    0.00  462000/462000      split(std::string const&, char) [10]
 +
[14]    0.0    0.13    0.00  462000        void std::vector<std::string, std::allocator<std::string> >::_M_emplace_back_aux<std::string const&>(std::string const&) [14]
 +
-----------------------------------------------
 +
                0.00    0.00      3/3          random_vector(int) [28]
 +
[22]    0.0    0.00    0.00      3        std::vector<float, std::allocator<float> >::vector(unsigned long, std::allocator<float> const&) [22]
 +
-----------------------------------------------
 +
                0.00    0.00      1/1          __libc_csu_init [38]
 +
[23]    0.0    0.00    0.00      1        _GLOBAL__sub_I__Z5printRKSt6vectorIfSaIfEEii [23]
 +
-----------------------------------------------
 +
 +
Index by function name
 +
 +
  [23] _GLOBAL__sub_I__Z5printRKSt6vectorIfSaIfEEii (nn.cpp) [2] transpose(float*, int, int) [13] operator/(std::vector<float, std::allocator<float> > const&, float)
 +
  [1] dot(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&, int, int, int) [14] void std::vector<std::string, std::allocator<std::string> >::_M_emplace_back_aux<std::string const&>(std::string const&) [3] operator-(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&)
 +
  [6] relu(std::vector<float, std::allocator<float> > const&) [7] void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) [8] operator*(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&)
 +
  [10] split(std::string const&, char) [11] void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) [5] operator*(float, std::vector<float, std::allocator<float> > const&)
 +
  [9] softmax(std::vector<float, std::allocator<float> > const&, int) [12] void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&)
 +
  [4] reluPrime(std::vector<float, std::allocator<float> > const&) [22] std::vector<float, std::allocator<float> >::vector(unsigned long, std::allocator<float> const&)
 +
 +
|}
 +
 +
=====Analysis=====
 +
The total execution time of the program is around 10 minutes. As is evident from profiling results, most of the execution time is taken up by the ''dot()'' function as it does matrix-matrix multiplication. This is the hotspot of the program that can be made efficient by doing this computation and other vector multiplications on the GPU.
 
----
 
----
  
 
=== Assignment 2 ===
 
=== Assignment 2 ===
 
=== Assignment 3 ===
 
=== Assignment 3 ===

Revision as of 00:23, 2 March 2019


GPU610/DPS915 | Student List | Group and Project Index | Student Resources | Glossary

Project Name Goes here

Team Members

  1. Sukhbeer Dhillon, Simple Backpropogation Neural Network
  2. Gurpreet Singh, Sudoku Puzzle Solver
  3. Edgar Giang, Some other other responsibility
  4. Email All

Progress

Assignment 1

Sudoku Puzzle Solver by Gurpreet Singh

Is it a program that solves Sudoku puzzles(9X9) using Bruteforce algorithm. The user can either pass a Sudoku files as an input or enter the values manually. Moreover, the file or the manual entry must strictly have 9 rows and 9 columns in them. Last but not the least, all the cells must be separated by a space and the cells that needs to be solved must have 0 in them as their value.

The original source code can be found at Link

Logic

In this program the Bruteforce algorithm first put 1 in the first cell. Then it moves to the second cell and put 1 in there and check if it satisfies all the rules and conditions. If it don't, then the algorithm will increment it's value to 2 and then check again. The value can change from 0-9 to find the correct value for a cell. If none of the value from the range of 0-9 satisfies the cell, then the program will iterate back and change the value of the first cell to 2 and then try the whole process again. In this way it will solve the puzzle.

Compiling the program

Enter the following commands:

  g++ -std=c++0x -pg solver.cpp checks.cpp checksolution.cpp -o a
  a fileName

-pg directs the compiler to include the executable code required for profiling.

-o directs the compiler to name the executable a.


If we run the sample-puzzle-1 (level- easy) file, which has the following text inside it:

       0 6 0 0 0 0 9 7 2
       0 5 0 0 0 2 0 0 3
       0 7 0 3 9 0 5 0 0
       2 0 0 0 0 5 4 0 8
       0 0 0 0 0 0 0 0 0
       3 0 1 8 0 0 0 0 6
       0 0 4 0 2 3 0 8 0
       7 0 0 9 0 0 0 2 0
       9 2 5 0 0 0 0 4 0

The output will be:

       1 6 3 4 5 8 9 7 2
       4 5 9 7 1 2 8 6 3
       8 7 2 3 9 6 5 1 4
       2 9 7 1 6 5 4 3 8
       5 8 6 2 3 4 1 9 7
       3 4 1 8 7 9 2 5 6
       6 1 4 5 2 3 7 8 9
       7 3 8 9 4 1 6 2 5
       9 2 5 6 8 7 3 4 1


Analysis

To analyze the call graph, enter the following command:

    gprof -q -b a> a.clg

-q directs the profiler (gprof) to output a call graph.

-b directs the profiler to omit detailed explanations of the column headings from the output.


The call graph for the above execution looks like:

Call graph
granularity: each sample hit covers 2 byte(s) no time propagated
index % time    self  children    called     name
               0.00    0.00    4539/4539        placeNum(int, int) [10]
[8]      0.0    0.00    0.00    4539         checkRow(int, int) [8]
-----------------------------------------------
               0.00    0.00    1620/1620        placeNum(int, int) [10]
[9]      0.0    0.00    0.00    1620         checkColumn(int, int) [9]
-----------------------------------------------
               0.00    0.00    1120/1120        solveSudoku() [16]
[10]     0.0    0.00    0.00    1120         placeNum(int, int) [10]
               0.00    0.00    4539/4539        checkRow(int, int) [8]
               0.00    0.00    1620/1620        checkColumn(int, int) [9]
               0.00    0.00     698/698         checkSquare(int, int, int) [11]
-----------------------------------------------
               0.00    0.00     698/698         placeNum(int, int) [10]
[11]     0.0    0.00    0.00     698         checkSquare(int, int, int) [11]
-----------------------------------------------
               0.00    0.00     476/476         solveSudoku() [16]
[12]     0.0    0.00    0.00     476         goBack(int&, int&) [12]
-----------------------------------------------
               0.00    0.00       2/2           main [6]
[13]     0.0    0.00    0.00       2         print(int (*) [9]) [13]
-----------------------------------------------
               0.00    0.00       1/1           __libc_csu_init [30]
[14]     0.0    0.00    0.00       1         _GLOBAL__sub_I_sudoku [14]
               0.00    0.00       1/1           __static_initialization_and_destruction_0(int, int) [18]
-----------------------------------------------
               0.00    0.00       1/1           __libc_csu_init [30]
[15]     0.0    0.00    0.00       1         _GLOBAL__sub_I_temp [15]
               0.00    0.00       1/1           __static_initialization_and_destruction_0(int, int) [19]
-----------------------------------------------
               0.00    0.00       1/1           main [6]
[16]     0.0    0.00    0.00       1         solveSudoku() [16]
               0.00    0.00    1120/1120        placeNum(int, int) [10]
               0.00    0.00     476/476         goBack(int&, int&) [12]
-----------------------------------------------
               0.00    0.00       1/1           main [6]
[17]     0.0    0.00    0.00       1         storePositions() [17]
-----------------------------------------------
               0.00    0.00       1/1           _GLOBAL__sub_I_sudoku [14]
[18]     0.0    0.00    0.00       1         __static_initialization_and_destruction_0(int, int) [18]
-----------------------------------------------
               0.00    0.00       1/1           _GLOBAL__sub_I_temp [15]
[19]     0.0    0.00    0.00       1         __static_initialization_and_destruction_0(int, int) [19]
-----------------------------------------------
Index by function name
 [14] _GLOBAL__sub_I_sudoku  [16] solveSudoku()          [13] print(int (*) [9])
 [15] _GLOBAL__sub_I_temp    [17] storePositions()       [12] goBack(int&, int&)
  [9] checkColumn(int, int)  [18] __static_initialization_and_destruction_0(int, int) [8] checkRow(int, int)
 [11] checkSquare(int, int, int) [19] __static_initialization_and_destruction_0(int, int) [10] placeNum(int, int)

From the above Call graph we can see that the program took no time in finding the solution and the maximum number of calls were made to the checkRow, checkColumn and checkSquare function. However, to get a better understanding of the program let's try a harder Sudoku puzzle.


If we run the sample-puzzle-2-hard (Level- hard) file, which has the following text inside it:

      0 0 0 0 0 0 0 0 0
      0 0 0 0 0 3 0 8 5
      0 0 1 0 2 0 0 0 0
      0 0 0 5 0 7 0 0 0
      0 0 4 0 0 0 1 0 0
      0 9 0 0 0 0 0 0 0
      5 0 0 0 0 0 0 7 3
      0 0 2 0 1 0 0 0 0
      0 0 0 0 4 0 0 0 9

The output will be:

      9 8 7 6 5 4 3 2 1
      2 4 6 1 7 3 9 8 5
      3 5 1 9 2 8 7 4 6
      1 2 8 5 3 7 6 9 4
      6 3 4 8 9 2 1 5 7
      7 9 5 4 6 1 8 3 2
      5 1 9 2 8 6 4 7 3
      4 7 2 3 1 9 5 6 8
      8 6 3 7 4 5 2 1 9

The Call graph for the following looks like:

Call graph
granularity: each sample hit covers 2 byte(s) for 0.04% of 26.79 seconds
index % time    self  children    called     name
                                                <spontaneous>
[1]    100.0    0.00   26.78                 main [1]
               0.68   26.09       1/1           solveSudoku() [2]
               0.01    0.00       1/1           storePositions() [9]
               0.00    0.00       2/2           print(int (*) [9]) [17]
-----------------------------------------------
               0.68   26.09       1/1           main [1]
[2]     99.9    0.68   26.09       1         solveSudoku() [2]
               3.64   21.56 157353814/157353814     placeNum(int, int) [3]
               0.89    0.00 69175252/69175252     goBack(int&, int&) [7]
-----------------------------------------------
               3.64   21.56 157353814/157353814     solveSudoku() [2]
[3]     94.1    3.64   21.56 157353814         placeNum(int, int) [3]
              13.31    0.00 622577597/622577597     checkRow(int, int) [4]
               5.04    0.00 223365661/223365661     checkColumn(int, int) [5]
               3.21    0.00 100608583/100608583     checkSquare(int, int, int) [6]
-----------------------------------------------
              13.31    0.00 622577597/622577597     placeNum(int, int) [3]
[4]     49.7   13.31    0.00 622577597         checkRow(int, int) [4]
-----------------------------------------------
               5.04    0.00 223365661/223365661     placeNum(int, int) [3]
[5]     18.8    5.04    0.00 223365661         checkColumn(int, int) [5]
-----------------------------------------------
               3.21    0.00 100608583/100608583     placeNum(int, int) [3]
[6]     12.0    3.21    0.00 100608583         checkSquare(int, int, int) [6]
-----------------------------------------------
               0.89    0.00 69175252/69175252     solveSudoku() [2]
[7]      3.3    0.89    0.00 69175252         goBack(int&, int&) [7]
-----------------------------------------------
               0.01    0.00       1/1           __libc_csu_init [10]
[8]      0.0    0.01    0.00       1         _GLOBAL__sub_I_sudoku [8]
               0.00    0.00       1/1           __static_initialization_and_destruction_0(int, int) [19]
-----------------------------------------------
               0.01    0.00       1/1           main [1]
[9]      0.0    0.01    0.00       1         storePositions() [9]
-----------------------------------------------
                                                <spontaneous>
[10]     0.0    0.00    0.01                 __libc_csu_init [10]
               0.01    0.00       1/1           _GLOBAL__sub_I_sudoku [8]
               0.00    0.00       1/1           _GLOBAL__sub_I_temp [18]
-----------------------------------------------
               0.00    0.00       2/2           main [1]
[17]     0.0    0.00    0.00       2         print(int (*) [9]) [17]
-----------------------------------------------
               0.00    0.00       1/1           __libc_csu_init [10]
[18]     0.0    0.00    0.00       1         _GLOBAL__sub_I_temp [18]
               0.00    0.00       1/1           __static_initialization_and_destruction_0(int, int) [20]
-----------------------------------------------
               0.00    0.00       1/1           _GLOBAL__sub_I_sudoku [8]
[19]     0.0    0.00    0.00       1         __static_initialization_and_destruction_0(int, int) [19]
-----------------------------------------------
               0.00    0.00       1/1           _GLOBAL__sub_I_temp [18]
[20]     0.0    0.00    0.00       1         __static_initialization_and_destruction_0(int, int) [20]
-----------------------------------------------
Index by function name
  [8] _GLOBAL__sub_I_sudoku   [2] solveSudoku()          [17] print(int (*) [9])
 [18] _GLOBAL__sub_I_temp     [9] storePositions()        [7] goBack(int&, int&)
  [5] checkColumn(int, int)  [19] __static_initialization_and_destruction_0(int, int) [4] checkRow(int, int)
  [6] checkSquare(int, int, int) [20] __static_initialization_and_destruction_0(int, int) [3] placeNum(int, int)


From the above Call graph we can see that for a harder Sudoku puzzle, the time increased significantly. Moreover, it can also be seen that almost 50% of the time is consumed by the checkRow function, 18.8% by checkColumn and finally 12% by the checkSquare function. Thousand of calls were made to these 3 functions, if we parallelizing these functions then the efficiency of the program can be increased significantly.


Simple Artificial Neural Network by Sukhbeer

Introduction

I am very interested in neural networks and I started learning about them recently. I think this is a good opportunity to build on my knowledge of a NN while also parallelising it. For that purpose, I have selected a very basic neural network which feeds forward with ReLu and softmax and back-propagates on a sample batch from MNIST handwritten digits dataset. In each iteration, the weights are adjusted to train the network for better predictions. The code performs matrix-multiplication(dot-product) each time that the activation vector and delta vector are calculated for the next node and the previous node respectively.

Source Code

Here is the source code used. The result given below is the comparison of predictions as made by the trained network after 10000 iterations. The ground truth is the actual value of the labels between 0-9 (true for the corresponding digit in the dataset).

-----------------------------------------------Epoch 10000--------------------------------------------------
  Predictions:
  0.000848207 9.07445e-06 0.000145165 0.797735 4.94866e-06 0.19374 1.55013e-06 0.000244941 0.00657041 0.000700498
  1.36476e-05 1.07548e-07 8.3835e-05 0.000744837 0.299883 9.37717e-05 3.53349e-05 0.00822595 0.00210021 0.688819
  5.11556e-06 0.000616957 0.000233088 0.87458 2.20579e-05 0.0140489 5.03569e-08 0.000518445 0.0826038 0.0273714
  0.0178851 3.64621e-08 0.0174107 0.000322792 0.716312 0.00120967 0.189534 0.00303238 0.00613965 0.0481543
  7.40077e-07 0.96872 0.014224 0.00555447 2.56397e-05 0.000115577 0.000157107 0.00366156 0.00669771 0.000842866
  7.37584e-05 0.00306397 0.0184482 0.056542 0.000217984 0.0807415 0.000430994 1.09367e-05 0.838792 0.00167921
  1.23026e-05 1.10682e-09 6.47478e-07 0.000129503 1.28475e-05 1.20242e-05 1.18166e-09 0.953265 2.63176e-05 0.046541
  0.974183 3.50241e-18 1.99895e-07 3.4534e-07 2.3755e-11 0.0257772 1.96811e-09 6.99407e-09 3.92052e-05 2.28711e-08
  2.21581e-05 9.26954e-09 0.000182046 0.00336899 3.40876e-05 0.0800376 8.35955e-07 1.2496e-07 0.914781 0.00157335
  8.59312e-07 4.1739e-05 0.000106891 0.000122639 0.00018295 4.02451e-05 7.21105e-07 0.898311 0.00405182 0.0971408
Ground truth:
  0 0 0 1 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 1
  0 0 0 1 0 0 0 0 0 0
  0 0 0 0 1 0 0 0 0 0
  0 1 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 1 0
  0 0 0 0 0 0 0 1 0 0
  1 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 1 0
  0 0 0 0 0 0 0 1 0 0
                                           Loss 0.184251
--------------------------------------------End of Epoch :(------------------------------------------------


Profiling

Here are the results of profiling the program

Flat profile

Flat profile:

Each sample counts as 0.01 seconds.

 %   cumulative   self              self     total           
time   seconds   seconds    calls  ns/call  ns/call  name    
97.98   1061.73  1061.73                             dot(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&, int, int, int)
 1.41   1076.95    15.23                             transpose(float*, int, int)
 0.16   1078.65     1.70                             operator-(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&)
 0.14   1080.13     1.48                             operator*(float, std::vector<float, std::allocator<float> > const&)
 0.12   1081.47     1.33                             relu(std::vector<float, std::allocator<float> > const&)
 0.08   1082.34     0.87 519195026     1.68     1.68  void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&)
 0.07   1083.07     0.73                             operator*(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&)
 0.05   1083.63     0.56                             reluPrime(std::vector<float, std::allocator<float> > const&)
 0.03   1083.93     0.30                             softmax(std::vector<float, std::allocator<float> > const&, int)
 0.02   1084.14     0.21   442679   474.87   474.87  void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&)
 0.02   1084.31     0.17 13107321    12.98    12.98  void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&)
 0.01   1084.45     0.14                             operator/(std::vector<float, std::allocator<float> > const&, float)
 0.01   1084.58     0.13   462000   281.67   281.67  void std::vector<std::string, std::allocator<std::string> >::_M_emplace_back_aux<std::string const&>(std::string const&)
 0.01   1084.68     0.10                             split(std::string const&, char)
 0.00   1084.68     0.00        3     0.00     0.00  std::vector<float, std::allocator<float> >::vector(unsigned long, std::allocator<float> const&)
 0.00   1084.68     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z5printRKSt6vectorIfSaIfEEii


Call graph

Call graph


granularity: each sample hit covers 2 byte(s) for 0.00% of 1084.68 seconds

index % time self children called name

                                                <spontaneous>

[1] 97.9 1061.73 0.00 dot(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&, int, int, int) [1]


                                                <spontaneous>

[2] 1.4 15.23 0.00 transpose(float*, int, int) [2]


                                                <spontaneous>

[3] 0.2 1.70 0.00 operator-(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&) [3]


                                                <spontaneous>

[4] 0.1 0.56 0.97 reluPrime(std::vector<float, std::allocator<float> > const&) [4]

               0.82    0.00 491520000/519195026     void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) [7]
               0.15    0.00  310000/442679      void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) [11]

                                                <spontaneous>

[5] 0.1 1.48 0.00 operator*(float, std::vector<float, std::allocator<float> > const&) [5]


                                                <spontaneous>

[6] 0.1 1.33 0.01 relu(std::vector<float, std::allocator<float> > const&) [6]

               0.00    0.00  307321/13107321     void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&) [12]
               0.00    0.00 2075026/519195026     void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) [7]
               0.00    0.00    2679/442679      void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) [11]

               0.00    0.00 2075026/519195026     relu(std::vector<float, std::allocator<float> > const&) [6]
               0.04    0.00 25600000/519195026     softmax(std::vector<float, std::allocator<float> > const&, int) [9]
               0.82    0.00 491520000/519195026     reluPrime(std::vector<float, std::allocator<float> > const&) [4]

[7] 0.1 0.87 0.00 519195026 void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) [7]


                                                <spontaneous>

[8] 0.1 0.73 0.00 operator*(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&) [8]


                                                <spontaneous>

[9] 0.1 0.30 0.27 softmax(std::vector<float, std::allocator<float> > const&, int) [9]

               0.17    0.00 12800000/13107321     void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&) [12]
               0.06    0.00  130000/442679      void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) [11]
               0.04    0.00 25600000/519195026     void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) [7]

                                                <spontaneous>

[10] 0.0 0.10 0.13 split(std::string const&, char) [10]

               0.13    0.00  462000/462000      void std::vector<std::string, std::allocator<std::string> >::_M_emplace_back_aux<std::string const&>(std::string const&) [14]

               0.00    0.00    2679/442679      relu(std::vector<float, std::allocator<float> > const&) [6]
               0.06    0.00  130000/442679      softmax(std::vector<float, std::allocator<float> > const&, int) [9]
               0.15    0.00  310000/442679      reluPrime(std::vector<float, std::allocator<float> > const&) [4]

[11] 0.0 0.21 0.00 442679 void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) [11]


               0.00    0.00  307321/13107321     relu(std::vector<float, std::allocator<float> > const&) [6]
               0.17    0.00 12800000/13107321     softmax(std::vector<float, std::allocator<float> > const&, int) [9]

[12] 0.0 0.17 0.00 13107321 void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&) [12]


                                                <spontaneous>

[13] 0.0 0.14 0.00 operator/(std::vector<float, std::allocator<float> > const&, float) [13]


               0.13    0.00  462000/462000      split(std::string const&, char) [10]

[14] 0.0 0.13 0.00 462000 void std::vector<std::string, std::allocator<std::string> >::_M_emplace_back_aux<std::string const&>(std::string const&) [14]


               0.00    0.00       3/3           random_vector(int) [28]

[22] 0.0 0.00 0.00 3 std::vector<float, std::allocator<float> >::vector(unsigned long, std::allocator<float> const&) [22]


               0.00    0.00       1/1           __libc_csu_init [38]

[23] 0.0 0.00 0.00 1 _GLOBAL__sub_I__Z5printRKSt6vectorIfSaIfEEii [23]


� Index by function name

 [23] _GLOBAL__sub_I__Z5printRKSt6vectorIfSaIfEEii (nn.cpp) [2] transpose(float*, int, int) [13] operator/(std::vector<float, std::allocator<float> > const&, float)
  [1] dot(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&, int, int, int) [14] void std::vector<std::string, std::allocator<std::string> >::_M_emplace_back_aux<std::string const&>(std::string const&) [3] operator-(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&)
  [6] relu(std::vector<float, std::allocator<float> > const&) [7] void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) [8] operator*(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&)
 [10] split(std::string const&, char) [11] void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) [5] operator*(float, std::vector<float, std::allocator<float> > const&)
  [9] softmax(std::vector<float, std::allocator<float> > const&, int) [12] void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&)
  [4] reluPrime(std::vector<float, std::allocator<float> > const&) [22] std::vector<float, std::allocator<float> >::vector(unsigned long, std::allocator<float> const&)
Analysis

The total execution time of the program is around 10 minutes. As is evident from profiling results, most of the execution time is taken up by the dot() function as it does matrix-matrix multiplication. This is the hotspot of the program that can be made efficient by doing this computation and other vector multiplications on the GPU.


Assignment 2

Assignment 3