# Project Name Goes here

## Team Members

1. Sukhbeer Dhillon, Simple Backpropogation Neural Network
2. Gurpreet Singh, Sudoku Puzzle Solver
3. Edgar Giang, Some other other responsibility
4. Email All

## Progress

### Assignment 1

#### Sudoku Puzzle Solver by Gurpreet Singh

Is it a program that solves Sudoku puzzles(9X9) using Bruteforce algorithm. The user can either pass a Sudoku files as an input or enter the values manually. Moreover, the file or the manual entry must strictly have 9 rows and 9 columns in them. Last but not the least, all the cells must be separated by a space and the cells that needs to be solved must have 0 in them as their value.

The original source code can be found at Link

##### Logic

In this program the Bruteforce algorithm first put 1 in the first cell. Then it moves to the second cell and put 1 in there and check if it satisfies all the rules and conditions. If it don't, then the algorithm will increment it's value to 2 and then check again. The value can change from 0-9 to find the correct value for a cell. If none of the value from the range of 0-9 satisfies the cell, then the program will iterate back and change the value of the first cell to 2 and then try the whole process again. In this way it will solve the puzzle.

##### Compiling the program

Enter the following commands:

```  g++ -std=c++0x -pg solver.cpp checks.cpp checksolution.cpp -o a
a fileName

```

-pg directs the compiler to include the executable code required for profiling.

-o directs the compiler to name the executable a.

If we run the sample-puzzle-1 (level- easy) file, which has the following text inside it:

```       0 6 0 0 0 0 9 7 2
0 5 0 0 0 2 0 0 3
0 7 0 3 9 0 5 0 0
2 0 0 0 0 5 4 0 8
0 0 0 0 0 0 0 0 0
3 0 1 8 0 0 0 0 6
0 0 4 0 2 3 0 8 0
7 0 0 9 0 0 0 2 0
9 2 5 0 0 0 0 4 0
```

The output will be:

```       1 6 3 4 5 8 9 7 2
4 5 9 7 1 2 8 6 3
8 7 2 3 9 6 5 1 4
2 9 7 1 6 5 4 3 8
5 8 6 2 3 4 1 9 7
3 4 1 8 7 9 2 5 6
6 1 4 5 2 3 7 8 9
7 3 8 9 4 1 6 2 5
9 2 5 6 8 7 3 4 1
```

##### Analysis

To analyze the call graph, enter the following command:

```    gprof -q -b a> a.clg
```

-q directs the profiler (gprof) to output a call graph.

-b directs the profiler to omit detailed explanations of the column headings from the output.

The call graph for the above execution looks like:

```Call graph
granularity: each sample hit covers 2 byte(s) no time propagated
index % time    self  children    called     name
0.00    0.00    4539/4539        placeNum(int, int) [10]
[8]      0.0    0.00    0.00    4539         checkRow(int, int) [8]
-----------------------------------------------
0.00    0.00    1620/1620        placeNum(int, int) [10]
[9]      0.0    0.00    0.00    1620         checkColumn(int, int) [9]
-----------------------------------------------
0.00    0.00    1120/1120        solveSudoku() [16]
[10]     0.0    0.00    0.00    1120         placeNum(int, int) [10]
0.00    0.00    4539/4539        checkRow(int, int) [8]
0.00    0.00    1620/1620        checkColumn(int, int) [9]
0.00    0.00     698/698         checkSquare(int, int, int) [11]
-----------------------------------------------
0.00    0.00     698/698         placeNum(int, int) [10]
[11]     0.0    0.00    0.00     698         checkSquare(int, int, int) [11]
-----------------------------------------------
0.00    0.00     476/476         solveSudoku() [16]
[12]     0.0    0.00    0.00     476         goBack(int&, int&) [12]
-----------------------------------------------
0.00    0.00       2/2           main [6]
[13]     0.0    0.00    0.00       2         print(int (*) [9]) [13]
-----------------------------------------------
0.00    0.00       1/1           __libc_csu_init [30]
[14]     0.0    0.00    0.00       1         _GLOBAL__sub_I_sudoku [14]
0.00    0.00       1/1           __static_initialization_and_destruction_0(int, int) [18]
-----------------------------------------------
0.00    0.00       1/1           __libc_csu_init [30]
[15]     0.0    0.00    0.00       1         _GLOBAL__sub_I_temp [15]
0.00    0.00       1/1           __static_initialization_and_destruction_0(int, int) [19]
-----------------------------------------------
0.00    0.00       1/1           main [6]
[16]     0.0    0.00    0.00       1         solveSudoku() [16]
0.00    0.00    1120/1120        placeNum(int, int) [10]
0.00    0.00     476/476         goBack(int&, int&) [12]
-----------------------------------------------
0.00    0.00       1/1           main [6]
[17]     0.0    0.00    0.00       1         storePositions() [17]
-----------------------------------------------
0.00    0.00       1/1           _GLOBAL__sub_I_sudoku [14]
[18]     0.0    0.00    0.00       1         __static_initialization_and_destruction_0(int, int) [18]
-----------------------------------------------
0.00    0.00       1/1           _GLOBAL__sub_I_temp [15]
[19]     0.0    0.00    0.00       1         __static_initialization_and_destruction_0(int, int) [19]
-----------------------------------------------
Index by function name
[14] _GLOBAL__sub_I_sudoku  [16] solveSudoku()          [13] print(int (*) [9])
[15] _GLOBAL__sub_I_temp    [17] storePositions()       [12] goBack(int&, int&)
[9] checkColumn(int, int)  [18] __static_initialization_and_destruction_0(int, int) [8] checkRow(int, int)
[11] checkSquare(int, int, int) [19] __static_initialization_and_destruction_0(int, int) [10] placeNum(int, int)

```

From the above Call graph we can see that the program took no time in finding the solution and the maximum number of calls were made to the checkRow, checkColumn and checkSquare function. However, to get a better understanding of the program let's try a harder Sudoku puzzle.

If we run the sample-puzzle-2-hard (Level- hard) file, which has the following text inside it:

```      0 0 0 0 0 0 0 0 0
0 0 0 0 0 3 0 8 5
0 0 1 0 2 0 0 0 0
0 0 0 5 0 7 0 0 0
0 0 4 0 0 0 1 0 0
0 9 0 0 0 0 0 0 0
5 0 0 0 0 0 0 7 3
0 0 2 0 1 0 0 0 0
0 0 0 0 4 0 0 0 9
```

The output will be:

```      9 8 7 6 5 4 3 2 1
2 4 6 1 7 3 9 8 5
3 5 1 9 2 8 7 4 6
1 2 8 5 3 7 6 9 4
6 3 4 8 9 2 1 5 7
7 9 5 4 6 1 8 3 2
5 1 9 2 8 6 4 7 3
4 7 2 3 1 9 5 6 8
8 6 3 7 4 5 2 1 9
```

The Call graph for the following looks like:

```Call graph
granularity: each sample hit covers 2 byte(s) for 0.04% of 26.79 seconds
index % time    self  children    called     name
<spontaneous>
[1]    100.0    0.00   26.78                 main [1]
0.68   26.09       1/1           solveSudoku() [2]
0.01    0.00       1/1           storePositions() [9]
0.00    0.00       2/2           print(int (*) [9]) [17]
-----------------------------------------------
0.68   26.09       1/1           main [1]
[2]     99.9    0.68   26.09       1         solveSudoku() [2]
3.64   21.56 157353814/157353814     placeNum(int, int) [3]
0.89    0.00 69175252/69175252     goBack(int&, int&) [7]
-----------------------------------------------
3.64   21.56 157353814/157353814     solveSudoku() [2]
[3]     94.1    3.64   21.56 157353814         placeNum(int, int) [3]
13.31    0.00 622577597/622577597     checkRow(int, int) [4]
5.04    0.00 223365661/223365661     checkColumn(int, int) [5]
3.21    0.00 100608583/100608583     checkSquare(int, int, int) [6]
-----------------------------------------------
13.31    0.00 622577597/622577597     placeNum(int, int) [3]
[4]     49.7   13.31    0.00 622577597         checkRow(int, int) [4]
-----------------------------------------------
5.04    0.00 223365661/223365661     placeNum(int, int) [3]
[5]     18.8    5.04    0.00 223365661         checkColumn(int, int) [5]
-----------------------------------------------
3.21    0.00 100608583/100608583     placeNum(int, int) [3]
[6]     12.0    3.21    0.00 100608583         checkSquare(int, int, int) [6]
-----------------------------------------------
0.89    0.00 69175252/69175252     solveSudoku() [2]
[7]      3.3    0.89    0.00 69175252         goBack(int&, int&) [7]
-----------------------------------------------
0.01    0.00       1/1           __libc_csu_init [10]
[8]      0.0    0.01    0.00       1         _GLOBAL__sub_I_sudoku [8]
0.00    0.00       1/1           __static_initialization_and_destruction_0(int, int) [19]
-----------------------------------------------
0.01    0.00       1/1           main [1]
[9]      0.0    0.01    0.00       1         storePositions() [9]
-----------------------------------------------
<spontaneous>
[10]     0.0    0.00    0.01                 __libc_csu_init [10]
0.01    0.00       1/1           _GLOBAL__sub_I_sudoku [8]
0.00    0.00       1/1           _GLOBAL__sub_I_temp [18]
-----------------------------------------------
0.00    0.00       2/2           main [1]
[17]     0.0    0.00    0.00       2         print(int (*) [9]) [17]
-----------------------------------------------
0.00    0.00       1/1           __libc_csu_init [10]
[18]     0.0    0.00    0.00       1         _GLOBAL__sub_I_temp [18]
0.00    0.00       1/1           __static_initialization_and_destruction_0(int, int) [20]
-----------------------------------------------
0.00    0.00       1/1           _GLOBAL__sub_I_sudoku [8]
[19]     0.0    0.00    0.00       1         __static_initialization_and_destruction_0(int, int) [19]
-----------------------------------------------
0.00    0.00       1/1           _GLOBAL__sub_I_temp [18]
[20]     0.0    0.00    0.00       1         __static_initialization_and_destruction_0(int, int) [20]
-----------------------------------------------
Index by function name
[8] _GLOBAL__sub_I_sudoku   [2] solveSudoku()          [17] print(int (*) [9])
[18] _GLOBAL__sub_I_temp     [9] storePositions()        [7] goBack(int&, int&)
[5] checkColumn(int, int)  [19] __static_initialization_and_destruction_0(int, int) [4] checkRow(int, int)
[6] checkSquare(int, int, int) [20] __static_initialization_and_destruction_0(int, int) [3] placeNum(int, int)
```

From the above Call graph we can see that for a harder Sudoku puzzle, the time increased significantly. Moreover, it can also be seen that almost 50% of the time is consumed by the checkRow function, 18.8% by checkColumn and finally 12% by the checkSquare function. Thousand of calls were made to these 3 functions, if we parallelizing these functions then the efficiency of the program can be increased significantly.

#### Simple Artificial Neural Network by Sukhbeer

##### Introduction

I am very interested in neural networks and I started learning about them recently. I think this is a good opportunity to build on my knowledge of a NN while also parallelising it. For that purpose, I have selected a very basic neural network which feeds forward with ReLu and softmax and back-propagates on a sample batch from MNIST handwritten digits dataset. In each iteration, the weights are adjusted to train the network for better predictions. The code performs matrix-multiplication(dot-product) each time that the activation vector and delta vector are calculated for the next node and the previous node respectively.

##### Source Code

Here is the source code used. The result given below is the comparison of predictions as made by the trained network after 10000 iterations. The ground truth is the actual value of the labels between 0-9 (true for the corresponding digit in the dataset).

```-----------------------------------------------Epoch 10000--------------------------------------------------
Predictions:
0.000848207 9.07445e-06 0.000145165 0.797735 4.94866e-06 0.19374 1.55013e-06 0.000244941 0.00657041 0.000700498
1.36476e-05 1.07548e-07 8.3835e-05 0.000744837 0.299883 9.37717e-05 3.53349e-05 0.00822595 0.00210021 0.688819
5.11556e-06 0.000616957 0.000233088 0.87458 2.20579e-05 0.0140489 5.03569e-08 0.000518445 0.0826038 0.0273714
0.0178851 3.64621e-08 0.0174107 0.000322792 0.716312 0.00120967 0.189534 0.00303238 0.00613965 0.0481543
7.40077e-07 0.96872 0.014224 0.00555447 2.56397e-05 0.000115577 0.000157107 0.00366156 0.00669771 0.000842866
7.37584e-05 0.00306397 0.0184482 0.056542 0.000217984 0.0807415 0.000430994 1.09367e-05 0.838792 0.00167921
1.23026e-05 1.10682e-09 6.47478e-07 0.000129503 1.28475e-05 1.20242e-05 1.18166e-09 0.953265 2.63176e-05 0.046541
0.974183 3.50241e-18 1.99895e-07 3.4534e-07 2.3755e-11 0.0257772 1.96811e-09 6.99407e-09 3.92052e-05 2.28711e-08
2.21581e-05 9.26954e-09 0.000182046 0.00336899 3.40876e-05 0.0800376 8.35955e-07 1.2496e-07 0.914781 0.00157335
8.59312e-07 4.1739e-05 0.000106891 0.000122639 0.00018295 4.02451e-05 7.21105e-07 0.898311 0.00405182 0.0971408
```
```Ground truth:
0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1
0 0 0 1 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 1 0 0
Loss 0.184251
--------------------------------------------End of Epoch :(------------------------------------------------
```

##### Profiling

Here are the results of profiling the program

Flat profile

Flat profile:

Each sample counts as 0.01 seconds.

``` %   cumulative   self              self     total
time   seconds   seconds    calls  ns/call  ns/call  name
97.98   1061.73  1061.73                             dot(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&, int, int, int)
1.41   1076.95    15.23                             transpose(float*, int, int)
0.16   1078.65     1.70                             operator-(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&)
0.14   1080.13     1.48                             operator*(float, std::vector<float, std::allocator<float> > const&)
0.12   1081.47     1.33                             relu(std::vector<float, std::allocator<float> > const&)
0.08   1082.34     0.87 519195026     1.68     1.68  void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&)
0.07   1083.07     0.73                             operator*(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&)
0.05   1083.63     0.56                             reluPrime(std::vector<float, std::allocator<float> > const&)
0.03   1083.93     0.30                             softmax(std::vector<float, std::allocator<float> > const&, int)
0.02   1084.14     0.21   442679   474.87   474.87  void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&)
0.02   1084.31     0.17 13107321    12.98    12.98  void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&)
0.01   1084.45     0.14                             operator/(std::vector<float, std::allocator<float> > const&, float)
0.01   1084.58     0.13   462000   281.67   281.67  void std::vector<std::string, std::allocator<std::string> >::_M_emplace_back_aux<std::string const&>(std::string const&)
0.01   1084.68     0.10                             split(std::string const&, char)
0.00   1084.68     0.00        3     0.00     0.00  std::vector<float, std::allocator<float> >::vector(unsigned long, std::allocator<float> const&)
0.00   1084.68     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z5printRKSt6vectorIfSaIfEEii
```

Call graph

Call graph

granularity: each sample hit covers 2 byte(s) for 0.00% of 1084.68 seconds

index % time self children called name

```                                                <spontaneous>
```

[1] 97.9 1061.73 0.00 dot(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&, int, int, int) [1]

```                                                <spontaneous>
```

[2] 1.4 15.23 0.00 transpose(float*, int, int) [2]

```                                                <spontaneous>
```

[3] 0.2 1.70 0.00 operator-(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&) [3]

```                                                <spontaneous>
```

[4] 0.1 0.56 0.97 reluPrime(std::vector<float, std::allocator<float> > const&) [4]

```               0.82    0.00 491520000/519195026     void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) [7]
0.15    0.00  310000/442679      void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) [11]
```

```                                                <spontaneous>
```

[5] 0.1 1.48 0.00 operator*(float, std::vector<float, std::allocator<float> > const&) [5]

```                                                <spontaneous>
```

[6] 0.1 1.33 0.01 relu(std::vector<float, std::allocator<float> > const&) [6]

```               0.00    0.00  307321/13107321     void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&) [12]
0.00    0.00 2075026/519195026     void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) [7]
0.00    0.00    2679/442679      void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) [11]
```

```               0.00    0.00 2075026/519195026     relu(std::vector<float, std::allocator<float> > const&) [6]
0.04    0.00 25600000/519195026     softmax(std::vector<float, std::allocator<float> > const&, int) [9]
0.82    0.00 491520000/519195026     reluPrime(std::vector<float, std::allocator<float> > const&) [4]
```

[7] 0.1 0.87 0.00 519195026 void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) [7]

```                                                <spontaneous>
```

[8] 0.1 0.73 0.00 operator*(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&) [8]

```                                                <spontaneous>
```

[9] 0.1 0.30 0.27 softmax(std::vector<float, std::allocator<float> > const&, int) [9]

```               0.17    0.00 12800000/13107321     void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&) [12]
0.06    0.00  130000/442679      void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) [11]
0.04    0.00 25600000/519195026     void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) [7]
```

```                                                <spontaneous>
```

[10] 0.0 0.10 0.13 split(std::string const&, char) [10]

```               0.13    0.00  462000/462000      void std::vector<std::string, std::allocator<std::string> >::_M_emplace_back_aux<std::string const&>(std::string const&) [14]
```

```               0.00    0.00    2679/442679      relu(std::vector<float, std::allocator<float> > const&) [6]
0.06    0.00  130000/442679      softmax(std::vector<float, std::allocator<float> > const&, int) [9]
0.15    0.00  310000/442679      reluPrime(std::vector<float, std::allocator<float> > const&) [4]
```

[11] 0.0 0.21 0.00 442679 void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) [11]

```               0.00    0.00  307321/13107321     relu(std::vector<float, std::allocator<float> > const&) [6]
0.17    0.00 12800000/13107321     softmax(std::vector<float, std::allocator<float> > const&, int) [9]
```

[12] 0.0 0.17 0.00 13107321 void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&) [12]

```                                                <spontaneous>
```

[13] 0.0 0.14 0.00 operator/(std::vector<float, std::allocator<float> > const&, float) [13]

```               0.13    0.00  462000/462000      split(std::string const&, char) [10]
```

[14] 0.0 0.13 0.00 462000 void std::vector<std::string, std::allocator<std::string> >::_M_emplace_back_aux<std::string const&>(std::string const&) [14]

```               0.00    0.00       3/3           random_vector(int) [28]
```

[22] 0.0 0.00 0.00 3 std::vector<float, std::allocator<float> >::vector(unsigned long, std::allocator<float> const&) [22]

```               0.00    0.00       1/1           __libc_csu_init [38]
```

[23] 0.0 0.00 0.00 1 _GLOBAL__sub_I__Z5printRKSt6vectorIfSaIfEEii [23]

� Index by function name

``` [23] _GLOBAL__sub_I__Z5printRKSt6vectorIfSaIfEEii (nn.cpp) [2] transpose(float*, int, int) [13] operator/(std::vector<float, std::allocator<float> > const&, float)
[1] dot(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&, int, int, int) [14] void std::vector<std::string, std::allocator<std::string> >::_M_emplace_back_aux<std::string const&>(std::string const&) [3] operator-(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&)
[6] relu(std::vector<float, std::allocator<float> > const&) [7] void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) [8] operator*(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&)
[10] split(std::string const&, char) [11] void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) [5] operator*(float, std::vector<float, std::allocator<float> > const&)
[9] softmax(std::vector<float, std::allocator<float> > const&, int) [12] void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&)
[4] reluPrime(std::vector<float, std::allocator<float> > const&) [22] std::vector<float, std::allocator<float> >::vector(unsigned long, std::allocator<float> const&)
```
##### Analysis

The total execution time of the program is around 10 minutes. As is evident from profiling results, most of the execution time is taken up by the dot() function as it does matrix-matrix multiplication. This is the hotspot of the program that can be made efficient by doing this computation and other vector multiplications on the GPU.