# Changes

## TriForce

, 13:17, 8 April 2019
Kernel Optimization Attempts
Sudoku Solver Profiling

Rather than try to continuously increase the difficulty of a 9x9 sudoku, I decided to modify the program I found to handle larger and large sudokus, increasing the size of the matrices that make up the sudoku (starting with a 9x9 sudoku, which is 9 3x3 matrices, then 16x16 which is 16 4x4 matrices, and finally 25x25 which is 25 5x5 matrices) without changing the logic of the program (only constants), so larger sudokus are solved the same way as a normal one.
Source code from: https://www.geeksforgeeks.org/sudoku-backtracking-7/
{| class="wikitable mw-collapsible mw-collapsed"
! Original Code:
|-
|
Original Code:
// A Backtracking program in C++ to solve Sudoku problem
/* Check if 'num' is not already placed in current row,
current column and current 3x3 box */
return !UsedInRow(grid, row, num) && !UsedInCol(grid, col, num) && !UsedInBox(grid, row - row%3 , col - col%3, num)&& grid[row][col]==UNASSIGNED;
}
/* Check if 'num' is not already placed in current row,
current column and current 4x4 box */
return !UsedInRow(grid, row, num) && !UsedInCol(grid, col, num) && !UsedInBox(grid, row - row%4 , col - col%4, num)&& grid[row][col]==UNASSIGNED;
}
/* Check if 'num' is not already placed in current row,
current column and current 5x5 box */
return !UsedInRow(grid, row, num) && !UsedInCol(grid, col, num) && !UsedInBox(grid, row - row%5 , col - col%5, num)&& grid[row][col]==UNASSIGNED;
}
return 0;
}
|}
Obtaining flat profiles and call graphs on matrix environment:
Attempted to run the program with a number of files (8K resolution):
{| class="wikitable mw-collapsible mw-collapsed"
! Sample Images
|-
|
[[File:Cabin small.jpg]]
[[File:Cabin2 small.jpg]]
|}
{| class="wikitable mw-collapsible mw-collapsed"
[[File:Julia.jpg]]
|}

This problem would be fairly simple to parallelize. In the image created by Julia sets each pixel is independent of the others. This problem involves Complex numbers, but that can be simply represented by using two arrays, or pairs of floats.

==== Assignment 1: Selection for parallelizing ====

After reviewing the three programs above, we decided to attempt to parallelize the Sudoku Solver Program for a few reasons.

1. By increasing the dimensions of the smaller matrices that make up a sudoku by one, we see a major increase in the time it takes to solve the sudoku, from almost instantly to around 38 seconds, and then to '''36 minutes'''. With a 25x25 sudoku (of 5x5 matrices), several functions were called over '''100 million times'''.

2. Based on the massive time increases and similarity to the Hamiltonian Path Problem [https://www.hackerearth.com/practice/algorithms/graphs/hamiltonian-path/tutorial/] which also uses backtracking to find a solution, we believe the run time of the sudoku solver to have a Big O notation that approaches O(n!) where 'n' is the number of blank spaces in the sudoku as the sudoku solver uses recursion to check every single possible solution, returning to previous steps if the tried solution does not work. O(n!) is an even worse runtime than O(n^2).

3. The Julia sets still took less than 6 minutes after increasing the image size, and the EasyBMP only took a few seconds to convert a large, high resolution image. Therefore, the Sudoku Solver had the greatest amount of time to be shaven off through optimization and thus offered the most challenge.
=== Assignment 2 ===
This code is unable to solve the 16x16 in any reasonable amount of time (I stopped it at 10+ minutes).
If you consider the 130+ empty spaces in the grid I estimate over 130^2 calls to cudaMemcpy either way...

So we need an algorithm which will check each open spot, calculate all possible values which can fit there, and assign single values.
We can also check each section (Box, row, col) for values which can only go in one place

{| class="wikitable mw-collapsible mw-collapsed"
! Faster than YoursAttempt One...
|-
|
'''Single Pass Sudoku Solver'''

This Kernel was designed to run on a single block with dimensions N*N the size of the Sudoku
limiting us to a Sudoku of size 25 * 25
For each empty space, counts the number possible values which can fit and how many times each value can fit in that section
If only one value can fit or that value has only one place, assigns the value

__global__ void superSolve(int * d_a) {
}
}
[[File:Unoptimized_vs_OptimizedBacktrack_vs_Kernel.png]]
=== Assignment 3 ===
Reduced superSolve runtime from 5.2 to 3.8ms
Changes:
#define N (BOX_W * BOX_W)
__global__ void solve(int* d_a) {
// Used to remember which row | col | box ( section ) have which values
__shared__ bool rowHas[N][N];
__shared__ bool colHas[N][N];
__shared__ bool boxHas[N][N];
// Used to ensure that the table has changed
__shared__ bool changed;
// Number of spaces which can place the number in each section
__shared__ int rowCount[N][N];
__shared__ int colCount[N][N];
__shared__ int boxCount[N][N];
__global__ void solve(int* d_a) { // Used to remember which row | col | box ( section ) have which values __shared__ bool rowHas[N][N]; __shared__ bool colHas[N][N]; __shared__ bool boxHas[N][N];  // Used to ensure that the table has changed __shared__ bool changed;  // Number of spaces which can place the number in each section __shared__ int rowCount[N][N]; __shared__ int colCount[N][N]; __shared__ int boxCount[N][N];  // Where the square is located in the Sudoku int row = threadIdx.x; int col = threadIdx.y; int box = row / BOX_W + (col / BOX_W) * BOX_W; // Unique identifier for each square in row, col, box // Corresponds to the generic Sudoku Solve // Using a Sudoku to solve a Sudoku !!! int offset = col + (row % BOX_W) * BOX_W + (box % BOX_W); // Square's location in the Sudoku int gridIdx = col * N + row;  int at = d_a[gridIdx];  bool notSeen[N]; for (int i = 0; i < N; ++i) notSeen[i] = true;  rowHas[col][row] = false; colHas[col][row] = false; boxHas[col][row] = false; __syncthreads();  if (at != UNASSIGNED) { rowHas[row][at - 1] = true; colHas[col][at - 1] = true; boxHas[box][at - 1] = true; } // Previous loop has not changed any values do { // RESET counters rowCount[col][row] = 0; colCount[col][row] = 0; boxCount[col][row] = 0; __syncthreads();  if (gridIdx == 0) // forget previous change changed = false; int count = 0; // number of values which can fit in this square int guess = 0; // last value found which can fit in this square for (int idx = 0; idx < N; ++idx) { // Ensures that every square in each section is working on a different number in the section int num = (idx + offset) % N; if (at == UNASSIGNED && notSeen[num]) { if (rowHas[row][num] || boxHas[box][num] || colHas[col][num]) notSeen[num] = false; else { ++count; guess = num; rowCount[row][num]++; colCount[col][num]++; boxCount[box][num]++; } } __syncthreads(); }  // Find values which can go in only one spot in the section for (int idx = 0; idx < N && count > 1; ++idx) { if (notSeen[idx] && (rowCount[row][idx] == 1 || boxCount[box][idx] == 1 || colCount[col][idx] == 1)) { // In this section this value can only appear in this square guess = idx; count = 1; } }  if (count == 1) { at = guess + 1; rowHas[row][guess] = true; colHas[col][guess] = true; boxHas[box][guess] = true; changed = true; } __syncthreads(); } while (changed); d_a[gridIdx] = at; }
bool validate(int resultnotSeen[N][N]) { for (int row = 0; row < N; row++) for (int col i = 0; col i < N; col++i) if (result[row] notSeen[coli] == 0) return false; return true; }
void print(int result rowHas[Ncol][Nrow]) { for (int row = 0false; row < N; row++) { for (int colHas[col ][row] = 0; col < Nfalse; col++) printf("%3d", result boxHas[rowcol][colrow])= false; printf __syncthreads("\n"); } }
// Driver program to test main program functions int main if (at != UNASSIGNED) { int h_a rowHas[Nrow][Nat - 1] = {true; { colHas[col][at - 1, ] = true; boxHas[box][at - 1] = true; } // Previous loop has not changed any values do { // RESET counters rowCount[col][row] = 0, 4, ; colCount[col][row] = 0, 25, ; boxCount[col][row] = 0, 19, ; __syncthreads(); if (gridIdx == 0, ) // forget previous change changed = false; int count = 0, 10, 21, 8, ; // number of values which can fit in this square int guess = 0, 14, ; // last value found which can fit in this square for (int idx = 0, 6, 12, 9, 0, 0, 0, 0, 0, 0, 5},; idx < N; ++idx) { // Ensures that every square in each section is working on a different number in the section int num = (idx + offset) % N; { 5, 0, 19, 23, 24, 0, 22, 12, 0, 0, 16, 6, 0, 20, 0, 18, 0, 25, 14, 13, 10, 11, 0, 1, 15 if (at == UNASSIGNED && notSeen[num]) { if (rowHas[row][num] || boxHas[box][num] || colHas[col][num]) notSeen[num] = false; else { ++count; guess = num; rowCount[row][num]++; colCount[col][num]++; boxCount[box][num]++; } } __syncthreads(); }, { 0, // Find values which can go in only one spot in the section for (int idx = 0, 0, 0, 0, 0, 21, 5, 0, 20, 11, 10, 0, ; idx < N && count > 1; ++idx) { if (notSeen[idx] && (rowCount[row][idx] == 1 || boxCount[box][idx] == 1 || colCount[col][idx] == 1, 0, 4, 8, 24, 23, 15, 18, 0, 16, 22, 19)) { // In this section this value can only appear in this square guess = idx; count = 1; }, } if (count == 1) { at = guess + 1; rowHas[row][guess] = true; colHas[col][guess] = true; boxHas[box][guess] = true; changed = true; } __syncthreads(); } while (changed); //SOLVED CHECK if (!(rowHas[row][col] || colHas[row][col] || boxHas[row][col])) changed = true; __syncthreads(); if (changed && gridIdx == 0, 7, 21, 8, 18, 0, 0, 0, 11, 0, 5, 0, 0, 24, 0, ) at = 0, ; 0, 17, 22, d_a[gridIdx] = at; 1, } 9, 6, 25, void print(int result[N][N]) { for (int row = 0, 0},; row < N; row++) { { 0, 13, 15, for (int col = 0; col < N; col++) printf("%3d", 22, 14, 0, 18, result[row][col]); printf("\n"); } 0, 16, } 0, 0, 0, // Driver program to test main program functions int main() { int h_a[N][N] = { { 41, 0, 04, 0, 1925, 0, 19, 0, 0, 24, 2010, 21, 17}, { 12 8, 0, 1114, 0, 6, 12, 09, 0, 0, 0, 15, 0, 0, 0, 05}, 21 { 5, 25 0, 19, 023, 424, 0, 22, 1412, 0, 20, 0}, { 816, 06, 0, 2120, 0, 1618, 0, 025, 014, 13, 10, 211, 0, 31, 15}, { 0, 0, 0, 0, 17, 23, 18, 22 0, 0, 21, 05, 0, 2420, 6}11, { 410, 0, 14, 18 1, 70, 94, 08, 2224, 2123, 1915, 018, 0, 16, 22, 19}, { 0, 27, 21, 08, 18, 50, 0, 0, 11, 0, 65, 16, 15 0, 0, 1124, 12}, { 22 0, 0, 24, 0, 2317, 22, 01, 0, 119, 06, 725, 0, 0}, 4, { 0, 1413, 15, 0, 222, 1214, 0, 818, 50, 1916, 0, 25 0, 9}0, { 20 4, 0, 0, 0, 519, 0, 0, 0, 024, 20, 21, 17}, 9 { 12, 0, 1211, 18 0, 06, 10, 0, 0, 70, 2415, 0, 0, 0, 13, 4}0, { 1321, 25, 019, 0, 54, 0, 2, 2322, 14, 40, 18, 2220, 0}, 17 { 8, 0, 0, 2021, 0, 116, 9, 21, 120, 0, 0, 8, 11}, { 14, 232, 0, 24, 03, 0, 0, 0, 0, 017, 023, 18, 22, 0, 20, 25 0, 0, 24, 36}, { 4, 13, 0, 1114, 18, 21 7, 9, 5, 180, 22}, { 721, 19, 0, 0, 11, 17, 20, 24, 0, 02, 0, 35, 4, 1, 120, 0, 0, 6, 1416, 15, 0, 511, 2512}, 13 { 22, 0, 24, 0, 0}23, { 0, 0, 16, 911, 0, 17, 11, 7, 10, 25, 0, 0, 4, 0, 1314, 60, 02, 12, 0, 18 8, 05, 19, 0, 1925, 49}, { 20, 0, 0, 20} 0, { 6, 155, 0, 19, 4, 130, 0, 0, 17, 59, 0, 12, 18, 11 0, 01, 0, 90, 87, 2224, 16, 25, 10, 7 0, 0, 0, 013, 04}, { 013, 0, 0, 25, 0, 02, 1023, 1914, 34, 018, 122, 0, 2217, 9, 4, 11, 150, 0, 20, 0, 01, 89, 21, 2312, 0, 25}, { 0, 24, 8, 1311}, { 14, 123, 0, 24, 0, 4, 200, 0, 17, 14, 0, 0, 18, 0, 16, 22, 50, 0, 1120, 25, 0, 10 3, 04, 13, 0}, { 2311, 1021, 09, 05, 18, 022}, { 07, 0, 0, 1811, 017, 20, 624, 0, 16, 0, 0, 17 3, 14, 01, 1312, 0, 0, 36, 19, 1214, 0}, { 25, 5, 25, 13, 0, 14, 11, 0, 17, 0}, { 8, 24, 130, 0, 19, 23, 1516, 9, 0, 017, 1211, 07, 10, 2025, 0, 22 0, 0, 13, 7}6, { 0, 0, 1718, 40, 0, 2219, 15 4, 0, 23 0, 1120}, 12 { 6, 2515, 0, 19, 04, 13, 0, 0, 18, 85, 0, 718, 11, 0, 0, 14 9, 08, 13}22, { 1916, 625, 2310, 22 7, 80, 0, 0, 10}, 25 { 0, 40, 14 0, 2, 0, 30, 710, 1319, 10, 11, 16 3, 0, 01, 0, 022, 09, 0}4, { 011, 415, 0, 1720, 0, 30, 8, 23, 0, 2425}, { 0, 24, 8, 2013, 23, 11, 10, 25, 22 1, 0, 0, 04, 12, 1320, 20, 1817, 614, 0}, { 0, 18, 0, 716, 1622, 05, 0, 6, 1711, 20, 2110, 0, 18, 0}, { 23, 10, 0, 0, 19, 0, 0, 80, 0, 18, 0, 06, 0, 416, 0}, { 18, 90, 2517, 1, 20, 1113, 0, 0, 13 3, 2219, 412, 0}, 21, 0 { 25, 5, 0, 2314, 711, 0, 0, 1517, 0, 38, 24, 13, 0, 8}19, { 023, 2115, 10 9, 0, 0, 12, 0, 20, 16, 0, 1922, 0, 07}, { 0, 0, 1517, 14 4, 40, 22, 15, 20, 23, 1811, 2312, 25, 11 0, 0, 70, 0} };, 18, 8, 0, 7, 0, 0, 14, 0, 13}, int* d_a; //Table int* d_result; //Table change indicator cudaMalloc((void**)&d_a { 19, 6, 23, N * N * sizeof(int)); cudaMalloc((void**)&d_result22, 8, 0, 0, 1, 25, 4, 14, 2, 0, 3, 7, 13, 10, 11, 16, 0, 0, 0, 0, 0, 0}, { 0, 4, 0, 17, 0, 3, 0, 24, 0, 8, 20, 23, 11, 10, 25, 22, 0, 0, 0, 12, 13, 2, 18, 6, 0}, { 0, 0, 7, 16, 0, 0, 6, 17, 2, 21, 0, 18, 0, 0, 0, 19, 0, 0, 8, 0, 0, 0, 0, 4, 0}, { 18, 9, 25, 1, 2, 11, 0, 0, 13, 22, 4, 0, 21, 0, 5, 0, 23, 7, 0, 0, 15, 0, 3, 0, 8}, { 0, 21, 10, 0, 0, 12, 0, 20, 16, 0, 19, 0, 0, 0, 0, 15, 14, 4, 2, 18, 23, 25, 11, 7, 0} }; int* d_a; //Table cudaMalloc((void**)&d_a, N * N * sizeof(int)); // Copy Sudoku to device cudaMemcpy(d_a, h_a, N * N * sizeof(int), cudaMemcpyHostToDevice); dim3 dBlock(N, N); solve << <1, dBlock >> > (d_a); // Copy Sudoku back to host cudaMemcpy(h_a, sizeof(int)); // Copy Sudoku to device cudaMemcpy(d_a, h_a, N * N * sizeof(int), cudaMemcpyHostToDevicecudaMemcpyDeviceToHost); dim3 dBlock(N, N); solve<<<1, dBlock>>>(d_a); // Copy Sudoku back to hostCheck if solved cudaMemcpy if (h_a, d_a, N * N * sizeof(int), cudaMemcpyDeviceToHost[0][0]); // Check if solved if (validate(h_a)) print(h_a); else printf("No solution could be found."); cudaFree(d_a); cudaFree(d_result); return 0;
}

|}