https://wiki.cdot.senecacollege.ca/w/api.php?action=feedcontributions&user=Dserpa&feedformat=atomCDOT Wiki - User contributions [en]2019-10-18T09:27:08ZUser contributionsMediaWiki 1.30.0https://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=138792GPU610/gpuchill2019-04-05T02:18:01Z<p>Dserpa: /* Rotate Image */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Calculation of Pi, Shrink & Rotate<br />
# [mailto:akkabia@myseneca.ca?subject=gpu610 Abdul Kabia], Some responsibility <br />
# [mailto:jtardif1@myseneca.ca?subject=gpu610 Josh Tardif], Some responsibility<br />
# [mailto:afaux@myseneca.ca?subject=gpu610 Andrew Faux], Some responsibility<br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca,akkabia@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
==== Sudoku Brute Force Solver ====<br />
<br />
I decided to profile a simple brute force Sudoku solver, found here (https://github.com/regehr/sudoku). The solver uses a simple back tracking algorithm, inserting possible values into cells, iterating through the puzzles thousands of times, until it eventually produces an answer which does not violate any of the rules of Sudoku. As such the solver runs at the same speed regardless of the human difficulty rating, able to solve easy and 'insane' level puzzles at the same speed. The solver also works independent of the ratio between clues and white space, producing quick results with even the most sparsely populated puzzles.As such the following run of the program uses a puzzle which is specifically made to play against the back tracking algorithm and provides maximum time for the solver.<br />
<br />
Test run with puzzle:<br />
<pre><br />
Original configuration:<br />
-------------<br />
| | | |<br />
| | 3| 85|<br />
| 1| 2 | |<br />
-------------<br />
| |5 7| |<br />
| 4| |1 |<br />
| 9 | | |<br />
-------------<br />
|5 | | 73|<br />
| 2| 1 | |<br />
| | 4 | 9|<br />
-------------<br />
17 entries filled<br />
solution:<br />
-------------<br />
|987|654|321|<br />
|246|173|985|<br />
|351|928|746|<br />
-------------<br />
|128|537|694|<br />
|634|892|157|<br />
|795|461|832|<br />
-------------<br />
|519|286|473|<br />
|472|319|568|<br />
|863|745|219|<br />
-------------<br />
found 1 solutions<br />
<br />
real 0m33.652s<br />
user 0m33.098s<br />
sys 0m0.015s<br />
</pre><br />
<br />
<br />
Flat profile:<br />
<pre><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
46.42 10.04 10.04 622865043 0.00 0.00 check_row<br />
23.52 15.13 5.09 1 5.09 21.32 solve<br />
18.26 19.08 3.95 223473489 0.00 0.00 check_col<br />
10.02 21.25 2.17 100654218 0.00 0.00 check_region<br />
0.72 21.40 0.16 2 0.08 0.08 print<br />
0.39 21.49 0.09 frame_dummy<br />
</pre><br />
<br />
I believe that if a GPU was used to enhance this program one would see a great increase of speed. All of the check functions essentially do the same thing, iterating through possible inserted values for any that violate the rules. If one is able to unload all of these iterations onto the GPU then there should be a corresponding increase in speed.<br />
<br />
==== Christopher Ginac Image Processing Library ====<br />
<br />
I decided to profile a single user created image processing library written by Christopher Ginac, you can follow his post of the library [https://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ here]. His library enables the user to play around with .PGM image formats. If given the right parameters, users have the following options:<br />
<br />
<pre><br />
What would you like to do:<br />
[1] Get a Sub Image<br />
[2] Enlarge Image<br />
[3] Shrink Image<br />
[4] Reflect Image<br />
[5] Translate Image<br />
[6] Rotate Image<br />
[7] Negate Image<br />
</pre><br />
<br />
I went with the Enlarge option to see how long that would take. In order for me to do this, I had to test both the limits of the program and my own seneca machine allowed space, in order to do this, I had to use a fairly large image. However, since the program creates a second image, my Seneca account ran out of space for the new image, so the program could not write out the newly enlarged image. So I had to settle on an image that was 16.3MB max, so that it could write a new one, totally in 32.6MB of space. <br />
<br />
<pre><br />
real 0m10.595s<br />
user 0m5.325s<br />
sys 0m1.446s<br />
</pre><br />
Which isn't really bad, but when we look deeper, we see where most of our time is being spent<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
21.74 1.06 1.06 1 1.06 1.06 Image::operator=(Image const&)<br />
21.33 2.10 1.04 2 0.52 0.52 Image::Image(int, int, int)<br />
18.66 3.01 0.91 154056114 0.00 0.00 Image::getPixelVal(int, int)<br />
15.59 3.77 0.76 1 0.76 2.34 Image::enlargeImage(int, Image&)<br />
14.97 4.50 0.73 1 0.73 1.67 writeImage(char*, Image&)<br />
3.69 4.68 0.18 2 0.09 0.09 Image::Image(Image const&)<br />
2.67 4.81 0.13 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.82 4.85 0.04 1 0.04 0.17 readImage(char*, Image&)<br />
0.62 4.88 0.03 1 0.03 0.03 Image::getImageInfo(int&, int&, int&)<br />
0.00 4.88 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 4.88 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 4.88 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev<br />
0.00 4.88 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 4.88 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)<br />
</pre><br />
<br />
It seems most of our time in this part of the code is spent assigning our enlarged image to the now one, and also creating our image object in the first place. I think if we were to somehow use a GPU for this process, we would see an decrease in run-time for this part of the library. Also, there also seems to be room for improvement on the very 'Image::enlargeImage' function itself. I feel like by loading said functionality onto thje GPU, we can reduce it's 0.76s to something even lower.<br />
<br />
Using the same image as above (16MB file), I went ahead and profile the Negate option as well. This as the name implies turns the image into a negative form.<br />
<pre><br />
real 0m5.707s<br />
user 0m0.000s<br />
sys 0m0.000s<br />
</pre><br />
<br />
As you can see, about half the time of the Enlarge option, which is expect considering you're not doing as much.<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls ms/call ms/call name <br />
23.53 0.16 0.16 2 80.00 80.00 Image::Image(Image const&)<br />
16.18 0.27 0.11 2 55.00 55.00 Image::Image(int, int, int)<br />
14.71 0.37 0.10 _fu62___ZSt4cout<br />
13.24 0.46 0.09 17117346 0.00 0.00 Image::getPixelVal(int, int)<br />
13.24 0.55 0.09 1 90.00 90.00 Image::operator=(Image const&)<br />
7.35 0.60 0.05 1 50.00 140.00 writeImage(char*, Image&)<br />
7.35 0.65 0.05 1 50.00 195.00 Image::negateImage(Image&)<br />
4.41 0.68 0.03 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.00 0.68 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 0.68 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 0.68 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 0.68 0.00 1 0.00 0.00 readImage(char*, Image&)<br />
0.00 0.68 0.00 1 0.00 0.00 Image::getImageInfo(int&, int&, int&)<br />
</pre><br />
<br />
Notice in both cases of the Enlarge and Negate options the function "Image::Image(int, int, int)" is always within the top 3 of functions that seem to take the most time. Also, the functions "Image::setPixelVal(int, int, int)" and <br />
"Image::getPixelVal(int, int)" are called very often. I think if we focus our efforts on unloading the "Image::getPixelVal(int, int)" and "Image::setPixelVal(int, int, int)" functions onto the GPU as I imagine they are VERY repetitive tasks, as well as try and optimize the "Image::Image(int, int, int)" function; we are sure to see an increase in performance for this program.<br />
<br />
==== Merge Sort Algorithm ====<br />
<br />
I decide to profile a vector merge sort algorithm. A merge sort is based on a based on divide and conquer technique which recursively breaks down a problem into two or more sub-problems of the same or related types. When these become simple enough to be solved directly the sub-problems are then combined to give a solution to the original problem. It first divides the array into equal halves and then combines them in a sorted manner. Due to this type of sort being broken into equal parts, I thought that it would be perfect for a GPU to be able to accelerate the process. With the sort being broken down into multiple chunks and then sent to the GPU it will be able to accomplish its task more efficiently. I was able to find the source code [https://codereview.stackexchange.com/questions/167680/merge-sort-implementation-with-vectors/ here].<br />
<br />
Profile for 10 million elements between 1 and 10000. Using -02 optimization.<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls ns/call ns/call name<br />
48.35 1.16 1.16 9999999 115.56 115.56 mergeSort(std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, <br />
std::vector<int, std::allocator<int> >&)<br />
32.80 1.94 0.78 sort(std::vector<int, std::allocator<int> >&)<br />
19.34 2.40 0.46 43708492 10.58 10.58 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, <br />
std::allocator<int> > >, int const&)<br />
0.00 2.40 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</pre><br />
As you can see 80% of the total time was spent in mergeSort and sort functions. <br /><br />
If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x.<br />
<br />
==== Calculation of Pi ====<br />
===== Initial Thoughts =====<br />
The program I decided to assess, and profile calculates the value of PI by using the approximation method called Monte Carlo. This works by having a circle that is 𝜋r2 and a square that is 4r2 with r being 1.0 and generating randomized points inside the area, both x and y being between -1 and 1 we keep track of how many points have been located inside the circle. The more points generated the more accurate the final calculation of PI will be. The amount of points needed for say billionth precision can easily reach in the hundreds of billions which would take just as many calculations of the same mathematical computation, which makes it a fantastic candidate to parallelize.<br />
<br />
====== Figure 1 ======<br />
[[File:Pi_calc.png]]<br />
<br/><br />
Figure 1: Graphical representation of the Monte Carlo method of approximating PI<br />
<br />
====== Figure 2 ======<br />
{| class="wikitable mw-collapsible mw-collapsed"<br />
! pi.cpp<br />
|-<br />
|<br />
<source><br />
/*<br />
Author: Daniel Serpa<br />
Pseudo code: Blaise Barney (https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesPI)<br />
*/<br />
<br />
#include <iostream><br />
#include <cstdlib><br />
#include <math.h><br />
#include <iomanip><br />
<br />
double calcPI(double);<br />
<br />
int main(int argc, char ** argv) {<br />
if (argc != 2) {<br />
std::cout << "Invalid number of arguments" << std::endl;<br />
return 1;<br />
}<br />
std::srand(852);<br />
double npoints = atof(argv[1]);<br />
std::cout << "Number of points: " << npoints << std::endl;<br />
double PI = calcPI(npoints);<br />
std::cout << std::setprecision(10) << PI << std::endl;<br />
return 0;<br />
}<br />
<br />
double calcPI(double npoints) {<br />
double circle_count = 0.0;<br />
for (int j = 0; j < npoints; j++) {<br />
double x_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
double y_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
if (sqrt(pow(x_coor, 2) + pow(y_coor, 2)) < 1.0) circle_count += 1.0;<br />
}<br />
return 4.0*circle_count / npoints;<br />
}<br />
</source><br />
|}<br />
<br />
Figure 2: Serial C++ program used for profiling of the Monte Carlo method of approximating PI<br />
<br />
===== Compilation =====<br />
Program is compiled using the command: <source>gpp -O2 -g -pg -oapp pi.cpp</source><br />
<br />
===== Running =====<br />
We will profile the program using 2 billion points<br />
<source><br />
> time app 2000000000<br />
Number of points: 2e+09<br />
3.14157<br />
<br />
real 1m0.072s<br />
user 0m59.268s<br />
sys 0m0.018s<br />
</source><br />
<br />
===== Profiling =====<br />
Flat:<br />
<source><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls Ts/call Ts/call name<br />
100.80 34.61 34.61 calcPI(double)<br />
0.00 34.61 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</source><br />
Call:<br />
<source><br />
granularity: each sample hit covers 2 byte(s) for 0.03% of 34.61 seconds<br />
<br />
index % time self children called name<br />
<spontaneous><br />
[1] 100.0 34.61 0.00 calcPI(double) [1]<br />
-----------------------------------------------<br />
0.00 0.00 1/1 __libc_csu_init [16]<br />
[9] 0.0 0.00 0.00 1 _GLOBAL__sub_I_main [9]<br />
-----------------------------------------------<br />
<br />
Index by function name<br />
<br />
[9] _GLOBAL__sub_I_main (pi.cpp) [1] calcPI(double)<br />
</source><br />
<br />
===== Results =====<br />
You need many billions of points and maybe even trillions to reach a high precision for the final result but using just 2 billion dots causes the program to take over 30 seconds to run. The most intensive part of the program is the loop which is what executes 2 billion times in my run of the program while profiling, which can all be parallelized. We can determine from the profiling that 100% of the time executing the program is spent in the loop but of course that is not possible so we will go with 99.9%, using a GTX 1080 as an example GPU which has 20 processors and each having 2048 threads, and using Amdahl's Law we can expect a speedup of 976.191 times<br />
<br />
=== Assignment 2 ===<br />
==== Beginning Information ====<br />
<br />
==== Enlarge Image====<br />
<pre><br />
__global__ void enlargeImg(int* a, int* b, int matrixSize, int growthVal, int imgCols, int enlargedCols) {<br />
int idx = blockIdx.x * blockDim.x + threadIdx.x;<br />
int x = idx / enlargedCols;<br />
int y = idx % enlargedCols;<br />
if (idx < matrixSize) {<br />
a[idx] = b[(x / growthVal) * imgCols + (y / growthVal)];<br />
}<br />
}<br />
</pre><br />
<br />
==== Shrink Image ====<br />
On the CPU "shrink" took 20,000 microseconds and the GPU took 118 microseconds which shows a speedup of 169.5 times<br />
<br />
<pre><br />
__global__ void shrinkImg(int* a, int* b, int matrixSize, int shrinkVal, int imgCols, int shrinkCols) {<br />
int idx = blockIdx.x * blockDim.x + threadIdx.x;<br />
int x = idx / shrinkCols;<br />
int y = idx % shrinkCols;<br />
if (idx < matrixSize) {<br />
a[idx] = b[(x / shrinkVal) * imgCols + (y / shrinkVal)];<br />
}<br />
}<br />
</pre><br />
<br />
==== Reflect Image====<br />
<br />
<pre><br />
// Reflect Image Horizontally<br />
__global__ void reflectImgH(int* a, int* b, int rows, int cols) {<br />
int i = blockIdx.x * blockDim.x + threadIdx.x;<br />
int j = blockIdx.y * blockDim.y + threadIdx.y;<br />
//tempImage.pixelVal[rows - (i + 1)][j] = oldImage.pixelVal[i][j];<br />
a[j * cols + (rows - (i + 1))] = b[j * cols + i];<br />
<br />
}<br />
<br />
//Reflect Image Vertically<br />
__global__ void reflectImgV(int* a, int* b, int rows, int cols) {<br />
int i = blockIdx.x * blockDim.x + threadIdx.x;<br />
int j = blockIdx.y * blockDim.y + threadIdx.y;<br />
//tempImage.pixelVal[i][cols - (j + 1)] = oldImage.pixelVal[i][j];<br />
a[(cols - (j + 1) * cols) + i] = b[j * cols + i];<br />
<br />
}<br />
</pre><br />
<br />
==== Translate Image====<br />
<br />
<pre><br />
__global__ void translateImg(int* a, int* b, int cols, int value) {<br />
int i = blockIdx.x * blockDim.x + threadIdx.x;<br />
int j = blockIdx.y * blockDim.y + threadIdx.y;<br />
<br />
//tempImage.pixelVal[i + value][j + value] = oldImage.pixelVal[i][j];<br />
a[(j-value) * cols + (i+value)] = b[j * cols + i];<br />
<br />
}<br />
</pre><br />
<br />
==== Rotate Image====<br />
On the CPU "rotate" took 40,000 microseconds and the GPU took 1,482 microseconds which shows a speedup of 27 times<br />
<br />
<pre><br />
__global__ void rotateImg(int* a, int* b, int matrixSize, int imgCols, int imgRows, int r0, int c0, float rads) {<br />
int idx = blockIdx.x * blockDim.x + threadIdx.x;<br />
int r = idx / imgCols;<br />
int c = idx % imgCols;<br />
if (idx < matrixSize) {<br />
int r1 = (int)(r0 + ((r - r0) * cos(rads)) - ((c - c0) * sin(rads)));<br />
int c1 = (int)(c0 + ((r - r0) * sin(rads)) + ((c - c0) * cos(rads)));<br />
if (r1 >= imgRows || r1 < 0 || c1 >= imgCols || c1 < 0) {<br />
}<br />
else {<br />
a[c1 * imgCols + r1] = b[c * imgCols + r];<br />
}<br />
<br />
}<br />
}<br />
<br />
__global__ void rotateImgBlackFix(int* a, int imgCols) {<br />
int idx = blockIdx.x * blockDim.x + threadIdx.x;<br />
int r = idx / imgCols;<br />
int c = idx % imgCols;<br />
if (a[c * imgCols + r] == 0)<br />
a[c * imgCols + r] = a[(c + 1) * imgCols + r];<br />
}<br />
</pre><br />
<br />
==== Negate Image====<br />
<br />
<pre><br />
__global__ void negateImg(int* a, int* b, int matrixSize) {<br />
int matrixCol = blockIdx.x * blockDim.x + threadIdx.x;<br />
if(matrixCol < matrixSize)<br />
<br />
</pre><br />
<br />
====Results====<br />
[[File:CHART2GOOD.png]]<br />
<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=138791GPU610/gpuchill2019-04-05T02:17:18Z<p>Dserpa: /* Shrink Image */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Calculation of Pi, Shrink & Rotate<br />
# [mailto:akkabia@myseneca.ca?subject=gpu610 Abdul Kabia], Some responsibility <br />
# [mailto:jtardif1@myseneca.ca?subject=gpu610 Josh Tardif], Some responsibility<br />
# [mailto:afaux@myseneca.ca?subject=gpu610 Andrew Faux], Some responsibility<br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca,akkabia@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
==== Sudoku Brute Force Solver ====<br />
<br />
I decided to profile a simple brute force Sudoku solver, found here (https://github.com/regehr/sudoku). The solver uses a simple back tracking algorithm, inserting possible values into cells, iterating through the puzzles thousands of times, until it eventually produces an answer which does not violate any of the rules of Sudoku. As such the solver runs at the same speed regardless of the human difficulty rating, able to solve easy and 'insane' level puzzles at the same speed. The solver also works independent of the ratio between clues and white space, producing quick results with even the most sparsely populated puzzles.As such the following run of the program uses a puzzle which is specifically made to play against the back tracking algorithm and provides maximum time for the solver.<br />
<br />
Test run with puzzle:<br />
<pre><br />
Original configuration:<br />
-------------<br />
| | | |<br />
| | 3| 85|<br />
| 1| 2 | |<br />
-------------<br />
| |5 7| |<br />
| 4| |1 |<br />
| 9 | | |<br />
-------------<br />
|5 | | 73|<br />
| 2| 1 | |<br />
| | 4 | 9|<br />
-------------<br />
17 entries filled<br />
solution:<br />
-------------<br />
|987|654|321|<br />
|246|173|985|<br />
|351|928|746|<br />
-------------<br />
|128|537|694|<br />
|634|892|157|<br />
|795|461|832|<br />
-------------<br />
|519|286|473|<br />
|472|319|568|<br />
|863|745|219|<br />
-------------<br />
found 1 solutions<br />
<br />
real 0m33.652s<br />
user 0m33.098s<br />
sys 0m0.015s<br />
</pre><br />
<br />
<br />
Flat profile:<br />
<pre><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
46.42 10.04 10.04 622865043 0.00 0.00 check_row<br />
23.52 15.13 5.09 1 5.09 21.32 solve<br />
18.26 19.08 3.95 223473489 0.00 0.00 check_col<br />
10.02 21.25 2.17 100654218 0.00 0.00 check_region<br />
0.72 21.40 0.16 2 0.08 0.08 print<br />
0.39 21.49 0.09 frame_dummy<br />
</pre><br />
<br />
I believe that if a GPU was used to enhance this program one would see a great increase of speed. All of the check functions essentially do the same thing, iterating through possible inserted values for any that violate the rules. If one is able to unload all of these iterations onto the GPU then there should be a corresponding increase in speed.<br />
<br />
==== Christopher Ginac Image Processing Library ====<br />
<br />
I decided to profile a single user created image processing library written by Christopher Ginac, you can follow his post of the library [https://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ here]. His library enables the user to play around with .PGM image formats. If given the right parameters, users have the following options:<br />
<br />
<pre><br />
What would you like to do:<br />
[1] Get a Sub Image<br />
[2] Enlarge Image<br />
[3] Shrink Image<br />
[4] Reflect Image<br />
[5] Translate Image<br />
[6] Rotate Image<br />
[7] Negate Image<br />
</pre><br />
<br />
I went with the Enlarge option to see how long that would take. In order for me to do this, I had to test both the limits of the program and my own seneca machine allowed space, in order to do this, I had to use a fairly large image. However, since the program creates a second image, my Seneca account ran out of space for the new image, so the program could not write out the newly enlarged image. So I had to settle on an image that was 16.3MB max, so that it could write a new one, totally in 32.6MB of space. <br />
<br />
<pre><br />
real 0m10.595s<br />
user 0m5.325s<br />
sys 0m1.446s<br />
</pre><br />
Which isn't really bad, but when we look deeper, we see where most of our time is being spent<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
21.74 1.06 1.06 1 1.06 1.06 Image::operator=(Image const&)<br />
21.33 2.10 1.04 2 0.52 0.52 Image::Image(int, int, int)<br />
18.66 3.01 0.91 154056114 0.00 0.00 Image::getPixelVal(int, int)<br />
15.59 3.77 0.76 1 0.76 2.34 Image::enlargeImage(int, Image&)<br />
14.97 4.50 0.73 1 0.73 1.67 writeImage(char*, Image&)<br />
3.69 4.68 0.18 2 0.09 0.09 Image::Image(Image const&)<br />
2.67 4.81 0.13 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.82 4.85 0.04 1 0.04 0.17 readImage(char*, Image&)<br />
0.62 4.88 0.03 1 0.03 0.03 Image::getImageInfo(int&, int&, int&)<br />
0.00 4.88 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 4.88 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 4.88 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev<br />
0.00 4.88 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 4.88 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)<br />
</pre><br />
<br />
It seems most of our time in this part of the code is spent assigning our enlarged image to the now one, and also creating our image object in the first place. I think if we were to somehow use a GPU for this process, we would see an decrease in run-time for this part of the library. Also, there also seems to be room for improvement on the very 'Image::enlargeImage' function itself. I feel like by loading said functionality onto thje GPU, we can reduce it's 0.76s to something even lower.<br />
<br />
Using the same image as above (16MB file), I went ahead and profile the Negate option as well. This as the name implies turns the image into a negative form.<br />
<pre><br />
real 0m5.707s<br />
user 0m0.000s<br />
sys 0m0.000s<br />
</pre><br />
<br />
As you can see, about half the time of the Enlarge option, which is expect considering you're not doing as much.<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls ms/call ms/call name <br />
23.53 0.16 0.16 2 80.00 80.00 Image::Image(Image const&)<br />
16.18 0.27 0.11 2 55.00 55.00 Image::Image(int, int, int)<br />
14.71 0.37 0.10 _fu62___ZSt4cout<br />
13.24 0.46 0.09 17117346 0.00 0.00 Image::getPixelVal(int, int)<br />
13.24 0.55 0.09 1 90.00 90.00 Image::operator=(Image const&)<br />
7.35 0.60 0.05 1 50.00 140.00 writeImage(char*, Image&)<br />
7.35 0.65 0.05 1 50.00 195.00 Image::negateImage(Image&)<br />
4.41 0.68 0.03 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.00 0.68 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 0.68 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 0.68 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 0.68 0.00 1 0.00 0.00 readImage(char*, Image&)<br />
0.00 0.68 0.00 1 0.00 0.00 Image::getImageInfo(int&, int&, int&)<br />
</pre><br />
<br />
Notice in both cases of the Enlarge and Negate options the function "Image::Image(int, int, int)" is always within the top 3 of functions that seem to take the most time. Also, the functions "Image::setPixelVal(int, int, int)" and <br />
"Image::getPixelVal(int, int)" are called very often. I think if we focus our efforts on unloading the "Image::getPixelVal(int, int)" and "Image::setPixelVal(int, int, int)" functions onto the GPU as I imagine they are VERY repetitive tasks, as well as try and optimize the "Image::Image(int, int, int)" function; we are sure to see an increase in performance for this program.<br />
<br />
==== Merge Sort Algorithm ====<br />
<br />
I decide to profile a vector merge sort algorithm. A merge sort is based on a based on divide and conquer technique which recursively breaks down a problem into two or more sub-problems of the same or related types. When these become simple enough to be solved directly the sub-problems are then combined to give a solution to the original problem. It first divides the array into equal halves and then combines them in a sorted manner. Due to this type of sort being broken into equal parts, I thought that it would be perfect for a GPU to be able to accelerate the process. With the sort being broken down into multiple chunks and then sent to the GPU it will be able to accomplish its task more efficiently. I was able to find the source code [https://codereview.stackexchange.com/questions/167680/merge-sort-implementation-with-vectors/ here].<br />
<br />
Profile for 10 million elements between 1 and 10000. Using -02 optimization.<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls ns/call ns/call name<br />
48.35 1.16 1.16 9999999 115.56 115.56 mergeSort(std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, <br />
std::vector<int, std::allocator<int> >&)<br />
32.80 1.94 0.78 sort(std::vector<int, std::allocator<int> >&)<br />
19.34 2.40 0.46 43708492 10.58 10.58 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, <br />
std::allocator<int> > >, int const&)<br />
0.00 2.40 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</pre><br />
As you can see 80% of the total time was spent in mergeSort and sort functions. <br /><br />
If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x.<br />
<br />
==== Calculation of Pi ====<br />
===== Initial Thoughts =====<br />
The program I decided to assess, and profile calculates the value of PI by using the approximation method called Monte Carlo. This works by having a circle that is 𝜋r2 and a square that is 4r2 with r being 1.0 and generating randomized points inside the area, both x and y being between -1 and 1 we keep track of how many points have been located inside the circle. The more points generated the more accurate the final calculation of PI will be. The amount of points needed for say billionth precision can easily reach in the hundreds of billions which would take just as many calculations of the same mathematical computation, which makes it a fantastic candidate to parallelize.<br />
<br />
====== Figure 1 ======<br />
[[File:Pi_calc.png]]<br />
<br/><br />
Figure 1: Graphical representation of the Monte Carlo method of approximating PI<br />
<br />
====== Figure 2 ======<br />
{| class="wikitable mw-collapsible mw-collapsed"<br />
! pi.cpp<br />
|-<br />
|<br />
<source><br />
/*<br />
Author: Daniel Serpa<br />
Pseudo code: Blaise Barney (https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesPI)<br />
*/<br />
<br />
#include <iostream><br />
#include <cstdlib><br />
#include <math.h><br />
#include <iomanip><br />
<br />
double calcPI(double);<br />
<br />
int main(int argc, char ** argv) {<br />
if (argc != 2) {<br />
std::cout << "Invalid number of arguments" << std::endl;<br />
return 1;<br />
}<br />
std::srand(852);<br />
double npoints = atof(argv[1]);<br />
std::cout << "Number of points: " << npoints << std::endl;<br />
double PI = calcPI(npoints);<br />
std::cout << std::setprecision(10) << PI << std::endl;<br />
return 0;<br />
}<br />
<br />
double calcPI(double npoints) {<br />
double circle_count = 0.0;<br />
for (int j = 0; j < npoints; j++) {<br />
double x_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
double y_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
if (sqrt(pow(x_coor, 2) + pow(y_coor, 2)) < 1.0) circle_count += 1.0;<br />
}<br />
return 4.0*circle_count / npoints;<br />
}<br />
</source><br />
|}<br />
<br />
Figure 2: Serial C++ program used for profiling of the Monte Carlo method of approximating PI<br />
<br />
===== Compilation =====<br />
Program is compiled using the command: <source>gpp -O2 -g -pg -oapp pi.cpp</source><br />
<br />
===== Running =====<br />
We will profile the program using 2 billion points<br />
<source><br />
> time app 2000000000<br />
Number of points: 2e+09<br />
3.14157<br />
<br />
real 1m0.072s<br />
user 0m59.268s<br />
sys 0m0.018s<br />
</source><br />
<br />
===== Profiling =====<br />
Flat:<br />
<source><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls Ts/call Ts/call name<br />
100.80 34.61 34.61 calcPI(double)<br />
0.00 34.61 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</source><br />
Call:<br />
<source><br />
granularity: each sample hit covers 2 byte(s) for 0.03% of 34.61 seconds<br />
<br />
index % time self children called name<br />
<spontaneous><br />
[1] 100.0 34.61 0.00 calcPI(double) [1]<br />
-----------------------------------------------<br />
0.00 0.00 1/1 __libc_csu_init [16]<br />
[9] 0.0 0.00 0.00 1 _GLOBAL__sub_I_main [9]<br />
-----------------------------------------------<br />
<br />
Index by function name<br />
<br />
[9] _GLOBAL__sub_I_main (pi.cpp) [1] calcPI(double)<br />
</source><br />
<br />
===== Results =====<br />
You need many billions of points and maybe even trillions to reach a high precision for the final result but using just 2 billion dots causes the program to take over 30 seconds to run. The most intensive part of the program is the loop which is what executes 2 billion times in my run of the program while profiling, which can all be parallelized. We can determine from the profiling that 100% of the time executing the program is spent in the loop but of course that is not possible so we will go with 99.9%, using a GTX 1080 as an example GPU which has 20 processors and each having 2048 threads, and using Amdahl's Law we can expect a speedup of 976.191 times<br />
<br />
=== Assignment 2 ===<br />
==== Beginning Information ====<br />
<br />
==== Enlarge Image====<br />
<pre><br />
__global__ void enlargeImg(int* a, int* b, int matrixSize, int growthVal, int imgCols, int enlargedCols) {<br />
int idx = blockIdx.x * blockDim.x + threadIdx.x;<br />
int x = idx / enlargedCols;<br />
int y = idx % enlargedCols;<br />
if (idx < matrixSize) {<br />
a[idx] = b[(x / growthVal) * imgCols + (y / growthVal)];<br />
}<br />
}<br />
</pre><br />
<br />
==== Shrink Image ====<br />
On the CPU "shrink" took 20,000 microseconds and the GPU took 118 microseconds which shows a speedup of 169.5 times<br />
<br />
<pre><br />
__global__ void shrinkImg(int* a, int* b, int matrixSize, int shrinkVal, int imgCols, int shrinkCols) {<br />
int idx = blockIdx.x * blockDim.x + threadIdx.x;<br />
int x = idx / shrinkCols;<br />
int y = idx % shrinkCols;<br />
if (idx < matrixSize) {<br />
a[idx] = b[(x / shrinkVal) * imgCols + (y / shrinkVal)];<br />
}<br />
}<br />
</pre><br />
<br />
==== Reflect Image====<br />
<br />
<pre><br />
// Reflect Image Horizontally<br />
__global__ void reflectImgH(int* a, int* b, int rows, int cols) {<br />
int i = blockIdx.x * blockDim.x + threadIdx.x;<br />
int j = blockIdx.y * blockDim.y + threadIdx.y;<br />
//tempImage.pixelVal[rows - (i + 1)][j] = oldImage.pixelVal[i][j];<br />
a[j * cols + (rows - (i + 1))] = b[j * cols + i];<br />
<br />
}<br />
<br />
//Reflect Image Vertically<br />
__global__ void reflectImgV(int* a, int* b, int rows, int cols) {<br />
int i = blockIdx.x * blockDim.x + threadIdx.x;<br />
int j = blockIdx.y * blockDim.y + threadIdx.y;<br />
//tempImage.pixelVal[i][cols - (j + 1)] = oldImage.pixelVal[i][j];<br />
a[(cols - (j + 1) * cols) + i] = b[j * cols + i];<br />
<br />
}<br />
</pre><br />
<br />
==== Translate Image====<br />
<br />
<pre><br />
__global__ void translateImg(int* a, int* b, int cols, int value) {<br />
int i = blockIdx.x * blockDim.x + threadIdx.x;<br />
int j = blockIdx.y * blockDim.y + threadIdx.y;<br />
<br />
//tempImage.pixelVal[i + value][j + value] = oldImage.pixelVal[i][j];<br />
a[(j-value) * cols + (i+value)] = b[j * cols + i];<br />
<br />
}<br />
</pre><br />
<br />
==== Rotate Image====<br />
On the CPU "rotate" took 40,000 microseconds and the GPU took 1,482 microseconds which shows a speedup of 27 times<br />
<br />
The following chart graphically shows how this speedup looks:<br />
<pre><br />
__global__ void rotateImg(int* a, int* b, int matrixSize, int imgCols, int imgRows, int r0, int c0, float rads) {<br />
int idx = blockIdx.x * blockDim.x + threadIdx.x;<br />
int r = idx / imgCols;<br />
int c = idx % imgCols;<br />
if (idx < matrixSize) {<br />
int r1 = (int)(r0 + ((r - r0) * cos(rads)) - ((c - c0) * sin(rads)));<br />
int c1 = (int)(c0 + ((r - r0) * sin(rads)) + ((c - c0) * cos(rads)));<br />
if (r1 >= imgRows || r1 < 0 || c1 >= imgCols || c1 < 0) {<br />
}<br />
else {<br />
a[c1 * imgCols + r1] = b[c * imgCols + r];<br />
}<br />
<br />
}<br />
}<br />
<br />
__global__ void rotateImgBlackFix(int* a, int imgCols) {<br />
int idx = blockIdx.x * blockDim.x + threadIdx.x;<br />
int r = idx / imgCols;<br />
int c = idx % imgCols;<br />
if (a[c * imgCols + r] == 0)<br />
a[c * imgCols + r] = a[(c + 1) * imgCols + r];<br />
}<br />
</pre><br />
<br />
==== Negate Image====<br />
<br />
<pre><br />
__global__ void negateImg(int* a, int* b, int matrixSize) {<br />
int matrixCol = blockIdx.x * blockDim.x + threadIdx.x;<br />
if(matrixCol < matrixSize)<br />
<br />
</pre><br />
<br />
====Results====<br />
[[File:CHART2GOOD.png]]<br />
<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=138787GPU610/gpuchill2019-04-05T02:15:37Z<p>Dserpa: /* Rotate Image */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Calculation of Pi, Shrink & Rotate<br />
# [mailto:akkabia@myseneca.ca?subject=gpu610 Abdul Kabia], Some responsibility <br />
# [mailto:jtardif1@myseneca.ca?subject=gpu610 Josh Tardif], Some responsibility<br />
# [mailto:afaux@myseneca.ca?subject=gpu610 Andrew Faux], Some responsibility<br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca,akkabia@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
==== Sudoku Brute Force Solver ====<br />
<br />
I decided to profile a simple brute force Sudoku solver, found here (https://github.com/regehr/sudoku). The solver uses a simple back tracking algorithm, inserting possible values into cells, iterating through the puzzles thousands of times, until it eventually produces an answer which does not violate any of the rules of Sudoku. As such the solver runs at the same speed regardless of the human difficulty rating, able to solve easy and 'insane' level puzzles at the same speed. The solver also works independent of the ratio between clues and white space, producing quick results with even the most sparsely populated puzzles.As such the following run of the program uses a puzzle which is specifically made to play against the back tracking algorithm and provides maximum time for the solver.<br />
<br />
Test run with puzzle:<br />
<pre><br />
Original configuration:<br />
-------------<br />
| | | |<br />
| | 3| 85|<br />
| 1| 2 | |<br />
-------------<br />
| |5 7| |<br />
| 4| |1 |<br />
| 9 | | |<br />
-------------<br />
|5 | | 73|<br />
| 2| 1 | |<br />
| | 4 | 9|<br />
-------------<br />
17 entries filled<br />
solution:<br />
-------------<br />
|987|654|321|<br />
|246|173|985|<br />
|351|928|746|<br />
-------------<br />
|128|537|694|<br />
|634|892|157|<br />
|795|461|832|<br />
-------------<br />
|519|286|473|<br />
|472|319|568|<br />
|863|745|219|<br />
-------------<br />
found 1 solutions<br />
<br />
real 0m33.652s<br />
user 0m33.098s<br />
sys 0m0.015s<br />
</pre><br />
<br />
<br />
Flat profile:<br />
<pre><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
46.42 10.04 10.04 622865043 0.00 0.00 check_row<br />
23.52 15.13 5.09 1 5.09 21.32 solve<br />
18.26 19.08 3.95 223473489 0.00 0.00 check_col<br />
10.02 21.25 2.17 100654218 0.00 0.00 check_region<br />
0.72 21.40 0.16 2 0.08 0.08 print<br />
0.39 21.49 0.09 frame_dummy<br />
</pre><br />
<br />
I believe that if a GPU was used to enhance this program one would see a great increase of speed. All of the check functions essentially do the same thing, iterating through possible inserted values for any that violate the rules. If one is able to unload all of these iterations onto the GPU then there should be a corresponding increase in speed.<br />
<br />
==== Christopher Ginac Image Processing Library ====<br />
<br />
I decided to profile a single user created image processing library written by Christopher Ginac, you can follow his post of the library [https://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ here]. His library enables the user to play around with .PGM image formats. If given the right parameters, users have the following options:<br />
<br />
<pre><br />
What would you like to do:<br />
[1] Get a Sub Image<br />
[2] Enlarge Image<br />
[3] Shrink Image<br />
[4] Reflect Image<br />
[5] Translate Image<br />
[6] Rotate Image<br />
[7] Negate Image<br />
</pre><br />
<br />
I went with the Enlarge option to see how long that would take. In order for me to do this, I had to test both the limits of the program and my own seneca machine allowed space, in order to do this, I had to use a fairly large image. However, since the program creates a second image, my Seneca account ran out of space for the new image, so the program could not write out the newly enlarged image. So I had to settle on an image that was 16.3MB max, so that it could write a new one, totally in 32.6MB of space. <br />
<br />
<pre><br />
real 0m10.595s<br />
user 0m5.325s<br />
sys 0m1.446s<br />
</pre><br />
Which isn't really bad, but when we look deeper, we see where most of our time is being spent<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
21.74 1.06 1.06 1 1.06 1.06 Image::operator=(Image const&)<br />
21.33 2.10 1.04 2 0.52 0.52 Image::Image(int, int, int)<br />
18.66 3.01 0.91 154056114 0.00 0.00 Image::getPixelVal(int, int)<br />
15.59 3.77 0.76 1 0.76 2.34 Image::enlargeImage(int, Image&)<br />
14.97 4.50 0.73 1 0.73 1.67 writeImage(char*, Image&)<br />
3.69 4.68 0.18 2 0.09 0.09 Image::Image(Image const&)<br />
2.67 4.81 0.13 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.82 4.85 0.04 1 0.04 0.17 readImage(char*, Image&)<br />
0.62 4.88 0.03 1 0.03 0.03 Image::getImageInfo(int&, int&, int&)<br />
0.00 4.88 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 4.88 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 4.88 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev<br />
0.00 4.88 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 4.88 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)<br />
</pre><br />
<br />
It seems most of our time in this part of the code is spent assigning our enlarged image to the now one, and also creating our image object in the first place. I think if we were to somehow use a GPU for this process, we would see an decrease in run-time for this part of the library. Also, there also seems to be room for improvement on the very 'Image::enlargeImage' function itself. I feel like by loading said functionality onto thje GPU, we can reduce it's 0.76s to something even lower.<br />
<br />
Using the same image as above (16MB file), I went ahead and profile the Negate option as well. This as the name implies turns the image into a negative form.<br />
<pre><br />
real 0m5.707s<br />
user 0m0.000s<br />
sys 0m0.000s<br />
</pre><br />
<br />
As you can see, about half the time of the Enlarge option, which is expect considering you're not doing as much.<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls ms/call ms/call name <br />
23.53 0.16 0.16 2 80.00 80.00 Image::Image(Image const&)<br />
16.18 0.27 0.11 2 55.00 55.00 Image::Image(int, int, int)<br />
14.71 0.37 0.10 _fu62___ZSt4cout<br />
13.24 0.46 0.09 17117346 0.00 0.00 Image::getPixelVal(int, int)<br />
13.24 0.55 0.09 1 90.00 90.00 Image::operator=(Image const&)<br />
7.35 0.60 0.05 1 50.00 140.00 writeImage(char*, Image&)<br />
7.35 0.65 0.05 1 50.00 195.00 Image::negateImage(Image&)<br />
4.41 0.68 0.03 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.00 0.68 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 0.68 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 0.68 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 0.68 0.00 1 0.00 0.00 readImage(char*, Image&)<br />
0.00 0.68 0.00 1 0.00 0.00 Image::getImageInfo(int&, int&, int&)<br />
</pre><br />
<br />
Notice in both cases of the Enlarge and Negate options the function "Image::Image(int, int, int)" is always within the top 3 of functions that seem to take the most time. Also, the functions "Image::setPixelVal(int, int, int)" and <br />
"Image::getPixelVal(int, int)" are called very often. I think if we focus our efforts on unloading the "Image::getPixelVal(int, int)" and "Image::setPixelVal(int, int, int)" functions onto the GPU as I imagine they are VERY repetitive tasks, as well as try and optimize the "Image::Image(int, int, int)" function; we are sure to see an increase in performance for this program.<br />
<br />
==== Merge Sort Algorithm ====<br />
<br />
I decide to profile a vector merge sort algorithm. A merge sort is based on a based on divide and conquer technique which recursively breaks down a problem into two or more sub-problems of the same or related types. When these become simple enough to be solved directly the sub-problems are then combined to give a solution to the original problem. It first divides the array into equal halves and then combines them in a sorted manner. Due to this type of sort being broken into equal parts, I thought that it would be perfect for a GPU to be able to accelerate the process. With the sort being broken down into multiple chunks and then sent to the GPU it will be able to accomplish its task more efficiently. I was able to find the source code [https://codereview.stackexchange.com/questions/167680/merge-sort-implementation-with-vectors/ here].<br />
<br />
Profile for 10 million elements between 1 and 10000. Using -02 optimization.<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls ns/call ns/call name<br />
48.35 1.16 1.16 9999999 115.56 115.56 mergeSort(std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, <br />
std::vector<int, std::allocator<int> >&)<br />
32.80 1.94 0.78 sort(std::vector<int, std::allocator<int> >&)<br />
19.34 2.40 0.46 43708492 10.58 10.58 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, <br />
std::allocator<int> > >, int const&)<br />
0.00 2.40 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</pre><br />
As you can see 80% of the total time was spent in mergeSort and sort functions. <br /><br />
If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x.<br />
<br />
==== Calculation of Pi ====<br />
===== Initial Thoughts =====<br />
The program I decided to assess, and profile calculates the value of PI by using the approximation method called Monte Carlo. This works by having a circle that is 𝜋r2 and a square that is 4r2 with r being 1.0 and generating randomized points inside the area, both x and y being between -1 and 1 we keep track of how many points have been located inside the circle. The more points generated the more accurate the final calculation of PI will be. The amount of points needed for say billionth precision can easily reach in the hundreds of billions which would take just as many calculations of the same mathematical computation, which makes it a fantastic candidate to parallelize.<br />
<br />
====== Figure 1 ======<br />
[[File:Pi_calc.png]]<br />
<br/><br />
Figure 1: Graphical representation of the Monte Carlo method of approximating PI<br />
<br />
====== Figure 2 ======<br />
{| class="wikitable mw-collapsible mw-collapsed"<br />
! pi.cpp<br />
|-<br />
|<br />
<source><br />
/*<br />
Author: Daniel Serpa<br />
Pseudo code: Blaise Barney (https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesPI)<br />
*/<br />
<br />
#include <iostream><br />
#include <cstdlib><br />
#include <math.h><br />
#include <iomanip><br />
<br />
double calcPI(double);<br />
<br />
int main(int argc, char ** argv) {<br />
if (argc != 2) {<br />
std::cout << "Invalid number of arguments" << std::endl;<br />
return 1;<br />
}<br />
std::srand(852);<br />
double npoints = atof(argv[1]);<br />
std::cout << "Number of points: " << npoints << std::endl;<br />
double PI = calcPI(npoints);<br />
std::cout << std::setprecision(10) << PI << std::endl;<br />
return 0;<br />
}<br />
<br />
double calcPI(double npoints) {<br />
double circle_count = 0.0;<br />
for (int j = 0; j < npoints; j++) {<br />
double x_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
double y_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
if (sqrt(pow(x_coor, 2) + pow(y_coor, 2)) < 1.0) circle_count += 1.0;<br />
}<br />
return 4.0*circle_count / npoints;<br />
}<br />
</source><br />
|}<br />
<br />
Figure 2: Serial C++ program used for profiling of the Monte Carlo method of approximating PI<br />
<br />
===== Compilation =====<br />
Program is compiled using the command: <source>gpp -O2 -g -pg -oapp pi.cpp</source><br />
<br />
===== Running =====<br />
We will profile the program using 2 billion points<br />
<source><br />
> time app 2000000000<br />
Number of points: 2e+09<br />
3.14157<br />
<br />
real 1m0.072s<br />
user 0m59.268s<br />
sys 0m0.018s<br />
</source><br />
<br />
===== Profiling =====<br />
Flat:<br />
<source><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls Ts/call Ts/call name<br />
100.80 34.61 34.61 calcPI(double)<br />
0.00 34.61 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</source><br />
Call:<br />
<source><br />
granularity: each sample hit covers 2 byte(s) for 0.03% of 34.61 seconds<br />
<br />
index % time self children called name<br />
<spontaneous><br />
[1] 100.0 34.61 0.00 calcPI(double) [1]<br />
-----------------------------------------------<br />
0.00 0.00 1/1 __libc_csu_init [16]<br />
[9] 0.0 0.00 0.00 1 _GLOBAL__sub_I_main [9]<br />
-----------------------------------------------<br />
<br />
Index by function name<br />
<br />
[9] _GLOBAL__sub_I_main (pi.cpp) [1] calcPI(double)<br />
</source><br />
<br />
===== Results =====<br />
You need many billions of points and maybe even trillions to reach a high precision for the final result but using just 2 billion dots causes the program to take over 30 seconds to run. The most intensive part of the program is the loop which is what executes 2 billion times in my run of the program while profiling, which can all be parallelized. We can determine from the profiling that 100% of the time executing the program is spent in the loop but of course that is not possible so we will go with 99.9%, using a GTX 1080 as an example GPU which has 20 processors and each having 2048 threads, and using Amdahl's Law we can expect a speedup of 976.191 times<br />
<br />
=== Assignment 2 ===<br />
==== Beginning Information ====<br />
<br />
==== Enlarge Image====<br />
<pre><br />
__global__ void enlargeImg(int* a, int* b, int matrixSize, int growthVal, int imgCols, int enlargedCols) {<br />
int idx = blockIdx.x * blockDim.x + threadIdx.x;<br />
int x = idx / enlargedCols;<br />
int y = idx % enlargedCols;<br />
if (idx < matrixSize) {<br />
a[idx] = b[(x / growthVal) * imgCols + (y / growthVal)];<br />
}<br />
}<br />
</pre><br />
<br />
==== Shrink Image ====<br />
On the CPU "shrink" took 20,000 microseconds and the GPU took 118 microseconds which shows a speedup of 169.5 times<br />
<br />
The following chart graphically shows how this speedup looks:<br />
<br />
<pre><br />
__global__ void shrinkImg(int* a, int* b, int matrixSize, int shrinkVal, int imgCols, int shrinkCols) {<br />
int idx = blockIdx.x * blockDim.x + threadIdx.x;<br />
int x = idx / shrinkCols;<br />
int y = idx % shrinkCols;<br />
if (idx < matrixSize) {<br />
a[idx] = b[(x / shrinkVal) * imgCols + (y / shrinkVal)];<br />
}<br />
}<br />
</pre><br />
<br />
==== Reflect Image====<br />
<br />
<pre><br />
__global__ void reflectImgH(int* a, int* b, int rows, int cols) {<br />
int i = blockIdx.x * blockDim.x + threadIdx.x;<br />
int j = blockIdx.y * blockDim.y + threadIdx.y;<br />
//tempImage.pixelVal[rows - (i + 1)][j] = oldImage.pixelVal[i][j];<br />
a[j * cols + (rows - (i + 1))] = b[j * cols + i];<br />
<br />
}<br />
__global__ void reflectImgV(int* a, int* b, int rows, int cols) {<br />
int i = blockIdx.x * blockDim.x + threadIdx.x;<br />
int j = blockIdx.y * blockDim.y + threadIdx.y;<br />
//tempImage.pixelVal[i][cols - (j + 1)] = oldImage.pixelVal[i][j];<br />
a[(cols - (j + 1) * cols) + i] = b[j * cols + i];<br />
<br />
}<br />
</pre><br />
<br />
==== Translate Image====<br />
<br />
<pre><br />
__global__ void translateImg(int* a, int* b, int cols, int value) {<br />
int i = blockIdx.x * blockDim.x + threadIdx.x;<br />
int j = blockIdx.y * blockDim.y + threadIdx.y;<br />
<br />
//tempImage.pixelVal[i + value][j + value] = oldImage.pixelVal[i][j];<br />
a[(j-value) * cols + (i+value)] = b[j * cols + i];<br />
<br />
}<br />
</pre><br />
<br />
==== Rotate Image====<br />
On the CPU "rotate" took 40,000 microseconds and the GPU took 1,482 microseconds which shows a speedup of 27 times<br />
<br />
The following chart graphically shows how this speedup looks:<br />
<pre><br />
__global__ void rotateImg(int* a, int* b, int matrixSize, int imgCols, int imgRows, int r0, int c0, float rads) {<br />
int idx = blockIdx.x * blockDim.x + threadIdx.x;<br />
int r = idx / imgCols;<br />
int c = idx % imgCols;<br />
if (idx < matrixSize) {<br />
int r1 = (int)(r0 + ((r - r0) * cos(rads)) - ((c - c0) * sin(rads)));<br />
int c1 = (int)(c0 + ((r - r0) * sin(rads)) + ((c - c0) * cos(rads)));<br />
if (r1 >= imgRows || r1 < 0 || c1 >= imgCols || c1 < 0) {<br />
}<br />
else {<br />
a[c1 * imgCols + r1] = b[c * imgCols + r];<br />
}<br />
<br />
}<br />
}<br />
<br />
__global__ void rotateImgBlackFix(int* a, int imgCols) {<br />
int idx = blockIdx.x * blockDim.x + threadIdx.x;<br />
int r = idx / imgCols;<br />
int c = idx % imgCols;<br />
if (a[c * imgCols + r] == 0)<br />
a[c * imgCols + r] = a[(c + 1) * imgCols + r];<br />
}<br />
</pre><br />
<br />
==== Negate Image====<br />
<br />
<pre><br />
__global__ void negateImg(int* a, int* b, int matrixSize) {<br />
int matrixCol = blockIdx.x * blockDim.x + threadIdx.x;<br />
if(matrixCol < matrixSize)<br />
<br />
</pre><br />
<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=138784GPU610/gpuchill2019-04-05T01:59:14Z<p>Dserpa: /* Shrink */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Calculation of Pi, Shrink & Rotate<br />
# [mailto:akkabia@myseneca.ca?subject=gpu610 Abdul Kabia], Some responsibility <br />
# [mailto:jtardif1@myseneca.ca?subject=gpu610 Josh Tardif], Some responsibility<br />
# [mailto:afaux@myseneca.ca?subject=gpu610 Andrew Faux], Some responsibility<br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca,akkabia@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
==== Sudoku Brute Force Solver ====<br />
<br />
I decided to profile a simple brute force Sudoku solver, found here (https://github.com/regehr/sudoku). The solver uses a simple back tracking algorithm, inserting possible values into cells, iterating through the puzzles thousands of times, until it eventually produces an answer which does not violate any of the rules of Sudoku. As such the solver runs at the same speed regardless of the human difficulty rating, able to solve easy and 'insane' level puzzles at the same speed. The solver also works independent of the ratio between clues and white space, producing quick results with even the most sparsely populated puzzles.As such the following run of the program uses a puzzle which is specifically made to play against the back tracking algorithm and provides maximum time for the solver.<br />
<br />
Test run with puzzle:<br />
<pre><br />
Original configuration:<br />
-------------<br />
| | | |<br />
| | 3| 85|<br />
| 1| 2 | |<br />
-------------<br />
| |5 7| |<br />
| 4| |1 |<br />
| 9 | | |<br />
-------------<br />
|5 | | 73|<br />
| 2| 1 | |<br />
| | 4 | 9|<br />
-------------<br />
17 entries filled<br />
solution:<br />
-------------<br />
|987|654|321|<br />
|246|173|985|<br />
|351|928|746|<br />
-------------<br />
|128|537|694|<br />
|634|892|157|<br />
|795|461|832|<br />
-------------<br />
|519|286|473|<br />
|472|319|568|<br />
|863|745|219|<br />
-------------<br />
found 1 solutions<br />
<br />
real 0m33.652s<br />
user 0m33.098s<br />
sys 0m0.015s<br />
</pre><br />
<br />
<br />
Flat profile:<br />
<pre><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
46.42 10.04 10.04 622865043 0.00 0.00 check_row<br />
23.52 15.13 5.09 1 5.09 21.32 solve<br />
18.26 19.08 3.95 223473489 0.00 0.00 check_col<br />
10.02 21.25 2.17 100654218 0.00 0.00 check_region<br />
0.72 21.40 0.16 2 0.08 0.08 print<br />
0.39 21.49 0.09 frame_dummy<br />
</pre><br />
<br />
I believe that if a GPU was used to enhance this program one would see a great increase of speed. All of the check functions essentially do the same thing, iterating through possible inserted values for any that violate the rules. If one is able to unload all of these iterations onto the GPU then there should be a corresponding increase in speed.<br />
<br />
==== Christopher Ginac Image Processing Library ====<br />
<br />
I decided to profile a single user created image processing library written by Christopher Ginac, you can follow his post of the library [https://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ here]. His library enables the user to play around with .PGM image formats. If given the right parameters, users have the following options:<br />
<br />
<pre><br />
What would you like to do:<br />
[1] Get a Sub Image<br />
[2] Enlarge Image<br />
[3] Shrink Image<br />
[4] Reflect Image<br />
[5] Translate Image<br />
[6] Rotate Image<br />
[7] Negate Image<br />
</pre><br />
<br />
I went with the Enlarge option to see how long that would take. In order for me to do this, I had to test both the limits of the program and my own seneca machine allowed space, in order to do this, I had to use a fairly large image. However, since the program creates a second image, my Seneca account ran out of space for the new image, so the program could not write out the newly enlarged image. So I had to settle on an image that was 16.3MB max, so that it could write a new one, totally in 32.6MB of space. <br />
<br />
<pre><br />
real 0m10.595s<br />
user 0m5.325s<br />
sys 0m1.446s<br />
</pre><br />
Which isn't really bad, but when we look deeper, we see where most of our time is being spent<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
21.74 1.06 1.06 1 1.06 1.06 Image::operator=(Image const&)<br />
21.33 2.10 1.04 2 0.52 0.52 Image::Image(int, int, int)<br />
18.66 3.01 0.91 154056114 0.00 0.00 Image::getPixelVal(int, int)<br />
15.59 3.77 0.76 1 0.76 2.34 Image::enlargeImage(int, Image&)<br />
14.97 4.50 0.73 1 0.73 1.67 writeImage(char*, Image&)<br />
3.69 4.68 0.18 2 0.09 0.09 Image::Image(Image const&)<br />
2.67 4.81 0.13 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.82 4.85 0.04 1 0.04 0.17 readImage(char*, Image&)<br />
0.62 4.88 0.03 1 0.03 0.03 Image::getImageInfo(int&, int&, int&)<br />
0.00 4.88 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 4.88 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 4.88 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev<br />
0.00 4.88 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 4.88 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)<br />
</pre><br />
<br />
It seems most of our time in this part of the code is spent assigning our enlarged image to the now one, and also creating our image object in the first place. I think if we were to somehow use a GPU for this process, we would see an decrease in run-time for this part of the library. Also, there also seems to be room for improvement on the very 'Image::enlargeImage' function itself. I feel like by loading said functionality onto thje GPU, we can reduce it's 0.76s to something even lower.<br />
<br />
Using the same image as above (16MB file), I went ahead and profile the Negate option as well. This as the name implies turns the image into a negative form.<br />
<pre><br />
real 0m5.707s<br />
user 0m0.000s<br />
sys 0m0.000s<br />
</pre><br />
<br />
As you can see, about half the time of the Enlarge option, which is expect considering you're not doing as much.<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls ms/call ms/call name <br />
23.53 0.16 0.16 2 80.00 80.00 Image::Image(Image const&)<br />
16.18 0.27 0.11 2 55.00 55.00 Image::Image(int, int, int)<br />
14.71 0.37 0.10 _fu62___ZSt4cout<br />
13.24 0.46 0.09 17117346 0.00 0.00 Image::getPixelVal(int, int)<br />
13.24 0.55 0.09 1 90.00 90.00 Image::operator=(Image const&)<br />
7.35 0.60 0.05 1 50.00 140.00 writeImage(char*, Image&)<br />
7.35 0.65 0.05 1 50.00 195.00 Image::negateImage(Image&)<br />
4.41 0.68 0.03 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.00 0.68 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 0.68 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 0.68 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 0.68 0.00 1 0.00 0.00 readImage(char*, Image&)<br />
0.00 0.68 0.00 1 0.00 0.00 Image::getImageInfo(int&, int&, int&)<br />
</pre><br />
<br />
Notice in both cases of the Enlarge and Negate options the function "Image::Image(int, int, int)" is always within the top 3 of functions that seem to take the most time. Also, the functions "Image::setPixelVal(int, int, int)" and <br />
"Image::getPixelVal(int, int)" are called very often. I think if we focus our efforts on unloading the "Image::getPixelVal(int, int)" and "Image::setPixelVal(int, int, int)" functions onto the GPU as I imagine they are VERY repetitive tasks, as well as try and optimize the "Image::Image(int, int, int)" function; we are sure to see an increase in performance for this program.<br />
<br />
==== Merge Sort Algorithm ====<br />
<br />
I decide to profile a vector merge sort algorithm. A merge sort is based on a based on divide and conquer technique which recursively breaks down a problem into two or more sub-problems of the same or related types. When these become simple enough to be solved directly the sub-problems are then combined to give a solution to the original problem. It first divides the array into equal halves and then combines them in a sorted manner. Due to this type of sort being broken into equal parts, I thought that it would be perfect for a GPU to be able to accelerate the process. With the sort being broken down into multiple chunks and then sent to the GPU it will be able to accomplish its task more efficiently. I was able to find the source code [https://codereview.stackexchange.com/questions/167680/merge-sort-implementation-with-vectors/ here].<br />
<br />
Profile for 10 million elements between 1 and 10000. Using -02 optimization.<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls ns/call ns/call name<br />
48.35 1.16 1.16 9999999 115.56 115.56 mergeSort(std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, <br />
std::vector<int, std::allocator<int> >&)<br />
32.80 1.94 0.78 sort(std::vector<int, std::allocator<int> >&)<br />
19.34 2.40 0.46 43708492 10.58 10.58 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, <br />
std::allocator<int> > >, int const&)<br />
0.00 2.40 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</pre><br />
As you can see 80% of the total time was spent in mergeSort and sort functions. <br /><br />
If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x.<br />
<br />
==== Calculation of Pi ====<br />
===== Initial Thoughts =====<br />
The program I decided to assess, and profile calculates the value of PI by using the approximation method called Monte Carlo. This works by having a circle that is 𝜋r2 and a square that is 4r2 with r being 1.0 and generating randomized points inside the area, both x and y being between -1 and 1 we keep track of how many points have been located inside the circle. The more points generated the more accurate the final calculation of PI will be. The amount of points needed for say billionth precision can easily reach in the hundreds of billions which would take just as many calculations of the same mathematical computation, which makes it a fantastic candidate to parallelize.<br />
<br />
====== Figure 1 ======<br />
[[File:Pi_calc.png]]<br />
<br/><br />
Figure 1: Graphical representation of the Monte Carlo method of approximating PI<br />
<br />
====== Figure 2 ======<br />
{| class="wikitable mw-collapsible mw-collapsed"<br />
! pi.cpp<br />
|-<br />
|<br />
<source><br />
/*<br />
Author: Daniel Serpa<br />
Pseudo code: Blaise Barney (https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesPI)<br />
*/<br />
<br />
#include <iostream><br />
#include <cstdlib><br />
#include <math.h><br />
#include <iomanip><br />
<br />
double calcPI(double);<br />
<br />
int main(int argc, char ** argv) {<br />
if (argc != 2) {<br />
std::cout << "Invalid number of arguments" << std::endl;<br />
return 1;<br />
}<br />
std::srand(852);<br />
double npoints = atof(argv[1]);<br />
std::cout << "Number of points: " << npoints << std::endl;<br />
double PI = calcPI(npoints);<br />
std::cout << std::setprecision(10) << PI << std::endl;<br />
return 0;<br />
}<br />
<br />
double calcPI(double npoints) {<br />
double circle_count = 0.0;<br />
for (int j = 0; j < npoints; j++) {<br />
double x_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
double y_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
if (sqrt(pow(x_coor, 2) + pow(y_coor, 2)) < 1.0) circle_count += 1.0;<br />
}<br />
return 4.0*circle_count / npoints;<br />
}<br />
</source><br />
|}<br />
<br />
Figure 2: Serial C++ program used for profiling of the Monte Carlo method of approximating PI<br />
<br />
===== Compilation =====<br />
Program is compiled using the command: <source>gpp -O2 -g -pg -oapp pi.cpp</source><br />
<br />
===== Running =====<br />
We will profile the program using 2 billion points<br />
<source><br />
> time app 2000000000<br />
Number of points: 2e+09<br />
3.14157<br />
<br />
real 1m0.072s<br />
user 0m59.268s<br />
sys 0m0.018s<br />
</source><br />
<br />
===== Profiling =====<br />
Flat:<br />
<source><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls Ts/call Ts/call name<br />
100.80 34.61 34.61 calcPI(double)<br />
0.00 34.61 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</source><br />
Call:<br />
<source><br />
granularity: each sample hit covers 2 byte(s) for 0.03% of 34.61 seconds<br />
<br />
index % time self children called name<br />
<spontaneous><br />
[1] 100.0 34.61 0.00 calcPI(double) [1]<br />
-----------------------------------------------<br />
0.00 0.00 1/1 __libc_csu_init [16]<br />
[9] 0.0 0.00 0.00 1 _GLOBAL__sub_I_main [9]<br />
-----------------------------------------------<br />
<br />
Index by function name<br />
<br />
[9] _GLOBAL__sub_I_main (pi.cpp) [1] calcPI(double)<br />
</source><br />
<br />
===== Results =====<br />
You need many billions of points and maybe even trillions to reach a high precision for the final result but using just 2 billion dots causes the program to take over 30 seconds to run. The most intensive part of the program is the loop which is what executes 2 billion times in my run of the program while profiling, which can all be parallelized. We can determine from the profiling that 100% of the time executing the program is spent in the loop but of course that is not possible so we will go with 99.9%, using a GTX 1080 as an example GPU which has 20 processors and each having 2048 threads, and using Amdahl's Law we can expect a speedup of 976.191 times<br />
<br />
=== Assignment 2 ===<br />
==== Beginning Information ====<br />
==== Shrink ====<br />
On the CPU "shrink" took 20,000 microseconds and the GPU took 118 microseconds which shows a speedup of 169.5 times<br />
<br />
The following chart graphically shows how this speedup looks:<br />
<br />
==== Rotate ====<br />
On the CPU "rotate" took 40,000 microseconds and the GPU took 123,123 microseconds which shows a speedup of X<br />
<br />
The following chart graphically shows how this speedup looks:<br />
<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=138783GPU610/gpuchill2019-04-05T01:49:01Z<p>Dserpa: /* Rotate */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Calculation of Pi, Shrink & Rotate<br />
# [mailto:akkabia@myseneca.ca?subject=gpu610 Abdul Kabia], Some responsibility <br />
# [mailto:jtardif1@myseneca.ca?subject=gpu610 Josh Tardif], Some responsibility<br />
# [mailto:afaux@myseneca.ca?subject=gpu610 Andrew Faux], Some responsibility<br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca,akkabia@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
==== Sudoku Brute Force Solver ====<br />
<br />
I decided to profile a simple brute force Sudoku solver, found here (https://github.com/regehr/sudoku). The solver uses a simple back tracking algorithm, inserting possible values into cells, iterating through the puzzles thousands of times, until it eventually produces an answer which does not violate any of the rules of Sudoku. As such the solver runs at the same speed regardless of the human difficulty rating, able to solve easy and 'insane' level puzzles at the same speed. The solver also works independent of the ratio between clues and white space, producing quick results with even the most sparsely populated puzzles.As such the following run of the program uses a puzzle which is specifically made to play against the back tracking algorithm and provides maximum time for the solver.<br />
<br />
Test run with puzzle:<br />
<pre><br />
Original configuration:<br />
-------------<br />
| | | |<br />
| | 3| 85|<br />
| 1| 2 | |<br />
-------------<br />
| |5 7| |<br />
| 4| |1 |<br />
| 9 | | |<br />
-------------<br />
|5 | | 73|<br />
| 2| 1 | |<br />
| | 4 | 9|<br />
-------------<br />
17 entries filled<br />
solution:<br />
-------------<br />
|987|654|321|<br />
|246|173|985|<br />
|351|928|746|<br />
-------------<br />
|128|537|694|<br />
|634|892|157|<br />
|795|461|832|<br />
-------------<br />
|519|286|473|<br />
|472|319|568|<br />
|863|745|219|<br />
-------------<br />
found 1 solutions<br />
<br />
real 0m33.652s<br />
user 0m33.098s<br />
sys 0m0.015s<br />
</pre><br />
<br />
<br />
Flat profile:<br />
<pre><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
46.42 10.04 10.04 622865043 0.00 0.00 check_row<br />
23.52 15.13 5.09 1 5.09 21.32 solve<br />
18.26 19.08 3.95 223473489 0.00 0.00 check_col<br />
10.02 21.25 2.17 100654218 0.00 0.00 check_region<br />
0.72 21.40 0.16 2 0.08 0.08 print<br />
0.39 21.49 0.09 frame_dummy<br />
</pre><br />
<br />
I believe that if a GPU was used to enhance this program one would see a great increase of speed. All of the check functions essentially do the same thing, iterating through possible inserted values for any that violate the rules. If one is able to unload all of these iterations onto the GPU then there should be a corresponding increase in speed.<br />
<br />
==== Christopher Ginac Image Processing Library ====<br />
<br />
I decided to profile a single user created image processing library written by Christopher Ginac, you can follow his post of the library [https://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ here]. His library enables the user to play around with .PGM image formats. If given the right parameters, users have the following options:<br />
<br />
<pre><br />
What would you like to do:<br />
[1] Get a Sub Image<br />
[2] Enlarge Image<br />
[3] Shrink Image<br />
[4] Reflect Image<br />
[5] Translate Image<br />
[6] Rotate Image<br />
[7] Negate Image<br />
</pre><br />
<br />
I went with the Enlarge option to see how long that would take. In order for me to do this, I had to test both the limits of the program and my own seneca machine allowed space, in order to do this, I had to use a fairly large image. However, since the program creates a second image, my Seneca account ran out of space for the new image, so the program could not write out the newly enlarged image. So I had to settle on an image that was 16.3MB max, so that it could write a new one, totally in 32.6MB of space. <br />
<br />
<pre><br />
real 0m10.595s<br />
user 0m5.325s<br />
sys 0m1.446s<br />
</pre><br />
Which isn't really bad, but when we look deeper, we see where most of our time is being spent<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
21.74 1.06 1.06 1 1.06 1.06 Image::operator=(Image const&)<br />
21.33 2.10 1.04 2 0.52 0.52 Image::Image(int, int, int)<br />
18.66 3.01 0.91 154056114 0.00 0.00 Image::getPixelVal(int, int)<br />
15.59 3.77 0.76 1 0.76 2.34 Image::enlargeImage(int, Image&)<br />
14.97 4.50 0.73 1 0.73 1.67 writeImage(char*, Image&)<br />
3.69 4.68 0.18 2 0.09 0.09 Image::Image(Image const&)<br />
2.67 4.81 0.13 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.82 4.85 0.04 1 0.04 0.17 readImage(char*, Image&)<br />
0.62 4.88 0.03 1 0.03 0.03 Image::getImageInfo(int&, int&, int&)<br />
0.00 4.88 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 4.88 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 4.88 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev<br />
0.00 4.88 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 4.88 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)<br />
</pre><br />
<br />
It seems most of our time in this part of the code is spent assigning our enlarged image to the now one, and also creating our image object in the first place. I think if we were to somehow use a GPU for this process, we would see an decrease in run-time for this part of the library. Also, there also seems to be room for improvement on the very 'Image::enlargeImage' function itself. I feel like by loading said functionality onto thje GPU, we can reduce it's 0.76s to something even lower.<br />
<br />
Using the same image as above (16MB file), I went ahead and profile the Negate option as well. This as the name implies turns the image into a negative form.<br />
<pre><br />
real 0m5.707s<br />
user 0m0.000s<br />
sys 0m0.000s<br />
</pre><br />
<br />
As you can see, about half the time of the Enlarge option, which is expect considering you're not doing as much.<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls ms/call ms/call name <br />
23.53 0.16 0.16 2 80.00 80.00 Image::Image(Image const&)<br />
16.18 0.27 0.11 2 55.00 55.00 Image::Image(int, int, int)<br />
14.71 0.37 0.10 _fu62___ZSt4cout<br />
13.24 0.46 0.09 17117346 0.00 0.00 Image::getPixelVal(int, int)<br />
13.24 0.55 0.09 1 90.00 90.00 Image::operator=(Image const&)<br />
7.35 0.60 0.05 1 50.00 140.00 writeImage(char*, Image&)<br />
7.35 0.65 0.05 1 50.00 195.00 Image::negateImage(Image&)<br />
4.41 0.68 0.03 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.00 0.68 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 0.68 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 0.68 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 0.68 0.00 1 0.00 0.00 readImage(char*, Image&)<br />
0.00 0.68 0.00 1 0.00 0.00 Image::getImageInfo(int&, int&, int&)<br />
</pre><br />
<br />
Notice in both cases of the Enlarge and Negate options the function "Image::Image(int, int, int)" is always within the top 3 of functions that seem to take the most time. Also, the functions "Image::setPixelVal(int, int, int)" and <br />
"Image::getPixelVal(int, int)" are called very often. I think if we focus our efforts on unloading the "Image::getPixelVal(int, int)" and "Image::setPixelVal(int, int, int)" functions onto the GPU as I imagine they are VERY repetitive tasks, as well as try and optimize the "Image::Image(int, int, int)" function; we are sure to see an increase in performance for this program.<br />
<br />
==== Merge Sort Algorithm ====<br />
<br />
I decide to profile a vector merge sort algorithm. A merge sort is based on a based on divide and conquer technique which recursively breaks down a problem into two or more sub-problems of the same or related types. When these become simple enough to be solved directly the sub-problems are then combined to give a solution to the original problem. It first divides the array into equal halves and then combines them in a sorted manner. Due to this type of sort being broken into equal parts, I thought that it would be perfect for a GPU to be able to accelerate the process. With the sort being broken down into multiple chunks and then sent to the GPU it will be able to accomplish its task more efficiently. I was able to find the source code [https://codereview.stackexchange.com/questions/167680/merge-sort-implementation-with-vectors/ here].<br />
<br />
Profile for 10 million elements between 1 and 10000. Using -02 optimization.<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls ns/call ns/call name<br />
48.35 1.16 1.16 9999999 115.56 115.56 mergeSort(std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, <br />
std::vector<int, std::allocator<int> >&)<br />
32.80 1.94 0.78 sort(std::vector<int, std::allocator<int> >&)<br />
19.34 2.40 0.46 43708492 10.58 10.58 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, <br />
std::allocator<int> > >, int const&)<br />
0.00 2.40 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</pre><br />
As you can see 80% of the total time was spent in mergeSort and sort functions. <br /><br />
If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x.<br />
<br />
==== Calculation of Pi ====<br />
===== Initial Thoughts =====<br />
The program I decided to assess, and profile calculates the value of PI by using the approximation method called Monte Carlo. This works by having a circle that is 𝜋r2 and a square that is 4r2 with r being 1.0 and generating randomized points inside the area, both x and y being between -1 and 1 we keep track of how many points have been located inside the circle. The more points generated the more accurate the final calculation of PI will be. The amount of points needed for say billionth precision can easily reach in the hundreds of billions which would take just as many calculations of the same mathematical computation, which makes it a fantastic candidate to parallelize.<br />
<br />
====== Figure 1 ======<br />
[[File:Pi_calc.png]]<br />
<br/><br />
Figure 1: Graphical representation of the Monte Carlo method of approximating PI<br />
<br />
====== Figure 2 ======<br />
{| class="wikitable mw-collapsible mw-collapsed"<br />
! pi.cpp<br />
|-<br />
|<br />
<source><br />
/*<br />
Author: Daniel Serpa<br />
Pseudo code: Blaise Barney (https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesPI)<br />
*/<br />
<br />
#include <iostream><br />
#include <cstdlib><br />
#include <math.h><br />
#include <iomanip><br />
<br />
double calcPI(double);<br />
<br />
int main(int argc, char ** argv) {<br />
if (argc != 2) {<br />
std::cout << "Invalid number of arguments" << std::endl;<br />
return 1;<br />
}<br />
std::srand(852);<br />
double npoints = atof(argv[1]);<br />
std::cout << "Number of points: " << npoints << std::endl;<br />
double PI = calcPI(npoints);<br />
std::cout << std::setprecision(10) << PI << std::endl;<br />
return 0;<br />
}<br />
<br />
double calcPI(double npoints) {<br />
double circle_count = 0.0;<br />
for (int j = 0; j < npoints; j++) {<br />
double x_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
double y_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
if (sqrt(pow(x_coor, 2) + pow(y_coor, 2)) < 1.0) circle_count += 1.0;<br />
}<br />
return 4.0*circle_count / npoints;<br />
}<br />
</source><br />
|}<br />
<br />
Figure 2: Serial C++ program used for profiling of the Monte Carlo method of approximating PI<br />
<br />
===== Compilation =====<br />
Program is compiled using the command: <source>gpp -O2 -g -pg -oapp pi.cpp</source><br />
<br />
===== Running =====<br />
We will profile the program using 2 billion points<br />
<source><br />
> time app 2000000000<br />
Number of points: 2e+09<br />
3.14157<br />
<br />
real 1m0.072s<br />
user 0m59.268s<br />
sys 0m0.018s<br />
</source><br />
<br />
===== Profiling =====<br />
Flat:<br />
<source><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls Ts/call Ts/call name<br />
100.80 34.61 34.61 calcPI(double)<br />
0.00 34.61 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</source><br />
Call:<br />
<source><br />
granularity: each sample hit covers 2 byte(s) for 0.03% of 34.61 seconds<br />
<br />
index % time self children called name<br />
<spontaneous><br />
[1] 100.0 34.61 0.00 calcPI(double) [1]<br />
-----------------------------------------------<br />
0.00 0.00 1/1 __libc_csu_init [16]<br />
[9] 0.0 0.00 0.00 1 _GLOBAL__sub_I_main [9]<br />
-----------------------------------------------<br />
<br />
Index by function name<br />
<br />
[9] _GLOBAL__sub_I_main (pi.cpp) [1] calcPI(double)<br />
</source><br />
<br />
===== Results =====<br />
You need many billions of points and maybe even trillions to reach a high precision for the final result but using just 2 billion dots causes the program to take over 30 seconds to run. The most intensive part of the program is the loop which is what executes 2 billion times in my run of the program while profiling, which can all be parallelized. We can determine from the profiling that 100% of the time executing the program is spent in the loop but of course that is not possible so we will go with 99.9%, using a GTX 1080 as an example GPU which has 20 processors and each having 2048 threads, and using Amdahl's Law we can expect a speedup of 976.191 times<br />
<br />
=== Assignment 2 ===<br />
==== Beginning Information ====<br />
==== Shrink ====<br />
On the CPU "shrink" took 20,000 microseconds and the GPU took 123,123 microseconds which shows a speedup of X<br />
<br />
The following chart graphically shows how this speedup looks:<br />
<br />
==== Rotate ====<br />
On the CPU "rotate" took 40,000 microseconds and the GPU took 123,123 microseconds which shows a speedup of X<br />
<br />
The following chart graphically shows how this speedup looks:<br />
<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=138782GPU610/gpuchill2019-04-05T01:48:41Z<p>Dserpa: /* Shrink */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Calculation of Pi, Shrink & Rotate<br />
# [mailto:akkabia@myseneca.ca?subject=gpu610 Abdul Kabia], Some responsibility <br />
# [mailto:jtardif1@myseneca.ca?subject=gpu610 Josh Tardif], Some responsibility<br />
# [mailto:afaux@myseneca.ca?subject=gpu610 Andrew Faux], Some responsibility<br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca,akkabia@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
==== Sudoku Brute Force Solver ====<br />
<br />
I decided to profile a simple brute force Sudoku solver, found here (https://github.com/regehr/sudoku). The solver uses a simple back tracking algorithm, inserting possible values into cells, iterating through the puzzles thousands of times, until it eventually produces an answer which does not violate any of the rules of Sudoku. As such the solver runs at the same speed regardless of the human difficulty rating, able to solve easy and 'insane' level puzzles at the same speed. The solver also works independent of the ratio between clues and white space, producing quick results with even the most sparsely populated puzzles.As such the following run of the program uses a puzzle which is specifically made to play against the back tracking algorithm and provides maximum time for the solver.<br />
<br />
Test run with puzzle:<br />
<pre><br />
Original configuration:<br />
-------------<br />
| | | |<br />
| | 3| 85|<br />
| 1| 2 | |<br />
-------------<br />
| |5 7| |<br />
| 4| |1 |<br />
| 9 | | |<br />
-------------<br />
|5 | | 73|<br />
| 2| 1 | |<br />
| | 4 | 9|<br />
-------------<br />
17 entries filled<br />
solution:<br />
-------------<br />
|987|654|321|<br />
|246|173|985|<br />
|351|928|746|<br />
-------------<br />
|128|537|694|<br />
|634|892|157|<br />
|795|461|832|<br />
-------------<br />
|519|286|473|<br />
|472|319|568|<br />
|863|745|219|<br />
-------------<br />
found 1 solutions<br />
<br />
real 0m33.652s<br />
user 0m33.098s<br />
sys 0m0.015s<br />
</pre><br />
<br />
<br />
Flat profile:<br />
<pre><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
46.42 10.04 10.04 622865043 0.00 0.00 check_row<br />
23.52 15.13 5.09 1 5.09 21.32 solve<br />
18.26 19.08 3.95 223473489 0.00 0.00 check_col<br />
10.02 21.25 2.17 100654218 0.00 0.00 check_region<br />
0.72 21.40 0.16 2 0.08 0.08 print<br />
0.39 21.49 0.09 frame_dummy<br />
</pre><br />
<br />
I believe that if a GPU was used to enhance this program one would see a great increase of speed. All of the check functions essentially do the same thing, iterating through possible inserted values for any that violate the rules. If one is able to unload all of these iterations onto the GPU then there should be a corresponding increase in speed.<br />
<br />
==== Christopher Ginac Image Processing Library ====<br />
<br />
I decided to profile a single user created image processing library written by Christopher Ginac, you can follow his post of the library [https://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ here]. His library enables the user to play around with .PGM image formats. If given the right parameters, users have the following options:<br />
<br />
<pre><br />
What would you like to do:<br />
[1] Get a Sub Image<br />
[2] Enlarge Image<br />
[3] Shrink Image<br />
[4] Reflect Image<br />
[5] Translate Image<br />
[6] Rotate Image<br />
[7] Negate Image<br />
</pre><br />
<br />
I went with the Enlarge option to see how long that would take. In order for me to do this, I had to test both the limits of the program and my own seneca machine allowed space, in order to do this, I had to use a fairly large image. However, since the program creates a second image, my Seneca account ran out of space for the new image, so the program could not write out the newly enlarged image. So I had to settle on an image that was 16.3MB max, so that it could write a new one, totally in 32.6MB of space. <br />
<br />
<pre><br />
real 0m10.595s<br />
user 0m5.325s<br />
sys 0m1.446s<br />
</pre><br />
Which isn't really bad, but when we look deeper, we see where most of our time is being spent<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
21.74 1.06 1.06 1 1.06 1.06 Image::operator=(Image const&)<br />
21.33 2.10 1.04 2 0.52 0.52 Image::Image(int, int, int)<br />
18.66 3.01 0.91 154056114 0.00 0.00 Image::getPixelVal(int, int)<br />
15.59 3.77 0.76 1 0.76 2.34 Image::enlargeImage(int, Image&)<br />
14.97 4.50 0.73 1 0.73 1.67 writeImage(char*, Image&)<br />
3.69 4.68 0.18 2 0.09 0.09 Image::Image(Image const&)<br />
2.67 4.81 0.13 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.82 4.85 0.04 1 0.04 0.17 readImage(char*, Image&)<br />
0.62 4.88 0.03 1 0.03 0.03 Image::getImageInfo(int&, int&, int&)<br />
0.00 4.88 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 4.88 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 4.88 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev<br />
0.00 4.88 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 4.88 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)<br />
</pre><br />
<br />
It seems most of our time in this part of the code is spent assigning our enlarged image to the now one, and also creating our image object in the first place. I think if we were to somehow use a GPU for this process, we would see an decrease in run-time for this part of the library. Also, there also seems to be room for improvement on the very 'Image::enlargeImage' function itself. I feel like by loading said functionality onto thje GPU, we can reduce it's 0.76s to something even lower.<br />
<br />
Using the same image as above (16MB file), I went ahead and profile the Negate option as well. This as the name implies turns the image into a negative form.<br />
<pre><br />
real 0m5.707s<br />
user 0m0.000s<br />
sys 0m0.000s<br />
</pre><br />
<br />
As you can see, about half the time of the Enlarge option, which is expect considering you're not doing as much.<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls ms/call ms/call name <br />
23.53 0.16 0.16 2 80.00 80.00 Image::Image(Image const&)<br />
16.18 0.27 0.11 2 55.00 55.00 Image::Image(int, int, int)<br />
14.71 0.37 0.10 _fu62___ZSt4cout<br />
13.24 0.46 0.09 17117346 0.00 0.00 Image::getPixelVal(int, int)<br />
13.24 0.55 0.09 1 90.00 90.00 Image::operator=(Image const&)<br />
7.35 0.60 0.05 1 50.00 140.00 writeImage(char*, Image&)<br />
7.35 0.65 0.05 1 50.00 195.00 Image::negateImage(Image&)<br />
4.41 0.68 0.03 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.00 0.68 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 0.68 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 0.68 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 0.68 0.00 1 0.00 0.00 readImage(char*, Image&)<br />
0.00 0.68 0.00 1 0.00 0.00 Image::getImageInfo(int&, int&, int&)<br />
</pre><br />
<br />
Notice in both cases of the Enlarge and Negate options the function "Image::Image(int, int, int)" is always within the top 3 of functions that seem to take the most time. Also, the functions "Image::setPixelVal(int, int, int)" and <br />
"Image::getPixelVal(int, int)" are called very often. I think if we focus our efforts on unloading the "Image::getPixelVal(int, int)" and "Image::setPixelVal(int, int, int)" functions onto the GPU as I imagine they are VERY repetitive tasks, as well as try and optimize the "Image::Image(int, int, int)" function; we are sure to see an increase in performance for this program.<br />
<br />
==== Merge Sort Algorithm ====<br />
<br />
I decide to profile a vector merge sort algorithm. A merge sort is based on a based on divide and conquer technique which recursively breaks down a problem into two or more sub-problems of the same or related types. When these become simple enough to be solved directly the sub-problems are then combined to give a solution to the original problem. It first divides the array into equal halves and then combines them in a sorted manner. Due to this type of sort being broken into equal parts, I thought that it would be perfect for a GPU to be able to accelerate the process. With the sort being broken down into multiple chunks and then sent to the GPU it will be able to accomplish its task more efficiently. I was able to find the source code [https://codereview.stackexchange.com/questions/167680/merge-sort-implementation-with-vectors/ here].<br />
<br />
Profile for 10 million elements between 1 and 10000. Using -02 optimization.<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls ns/call ns/call name<br />
48.35 1.16 1.16 9999999 115.56 115.56 mergeSort(std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, <br />
std::vector<int, std::allocator<int> >&)<br />
32.80 1.94 0.78 sort(std::vector<int, std::allocator<int> >&)<br />
19.34 2.40 0.46 43708492 10.58 10.58 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, <br />
std::allocator<int> > >, int const&)<br />
0.00 2.40 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</pre><br />
As you can see 80% of the total time was spent in mergeSort and sort functions. <br /><br />
If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x.<br />
<br />
==== Calculation of Pi ====<br />
===== Initial Thoughts =====<br />
The program I decided to assess, and profile calculates the value of PI by using the approximation method called Monte Carlo. This works by having a circle that is 𝜋r2 and a square that is 4r2 with r being 1.0 and generating randomized points inside the area, both x and y being between -1 and 1 we keep track of how many points have been located inside the circle. The more points generated the more accurate the final calculation of PI will be. The amount of points needed for say billionth precision can easily reach in the hundreds of billions which would take just as many calculations of the same mathematical computation, which makes it a fantastic candidate to parallelize.<br />
<br />
====== Figure 1 ======<br />
[[File:Pi_calc.png]]<br />
<br/><br />
Figure 1: Graphical representation of the Monte Carlo method of approximating PI<br />
<br />
====== Figure 2 ======<br />
{| class="wikitable mw-collapsible mw-collapsed"<br />
! pi.cpp<br />
|-<br />
|<br />
<source><br />
/*<br />
Author: Daniel Serpa<br />
Pseudo code: Blaise Barney (https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesPI)<br />
*/<br />
<br />
#include <iostream><br />
#include <cstdlib><br />
#include <math.h><br />
#include <iomanip><br />
<br />
double calcPI(double);<br />
<br />
int main(int argc, char ** argv) {<br />
if (argc != 2) {<br />
std::cout << "Invalid number of arguments" << std::endl;<br />
return 1;<br />
}<br />
std::srand(852);<br />
double npoints = atof(argv[1]);<br />
std::cout << "Number of points: " << npoints << std::endl;<br />
double PI = calcPI(npoints);<br />
std::cout << std::setprecision(10) << PI << std::endl;<br />
return 0;<br />
}<br />
<br />
double calcPI(double npoints) {<br />
double circle_count = 0.0;<br />
for (int j = 0; j < npoints; j++) {<br />
double x_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
double y_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
if (sqrt(pow(x_coor, 2) + pow(y_coor, 2)) < 1.0) circle_count += 1.0;<br />
}<br />
return 4.0*circle_count / npoints;<br />
}<br />
</source><br />
|}<br />
<br />
Figure 2: Serial C++ program used for profiling of the Monte Carlo method of approximating PI<br />
<br />
===== Compilation =====<br />
Program is compiled using the command: <source>gpp -O2 -g -pg -oapp pi.cpp</source><br />
<br />
===== Running =====<br />
We will profile the program using 2 billion points<br />
<source><br />
> time app 2000000000<br />
Number of points: 2e+09<br />
3.14157<br />
<br />
real 1m0.072s<br />
user 0m59.268s<br />
sys 0m0.018s<br />
</source><br />
<br />
===== Profiling =====<br />
Flat:<br />
<source><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls Ts/call Ts/call name<br />
100.80 34.61 34.61 calcPI(double)<br />
0.00 34.61 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</source><br />
Call:<br />
<source><br />
granularity: each sample hit covers 2 byte(s) for 0.03% of 34.61 seconds<br />
<br />
index % time self children called name<br />
<spontaneous><br />
[1] 100.0 34.61 0.00 calcPI(double) [1]<br />
-----------------------------------------------<br />
0.00 0.00 1/1 __libc_csu_init [16]<br />
[9] 0.0 0.00 0.00 1 _GLOBAL__sub_I_main [9]<br />
-----------------------------------------------<br />
<br />
Index by function name<br />
<br />
[9] _GLOBAL__sub_I_main (pi.cpp) [1] calcPI(double)<br />
</source><br />
<br />
===== Results =====<br />
You need many billions of points and maybe even trillions to reach a high precision for the final result but using just 2 billion dots causes the program to take over 30 seconds to run. The most intensive part of the program is the loop which is what executes 2 billion times in my run of the program while profiling, which can all be parallelized. We can determine from the profiling that 100% of the time executing the program is spent in the loop but of course that is not possible so we will go with 99.9%, using a GTX 1080 as an example GPU which has 20 processors and each having 2048 threads, and using Amdahl's Law we can expect a speedup of 976.191 times<br />
<br />
=== Assignment 2 ===<br />
==== Beginning Information ====<br />
==== Shrink ====<br />
On the CPU "shrink" took 20,000 microseconds and the GPU took 123,123 microseconds which shows a speedup of X<br />
<br />
The following chart graphically shows how this speedup looks:<br />
<br />
==== Rotate ====<br />
<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=138776GPU610/gpuchill2019-04-04T18:22:51Z<p>Dserpa: /* Results */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Calculation of Pi, Shrink & Rotate<br />
# [mailto:akkabia@myseneca.ca?subject=gpu610 Abdul Kabia], Some responsibility <br />
# [mailto:jtardif1@myseneca.ca?subject=gpu610 Josh Tardif], Some responsibility<br />
# [mailto:afaux@myseneca.ca?subject=gpu610 Andrew Faux], Some responsibility<br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca,akkabia@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
==== Sudoku Brute Force Solver ====<br />
<br />
I decided to profile a simple brute force Sudoku solver, found here (https://github.com/regehr/sudoku). The solver uses a simple back tracking algorithm, inserting possible values into cells, iterating through the puzzles thousands of times, until it eventually produces an answer which does not violate any of the rules of Sudoku. As such the solver runs at the same speed regardless of the human difficulty rating, able to solve easy and 'insane' level puzzles at the same speed. The solver also works independent of the ratio between clues and white space, producing quick results with even the most sparsely populated puzzles.As such the following run of the program uses a puzzle which is specifically made to play against the back tracking algorithm and provides maximum time for the solver.<br />
<br />
Test run with puzzle:<br />
<pre><br />
Original configuration:<br />
-------------<br />
| | | |<br />
| | 3| 85|<br />
| 1| 2 | |<br />
-------------<br />
| |5 7| |<br />
| 4| |1 |<br />
| 9 | | |<br />
-------------<br />
|5 | | 73|<br />
| 2| 1 | |<br />
| | 4 | 9|<br />
-------------<br />
17 entries filled<br />
solution:<br />
-------------<br />
|987|654|321|<br />
|246|173|985|<br />
|351|928|746|<br />
-------------<br />
|128|537|694|<br />
|634|892|157|<br />
|795|461|832|<br />
-------------<br />
|519|286|473|<br />
|472|319|568|<br />
|863|745|219|<br />
-------------<br />
found 1 solutions<br />
<br />
real 0m33.652s<br />
user 0m33.098s<br />
sys 0m0.015s<br />
</pre><br />
<br />
<br />
Flat profile:<br />
<pre><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
46.42 10.04 10.04 622865043 0.00 0.00 check_row<br />
23.52 15.13 5.09 1 5.09 21.32 solve<br />
18.26 19.08 3.95 223473489 0.00 0.00 check_col<br />
10.02 21.25 2.17 100654218 0.00 0.00 check_region<br />
0.72 21.40 0.16 2 0.08 0.08 print<br />
0.39 21.49 0.09 frame_dummy<br />
</pre><br />
<br />
I believe that if a GPU was used to enhance this program one would see a great increase of speed. All of the check functions essentially do the same thing, iterating through possible inserted values for any that violate the rules. If one is able to unload all of these iterations onto the GPU then there should be a corresponding increase in speed.<br />
<br />
==== Christopher Ginac Image Processing Library ====<br />
<br />
I decided to profile a single user created image processing library written by Christopher Ginac, you can follow his post of the library [https://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ here]. His library enables the user to play around with .PGM image formats. If given the right parameters, users have the following options:<br />
<br />
<pre><br />
What would you like to do:<br />
[1] Get a Sub Image<br />
[2] Enlarge Image<br />
[3] Shrink Image<br />
[4] Reflect Image<br />
[5] Translate Image<br />
[6] Rotate Image<br />
[7] Negate Image<br />
</pre><br />
<br />
I went with the Enlarge option to see how long that would take. In order for me to do this, I had to test both the limits of the program and my own seneca machine allowed space, in order to do this, I had to use a fairly large image. However, since the program creates a second image, my Seneca account ran out of space for the new image, so the program could not write out the newly enlarged image. So I had to settle on an image that was 16.3MB max, so that it could write a new one, totally in 32.6MB of space. <br />
<br />
<pre><br />
real 0m10.595s<br />
user 0m5.325s<br />
sys 0m1.446s<br />
</pre><br />
Which isn't really bad, but when we look deeper, we see where most of our time is being spent<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
21.74 1.06 1.06 1 1.06 1.06 Image::operator=(Image const&)<br />
21.33 2.10 1.04 2 0.52 0.52 Image::Image(int, int, int)<br />
18.66 3.01 0.91 154056114 0.00 0.00 Image::getPixelVal(int, int)<br />
15.59 3.77 0.76 1 0.76 2.34 Image::enlargeImage(int, Image&)<br />
14.97 4.50 0.73 1 0.73 1.67 writeImage(char*, Image&)<br />
3.69 4.68 0.18 2 0.09 0.09 Image::Image(Image const&)<br />
2.67 4.81 0.13 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.82 4.85 0.04 1 0.04 0.17 readImage(char*, Image&)<br />
0.62 4.88 0.03 1 0.03 0.03 Image::getImageInfo(int&, int&, int&)<br />
0.00 4.88 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 4.88 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 4.88 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev<br />
0.00 4.88 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 4.88 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)<br />
</pre><br />
<br />
It seems most of our time in this part of the code is spent assigning our enlarged image to the now one, and also creating our image object in the first place. I think if we were to somehow use a GPU for this process, we would see an decrease in run-time for this part of the library. Also, there also seems to be room for improvement on the very 'Image::enlargeImage' function itself. I feel like by loading said functionality onto thje GPU, we can reduce it's 0.76s to something even lower.<br />
<br />
Using the same image as above (16MB file), I went ahead and profile the Negate option as well. This as the name implies turns the image into a negative form.<br />
<pre><br />
real 0m5.707s<br />
user 0m0.000s<br />
sys 0m0.000s<br />
</pre><br />
<br />
As you can see, about half the time of the Enlarge option, which is expect considering you're not doing as much.<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls ms/call ms/call name <br />
23.53 0.16 0.16 2 80.00 80.00 Image::Image(Image const&)<br />
16.18 0.27 0.11 2 55.00 55.00 Image::Image(int, int, int)<br />
14.71 0.37 0.10 _fu62___ZSt4cout<br />
13.24 0.46 0.09 17117346 0.00 0.00 Image::getPixelVal(int, int)<br />
13.24 0.55 0.09 1 90.00 90.00 Image::operator=(Image const&)<br />
7.35 0.60 0.05 1 50.00 140.00 writeImage(char*, Image&)<br />
7.35 0.65 0.05 1 50.00 195.00 Image::negateImage(Image&)<br />
4.41 0.68 0.03 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.00 0.68 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 0.68 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 0.68 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 0.68 0.00 1 0.00 0.00 readImage(char*, Image&)<br />
0.00 0.68 0.00 1 0.00 0.00 Image::getImageInfo(int&, int&, int&)<br />
</pre><br />
<br />
Notice in both cases of the Enlarge and Negate options the function "Image::Image(int, int, int)" is always within the top 3 of functions that seem to take the most time. Also, the functions "Image::setPixelVal(int, int, int)" and <br />
"Image::getPixelVal(int, int)" are called very often. I think if we focus our efforts on unloading the "Image::getPixelVal(int, int)" and "Image::setPixelVal(int, int, int)" functions onto the GPU as I imagine they are VERY repetitive tasks, as well as try and optimize the "Image::Image(int, int, int)" function; we are sure to see an increase in performance for this program.<br />
<br />
==== Merge Sort Algorithm ====<br />
<br />
I decide to profile a vector merge sort algorithm. A merge sort is based on a based on divide and conquer technique which recursively breaks down a problem into two or more sub-problems of the same or related types. When these become simple enough to be solved directly the sub-problems are then combined to give a solution to the original problem. It first divides the array into equal halves and then combines them in a sorted manner. Due to this type of sort being broken into equal parts, I thought that it would be perfect for a GPU to be able to accelerate the process. With the sort being broken down into multiple chunks and then sent to the GPU it will be able to accomplish its task more efficiently. I was able to find the source code [https://codereview.stackexchange.com/questions/167680/merge-sort-implementation-with-vectors/ here].<br />
<br />
Profile for 10 million elements between 1 and 10000. Using -02 optimization.<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls ns/call ns/call name<br />
48.35 1.16 1.16 9999999 115.56 115.56 mergeSort(std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, <br />
std::vector<int, std::allocator<int> >&)<br />
32.80 1.94 0.78 sort(std::vector<int, std::allocator<int> >&)<br />
19.34 2.40 0.46 43708492 10.58 10.58 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, <br />
std::allocator<int> > >, int const&)<br />
0.00 2.40 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</pre><br />
As you can see 80% of the total time was spent in mergeSort and sort functions. <br /><br />
If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x.<br />
<br />
==== Calculation of Pi ====<br />
===== Initial Thoughts =====<br />
The program I decided to assess, and profile calculates the value of PI by using the approximation method called Monte Carlo. This works by having a circle that is 𝜋r2 and a square that is 4r2 with r being 1.0 and generating randomized points inside the area, both x and y being between -1 and 1 we keep track of how many points have been located inside the circle. The more points generated the more accurate the final calculation of PI will be. The amount of points needed for say billionth precision can easily reach in the hundreds of billions which would take just as many calculations of the same mathematical computation, which makes it a fantastic candidate to parallelize.<br />
<br />
====== Figure 1 ======<br />
[[File:Pi_calc.png]]<br />
<br/><br />
Figure 1: Graphical representation of the Monte Carlo method of approximating PI<br />
<br />
====== Figure 2 ======<br />
{| class="wikitable mw-collapsible mw-collapsed"<br />
! pi.cpp<br />
|-<br />
|<br />
<source><br />
/*<br />
Author: Daniel Serpa<br />
Pseudo code: Blaise Barney (https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesPI)<br />
*/<br />
<br />
#include <iostream><br />
#include <cstdlib><br />
#include <math.h><br />
#include <iomanip><br />
<br />
double calcPI(double);<br />
<br />
int main(int argc, char ** argv) {<br />
if (argc != 2) {<br />
std::cout << "Invalid number of arguments" << std::endl;<br />
return 1;<br />
}<br />
std::srand(852);<br />
double npoints = atof(argv[1]);<br />
std::cout << "Number of points: " << npoints << std::endl;<br />
double PI = calcPI(npoints);<br />
std::cout << std::setprecision(10) << PI << std::endl;<br />
return 0;<br />
}<br />
<br />
double calcPI(double npoints) {<br />
double circle_count = 0.0;<br />
for (int j = 0; j < npoints; j++) {<br />
double x_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
double y_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
if (sqrt(pow(x_coor, 2) + pow(y_coor, 2)) < 1.0) circle_count += 1.0;<br />
}<br />
return 4.0*circle_count / npoints;<br />
}<br />
</source><br />
|}<br />
<br />
Figure 2: Serial C++ program used for profiling of the Monte Carlo method of approximating PI<br />
<br />
===== Compilation =====<br />
Program is compiled using the command: <source>gpp -O2 -g -pg -oapp pi.cpp</source><br />
<br />
===== Running =====<br />
We will profile the program using 2 billion points<br />
<source><br />
> time app 2000000000<br />
Number of points: 2e+09<br />
3.14157<br />
<br />
real 1m0.072s<br />
user 0m59.268s<br />
sys 0m0.018s<br />
</source><br />
<br />
===== Profiling =====<br />
Flat:<br />
<source><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls Ts/call Ts/call name<br />
100.80 34.61 34.61 calcPI(double)<br />
0.00 34.61 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</source><br />
Call:<br />
<source><br />
granularity: each sample hit covers 2 byte(s) for 0.03% of 34.61 seconds<br />
<br />
index % time self children called name<br />
<spontaneous><br />
[1] 100.0 34.61 0.00 calcPI(double) [1]<br />
-----------------------------------------------<br />
0.00 0.00 1/1 __libc_csu_init [16]<br />
[9] 0.0 0.00 0.00 1 _GLOBAL__sub_I_main [9]<br />
-----------------------------------------------<br />
<br />
Index by function name<br />
<br />
[9] _GLOBAL__sub_I_main (pi.cpp) [1] calcPI(double)<br />
</source><br />
<br />
===== Results =====<br />
You need many billions of points and maybe even trillions to reach a high precision for the final result but using just 2 billion dots causes the program to take over 30 seconds to run. The most intensive part of the program is the loop which is what executes 2 billion times in my run of the program while profiling, which can all be parallelized. We can determine from the profiling that 100% of the time executing the program is spent in the loop but of course that is not possible so we will go with 99.9%, using a GTX 1080 as an example GPU which has 20 processors and each having 2048 threads, and using Amdahl's Law we can expect a speedup of 976.191 times<br />
<br />
=== Assignment 2 ===<br />
==== Beginning Information ====<br />
==== Shrink ====<br />
==== Rotate ====<br />
<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=138775GPU610/gpuchill2019-04-04T17:46:31Z<p>Dserpa: /* Assignment 2 */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Calculation of Pi, Shrink & Rotate<br />
# [mailto:akkabia@myseneca.ca?subject=gpu610 Abdul Kabia], Some responsibility <br />
# [mailto:jtardif1@myseneca.ca?subject=gpu610 Josh Tardif], Some responsibility<br />
# [mailto:afaux@myseneca.ca?subject=gpu610 Andrew Faux], Some responsibility<br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca,akkabia@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
==== Sudoku Brute Force Solver ====<br />
<br />
I decided to profile a simple brute force Sudoku solver, found here (https://github.com/regehr/sudoku). The solver uses a simple back tracking algorithm, inserting possible values into cells, iterating through the puzzles thousands of times, until it eventually produces an answer which does not violate any of the rules of Sudoku. As such the solver runs at the same speed regardless of the human difficulty rating, able to solve easy and 'insane' level puzzles at the same speed. The solver also works independent of the ratio between clues and white space, producing quick results with even the most sparsely populated puzzles.As such the following run of the program uses a puzzle which is specifically made to play against the back tracking algorithm and provides maximum time for the solver.<br />
<br />
Test run with puzzle:<br />
<pre><br />
Original configuration:<br />
-------------<br />
| | | |<br />
| | 3| 85|<br />
| 1| 2 | |<br />
-------------<br />
| |5 7| |<br />
| 4| |1 |<br />
| 9 | | |<br />
-------------<br />
|5 | | 73|<br />
| 2| 1 | |<br />
| | 4 | 9|<br />
-------------<br />
17 entries filled<br />
solution:<br />
-------------<br />
|987|654|321|<br />
|246|173|985|<br />
|351|928|746|<br />
-------------<br />
|128|537|694|<br />
|634|892|157|<br />
|795|461|832|<br />
-------------<br />
|519|286|473|<br />
|472|319|568|<br />
|863|745|219|<br />
-------------<br />
found 1 solutions<br />
<br />
real 0m33.652s<br />
user 0m33.098s<br />
sys 0m0.015s<br />
</pre><br />
<br />
<br />
Flat profile:<br />
<pre><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
46.42 10.04 10.04 622865043 0.00 0.00 check_row<br />
23.52 15.13 5.09 1 5.09 21.32 solve<br />
18.26 19.08 3.95 223473489 0.00 0.00 check_col<br />
10.02 21.25 2.17 100654218 0.00 0.00 check_region<br />
0.72 21.40 0.16 2 0.08 0.08 print<br />
0.39 21.49 0.09 frame_dummy<br />
</pre><br />
<br />
I believe that if a GPU was used to enhance this program one would see a great increase of speed. All of the check functions essentially do the same thing, iterating through possible inserted values for any that violate the rules. If one is able to unload all of these iterations onto the GPU then there should be a corresponding increase in speed.<br />
<br />
==== Christopher Ginac Image Processing Library ====<br />
<br />
I decided to profile a single user created image processing library written by Christopher Ginac, you can follow his post of the library [https://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ here]. His library enables the user to play around with .PGM image formats. If given the right parameters, users have the following options:<br />
<br />
<pre><br />
What would you like to do:<br />
[1] Get a Sub Image<br />
[2] Enlarge Image<br />
[3] Shrink Image<br />
[4] Reflect Image<br />
[5] Translate Image<br />
[6] Rotate Image<br />
[7] Negate Image<br />
</pre><br />
<br />
I went with the Enlarge option to see how long that would take. In order for me to do this, I had to test both the limits of the program and my own seneca machine allowed space, in order to do this, I had to use a fairly large image. However, since the program creates a second image, my Seneca account ran out of space for the new image, so the program could not write out the newly enlarged image. So I had to settle on an image that was 16.3MB max, so that it could write a new one, totally in 32.6MB of space. <br />
<br />
<pre><br />
real 0m10.595s<br />
user 0m5.325s<br />
sys 0m1.446s<br />
</pre><br />
Which isn't really bad, but when we look deeper, we see where most of our time is being spent<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
21.74 1.06 1.06 1 1.06 1.06 Image::operator=(Image const&)<br />
21.33 2.10 1.04 2 0.52 0.52 Image::Image(int, int, int)<br />
18.66 3.01 0.91 154056114 0.00 0.00 Image::getPixelVal(int, int)<br />
15.59 3.77 0.76 1 0.76 2.34 Image::enlargeImage(int, Image&)<br />
14.97 4.50 0.73 1 0.73 1.67 writeImage(char*, Image&)<br />
3.69 4.68 0.18 2 0.09 0.09 Image::Image(Image const&)<br />
2.67 4.81 0.13 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.82 4.85 0.04 1 0.04 0.17 readImage(char*, Image&)<br />
0.62 4.88 0.03 1 0.03 0.03 Image::getImageInfo(int&, int&, int&)<br />
0.00 4.88 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 4.88 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 4.88 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev<br />
0.00 4.88 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 4.88 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)<br />
</pre><br />
<br />
It seems most of our time in this part of the code is spent assigning our enlarged image to the now one, and also creating our image object in the first place. I think if we were to somehow use a GPU for this process, we would see an decrease in run-time for this part of the library. Also, there also seems to be room for improvement on the very 'Image::enlargeImage' function itself. I feel like by loading said functionality onto thje GPU, we can reduce it's 0.76s to something even lower.<br />
<br />
Using the same image as above (16MB file), I went ahead and profile the Negate option as well. This as the name implies turns the image into a negative form.<br />
<pre><br />
real 0m5.707s<br />
user 0m0.000s<br />
sys 0m0.000s<br />
</pre><br />
<br />
As you can see, about half the time of the Enlarge option, which is expect considering you're not doing as much.<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls ms/call ms/call name <br />
23.53 0.16 0.16 2 80.00 80.00 Image::Image(Image const&)<br />
16.18 0.27 0.11 2 55.00 55.00 Image::Image(int, int, int)<br />
14.71 0.37 0.10 _fu62___ZSt4cout<br />
13.24 0.46 0.09 17117346 0.00 0.00 Image::getPixelVal(int, int)<br />
13.24 0.55 0.09 1 90.00 90.00 Image::operator=(Image const&)<br />
7.35 0.60 0.05 1 50.00 140.00 writeImage(char*, Image&)<br />
7.35 0.65 0.05 1 50.00 195.00 Image::negateImage(Image&)<br />
4.41 0.68 0.03 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.00 0.68 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 0.68 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 0.68 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 0.68 0.00 1 0.00 0.00 readImage(char*, Image&)<br />
0.00 0.68 0.00 1 0.00 0.00 Image::getImageInfo(int&, int&, int&)<br />
</pre><br />
<br />
Notice in both cases of the Enlarge and Negate options the function "Image::Image(int, int, int)" is always within the top 3 of functions that seem to take the most time. Also, the functions "Image::setPixelVal(int, int, int)" and <br />
"Image::getPixelVal(int, int)" are called very often. I think if we focus our efforts on unloading the "Image::getPixelVal(int, int)" and "Image::setPixelVal(int, int, int)" functions onto the GPU as I imagine they are VERY repetitive tasks, as well as try and optimize the "Image::Image(int, int, int)" function; we are sure to see an increase in performance for this program.<br />
<br />
==== Merge Sort Algorithm ====<br />
<br />
I decide to profile a vector merge sort algorithm. A merge sort is based on a based on divide and conquer technique which recursively breaks down a problem into two or more sub-problems of the same or related types. When these become simple enough to be solved directly the sub-problems are then combined to give a solution to the original problem. It first divides the array into equal halves and then combines them in a sorted manner. Due to this type of sort being broken into equal parts, I thought that it would be perfect for a GPU to be able to accelerate the process. With the sort being broken down into multiple chunks and then sent to the GPU it will be able to accomplish its task more efficiently. I was able to find the source code [https://codereview.stackexchange.com/questions/167680/merge-sort-implementation-with-vectors/ here].<br />
<br />
Profile for 10 million elements between 1 and 10000. Using -02 optimization.<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls ns/call ns/call name<br />
48.35 1.16 1.16 9999999 115.56 115.56 mergeSort(std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, <br />
std::vector<int, std::allocator<int> >&)<br />
32.80 1.94 0.78 sort(std::vector<int, std::allocator<int> >&)<br />
19.34 2.40 0.46 43708492 10.58 10.58 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, <br />
std::allocator<int> > >, int const&)<br />
0.00 2.40 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</pre><br />
As you can see 80% of the total time was spent in mergeSort and sort functions. <br /><br />
If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x.<br />
<br />
==== Calculation of Pi ====<br />
===== Initial Thoughts =====<br />
The program I decided to assess, and profile calculates the value of PI by using the approximation method called Monte Carlo. This works by having a circle that is 𝜋r2 and a square that is 4r2 with r being 1.0 and generating randomized points inside the area, both x and y being between -1 and 1 we keep track of how many points have been located inside the circle. The more points generated the more accurate the final calculation of PI will be. The amount of points needed for say billionth precision can easily reach in the hundreds of billions which would take just as many calculations of the same mathematical computation, which makes it a fantastic candidate to parallelize.<br />
<br />
====== Figure 1 ======<br />
[[File:Pi_calc.png]]<br />
<br/><br />
Figure 1: Graphical representation of the Monte Carlo method of approximating PI<br />
<br />
====== Figure 2 ======<br />
{| class="wikitable mw-collapsible mw-collapsed"<br />
! pi.cpp<br />
|-<br />
|<br />
<source><br />
/*<br />
Author: Daniel Serpa<br />
Pseudo code: Blaise Barney (https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesPI)<br />
*/<br />
<br />
#include <iostream><br />
#include <cstdlib><br />
#include <math.h><br />
#include <iomanip><br />
<br />
double calcPI(double);<br />
<br />
int main(int argc, char ** argv) {<br />
if (argc != 2) {<br />
std::cout << "Invalid number of arguments" << std::endl;<br />
return 1;<br />
}<br />
std::srand(852);<br />
double npoints = atof(argv[1]);<br />
std::cout << "Number of points: " << npoints << std::endl;<br />
double PI = calcPI(npoints);<br />
std::cout << std::setprecision(10) << PI << std::endl;<br />
return 0;<br />
}<br />
<br />
double calcPI(double npoints) {<br />
double circle_count = 0.0;<br />
for (int j = 0; j < npoints; j++) {<br />
double x_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
double y_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
if (sqrt(pow(x_coor, 2) + pow(y_coor, 2)) < 1.0) circle_count += 1.0;<br />
}<br />
return 4.0*circle_count / npoints;<br />
}<br />
</source><br />
|}<br />
<br />
Figure 2: Serial C++ program used for profiling of the Monte Carlo method of approximating PI<br />
<br />
===== Compilation =====<br />
Program is compiled using the command: <source>gpp -O2 -g -pg -oapp pi.cpp</source><br />
<br />
===== Running =====<br />
We will profile the program using 2 billion points<br />
<source><br />
> time app 2000000000<br />
Number of points: 2e+09<br />
3.14157<br />
<br />
real 1m0.072s<br />
user 0m59.268s<br />
sys 0m0.018s<br />
</source><br />
<br />
===== Profiling =====<br />
Flat:<br />
<source><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls Ts/call Ts/call name<br />
100.80 34.61 34.61 calcPI(double)<br />
0.00 34.61 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</source><br />
Call:<br />
<source><br />
granularity: each sample hit covers 2 byte(s) for 0.03% of 34.61 seconds<br />
<br />
index % time self children called name<br />
<spontaneous><br />
[1] 100.0 34.61 0.00 calcPI(double) [1]<br />
-----------------------------------------------<br />
0.00 0.00 1/1 __libc_csu_init [16]<br />
[9] 0.0 0.00 0.00 1 _GLOBAL__sub_I_main [9]<br />
-----------------------------------------------<br />
<br />
Index by function name<br />
<br />
[9] _GLOBAL__sub_I_main (pi.cpp) [1] calcPI(double)<br />
</source><br />
<br />
===== Results =====<br />
You need many billions of points and maybe even trillions to reach a high precision for the final result but using just 2 billion dots causes the program to take over 30 seconds to run. The most intensive part of the program is the loop which is what loops 2 billion times in my run of the program while profiling, which can all be parallelized. We can determine from the profiling that 100% of the time executing the program is spent in the loop but of course that is not possible so we will go with 99.9%, using a GTX 1080 as an example GPU which has 20 processors and each having 2048 threads, and using Amdahl's Law we can expect a speedup of 976.191 times<br />
<br />
=== Assignment 2 ===<br />
==== Beginning Information ====<br />
==== Shrink ====<br />
==== Rotate ====<br />
<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=138774GPU610/gpuchill2019-04-04T17:44:58Z<p>Dserpa: /* Team Members */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Calculation of Pi, Shrink & Rotate<br />
# [mailto:akkabia@myseneca.ca?subject=gpu610 Abdul Kabia], Some responsibility <br />
# [mailto:jtardif1@myseneca.ca?subject=gpu610 Josh Tardif], Some responsibility<br />
# [mailto:afaux@myseneca.ca?subject=gpu610 Andrew Faux], Some responsibility<br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca,akkabia@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
==== Sudoku Brute Force Solver ====<br />
<br />
I decided to profile a simple brute force Sudoku solver, found here (https://github.com/regehr/sudoku). The solver uses a simple back tracking algorithm, inserting possible values into cells, iterating through the puzzles thousands of times, until it eventually produces an answer which does not violate any of the rules of Sudoku. As such the solver runs at the same speed regardless of the human difficulty rating, able to solve easy and 'insane' level puzzles at the same speed. The solver also works independent of the ratio between clues and white space, producing quick results with even the most sparsely populated puzzles.As such the following run of the program uses a puzzle which is specifically made to play against the back tracking algorithm and provides maximum time for the solver.<br />
<br />
Test run with puzzle:<br />
<pre><br />
Original configuration:<br />
-------------<br />
| | | |<br />
| | 3| 85|<br />
| 1| 2 | |<br />
-------------<br />
| |5 7| |<br />
| 4| |1 |<br />
| 9 | | |<br />
-------------<br />
|5 | | 73|<br />
| 2| 1 | |<br />
| | 4 | 9|<br />
-------------<br />
17 entries filled<br />
solution:<br />
-------------<br />
|987|654|321|<br />
|246|173|985|<br />
|351|928|746|<br />
-------------<br />
|128|537|694|<br />
|634|892|157|<br />
|795|461|832|<br />
-------------<br />
|519|286|473|<br />
|472|319|568|<br />
|863|745|219|<br />
-------------<br />
found 1 solutions<br />
<br />
real 0m33.652s<br />
user 0m33.098s<br />
sys 0m0.015s<br />
</pre><br />
<br />
<br />
Flat profile:<br />
<pre><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
46.42 10.04 10.04 622865043 0.00 0.00 check_row<br />
23.52 15.13 5.09 1 5.09 21.32 solve<br />
18.26 19.08 3.95 223473489 0.00 0.00 check_col<br />
10.02 21.25 2.17 100654218 0.00 0.00 check_region<br />
0.72 21.40 0.16 2 0.08 0.08 print<br />
0.39 21.49 0.09 frame_dummy<br />
</pre><br />
<br />
I believe that if a GPU was used to enhance this program one would see a great increase of speed. All of the check functions essentially do the same thing, iterating through possible inserted values for any that violate the rules. If one is able to unload all of these iterations onto the GPU then there should be a corresponding increase in speed.<br />
<br />
==== Christopher Ginac Image Processing Library ====<br />
<br />
I decided to profile a single user created image processing library written by Christopher Ginac, you can follow his post of the library [https://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ here]. His library enables the user to play around with .PGM image formats. If given the right parameters, users have the following options:<br />
<br />
<pre><br />
What would you like to do:<br />
[1] Get a Sub Image<br />
[2] Enlarge Image<br />
[3] Shrink Image<br />
[4] Reflect Image<br />
[5] Translate Image<br />
[6] Rotate Image<br />
[7] Negate Image<br />
</pre><br />
<br />
I went with the Enlarge option to see how long that would take. In order for me to do this, I had to test both the limits of the program and my own seneca machine allowed space, in order to do this, I had to use a fairly large image. However, since the program creates a second image, my Seneca account ran out of space for the new image, so the program could not write out the newly enlarged image. So I had to settle on an image that was 16.3MB max, so that it could write a new one, totally in 32.6MB of space. <br />
<br />
<pre><br />
real 0m10.595s<br />
user 0m5.325s<br />
sys 0m1.446s<br />
</pre><br />
Which isn't really bad, but when we look deeper, we see where most of our time is being spent<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
21.74 1.06 1.06 1 1.06 1.06 Image::operator=(Image const&)<br />
21.33 2.10 1.04 2 0.52 0.52 Image::Image(int, int, int)<br />
18.66 3.01 0.91 154056114 0.00 0.00 Image::getPixelVal(int, int)<br />
15.59 3.77 0.76 1 0.76 2.34 Image::enlargeImage(int, Image&)<br />
14.97 4.50 0.73 1 0.73 1.67 writeImage(char*, Image&)<br />
3.69 4.68 0.18 2 0.09 0.09 Image::Image(Image const&)<br />
2.67 4.81 0.13 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.82 4.85 0.04 1 0.04 0.17 readImage(char*, Image&)<br />
0.62 4.88 0.03 1 0.03 0.03 Image::getImageInfo(int&, int&, int&)<br />
0.00 4.88 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 4.88 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 4.88 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev<br />
0.00 4.88 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 4.88 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)<br />
</pre><br />
<br />
It seems most of our time in this part of the code is spent assigning our enlarged image to the now one, and also creating our image object in the first place. I think if we were to somehow use a GPU for this process, we would see an decrease in run-time for this part of the library. Also, there also seems to be room for improvement on the very 'Image::enlargeImage' function itself. I feel like by loading said functionality onto thje GPU, we can reduce it's 0.76s to something even lower.<br />
<br />
Using the same image as above (16MB file), I went ahead and profile the Negate option as well. This as the name implies turns the image into a negative form.<br />
<pre><br />
real 0m5.707s<br />
user 0m0.000s<br />
sys 0m0.000s<br />
</pre><br />
<br />
As you can see, about half the time of the Enlarge option, which is expect considering you're not doing as much.<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls ms/call ms/call name <br />
23.53 0.16 0.16 2 80.00 80.00 Image::Image(Image const&)<br />
16.18 0.27 0.11 2 55.00 55.00 Image::Image(int, int, int)<br />
14.71 0.37 0.10 _fu62___ZSt4cout<br />
13.24 0.46 0.09 17117346 0.00 0.00 Image::getPixelVal(int, int)<br />
13.24 0.55 0.09 1 90.00 90.00 Image::operator=(Image const&)<br />
7.35 0.60 0.05 1 50.00 140.00 writeImage(char*, Image&)<br />
7.35 0.65 0.05 1 50.00 195.00 Image::negateImage(Image&)<br />
4.41 0.68 0.03 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.00 0.68 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 0.68 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 0.68 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 0.68 0.00 1 0.00 0.00 readImage(char*, Image&)<br />
0.00 0.68 0.00 1 0.00 0.00 Image::getImageInfo(int&, int&, int&)<br />
</pre><br />
<br />
Notice in both cases of the Enlarge and Negate options the function "Image::Image(int, int, int)" is always within the top 3 of functions that seem to take the most time. Also, the functions "Image::setPixelVal(int, int, int)" and <br />
"Image::getPixelVal(int, int)" are called very often. I think if we focus our efforts on unloading the "Image::getPixelVal(int, int)" and "Image::setPixelVal(int, int, int)" functions onto the GPU as I imagine they are VERY repetitive tasks, as well as try and optimize the "Image::Image(int, int, int)" function; we are sure to see an increase in performance for this program.<br />
<br />
==== Merge Sort Algorithm ====<br />
<br />
I decide to profile a vector merge sort algorithm. A merge sort is based on a based on divide and conquer technique which recursively breaks down a problem into two or more sub-problems of the same or related types. When these become simple enough to be solved directly the sub-problems are then combined to give a solution to the original problem. It first divides the array into equal halves and then combines them in a sorted manner. Due to this type of sort being broken into equal parts, I thought that it would be perfect for a GPU to be able to accelerate the process. With the sort being broken down into multiple chunks and then sent to the GPU it will be able to accomplish its task more efficiently. I was able to find the source code [https://codereview.stackexchange.com/questions/167680/merge-sort-implementation-with-vectors/ here].<br />
<br />
Profile for 10 million elements between 1 and 10000. Using -02 optimization.<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls ns/call ns/call name<br />
48.35 1.16 1.16 9999999 115.56 115.56 mergeSort(std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, <br />
std::vector<int, std::allocator<int> >&)<br />
32.80 1.94 0.78 sort(std::vector<int, std::allocator<int> >&)<br />
19.34 2.40 0.46 43708492 10.58 10.58 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, <br />
std::allocator<int> > >, int const&)<br />
0.00 2.40 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</pre><br />
As you can see 80% of the total time was spent in mergeSort and sort functions. <br /><br />
If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x.<br />
<br />
==== Calculation of Pi ====<br />
===== Initial Thoughts =====<br />
The program I decided to assess, and profile calculates the value of PI by using the approximation method called Monte Carlo. This works by having a circle that is 𝜋r2 and a square that is 4r2 with r being 1.0 and generating randomized points inside the area, both x and y being between -1 and 1 we keep track of how many points have been located inside the circle. The more points generated the more accurate the final calculation of PI will be. The amount of points needed for say billionth precision can easily reach in the hundreds of billions which would take just as many calculations of the same mathematical computation, which makes it a fantastic candidate to parallelize.<br />
<br />
====== Figure 1 ======<br />
[[File:Pi_calc.png]]<br />
<br/><br />
Figure 1: Graphical representation of the Monte Carlo method of approximating PI<br />
<br />
====== Figure 2 ======<br />
{| class="wikitable mw-collapsible mw-collapsed"<br />
! pi.cpp<br />
|-<br />
|<br />
<source><br />
/*<br />
Author: Daniel Serpa<br />
Pseudo code: Blaise Barney (https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesPI)<br />
*/<br />
<br />
#include <iostream><br />
#include <cstdlib><br />
#include <math.h><br />
#include <iomanip><br />
<br />
double calcPI(double);<br />
<br />
int main(int argc, char ** argv) {<br />
if (argc != 2) {<br />
std::cout << "Invalid number of arguments" << std::endl;<br />
return 1;<br />
}<br />
std::srand(852);<br />
double npoints = atof(argv[1]);<br />
std::cout << "Number of points: " << npoints << std::endl;<br />
double PI = calcPI(npoints);<br />
std::cout << std::setprecision(10) << PI << std::endl;<br />
return 0;<br />
}<br />
<br />
double calcPI(double npoints) {<br />
double circle_count = 0.0;<br />
for (int j = 0; j < npoints; j++) {<br />
double x_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
double y_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
if (sqrt(pow(x_coor, 2) + pow(y_coor, 2)) < 1.0) circle_count += 1.0;<br />
}<br />
return 4.0*circle_count / npoints;<br />
}<br />
</source><br />
|}<br />
<br />
Figure 2: Serial C++ program used for profiling of the Monte Carlo method of approximating PI<br />
<br />
===== Compilation =====<br />
Program is compiled using the command: <source>gpp -O2 -g -pg -oapp pi.cpp</source><br />
<br />
===== Running =====<br />
We will profile the program using 2 billion points<br />
<source><br />
> time app 2000000000<br />
Number of points: 2e+09<br />
3.14157<br />
<br />
real 1m0.072s<br />
user 0m59.268s<br />
sys 0m0.018s<br />
</source><br />
<br />
===== Profiling =====<br />
Flat:<br />
<source><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls Ts/call Ts/call name<br />
100.80 34.61 34.61 calcPI(double)<br />
0.00 34.61 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</source><br />
Call:<br />
<source><br />
granularity: each sample hit covers 2 byte(s) for 0.03% of 34.61 seconds<br />
<br />
index % time self children called name<br />
<spontaneous><br />
[1] 100.0 34.61 0.00 calcPI(double) [1]<br />
-----------------------------------------------<br />
0.00 0.00 1/1 __libc_csu_init [16]<br />
[9] 0.0 0.00 0.00 1 _GLOBAL__sub_I_main [9]<br />
-----------------------------------------------<br />
<br />
Index by function name<br />
<br />
[9] _GLOBAL__sub_I_main (pi.cpp) [1] calcPI(double)<br />
</source><br />
<br />
===== Results =====<br />
You need many billions of points and maybe even trillions to reach a high precision for the final result but using just 2 billion dots causes the program to take over 30 seconds to run. The most intensive part of the program is the loop which is what loops 2 billion times in my run of the program while profiling, which can all be parallelized. We can determine from the profiling that 100% of the time executing the program is spent in the loop but of course that is not possible so we will go with 99.9%, using a GTX 1080 as an example GPU which has 20 processors and each having 2048 threads, and using Amdahl's Law we can expect a speedup of 976.191 times<br />
<br />
=== Assignment 2 ===<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=138229GPU610/gpuchill2019-03-08T13:44:53Z<p>Dserpa: /* Results */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Calculation of Pi<br />
# [mailto:akkabia@myseneca.ca?subject=gpu610 Abdul Kabia], Some responsibility <br />
# [mailto:jtardif1@myseneca.ca?subject=gpu610 Josh Tardif], Some responsibility<br />
# [mailto:afaux@myseneca.ca?subject=gpu610 Andrew Faux], Some responsibility<br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca,akkabia@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
==== Sudoku Brute Force Solver ====<br />
<br />
I decided to profile a simple brute force Sudoku solver, found here (https://github.com/regehr/sudoku). The solver uses a simple back tracking algorithm, inserting possible values into cells, iterating through the puzzles thousands of times, until it eventually produces an answer which does not violate any of the rules of Sudoku. As such the solver runs at the same speed regardless of the human difficulty rating, able to solve easy and 'insane' level puzzles at the same speed. The solver also works independent of the ratio between clues and white space, producing quick results with even the most sparsely populated puzzles.As such the following run of the program uses a puzzle which is specifically made to play against the back tracking algorithm and provides maximum time for the solver.<br />
<br />
Test run with puzzle:<br />
<pre><br />
Original configuration:<br />
-------------<br />
| | | |<br />
| | 3| 85|<br />
| 1| 2 | |<br />
-------------<br />
| |5 7| |<br />
| 4| |1 |<br />
| 9 | | |<br />
-------------<br />
|5 | | 73|<br />
| 2| 1 | |<br />
| | 4 | 9|<br />
-------------<br />
17 entries filled<br />
solution:<br />
-------------<br />
|987|654|321|<br />
|246|173|985|<br />
|351|928|746|<br />
-------------<br />
|128|537|694|<br />
|634|892|157|<br />
|795|461|832|<br />
-------------<br />
|519|286|473|<br />
|472|319|568|<br />
|863|745|219|<br />
-------------<br />
found 1 solutions<br />
<br />
real 0m33.652s<br />
user 0m33.098s<br />
sys 0m0.015s<br />
</pre><br />
<br />
<br />
Flat profile:<br />
<pre><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
46.42 10.04 10.04 622865043 0.00 0.00 check_row<br />
23.52 15.13 5.09 1 5.09 21.32 solve<br />
18.26 19.08 3.95 223473489 0.00 0.00 check_col<br />
10.02 21.25 2.17 100654218 0.00 0.00 check_region<br />
0.72 21.40 0.16 2 0.08 0.08 print<br />
0.39 21.49 0.09 frame_dummy<br />
</pre><br />
<br />
I believe that if a GPU was used to enhance this program one would see a great increase of speed. All of the check functions essentially do the same thing, iterating through possible inserted values for any that violate the rules. If one is able to unload all of these iterations onto the GPU then there should be a corresponding increase in speed.<br />
<br />
==== Christopher Ginac Image Processing Library ====<br />
<br />
I decided to profile a single user created image processing library written by Christopher Ginac, you can follow his post of the library [https://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ here]. His library enables the user to play around with .PGM image formats. If given the right parameters, users have the following options:<br />
<br />
<pre><br />
What would you like to do:<br />
[1] Get a Sub Image<br />
[2] Enlarge Image<br />
[3] Shrink Image<br />
[4] Reflect Image<br />
[5] Translate Image<br />
[6] Rotate Image<br />
[7] Negate Image<br />
</pre><br />
<br />
I went with the Enlarge option to see how long that would take. In order for me to do this, I had to test both the limits of the program and my own seneca machine allowed space, in order to do this, I had to use a fairly large image. However, since the program creates a second image, my Seneca account ran out of space for the new image, so the program could not write out the newly enlarged image. So I had to settle on an image that was 16.3MB max, so that it could write a new one, totally in 32.6MB of space. <br />
<br />
<pre><br />
real 0m10.595s<br />
user 0m5.325s<br />
sys 0m1.446s<br />
</pre><br />
Which isn't really bad, but when we look deeper, we see where most of our time is being spent<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
21.74 1.06 1.06 1 1.06 1.06 Image::operator=(Image const&)<br />
21.33 2.10 1.04 2 0.52 0.52 Image::Image(int, int, int)<br />
18.66 3.01 0.91 154056114 0.00 0.00 Image::getPixelVal(int, int)<br />
15.59 3.77 0.76 1 0.76 2.34 Image::enlargeImage(int, Image&)<br />
14.97 4.50 0.73 1 0.73 1.67 writeImage(char*, Image&)<br />
3.69 4.68 0.18 2 0.09 0.09 Image::Image(Image const&)<br />
2.67 4.81 0.13 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.82 4.85 0.04 1 0.04 0.17 readImage(char*, Image&)<br />
0.62 4.88 0.03 1 0.03 0.03 Image::getImageInfo(int&, int&, int&)<br />
0.00 4.88 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 4.88 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 4.88 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev<br />
0.00 4.88 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 4.88 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)<br />
</pre><br />
<br />
It seems most of our time in this part of the code is spent assigning our enlarged image to the now one, and also creating our image object in the first place. I think if we were to somehow use a GPU for this process, we would see an decrease in run-time for this part of the library. Also, there also seems to be room for improvement on the very 'Image::enlargeImage' function itself. I feel like by loading said functionality onto thje GPU, we can reduce it's 0.76s to something even lower.<br />
<br />
==== Merge Sort Algorithm ====<br />
<br />
I decide to profile a vector merge sort algorithm. A merge sort is based on a based on divide and conquer technique which recursively breaks down a problem into two or more sub-problems of the same or related types. When these become simple enough to be solved directly the sub-problems are then combined to give a solution to the original problem. It first divides the array into equal halves and then combines them in a sorted manner. Due to this type of sort being broken into equal parts, I thought that it would be perfect for a GPU to be able to accelerate the process. With the sort being broken down into multiple chunks and then sent to the GPU it will be able to accomplish its task more efficiently. I was able to find the source code [https://codereview.stackexchange.com/questions/167680/merge-sort-implementation-with-vectors/ here].<br />
<br />
Profile for 10 million elements between 1 and 10000. Using -02 optimization.<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls ns/call ns/call name<br />
48.35 1.16 1.16 9999999 115.56 115.56 mergeSort(std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, <br />
std::vector<int, std::allocator<int> >&)<br />
32.80 1.94 0.78 sort(std::vector<int, std::allocator<int> >&)<br />
19.34 2.40 0.46 43708492 10.58 10.58 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, <br />
std::allocator<int> > >, int const&)<br />
0.00 2.40 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</pre><br />
As you can see 80% of the total time was spent in mergeSort and sort functions. <br /><br />
If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x.<br />
<br />
==== Calculation of Pi ====<br />
===== Initial Thoughts =====<br />
The program I decided to assess, and profile calculates the value of PI by using the approximation method called Monte Carlo. This works by having a circle that is 𝜋r2 and a square that is 4r2 with r being 1.0 and generating randomized points inside the area, both x and y being between -1 and 1 we keep track of how many points have been located inside the circle. The more points generated the more accurate the final calculation of PI will be. The amount of points needed for say billionth precision can easily reach in the hundreds of billions which would take just as many calculations of the same mathematical computation, which makes it a fantastic candidate to parallelize.<br />
<br />
====== Figure 1 ======<br />
[[File:Pi_calc.png]]<br />
<br/><br />
Figure 1: Graphical representation of the Monte Carlo method of approximating PI<br />
<br />
====== Figure 2 ======<br />
{| class="wikitable mw-collapsible mw-collapsed"<br />
! pi.cpp<br />
|-<br />
|<br />
<source><br />
/*<br />
Author: Daniel Serpa<br />
Pseudo code: Blaise Barney (https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesPI)<br />
*/<br />
<br />
#include <iostream><br />
#include <cstdlib><br />
#include <math.h><br />
#include <iomanip><br />
<br />
double calcPI(double);<br />
<br />
int main(int argc, char ** argv) {<br />
if (argc != 2) {<br />
std::cout << "Invalid number of arguments" << std::endl;<br />
return 1;<br />
}<br />
std::srand(852);<br />
double npoints = atof(argv[1]);<br />
std::cout << "Number of points: " << npoints << std::endl;<br />
double PI = calcPI(npoints);<br />
std::cout << std::setprecision(10) << PI << std::endl;<br />
return 0;<br />
}<br />
<br />
double calcPI(double npoints) {<br />
double circle_count = 0.0;<br />
for (int j = 0; j < npoints; j++) {<br />
double x_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
double y_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
if (sqrt(pow(x_coor, 2) + pow(y_coor, 2)) < 1.0) circle_count += 1.0;<br />
}<br />
return 4.0*circle_count / npoints;<br />
}<br />
</source><br />
|}<br />
<br />
Figure 2: Serial C++ program used for profiling of the Monte Carlo method of approximating PI<br />
<br />
===== Compilation =====<br />
Program is compiled using the command: <source>gpp -O2 -g -pg -oapp pi.cpp</source><br />
<br />
===== Running =====<br />
We will profile the program using 2 billion points<br />
<source><br />
> time app 2000000000<br />
Number of points: 2e+09<br />
3.14157<br />
<br />
real 1m0.072s<br />
user 0m59.268s<br />
sys 0m0.018s<br />
</source><br />
<br />
===== Profiling =====<br />
Flat:<br />
<source><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls Ts/call Ts/call name<br />
100.80 34.61 34.61 calcPI(double)<br />
0.00 34.61 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</source><br />
Call:<br />
<source><br />
granularity: each sample hit covers 2 byte(s) for 0.03% of 34.61 seconds<br />
<br />
index % time self children called name<br />
<spontaneous><br />
[1] 100.0 34.61 0.00 calcPI(double) [1]<br />
-----------------------------------------------<br />
0.00 0.00 1/1 __libc_csu_init [16]<br />
[9] 0.0 0.00 0.00 1 _GLOBAL__sub_I_main [9]<br />
-----------------------------------------------<br />
<br />
Index by function name<br />
<br />
[9] _GLOBAL__sub_I_main (pi.cpp) [1] calcPI(double)<br />
</source><br />
<br />
===== Results =====<br />
You need many billions of points and maybe even trillions to reach a high precision for the final result but using just 2 billion dots causes the program to take over 30 seconds to run. The most intensive part of the program is the loop which is what loops 2 billion times in my run of the program while profiling, which can all be parallelized. We can determine from the profiling that 100% of the time executing the program is spent in the loop but of course that is not possible so we will go with 99.9%, using a GTX 1080 as an example GPU which has 20 processors and each having 2048 threads, and using Amdahl's Law we can expect a speedup of 976.191 times<br />
<br />
=== Assignment 2 ===<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=138070GPU610/gpuchill2019-03-04T16:29:53Z<p>Dserpa: /* Results */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Calculation of Pi<br />
# [mailto:akkabia@myseneca.ca?subject=gpu610 Abdul Kabia], Some responsibility <br />
# [mailto:jtardif1@myseneca.ca?subject=gpu610 Josh Tardif], Some responsibility<br />
# [mailto:afaux@myseneca.ca?subject=gpu610 Andrew Faux], Some responsibility<br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca,akkabia@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
==== Sudoku Brute Force Solver ====<br />
<br />
I decided to profile a simple brute force Sudoku solver, found here (https://github.com/regehr/sudoku). The solver uses a simple back tracking algorithm, inserting possible values into cells, iterating through the puzzles thousands of times, until it eventually produces an answer which does not violate any of the rules of Sudoku. As such the solver runs at the same speed regardless of the human difficulty rating, able to solve easy and 'insane' level puzzles at the same speed. The solver also works independent of the ratio between clues and white space, producing quick results with even the most sparsely populated puzzles.As such the following run of the program uses a puzzle which is specifically made to play against the back tracking algorithm and provides maximum time for the solver.<br />
<br />
Test run with puzzle:<br />
<pre><br />
Original configuration:<br />
-------------<br />
| | | |<br />
| | 3| 85|<br />
| 1| 2 | |<br />
-------------<br />
| |5 7| |<br />
| 4| |1 |<br />
| 9 | | |<br />
-------------<br />
|5 | | 73|<br />
| 2| 1 | |<br />
| | 4 | 9|<br />
-------------<br />
17 entries filled<br />
solution:<br />
-------------<br />
|987|654|321|<br />
|246|173|985|<br />
|351|928|746|<br />
-------------<br />
|128|537|694|<br />
|634|892|157|<br />
|795|461|832|<br />
-------------<br />
|519|286|473|<br />
|472|319|568|<br />
|863|745|219|<br />
-------------<br />
found 1 solutions<br />
<br />
real 0m33.652s<br />
user 0m33.098s<br />
sys 0m0.015s<br />
</pre><br />
<br />
<br />
Flat profile:<br />
<pre><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
46.42 10.04 10.04 622865043 0.00 0.00 check_row<br />
23.52 15.13 5.09 1 5.09 21.32 solve<br />
18.26 19.08 3.95 223473489 0.00 0.00 check_col<br />
10.02 21.25 2.17 100654218 0.00 0.00 check_region<br />
0.72 21.40 0.16 2 0.08 0.08 print<br />
0.39 21.49 0.09 frame_dummy<br />
</pre><br />
<br />
I believe that if a GPU was used to enhance this program one would see a great increase of speed. All of the check functions essentially do the same thing, iterating through possible inserted values for any that violate the rules. If one is able to unload all of these iterations onto the GPU then there should be a corresponding increase in speed.<br />
<br />
==== Christopher Ginac Image Processing Library ====<br />
<br />
I decided to profile a single user created image processing library written by Christopher Ginac, you can follow his post of the library [https://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ here]. His library enables the user to play around with .PGM image formats. If given the right parameters, users have the following options:<br />
<br />
<pre><br />
What would you like to do:<br />
[1] Get a Sub Image<br />
[2] Enlarge Image<br />
[3] Shrink Image<br />
[4] Reflect Image<br />
[5] Translate Image<br />
[6] Rotate Image<br />
[7] Negate Image<br />
</pre><br />
<br />
I went with the Enlarge option to see how long that would take. In order for me to do this, I had to test both the limits of the program and my own seneca machine allowed space, in order to do this, I had to use a fairly large image. However, since the program creates a second image, my Seneca account ran out of space for the new image, so the program could not write out the newly enlarged image. So I had to settle on an image that was 16.3MB max, so that it could write a new one, totally in 32.6MB of space. <br />
<br />
<pre><br />
real 0m10.595s<br />
user 0m5.325s<br />
sys 0m1.446s<br />
</pre><br />
Which isn't really bad, but when we look deeper, we see where most of our time is being spent<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
21.74 1.06 1.06 1 1.06 1.06 Image::operator=(Image const&)<br />
21.33 2.10 1.04 2 0.52 0.52 Image::Image(int, int, int)<br />
18.66 3.01 0.91 154056114 0.00 0.00 Image::getPixelVal(int, int)<br />
15.59 3.77 0.76 1 0.76 2.34 Image::enlargeImage(int, Image&)<br />
14.97 4.50 0.73 1 0.73 1.67 writeImage(char*, Image&)<br />
3.69 4.68 0.18 2 0.09 0.09 Image::Image(Image const&)<br />
2.67 4.81 0.13 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.82 4.85 0.04 1 0.04 0.17 readImage(char*, Image&)<br />
0.62 4.88 0.03 1 0.03 0.03 Image::getImageInfo(int&, int&, int&)<br />
0.00 4.88 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 4.88 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 4.88 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev<br />
0.00 4.88 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 4.88 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)<br />
</pre><br />
<br />
It seems most of our time in this part of the code is spent assigning our enlarged image to the now one, and also creating our image object in the first place. I think if we were to somehow use a GPU for this process, we would see an decrease in run-time for this part of the library. Also, there also seems to be room for improvement on the very 'Image::enlargeImage' function itself. I feel like by loading said functionality onto thje GPU, we can reduce it's 0.76s to something even lower.<br />
<br />
==== Merge Sort Algorithm ====<br />
<br />
I decide to profile a vector merge sort algorithm. A merge sort is based on a based on divide and conquer technique which recursively breaks down a problem into two or more sub-problems of the same or related types. When these become simple enough to be solved directly the sub-problems are then combined to give a solution to the original problem. It first divides the array into equal halves and then combines them in a sorted manner. Due to this type of sort being broken into equal parts, I thought that it would be perfect for a GPU to be able to accelerate the process. With the sort being broken down into multiple chunks and then sent to the GPU it will be able to accomplish its task more efficiently. I was able to find the source code [https://codereview.stackexchange.com/questions/167680/merge-sort-implementation-with-vectors/ here].<br />
<br />
Profile for 10 million elements between 1 and 10000. Using -02 optimization.<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls ns/call ns/call name<br />
48.35 1.16 1.16 9999999 115.56 115.56 mergeSort(std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, <br />
std::vector<int, std::allocator<int> >&)<br />
32.80 1.94 0.78 sort(std::vector<int, std::allocator<int> >&)<br />
19.34 2.40 0.46 43708492 10.58 10.58 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, <br />
std::allocator<int> > >, int const&)<br />
0.00 2.40 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</pre><br />
As you can see 80% of the total time was spent in mergeSort and sort functions. <br /><br />
If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x.<br />
<br />
==== Calculation of Pi ====<br />
===== Initial Thoughts =====<br />
The program I decided to assess, and profile calculates the value of PI by using the approximation method called Monte Carlo. This works by having a circle that is 𝜋r2 and a square that is 4r2 with r being 1.0 and generating randomized points inside the area, both x and y being between -1 and 1 we keep track of how many points have been located inside the circle. The more points generated the more accurate the final calculation of PI will be. The amount of points needed for say billionth precision can easily reach in the hundreds of billions which would take just as many calculations of the same mathematical computation, which makes it a fantastic candidate to parallelize.<br />
<br />
====== Figure 1 ======<br />
[[File:Pi_calc.png]]<br />
<br/><br />
Figure 1: Graphical representation of the Monte Carlo method of approximating PI<br />
<br />
====== Figure 2 ======<br />
{| class="wikitable mw-collapsible mw-collapsed"<br />
! pi.cpp<br />
|-<br />
|<br />
<source><br />
/*<br />
Author: Daniel Serpa<br />
Pseudo code: Blaise Barney (https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesPI)<br />
*/<br />
<br />
#include <iostream><br />
#include <cstdlib><br />
#include <math.h><br />
#include <iomanip><br />
<br />
double calcPI(double);<br />
<br />
int main(int argc, char ** argv) {<br />
if (argc != 2) {<br />
std::cout << "Invalid number of arguments" << std::endl;<br />
return 1;<br />
}<br />
std::srand(852);<br />
double npoints = atof(argv[1]);<br />
std::cout << "Number of points: " << npoints << std::endl;<br />
double PI = calcPI(npoints);<br />
std::cout << std::setprecision(10) << PI << std::endl;<br />
return 0;<br />
}<br />
<br />
double calcPI(double npoints) {<br />
double circle_count = 0.0;<br />
for (int j = 0; j < npoints; j++) {<br />
double x_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
double y_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
if (sqrt(pow(x_coor, 2) + pow(y_coor, 2)) < 1.0) circle_count += 1.0;<br />
}<br />
return 4.0*circle_count / npoints;<br />
}<br />
</source><br />
|}<br />
<br />
Figure 2: Serial C++ program used for profiling of the Monte Carlo method of approximating PI<br />
<br />
===== Compilation =====<br />
Program is compiled using the command: <source>gpp -O2 -g -pg -oapp pi.cpp</source><br />
<br />
===== Running =====<br />
We will profile the program using 2 billion points<br />
<source><br />
> time app 2000000000<br />
Number of points: 2e+09<br />
3.14157<br />
<br />
real 1m0.072s<br />
user 0m59.268s<br />
sys 0m0.018s<br />
</source><br />
<br />
===== Profiling =====<br />
Flat:<br />
<source><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls Ts/call Ts/call name<br />
100.80 34.61 34.61 calcPI(double)<br />
0.00 34.61 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</source><br />
Call:<br />
<source><br />
granularity: each sample hit covers 2 byte(s) for 0.03% of 34.61 seconds<br />
<br />
index % time self children called name<br />
<spontaneous><br />
[1] 100.0 34.61 0.00 calcPI(double) [1]<br />
-----------------------------------------------<br />
0.00 0.00 1/1 __libc_csu_init [16]<br />
[9] 0.0 0.00 0.00 1 _GLOBAL__sub_I_main [9]<br />
-----------------------------------------------<br />
<br />
Index by function name<br />
<br />
[9] _GLOBAL__sub_I_main (pi.cpp) [1] calcPI(double)<br />
</source><br />
<br />
===== Results =====<br />
You need many billions of points and maybe even trillions to reach a high precision for the final result but using just 2 billion dots causes the program to take over 30 seconds to run. The most intensive part of the program is the loop which is what loops 2 billion times in my run of the program while profiling, which can all be parallelized. We can determine from the profiling that 100% of the time executing the program is spent in the loop but of course that is not possible so we will go with 99.9%, using a GTX 1080 as an example GPU which has 20 SMX and each having 2048 threads, and using Amdahl's Law we can expect a speedup of 976.191 times<br />
<br />
=== Assignment 2 ===<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=138069GPU610/gpuchill2019-03-04T16:24:44Z<p>Dserpa: /* Initial Thoughts */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Calculation of Pi<br />
# [mailto:akkabia@myseneca.ca?subject=gpu610 Abdul Kabia], Some responsibility <br />
# [mailto:jtardif1@myseneca.ca?subject=gpu610 Josh Tardif], Some responsibility<br />
# [mailto:afaux@myseneca.ca?subject=gpu610 Andrew Faux], Some responsibility<br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca,akkabia@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
==== Sudoku Brute Force Solver ====<br />
<br />
I decided to profile a simple brute force Sudoku solver, found here (https://github.com/regehr/sudoku). The solver uses a simple back tracking algorithm, inserting possible values into cells, iterating through the puzzles thousands of times, until it eventually produces an answer which does not violate any of the rules of Sudoku. As such the solver runs at the same speed regardless of the human difficulty rating, able to solve easy and 'insane' level puzzles at the same speed. The solver also works independent of the ratio between clues and white space, producing quick results with even the most sparsely populated puzzles.As such the following run of the program uses a puzzle which is specifically made to play against the back tracking algorithm and provides maximum time for the solver.<br />
<br />
Test run with puzzle:<br />
<pre><br />
Original configuration:<br />
-------------<br />
| | | |<br />
| | 3| 85|<br />
| 1| 2 | |<br />
-------------<br />
| |5 7| |<br />
| 4| |1 |<br />
| 9 | | |<br />
-------------<br />
|5 | | 73|<br />
| 2| 1 | |<br />
| | 4 | 9|<br />
-------------<br />
17 entries filled<br />
solution:<br />
-------------<br />
|987|654|321|<br />
|246|173|985|<br />
|351|928|746|<br />
-------------<br />
|128|537|694|<br />
|634|892|157|<br />
|795|461|832|<br />
-------------<br />
|519|286|473|<br />
|472|319|568|<br />
|863|745|219|<br />
-------------<br />
found 1 solutions<br />
<br />
real 0m33.652s<br />
user 0m33.098s<br />
sys 0m0.015s<br />
</pre><br />
<br />
<br />
Flat profile:<br />
<pre><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
46.42 10.04 10.04 622865043 0.00 0.00 check_row<br />
23.52 15.13 5.09 1 5.09 21.32 solve<br />
18.26 19.08 3.95 223473489 0.00 0.00 check_col<br />
10.02 21.25 2.17 100654218 0.00 0.00 check_region<br />
0.72 21.40 0.16 2 0.08 0.08 print<br />
0.39 21.49 0.09 frame_dummy<br />
</pre><br />
<br />
I believe that if a GPU was used to enhance this program one would see a great increase of speed. All of the check functions essentially do the same thing, iterating through possible inserted values for any that violate the rules. If one is able to unload all of these iterations onto the GPU then there should be a corresponding increase in speed.<br />
<br />
==== Christopher Ginac Image Processing Library ====<br />
<br />
I decided to profile a single user created image processing library written by Christopher Ginac, you can follow his post of the library [https://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ here]. His library enables the user to play around with .PGM image formats. If given the right parameters, users have the following options:<br />
<br />
<pre><br />
What would you like to do:<br />
[1] Get a Sub Image<br />
[2] Enlarge Image<br />
[3] Shrink Image<br />
[4] Reflect Image<br />
[5] Translate Image<br />
[6] Rotate Image<br />
[7] Negate Image<br />
</pre><br />
<br />
I went with the Enlarge option to see how long that would take. In order for me to do this, I had to test both the limits of the program and my own seneca machine allowed space, in order to do this, I had to use a fairly large image. However, since the program creates a second image, my Seneca account ran out of space for the new image, so the program could not write out the newly enlarged image. So I had to settle on an image that was 16.3MB max, so that it could write a new one, totally in 32.6MB of space. <br />
<br />
<pre><br />
real 0m10.595s<br />
user 0m5.325s<br />
sys 0m1.446s<br />
</pre><br />
Which isn't really bad, but when we look deeper, we see where most of our time is being spent<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
21.74 1.06 1.06 1 1.06 1.06 Image::operator=(Image const&)<br />
21.33 2.10 1.04 2 0.52 0.52 Image::Image(int, int, int)<br />
18.66 3.01 0.91 154056114 0.00 0.00 Image::getPixelVal(int, int)<br />
15.59 3.77 0.76 1 0.76 2.34 Image::enlargeImage(int, Image&)<br />
14.97 4.50 0.73 1 0.73 1.67 writeImage(char*, Image&)<br />
3.69 4.68 0.18 2 0.09 0.09 Image::Image(Image const&)<br />
2.67 4.81 0.13 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.82 4.85 0.04 1 0.04 0.17 readImage(char*, Image&)<br />
0.62 4.88 0.03 1 0.03 0.03 Image::getImageInfo(int&, int&, int&)<br />
0.00 4.88 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 4.88 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 4.88 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev<br />
0.00 4.88 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 4.88 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)<br />
</pre><br />
<br />
It seems most of our time in this part of the code is spent assigning our enlarged image to the now one, and also creating our image object in the first place. I think if we were to somehow use a GPU for this process, we would see an decrease in run-time for this part of the library. Also, there also seems to be room for improvement on the very 'Image::enlargeImage' function itself. I feel like by loading said functionality onto thje GPU, we can reduce it's 0.76s to something even lower.<br />
<br />
==== Merge Sort Algorithm ====<br />
<br />
I decide to profile a vector merge sort algorithm. A merge sort is based on a based on divide and conquer technique which recursively breaks down a problem into two or more sub-problems of the same or related types. When these become simple enough to be solved directly the sub-problems are then combined to give a solution to the original problem. It first divides the array into equal halves and then combines them in a sorted manner. Due to this type of sort being broken into equal parts, I thought that it would be perfect for a GPU to be able to accelerate the process. With the sort being broken down into multiple chunks and then sent to the GPU it will be able to accomplish its task more efficiently. I was able to find the source code [https://codereview.stackexchange.com/questions/167680/merge-sort-implementation-with-vectors/ here].<br />
<br />
Profile for 10 million elements between 1 and 10000. Using -02 optimization.<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls ns/call ns/call name<br />
48.35 1.16 1.16 9999999 115.56 115.56 mergeSort(std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, <br />
std::vector<int, std::allocator<int> >&)<br />
32.80 1.94 0.78 sort(std::vector<int, std::allocator<int> >&)<br />
19.34 2.40 0.46 43708492 10.58 10.58 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, <br />
std::allocator<int> > >, int const&)<br />
0.00 2.40 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</pre><br />
As you can see 80% of the total time was spent in mergeSort and sort functions. <br /><br />
If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x.<br />
<br />
==== Calculation of Pi ====<br />
===== Initial Thoughts =====<br />
The program I decided to assess, and profile calculates the value of PI by using the approximation method called Monte Carlo. This works by having a circle that is 𝜋r2 and a square that is 4r2 with r being 1.0 and generating randomized points inside the area, both x and y being between -1 and 1 we keep track of how many points have been located inside the circle. The more points generated the more accurate the final calculation of PI will be. The amount of points needed for say billionth precision can easily reach in the hundreds of billions which would take just as many calculations of the same mathematical computation, which makes it a fantastic candidate to parallelize.<br />
<br />
====== Figure 1 ======<br />
[[File:Pi_calc.png]]<br />
<br/><br />
Figure 1: Graphical representation of the Monte Carlo method of approximating PI<br />
<br />
====== Figure 2 ======<br />
{| class="wikitable mw-collapsible mw-collapsed"<br />
! pi.cpp<br />
|-<br />
|<br />
<source><br />
/*<br />
Author: Daniel Serpa<br />
Pseudo code: Blaise Barney (https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesPI)<br />
*/<br />
<br />
#include <iostream><br />
#include <cstdlib><br />
#include <math.h><br />
#include <iomanip><br />
<br />
double calcPI(double);<br />
<br />
int main(int argc, char ** argv) {<br />
if (argc != 2) {<br />
std::cout << "Invalid number of arguments" << std::endl;<br />
return 1;<br />
}<br />
std::srand(852);<br />
double npoints = atof(argv[1]);<br />
std::cout << "Number of points: " << npoints << std::endl;<br />
double PI = calcPI(npoints);<br />
std::cout << std::setprecision(10) << PI << std::endl;<br />
return 0;<br />
}<br />
<br />
double calcPI(double npoints) {<br />
double circle_count = 0.0;<br />
for (int j = 0; j < npoints; j++) {<br />
double x_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
double y_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
if (sqrt(pow(x_coor, 2) + pow(y_coor, 2)) < 1.0) circle_count += 1.0;<br />
}<br />
return 4.0*circle_count / npoints;<br />
}<br />
</source><br />
|}<br />
<br />
Figure 2: Serial C++ program used for profiling of the Monte Carlo method of approximating PI<br />
<br />
===== Compilation =====<br />
Program is compiled using the command: <source>gpp -O2 -g -pg -oapp pi.cpp</source><br />
<br />
===== Running =====<br />
We will profile the program using 2 billion points<br />
<source><br />
> time app 2000000000<br />
Number of points: 2e+09<br />
3.14157<br />
<br />
real 1m0.072s<br />
user 0m59.268s<br />
sys 0m0.018s<br />
</source><br />
<br />
===== Profiling =====<br />
Flat:<br />
<source><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls Ts/call Ts/call name<br />
100.80 34.61 34.61 calcPI(double)<br />
0.00 34.61 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</source><br />
Call:<br />
<source><br />
granularity: each sample hit covers 2 byte(s) for 0.03% of 34.61 seconds<br />
<br />
index % time self children called name<br />
<spontaneous><br />
[1] 100.0 34.61 0.00 calcPI(double) [1]<br />
-----------------------------------------------<br />
0.00 0.00 1/1 __libc_csu_init [16]<br />
[9] 0.0 0.00 0.00 1 _GLOBAL__sub_I_main [9]<br />
-----------------------------------------------<br />
<br />
Index by function name<br />
<br />
[9] _GLOBAL__sub_I_main (pi.cpp) [1] calcPI(double)<br />
</source><br />
<br />
===== Results =====<br />
You need many billions of points and maybe even trillions to reach a high precision for the final result but using just 2 billion dots causes the program to take over 30 seconds to run. The most intensive part of the program is the loop which is what loops 2 billion times in my run of the program while profiling, which can all be parallelized. We can determine from the profiling that 100% of the time executing the program is spent in the loop but of course that is not possible so we will go with 99.9%, using a GTX 1080 as an example GPU which has 20 SMX and each having 2048 threads, and using Amdahl's Law we can expect a speedup of 976.191x<br />
<br />
=== Assignment 2 ===<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=138068GPU610/gpuchill2019-03-04T16:21:43Z<p>Dserpa: /* Figure 1 */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Calculation of Pi<br />
# [mailto:akkabia@myseneca.ca?subject=gpu610 Abdul Kabia], Some responsibility <br />
# [mailto:jtardif1@myseneca.ca?subject=gpu610 Josh Tardif], Some responsibility<br />
# [mailto:afaux@myseneca.ca?subject=gpu610 Andrew Faux], Some responsibility<br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca,akkabia@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
==== Sudoku Brute Force Solver ====<br />
<br />
I decided to profile a simple brute force Sudoku solver, found here (https://github.com/regehr/sudoku). The solver uses a simple back tracking algorithm, inserting possible values into cells, iterating through the puzzles thousands of times, until it eventually produces an answer which does not violate any of the rules of Sudoku. As such the solver runs at the same speed regardless of the human difficulty rating, able to solve easy and 'insane' level puzzles at the same speed. The solver also works independent of the ratio between clues and white space, producing quick results with even the most sparsely populated puzzles.As such the following run of the program uses a puzzle which is specifically made to play against the back tracking algorithm and provides maximum time for the solver.<br />
<br />
Test run with puzzle:<br />
<pre><br />
Original configuration:<br />
-------------<br />
| | | |<br />
| | 3| 85|<br />
| 1| 2 | |<br />
-------------<br />
| |5 7| |<br />
| 4| |1 |<br />
| 9 | | |<br />
-------------<br />
|5 | | 73|<br />
| 2| 1 | |<br />
| | 4 | 9|<br />
-------------<br />
17 entries filled<br />
solution:<br />
-------------<br />
|987|654|321|<br />
|246|173|985|<br />
|351|928|746|<br />
-------------<br />
|128|537|694|<br />
|634|892|157|<br />
|795|461|832|<br />
-------------<br />
|519|286|473|<br />
|472|319|568|<br />
|863|745|219|<br />
-------------<br />
found 1 solutions<br />
<br />
real 0m33.652s<br />
user 0m33.098s<br />
sys 0m0.015s<br />
</pre><br />
<br />
<br />
Flat profile:<br />
<pre><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
46.42 10.04 10.04 622865043 0.00 0.00 check_row<br />
23.52 15.13 5.09 1 5.09 21.32 solve<br />
18.26 19.08 3.95 223473489 0.00 0.00 check_col<br />
10.02 21.25 2.17 100654218 0.00 0.00 check_region<br />
0.72 21.40 0.16 2 0.08 0.08 print<br />
0.39 21.49 0.09 frame_dummy<br />
</pre><br />
<br />
I believe that if a GPU was used to enhance this program one would see a great increase of speed. All of the check functions essentially do the same thing, iterating through possible inserted values for any that violate the rules. If one is able to unload all of these iterations onto the GPU then there should be a corresponding increase in speed.<br />
<br />
==== Christopher Ginac Image Processing Library ====<br />
<br />
I decided to profile a single user created image processing library written by Christopher Ginac, you can follow his post of the library [https://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ here]. His library enables the user to play around with .PGM image formats. If given the right parameters, users have the following options:<br />
<br />
<pre><br />
What would you like to do:<br />
[1] Get a Sub Image<br />
[2] Enlarge Image<br />
[3] Shrink Image<br />
[4] Reflect Image<br />
[5] Translate Image<br />
[6] Rotate Image<br />
[7] Negate Image<br />
</pre><br />
<br />
I went with the Enlarge option to see how long that would take. In order for me to do this, I had to test both the limits of the program and my own seneca machine allowed space, in order to do this, I had to use a fairly large image. However, since the program creates a second image, my Seneca account ran out of space for the new image, so the program could not write out the newly enlarged image. So I had to settle on an image that was 16.3MB max, so that it could write a new one, totally in 32.6MB of space. <br />
<br />
<pre><br />
real 0m10.595s<br />
user 0m5.325s<br />
sys 0m1.446s<br />
</pre><br />
Which isn't really bad, but when we look deeper, we see where most of our time is being spent<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
21.74 1.06 1.06 1 1.06 1.06 Image::operator=(Image const&)<br />
21.33 2.10 1.04 2 0.52 0.52 Image::Image(int, int, int)<br />
18.66 3.01 0.91 154056114 0.00 0.00 Image::getPixelVal(int, int)<br />
15.59 3.77 0.76 1 0.76 2.34 Image::enlargeImage(int, Image&)<br />
14.97 4.50 0.73 1 0.73 1.67 writeImage(char*, Image&)<br />
3.69 4.68 0.18 2 0.09 0.09 Image::Image(Image const&)<br />
2.67 4.81 0.13 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.82 4.85 0.04 1 0.04 0.17 readImage(char*, Image&)<br />
0.62 4.88 0.03 1 0.03 0.03 Image::getImageInfo(int&, int&, int&)<br />
0.00 4.88 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 4.88 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 4.88 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev<br />
0.00 4.88 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 4.88 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)<br />
</pre><br />
<br />
It seems most of our time in this part of the code is spent assigning our enlarged image to the now one, and also creating our image object in the first place. I think if we were to somehow use a GPU for this process, we would see an decrease in run-time for this part of the library. Also, there also seems to be room for improvement on the very 'Image::enlargeImage' function itself. I feel like by loading said functionality onto thje GPU, we can reduce it's 0.76s to something even lower.<br />
<br />
==== Merge Sort Algorithm ====<br />
<br />
I decide to profile a vector merge sort algorithm. A merge sort is based on a based on divide and conquer technique which recursively breaks down a problem into two or more sub-problems of the same or related types. When these become simple enough to be solved directly the sub-problems are then combined to give a solution to the original problem. It first divides the array into equal halves and then combines them in a sorted manner. Due to this type of sort being broken into equal parts, I thought that it would be perfect for a GPU to be able to accelerate the process. With the sort being broken down into multiple chunks and then sent to the GPU it will be able to accomplish its task more efficiently. I was able to find the source code [https://codereview.stackexchange.com/questions/167680/merge-sort-implementation-with-vectors/ here].<br />
<br />
Profile for 10 million elements between 1 and 10000. Using -02 optimization.<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls ns/call ns/call name<br />
48.35 1.16 1.16 9999999 115.56 115.56 mergeSort(std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, <br />
std::vector<int, std::allocator<int> >&)<br />
32.80 1.94 0.78 sort(std::vector<int, std::allocator<int> >&)<br />
19.34 2.40 0.46 43708492 10.58 10.58 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, <br />
std::allocator<int> > >, int const&)<br />
0.00 2.40 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</pre><br />
As you can see 80% of the total time was spent in mergeSort and sort functions. <br /><br />
If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x.<br />
<br />
==== Calculation of Pi ====<br />
===== Initial Thoughts =====<br />
The program I decided to assess, and profile calculates the value of PI by using the approximation method called Monte Carlo. This works by having a circle that is 𝜋r2 and a square that is 4r2 with r being 1.0 and generating randomized points inside the area, both x and y being between -1 and 1 we keep track of how many points have been located inside the circle. The more points generated the more accurate the final calculation of PI will be. The amount of points needed for say billionth precision can easily reach in the hundreds of billions which would take just as many calculations of the same mathematical computation, which makes it a fantastic candidate to parallelize.<br />
<br />
====== Figure 1 ======<br />
[[File:Pi_calc.png]]<br />
<br/><br />
Figure 1: Graphical representation of the Monte Carlo method of approximating PI<br />
<br />
{| class="wikitable mw-collapsible mw-collapsed"<br />
! pi.cpp<br />
|-<br />
|<br />
<source><br />
/*<br />
Author: Daniel Serpa<br />
Pseudo code: Blaise Barney (https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesPI)<br />
*/<br />
<br />
#include <iostream><br />
#include <cstdlib><br />
#include <math.h><br />
#include <iomanip><br />
<br />
double calcPI(double);<br />
<br />
int main(int argc, char ** argv) {<br />
if (argc != 2) {<br />
std::cout << "Invalid number of arguments" << std::endl;<br />
return 1;<br />
}<br />
std::srand(852);<br />
double npoints = atof(argv[1]);<br />
std::cout << "Number of points: " << npoints << std::endl;<br />
double PI = calcPI(npoints);<br />
std::cout << std::setprecision(10) << PI << std::endl;<br />
return 0;<br />
}<br />
<br />
double calcPI(double npoints) {<br />
double circle_count = 0.0;<br />
for (int j = 0; j < npoints; j++) {<br />
double x_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
double y_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
if (sqrt(pow(x_coor, 2) + pow(y_coor, 2)) < 1.0) circle_count += 1.0;<br />
}<br />
return 4.0*circle_count / npoints;<br />
}<br />
</source><br />
|}<br />
<br />
====== Figure 2 ======<br />
Figure 2: Serial C++ program used for profiling of the Monte Carlo method of approximating PI<br />
<br />
===== Compilation =====<br />
Program is compiled using the command: <source>gpp -O2 -g -pg -oapp pi.cpp</source><br />
<br />
===== Running =====<br />
We will profile the program using 2 billion points<br />
<source><br />
> time app 2000000000<br />
Number of points: 2e+09<br />
3.14157<br />
<br />
real 1m0.072s<br />
user 0m59.268s<br />
sys 0m0.018s<br />
</source><br />
<br />
===== Profiling =====<br />
Flat:<br />
<source><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls Ts/call Ts/call name<br />
100.80 34.61 34.61 calcPI(double)<br />
0.00 34.61 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</source><br />
Call:<br />
<source><br />
granularity: each sample hit covers 2 byte(s) for 0.03% of 34.61 seconds<br />
<br />
index % time self children called name<br />
<spontaneous><br />
[1] 100.0 34.61 0.00 calcPI(double) [1]<br />
-----------------------------------------------<br />
0.00 0.00 1/1 __libc_csu_init [16]<br />
[9] 0.0 0.00 0.00 1 _GLOBAL__sub_I_main [9]<br />
-----------------------------------------------<br />
<br />
Index by function name<br />
<br />
[9] _GLOBAL__sub_I_main (pi.cpp) [1] calcPI(double)<br />
</source><br />
<br />
===== Results =====<br />
You need many billions of points and maybe even trillions to reach a high precision for the final result but using just 2 billion dots causes the program to take over 30 seconds to run. The most intensive part of the program is the loop which is what loops 2 billion times in my run of the program while profiling, which can all be parallelized. We can determine from the profiling that 100% of the time executing the program is spent in the loop but of course that is not possible so we will go with 99.9%, using a GTX 1080 as an example GPU which has 20 SMX and each having 2048 threads, and using Amdahl's Law we can expect a speedup of 976.191x<br />
<br />
=== Assignment 2 ===<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=138067GPU610/gpuchill2019-03-04T16:21:18Z<p>Dserpa: /* = Figure 1 */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Calculation of Pi<br />
# [mailto:akkabia@myseneca.ca?subject=gpu610 Abdul Kabia], Some responsibility <br />
# [mailto:jtardif1@myseneca.ca?subject=gpu610 Josh Tardif], Some responsibility<br />
# [mailto:afaux@myseneca.ca?subject=gpu610 Andrew Faux], Some responsibility<br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca,akkabia@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
==== Sudoku Brute Force Solver ====<br />
<br />
I decided to profile a simple brute force Sudoku solver, found here (https://github.com/regehr/sudoku). The solver uses a simple back tracking algorithm, inserting possible values into cells, iterating through the puzzles thousands of times, until it eventually produces an answer which does not violate any of the rules of Sudoku. As such the solver runs at the same speed regardless of the human difficulty rating, able to solve easy and 'insane' level puzzles at the same speed. The solver also works independent of the ratio between clues and white space, producing quick results with even the most sparsely populated puzzles.As such the following run of the program uses a puzzle which is specifically made to play against the back tracking algorithm and provides maximum time for the solver.<br />
<br />
Test run with puzzle:<br />
<pre><br />
Original configuration:<br />
-------------<br />
| | | |<br />
| | 3| 85|<br />
| 1| 2 | |<br />
-------------<br />
| |5 7| |<br />
| 4| |1 |<br />
| 9 | | |<br />
-------------<br />
|5 | | 73|<br />
| 2| 1 | |<br />
| | 4 | 9|<br />
-------------<br />
17 entries filled<br />
solution:<br />
-------------<br />
|987|654|321|<br />
|246|173|985|<br />
|351|928|746|<br />
-------------<br />
|128|537|694|<br />
|634|892|157|<br />
|795|461|832|<br />
-------------<br />
|519|286|473|<br />
|472|319|568|<br />
|863|745|219|<br />
-------------<br />
found 1 solutions<br />
<br />
real 0m33.652s<br />
user 0m33.098s<br />
sys 0m0.015s<br />
</pre><br />
<br />
<br />
Flat profile:<br />
<pre><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
46.42 10.04 10.04 622865043 0.00 0.00 check_row<br />
23.52 15.13 5.09 1 5.09 21.32 solve<br />
18.26 19.08 3.95 223473489 0.00 0.00 check_col<br />
10.02 21.25 2.17 100654218 0.00 0.00 check_region<br />
0.72 21.40 0.16 2 0.08 0.08 print<br />
0.39 21.49 0.09 frame_dummy<br />
</pre><br />
<br />
I believe that if a GPU was used to enhance this program one would see a great increase of speed. All of the check functions essentially do the same thing, iterating through possible inserted values for any that violate the rules. If one is able to unload all of these iterations onto the GPU then there should be a corresponding increase in speed.<br />
<br />
==== Christopher Ginac Image Processing Library ====<br />
<br />
I decided to profile a single user created image processing library written by Christopher Ginac, you can follow his post of the library [https://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ here]. His library enables the user to play around with .PGM image formats. If given the right parameters, users have the following options:<br />
<br />
<pre><br />
What would you like to do:<br />
[1] Get a Sub Image<br />
[2] Enlarge Image<br />
[3] Shrink Image<br />
[4] Reflect Image<br />
[5] Translate Image<br />
[6] Rotate Image<br />
[7] Negate Image<br />
</pre><br />
<br />
I went with the Enlarge option to see how long that would take. In order for me to do this, I had to test both the limits of the program and my own seneca machine allowed space, in order to do this, I had to use a fairly large image. However, since the program creates a second image, my Seneca account ran out of space for the new image, so the program could not write out the newly enlarged image. So I had to settle on an image that was 16.3MB max, so that it could write a new one, totally in 32.6MB of space. <br />
<br />
<pre><br />
real 0m10.595s<br />
user 0m5.325s<br />
sys 0m1.446s<br />
</pre><br />
Which isn't really bad, but when we look deeper, we see where most of our time is being spent<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
21.74 1.06 1.06 1 1.06 1.06 Image::operator=(Image const&)<br />
21.33 2.10 1.04 2 0.52 0.52 Image::Image(int, int, int)<br />
18.66 3.01 0.91 154056114 0.00 0.00 Image::getPixelVal(int, int)<br />
15.59 3.77 0.76 1 0.76 2.34 Image::enlargeImage(int, Image&)<br />
14.97 4.50 0.73 1 0.73 1.67 writeImage(char*, Image&)<br />
3.69 4.68 0.18 2 0.09 0.09 Image::Image(Image const&)<br />
2.67 4.81 0.13 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.82 4.85 0.04 1 0.04 0.17 readImage(char*, Image&)<br />
0.62 4.88 0.03 1 0.03 0.03 Image::getImageInfo(int&, int&, int&)<br />
0.00 4.88 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 4.88 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 4.88 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev<br />
0.00 4.88 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 4.88 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)<br />
</pre><br />
<br />
It seems most of our time in this part of the code is spent assigning our enlarged image to the now one, and also creating our image object in the first place. I think if we were to somehow use a GPU for this process, we would see an decrease in run-time for this part of the library. Also, there also seems to be room for improvement on the very 'Image::enlargeImage' function itself. I feel like by loading said functionality onto thje GPU, we can reduce it's 0.76s to something even lower.<br />
<br />
==== Merge Sort Algorithm ====<br />
<br />
I decide to profile a vector merge sort algorithm. A merge sort is based on a based on divide and conquer technique which recursively breaks down a problem into two or more sub-problems of the same or related types. When these become simple enough to be solved directly the sub-problems are then combined to give a solution to the original problem. It first divides the array into equal halves and then combines them in a sorted manner. Due to this type of sort being broken into equal parts, I thought that it would be perfect for a GPU to be able to accelerate the process. With the sort being broken down into multiple chunks and then sent to the GPU it will be able to accomplish its task more efficiently. I was able to find the source code [https://codereview.stackexchange.com/questions/167680/merge-sort-implementation-with-vectors/ here].<br />
<br />
Profile for 10 million elements between 1 and 10000. Using -02 optimization.<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls ns/call ns/call name<br />
48.35 1.16 1.16 9999999 115.56 115.56 mergeSort(std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, <br />
std::vector<int, std::allocator<int> >&)<br />
32.80 1.94 0.78 sort(std::vector<int, std::allocator<int> >&)<br />
19.34 2.40 0.46 43708492 10.58 10.58 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, <br />
std::allocator<int> > >, int const&)<br />
0.00 2.40 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</pre><br />
As you can see 80% of the total time was spent in mergeSort and sort functions. <br /><br />
If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x.<br />
<br />
==== Calculation of Pi ====<br />
===== Initial Thoughts =====<br />
The program I decided to assess, and profile calculates the value of PI by using the approximation method called Monte Carlo. This works by having a circle that is 𝜋r2 and a square that is 4r2 with r being 1.0 and generating randomized points inside the area, both x and y being between -1 and 1 we keep track of how many points have been located inside the circle. The more points generated the more accurate the final calculation of PI will be. The amount of points needed for say billionth precision can easily reach in the hundreds of billions which would take just as many calculations of the same mathematical computation, which makes it a fantastic candidate to parallelize.<br />
<br />
====== Figure 1 ======<br />
[[File:Pi_calc.png]]<br />
<br/><br />
Figure 1: Graphical representation of the Monte Carlo method of approximating PI<br />
<br />
----<br />
<br />
{| class="wikitable mw-collapsible mw-collapsed"<br />
! pi.cpp<br />
|-<br />
|<br />
<source><br />
/*<br />
Author: Daniel Serpa<br />
Pseudo code: Blaise Barney (https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesPI)<br />
*/<br />
<br />
#include <iostream><br />
#include <cstdlib><br />
#include <math.h><br />
#include <iomanip><br />
<br />
double calcPI(double);<br />
<br />
int main(int argc, char ** argv) {<br />
if (argc != 2) {<br />
std::cout << "Invalid number of arguments" << std::endl;<br />
return 1;<br />
}<br />
std::srand(852);<br />
double npoints = atof(argv[1]);<br />
std::cout << "Number of points: " << npoints << std::endl;<br />
double PI = calcPI(npoints);<br />
std::cout << std::setprecision(10) << PI << std::endl;<br />
return 0;<br />
}<br />
<br />
double calcPI(double npoints) {<br />
double circle_count = 0.0;<br />
for (int j = 0; j < npoints; j++) {<br />
double x_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
double y_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
if (sqrt(pow(x_coor, 2) + pow(y_coor, 2)) < 1.0) circle_count += 1.0;<br />
}<br />
return 4.0*circle_count / npoints;<br />
}<br />
</source><br />
|}<br />
====== Figure 2 ======<br />
Figure 2: Serial C++ program used for profiling of the Monte Carlo method of approximating PI<br />
<br />
===== Compilation =====<br />
Program is compiled using the command: <source>gpp -O2 -g -pg -oapp pi.cpp</source><br />
<br />
===== Running =====<br />
We will profile the program using 2 billion points<br />
<source><br />
> time app 2000000000<br />
Number of points: 2e+09<br />
3.14157<br />
<br />
real 1m0.072s<br />
user 0m59.268s<br />
sys 0m0.018s<br />
</source><br />
<br />
===== Profiling =====<br />
Flat:<br />
<source><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls Ts/call Ts/call name<br />
100.80 34.61 34.61 calcPI(double)<br />
0.00 34.61 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</source><br />
Call:<br />
<source><br />
granularity: each sample hit covers 2 byte(s) for 0.03% of 34.61 seconds<br />
<br />
index % time self children called name<br />
<spontaneous><br />
[1] 100.0 34.61 0.00 calcPI(double) [1]<br />
-----------------------------------------------<br />
0.00 0.00 1/1 __libc_csu_init [16]<br />
[9] 0.0 0.00 0.00 1 _GLOBAL__sub_I_main [9]<br />
-----------------------------------------------<br />
<br />
Index by function name<br />
<br />
[9] _GLOBAL__sub_I_main (pi.cpp) [1] calcPI(double)<br />
</source><br />
<br />
===== Results =====<br />
You need many billions of points and maybe even trillions to reach a high precision for the final result but using just 2 billion dots causes the program to take over 30 seconds to run. The most intensive part of the program is the loop which is what loops 2 billion times in my run of the program while profiling, which can all be parallelized. We can determine from the profiling that 100% of the time executing the program is spent in the loop but of course that is not possible so we will go with 99.9%, using a GTX 1080 as an example GPU which has 20 SMX and each having 2048 threads, and using Amdahl's Law we can expect a speedup of 976.191x<br />
<br />
=== Assignment 2 ===<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=138066GPU610/gpuchill2019-03-04T16:21:03Z<p>Dserpa: /* Initial Thoughts */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Calculation of Pi<br />
# [mailto:akkabia@myseneca.ca?subject=gpu610 Abdul Kabia], Some responsibility <br />
# [mailto:jtardif1@myseneca.ca?subject=gpu610 Josh Tardif], Some responsibility<br />
# [mailto:afaux@myseneca.ca?subject=gpu610 Andrew Faux], Some responsibility<br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca,akkabia@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
==== Sudoku Brute Force Solver ====<br />
<br />
I decided to profile a simple brute force Sudoku solver, found here (https://github.com/regehr/sudoku). The solver uses a simple back tracking algorithm, inserting possible values into cells, iterating through the puzzles thousands of times, until it eventually produces an answer which does not violate any of the rules of Sudoku. As such the solver runs at the same speed regardless of the human difficulty rating, able to solve easy and 'insane' level puzzles at the same speed. The solver also works independent of the ratio between clues and white space, producing quick results with even the most sparsely populated puzzles.As such the following run of the program uses a puzzle which is specifically made to play against the back tracking algorithm and provides maximum time for the solver.<br />
<br />
Test run with puzzle:<br />
<pre><br />
Original configuration:<br />
-------------<br />
| | | |<br />
| | 3| 85|<br />
| 1| 2 | |<br />
-------------<br />
| |5 7| |<br />
| 4| |1 |<br />
| 9 | | |<br />
-------------<br />
|5 | | 73|<br />
| 2| 1 | |<br />
| | 4 | 9|<br />
-------------<br />
17 entries filled<br />
solution:<br />
-------------<br />
|987|654|321|<br />
|246|173|985|<br />
|351|928|746|<br />
-------------<br />
|128|537|694|<br />
|634|892|157|<br />
|795|461|832|<br />
-------------<br />
|519|286|473|<br />
|472|319|568|<br />
|863|745|219|<br />
-------------<br />
found 1 solutions<br />
<br />
real 0m33.652s<br />
user 0m33.098s<br />
sys 0m0.015s<br />
</pre><br />
<br />
<br />
Flat profile:<br />
<pre><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
46.42 10.04 10.04 622865043 0.00 0.00 check_row<br />
23.52 15.13 5.09 1 5.09 21.32 solve<br />
18.26 19.08 3.95 223473489 0.00 0.00 check_col<br />
10.02 21.25 2.17 100654218 0.00 0.00 check_region<br />
0.72 21.40 0.16 2 0.08 0.08 print<br />
0.39 21.49 0.09 frame_dummy<br />
</pre><br />
<br />
I believe that if a GPU was used to enhance this program one would see a great increase of speed. All of the check functions essentially do the same thing, iterating through possible inserted values for any that violate the rules. If one is able to unload all of these iterations onto the GPU then there should be a corresponding increase in speed.<br />
<br />
==== Christopher Ginac Image Processing Library ====<br />
<br />
I decided to profile a single user created image processing library written by Christopher Ginac, you can follow his post of the library [https://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ here]. His library enables the user to play around with .PGM image formats. If given the right parameters, users have the following options:<br />
<br />
<pre><br />
What would you like to do:<br />
[1] Get a Sub Image<br />
[2] Enlarge Image<br />
[3] Shrink Image<br />
[4] Reflect Image<br />
[5] Translate Image<br />
[6] Rotate Image<br />
[7] Negate Image<br />
</pre><br />
<br />
I went with the Enlarge option to see how long that would take. In order for me to do this, I had to test both the limits of the program and my own seneca machine allowed space, in order to do this, I had to use a fairly large image. However, since the program creates a second image, my Seneca account ran out of space for the new image, so the program could not write out the newly enlarged image. So I had to settle on an image that was 16.3MB max, so that it could write a new one, totally in 32.6MB of space. <br />
<br />
<pre><br />
real 0m10.595s<br />
user 0m5.325s<br />
sys 0m1.446s<br />
</pre><br />
Which isn't really bad, but when we look deeper, we see where most of our time is being spent<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
21.74 1.06 1.06 1 1.06 1.06 Image::operator=(Image const&)<br />
21.33 2.10 1.04 2 0.52 0.52 Image::Image(int, int, int)<br />
18.66 3.01 0.91 154056114 0.00 0.00 Image::getPixelVal(int, int)<br />
15.59 3.77 0.76 1 0.76 2.34 Image::enlargeImage(int, Image&)<br />
14.97 4.50 0.73 1 0.73 1.67 writeImage(char*, Image&)<br />
3.69 4.68 0.18 2 0.09 0.09 Image::Image(Image const&)<br />
2.67 4.81 0.13 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.82 4.85 0.04 1 0.04 0.17 readImage(char*, Image&)<br />
0.62 4.88 0.03 1 0.03 0.03 Image::getImageInfo(int&, int&, int&)<br />
0.00 4.88 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 4.88 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 4.88 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev<br />
0.00 4.88 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 4.88 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)<br />
</pre><br />
<br />
It seems most of our time in this part of the code is spent assigning our enlarged image to the now one, and also creating our image object in the first place. I think if we were to somehow use a GPU for this process, we would see an decrease in run-time for this part of the library. Also, there also seems to be room for improvement on the very 'Image::enlargeImage' function itself. I feel like by loading said functionality onto thje GPU, we can reduce it's 0.76s to something even lower.<br />
<br />
==== Merge Sort Algorithm ====<br />
<br />
I decide to profile a vector merge sort algorithm. A merge sort is based on a based on divide and conquer technique which recursively breaks down a problem into two or more sub-problems of the same or related types. When these become simple enough to be solved directly the sub-problems are then combined to give a solution to the original problem. It first divides the array into equal halves and then combines them in a sorted manner. Due to this type of sort being broken into equal parts, I thought that it would be perfect for a GPU to be able to accelerate the process. With the sort being broken down into multiple chunks and then sent to the GPU it will be able to accomplish its task more efficiently. I was able to find the source code [https://codereview.stackexchange.com/questions/167680/merge-sort-implementation-with-vectors/ here].<br />
<br />
Profile for 10 million elements between 1 and 10000. Using -02 optimization.<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls ns/call ns/call name<br />
48.35 1.16 1.16 9999999 115.56 115.56 mergeSort(std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, <br />
std::vector<int, std::allocator<int> >&)<br />
32.80 1.94 0.78 sort(std::vector<int, std::allocator<int> >&)<br />
19.34 2.40 0.46 43708492 10.58 10.58 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, <br />
std::allocator<int> > >, int const&)<br />
0.00 2.40 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</pre><br />
As you can see 80% of the total time was spent in mergeSort and sort functions. <br /><br />
If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x.<br />
<br />
==== Calculation of Pi ====<br />
===== Initial Thoughts =====<br />
The program I decided to assess, and profile calculates the value of PI by using the approximation method called Monte Carlo. This works by having a circle that is 𝜋r2 and a square that is 4r2 with r being 1.0 and generating randomized points inside the area, both x and y being between -1 and 1 we keep track of how many points have been located inside the circle. The more points generated the more accurate the final calculation of PI will be. The amount of points needed for say billionth precision can easily reach in the hundreds of billions which would take just as many calculations of the same mathematical computation, which makes it a fantastic candidate to parallelize.<br />
<br />
====== Figure 1 =====<br />
[[File:Pi_calc.png]]<br />
<br/><br />
Figure 1: Graphical representation of the Monte Carlo method of approximating PI<br />
<br />
----<br />
<br />
{| class="wikitable mw-collapsible mw-collapsed"<br />
! pi.cpp<br />
|-<br />
|<br />
<source><br />
/*<br />
Author: Daniel Serpa<br />
Pseudo code: Blaise Barney (https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesPI)<br />
*/<br />
<br />
#include <iostream><br />
#include <cstdlib><br />
#include <math.h><br />
#include <iomanip><br />
<br />
double calcPI(double);<br />
<br />
int main(int argc, char ** argv) {<br />
if (argc != 2) {<br />
std::cout << "Invalid number of arguments" << std::endl;<br />
return 1;<br />
}<br />
std::srand(852);<br />
double npoints = atof(argv[1]);<br />
std::cout << "Number of points: " << npoints << std::endl;<br />
double PI = calcPI(npoints);<br />
std::cout << std::setprecision(10) << PI << std::endl;<br />
return 0;<br />
}<br />
<br />
double calcPI(double npoints) {<br />
double circle_count = 0.0;<br />
for (int j = 0; j < npoints; j++) {<br />
double x_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
double y_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
if (sqrt(pow(x_coor, 2) + pow(y_coor, 2)) < 1.0) circle_count += 1.0;<br />
}<br />
return 4.0*circle_count / npoints;<br />
}<br />
</source><br />
|}<br />
====== Figure 2 ======<br />
Figure 2: Serial C++ program used for profiling of the Monte Carlo method of approximating PI<br />
<br />
===== Compilation =====<br />
Program is compiled using the command: <source>gpp -O2 -g -pg -oapp pi.cpp</source><br />
<br />
===== Running =====<br />
We will profile the program using 2 billion points<br />
<source><br />
> time app 2000000000<br />
Number of points: 2e+09<br />
3.14157<br />
<br />
real 1m0.072s<br />
user 0m59.268s<br />
sys 0m0.018s<br />
</source><br />
<br />
===== Profiling =====<br />
Flat:<br />
<source><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls Ts/call Ts/call name<br />
100.80 34.61 34.61 calcPI(double)<br />
0.00 34.61 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</source><br />
Call:<br />
<source><br />
granularity: each sample hit covers 2 byte(s) for 0.03% of 34.61 seconds<br />
<br />
index % time self children called name<br />
<spontaneous><br />
[1] 100.0 34.61 0.00 calcPI(double) [1]<br />
-----------------------------------------------<br />
0.00 0.00 1/1 __libc_csu_init [16]<br />
[9] 0.0 0.00 0.00 1 _GLOBAL__sub_I_main [9]<br />
-----------------------------------------------<br />
<br />
Index by function name<br />
<br />
[9] _GLOBAL__sub_I_main (pi.cpp) [1] calcPI(double)<br />
</source><br />
<br />
===== Results =====<br />
You need many billions of points and maybe even trillions to reach a high precision for the final result but using just 2 billion dots causes the program to take over 30 seconds to run. The most intensive part of the program is the loop which is what loops 2 billion times in my run of the program while profiling, which can all be parallelized. We can determine from the profiling that 100% of the time executing the program is spent in the loop but of course that is not possible so we will go with 99.9%, using a GTX 1080 as an example GPU which has 20 SMX and each having 2048 threads, and using Amdahl's Law we can expect a speedup of 976.191x<br />
<br />
=== Assignment 2 ===<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=138064GPU610/gpuchill2019-03-04T16:12:51Z<p>Dserpa: /* Results */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Calculation of Pi<br />
# [mailto:akkabia@myseneca.ca?subject=gpu610 Abdul Kabia], Some responsibility <br />
# [mailto:jtardif1@myseneca.ca?subject=gpu610 Josh Tardif], Some responsibility<br />
# [mailto:afaux@myseneca.ca?subject=gpu610 Andrew Faux], Some responsibility<br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca,akkabia@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
==== Sudoku Brute Force Solver ====<br />
<br />
I decided to profile a simple brute force Sudoku solver, found here (https://github.com/regehr/sudoku). The solver uses a simple back tracking algorithm, inserting possible values into cells, iterating through the puzzles thousands of times, until it eventually produces an answer which does not violate any of the rules of Sudoku. As such the solver runs at the same speed regardless of the human difficulty rating, able to solve easy and 'insane' level puzzles at the same speed. The solver also works independent of the ratio between clues and white space, producing quick results with even the most sparsely populated puzzles.As such the following run of the program uses a puzzle which is specifically made to play against the back tracking algorithm and provides maximum time for the solver.<br />
<br />
Test run with puzzle:<br />
<pre><br />
Original configuration:<br />
-------------<br />
| | | |<br />
| | 3| 85|<br />
| 1| 2 | |<br />
-------------<br />
| |5 7| |<br />
| 4| |1 |<br />
| 9 | | |<br />
-------------<br />
|5 | | 73|<br />
| 2| 1 | |<br />
| | 4 | 9|<br />
-------------<br />
17 entries filled<br />
solution:<br />
-------------<br />
|987|654|321|<br />
|246|173|985|<br />
|351|928|746|<br />
-------------<br />
|128|537|694|<br />
|634|892|157|<br />
|795|461|832|<br />
-------------<br />
|519|286|473|<br />
|472|319|568|<br />
|863|745|219|<br />
-------------<br />
found 1 solutions<br />
<br />
real 0m33.652s<br />
user 0m33.098s<br />
sys 0m0.015s<br />
</pre><br />
<br />
<br />
Flat profile:<br />
<pre><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
46.42 10.04 10.04 622865043 0.00 0.00 check_row<br />
23.52 15.13 5.09 1 5.09 21.32 solve<br />
18.26 19.08 3.95 223473489 0.00 0.00 check_col<br />
10.02 21.25 2.17 100654218 0.00 0.00 check_region<br />
0.72 21.40 0.16 2 0.08 0.08 print<br />
0.39 21.49 0.09 frame_dummy<br />
</pre><br />
<br />
I believe that if a GPU was used to enhance this program one would see a great increase of speed. All of the check functions essentially do the same thing, iterating through possible inserted values for any that violate the rules. If one is able to unload all of these iterations onto the GPU then there should be a corresponding increase in speed.<br />
<br />
==== Christopher Ginac Image Processing Library ====<br />
<br />
I decided to profile a single user created image processing library written by Christopher Ginac, you can follow his post of the library [https://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ here]. His library enables the user to play around with .PGM image formats. If given the right parameters, users have the following options:<br />
<br />
<pre><br />
What would you like to do:<br />
[1] Get a Sub Image<br />
[2] Enlarge Image<br />
[3] Shrink Image<br />
[4] Reflect Image<br />
[5] Translate Image<br />
[6] Rotate Image<br />
[7] Negate Image<br />
</pre><br />
<br />
I went with the Enlarge option to see how long that would take. In order for me to do this, I had to test both the limits of the program and my own seneca machine allowed space, in order to do this, I had to use a fairly large image. However, since the program creates a second image, my Seneca account ran out of space for the new image, so the program could not write out the newly enlarged image. So I had to settle on an image that was 16.3MB max, so that it could write a new one, totally in 32.6MB of space. <br />
<br />
<pre><br />
real 0m10.595s<br />
user 0m5.325s<br />
sys 0m1.446s<br />
</pre><br />
Which isn't really bad, but when we look deeper, we see where most of our time is being spent<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
21.74 1.06 1.06 1 1.06 1.06 Image::operator=(Image const&)<br />
21.33 2.10 1.04 2 0.52 0.52 Image::Image(int, int, int)<br />
18.66 3.01 0.91 154056114 0.00 0.00 Image::getPixelVal(int, int)<br />
15.59 3.77 0.76 1 0.76 2.34 Image::enlargeImage(int, Image&)<br />
14.97 4.50 0.73 1 0.73 1.67 writeImage(char*, Image&)<br />
3.69 4.68 0.18 2 0.09 0.09 Image::Image(Image const&)<br />
2.67 4.81 0.13 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.82 4.85 0.04 1 0.04 0.17 readImage(char*, Image&)<br />
0.62 4.88 0.03 1 0.03 0.03 Image::getImageInfo(int&, int&, int&)<br />
0.00 4.88 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 4.88 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 4.88 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev<br />
0.00 4.88 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 4.88 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)<br />
</pre><br />
<br />
It seems most of our time in this part of the code is spent assigning our enlarged image to the now one, and also creating our image object in the first place. I think if we were to somehow use a GPU for this process, we would see an decrease in run-time for this part of the library. Also, there also seems to be room for improvement on the very 'Image::enlargeImage' function itself. I feel like by loading said functionality onto thje GPU, we can reduce it's 0.76s to something even lower.<br />
<br />
==== Merge Sort Algorithm ====<br />
<br />
I decide to profile a vector merge sort algorithm. A merge sort is based on a based on divide and conquer technique which recursively breaks down a problem into two or more sub-problems of the same or related types. When these become simple enough to be solved directly the sub-problems are then combined to give a solution to the original problem. It first divides the array into equal halves and then combines them in a sorted manner. Due to this type of sort being broken into equal parts, I thought that it would be perfect for a GPU to be able to accelerate the process. With the sort being broken down into multiple chunks and then sent to the GPU it will be able to accomplish its task more efficiently. I was able to find the source code [https://codereview.stackexchange.com/questions/167680/merge-sort-implementation-with-vectors/ here].<br />
<br />
Profile for 10 million elements between 1 and 10000. Using -02 optimization.<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls ns/call ns/call name<br />
48.35 1.16 1.16 9999999 115.56 115.56 mergeSort(std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, <br />
std::vector<int, std::allocator<int> >&)<br />
32.80 1.94 0.78 sort(std::vector<int, std::allocator<int> >&)<br />
19.34 2.40 0.46 43708492 10.58 10.58 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, <br />
std::allocator<int> > >, int const&)<br />
0.00 2.40 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</pre><br />
As you can see 80% of the total time was spent in mergeSort and sort functions. <br /><br />
If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x.<br />
<br />
==== Calculation of Pi ====<br />
===== Initial Thoughts =====<br />
The program I decided to assess, and profile calculates the value of PI by using the approximation method called Monte Carlo. This works by having a circle that is 𝜋r2 and a square that is 4r2 with r being 1.0 and generating randomized points inside the area, both x and y being between -1 and 1 we keep track of how many points have been located inside the circle. The more points generated the more accurate the final calculation of PI will be. The amount of points needed for say billionth precision can easily reach in the hundreds of billions which would take just as many calculations of the same mathematical computation, which makes it a fantastic candidate to parallelize.<br />
<br />
<br />
[[File:Pi_calc.png]]<br />
<br/><br />
Figure 1: Graphical representation of the Monte Carlo method of approximating PI<br />
<br />
----<br />
<br />
{| class="wikitable mw-collapsible mw-collapsed"<br />
! pi.cpp<br />
|-<br />
|<br />
<source><br />
/*<br />
Author: Daniel Serpa<br />
Pseudo code: Blaise Barney (https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesPI)<br />
*/<br />
<br />
#include <iostream><br />
#include <cstdlib><br />
#include <math.h><br />
#include <iomanip><br />
<br />
double calcPI(double);<br />
<br />
int main(int argc, char ** argv) {<br />
if (argc != 2) {<br />
std::cout << "Invalid number of arguments" << std::endl;<br />
return 1;<br />
}<br />
std::srand(852);<br />
double npoints = atof(argv[1]);<br />
std::cout << "Number of points: " << npoints << std::endl;<br />
double PI = calcPI(npoints);<br />
std::cout << std::setprecision(10) << PI << std::endl;<br />
return 0;<br />
}<br />
<br />
double calcPI(double npoints) {<br />
double circle_count = 0.0;<br />
for (int j = 0; j < npoints; j++) {<br />
double x_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
double y_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
if (sqrt(pow(x_coor, 2) + pow(y_coor, 2)) < 1.0) circle_count += 1.0;<br />
}<br />
return 4.0*circle_count / npoints;<br />
}<br />
</source><br />
|}<br />
Figure 2: Serial C++ program used for profiling of the Monte Carlo method of approximating PI<br />
<br />
===== Compilation =====<br />
Program is compiled using the command: <source>gpp -O2 -g -pg -oapp pi.cpp</source><br />
<br />
===== Running =====<br />
We will profile the program using 2 billion points<br />
<source><br />
> time app 2000000000<br />
Number of points: 2e+09<br />
3.14157<br />
<br />
real 1m0.072s<br />
user 0m59.268s<br />
sys 0m0.018s<br />
</source><br />
<br />
===== Profiling =====<br />
Flat:<br />
<source><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls Ts/call Ts/call name<br />
100.80 34.61 34.61 calcPI(double)<br />
0.00 34.61 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</source><br />
Call:<br />
<source><br />
granularity: each sample hit covers 2 byte(s) for 0.03% of 34.61 seconds<br />
<br />
index % time self children called name<br />
<spontaneous><br />
[1] 100.0 34.61 0.00 calcPI(double) [1]<br />
-----------------------------------------------<br />
0.00 0.00 1/1 __libc_csu_init [16]<br />
[9] 0.0 0.00 0.00 1 _GLOBAL__sub_I_main [9]<br />
-----------------------------------------------<br />
<br />
Index by function name<br />
<br />
[9] _GLOBAL__sub_I_main (pi.cpp) [1] calcPI(double)<br />
</source><br />
<br />
===== Results =====<br />
You need many billions of points and maybe even trillions to reach a high precision for the final result but using just 2 billion dots causes the program to take over 30 seconds to run. The most intensive part of the program is the loop which is what loops 2 billion times in my run of the program while profiling, which can all be parallelized. We can determine from the profiling that 100% of the time executing the program is spent in the loop but of course that is not possible so we will go with 99.9%, using a GTX 1080 as an example GPU which has 20 SMX and each having 2048 threads, and using Amdahl's Law we can expect a speedup of 976.191x<br />
<br />
=== Assignment 2 ===<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=138063GPU610/gpuchill2019-03-04T16:11:12Z<p>Dserpa: /* Results */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Calculation of Pi<br />
# [mailto:akkabia@myseneca.ca?subject=gpu610 Abdul Kabia], Some responsibility <br />
# [mailto:jtardif1@myseneca.ca?subject=gpu610 Josh Tardif], Some responsibility<br />
# [mailto:afaux@myseneca.ca?subject=gpu610 Andrew Faux], Some responsibility<br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca,akkabia@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
==== Sudoku Brute Force Solver ====<br />
<br />
I decided to profile a simple brute force Sudoku solver, found here (https://github.com/regehr/sudoku). The solver uses a simple back tracking algorithm, inserting possible values into cells, iterating through the puzzles thousands of times, until it eventually produces an answer which does not violate any of the rules of Sudoku. As such the solver runs at the same speed regardless of the human difficulty rating, able to solve easy and 'insane' level puzzles at the same speed. The solver also works independent of the ratio between clues and white space, producing quick results with even the most sparsely populated puzzles.As such the following run of the program uses a puzzle which is specifically made to play against the back tracking algorithm and provides maximum time for the solver.<br />
<br />
Test run with puzzle:<br />
<pre><br />
Original configuration:<br />
-------------<br />
| | | |<br />
| | 3| 85|<br />
| 1| 2 | |<br />
-------------<br />
| |5 7| |<br />
| 4| |1 |<br />
| 9 | | |<br />
-------------<br />
|5 | | 73|<br />
| 2| 1 | |<br />
| | 4 | 9|<br />
-------------<br />
17 entries filled<br />
solution:<br />
-------------<br />
|987|654|321|<br />
|246|173|985|<br />
|351|928|746|<br />
-------------<br />
|128|537|694|<br />
|634|892|157|<br />
|795|461|832|<br />
-------------<br />
|519|286|473|<br />
|472|319|568|<br />
|863|745|219|<br />
-------------<br />
found 1 solutions<br />
<br />
real 0m33.652s<br />
user 0m33.098s<br />
sys 0m0.015s<br />
</pre><br />
<br />
<br />
Flat profile:<br />
<pre><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
46.42 10.04 10.04 622865043 0.00 0.00 check_row<br />
23.52 15.13 5.09 1 5.09 21.32 solve<br />
18.26 19.08 3.95 223473489 0.00 0.00 check_col<br />
10.02 21.25 2.17 100654218 0.00 0.00 check_region<br />
0.72 21.40 0.16 2 0.08 0.08 print<br />
0.39 21.49 0.09 frame_dummy<br />
</pre><br />
<br />
I believe that if a GPU was used to enhance this program one would see a great increase of speed. All of the check functions essentially do the same thing, iterating through possible inserted values for any that violate the rules. If one is able to unload all of these iterations onto the GPU then there should be a corresponding increase in speed.<br />
<br />
==== Christopher Ginac Image Processing Library ====<br />
<br />
I decided to profile a single user created image processing library written by Christopher Ginac, you can follow his post of the library [https://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ here]. His library enables the user to play around with .PGM image formats. If given the right parameters, users have the following options:<br />
<br />
<pre><br />
What would you like to do:<br />
[1] Get a Sub Image<br />
[2] Enlarge Image<br />
[3] Shrink Image<br />
[4] Reflect Image<br />
[5] Translate Image<br />
[6] Rotate Image<br />
[7] Negate Image<br />
</pre><br />
<br />
I went with the Enlarge option to see how long that would take. In order for me to do this, I had to test both the limits of the program and my own seneca machine allowed space, in order to do this, I had to use a fairly large image. However, since the program creates a second image, my Seneca account ran out of space for the new image, so the program could not write out the newly enlarged image. So I had to settle on an image that was 16.3MB max, so that it could write a new one, totally in 32.6MB of space. <br />
<br />
<pre><br />
real 0m10.595s<br />
user 0m5.325s<br />
sys 0m1.446s<br />
</pre><br />
Which isn't really bad, but when we look deeper, we see where most of our time is being spent<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
21.74 1.06 1.06 1 1.06 1.06 Image::operator=(Image const&)<br />
21.33 2.10 1.04 2 0.52 0.52 Image::Image(int, int, int)<br />
18.66 3.01 0.91 154056114 0.00 0.00 Image::getPixelVal(int, int)<br />
15.59 3.77 0.76 1 0.76 2.34 Image::enlargeImage(int, Image&)<br />
14.97 4.50 0.73 1 0.73 1.67 writeImage(char*, Image&)<br />
3.69 4.68 0.18 2 0.09 0.09 Image::Image(Image const&)<br />
2.67 4.81 0.13 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.82 4.85 0.04 1 0.04 0.17 readImage(char*, Image&)<br />
0.62 4.88 0.03 1 0.03 0.03 Image::getImageInfo(int&, int&, int&)<br />
0.00 4.88 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 4.88 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 4.88 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev<br />
0.00 4.88 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 4.88 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)<br />
</pre><br />
<br />
It seems most of our time in this part of the code is spent assigning our enlarged image to the now one, and also creating our image object in the first place. I think if we were to somehow use a GPU for this process, we would see an decrease in run-time for this part of the library. Also, there also seems to be room for improvement on the very 'Image::enlargeImage' function itself. I feel like by loading said functionality onto thje GPU, we can reduce it's 0.76s to something even lower.<br />
<br />
==== Merge Sort Algorithm ====<br />
<br />
I decide to profile a vector merge sort algorithm. A merge sort is based on a based on divide and conquer technique which recursively breaks down a problem into two or more sub-problems of the same or related types. When these become simple enough to be solved directly the sub-problems are then combined to give a solution to the original problem. It first divides the array into equal halves and then combines them in a sorted manner. Due to this type of sort being broken into equal parts, I thought that it would be perfect for a GPU to be able to accelerate the process. With the sort being broken down into multiple chunks and then sent to the GPU it will be able to accomplish its task more efficiently. I was able to find the source code [https://codereview.stackexchange.com/questions/167680/merge-sort-implementation-with-vectors/ here].<br />
<br />
Profile for 10 million elements between 1 and 10000. Using -02 optimization.<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls ns/call ns/call name<br />
48.35 1.16 1.16 9999999 115.56 115.56 mergeSort(std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, <br />
std::vector<int, std::allocator<int> >&)<br />
32.80 1.94 0.78 sort(std::vector<int, std::allocator<int> >&)<br />
19.34 2.40 0.46 43708492 10.58 10.58 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, <br />
std::allocator<int> > >, int const&)<br />
0.00 2.40 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</pre><br />
As you can see 80% of the total time was spent in mergeSort and sort functions. <br /><br />
If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x.<br />
<br />
==== Calculation of Pi ====<br />
===== Initial Thoughts =====<br />
The program I decided to assess, and profile calculates the value of PI by using the approximation method called Monte Carlo. This works by having a circle that is 𝜋r2 and a square that is 4r2 with r being 1.0 and generating randomized points inside the area, both x and y being between -1 and 1 we keep track of how many points have been located inside the circle. The more points generated the more accurate the final calculation of PI will be. The amount of points needed for say billionth precision can easily reach in the hundreds of billions which would take just as many calculations of the same mathematical computation, which makes it a fantastic candidate to parallelize.<br />
<br />
<br />
[[File:Pi_calc.png]]<br />
<br/><br />
Figure 1: Graphical representation of the Monte Carlo method of approximating PI<br />
<br />
----<br />
<br />
{| class="wikitable mw-collapsible mw-collapsed"<br />
! pi.cpp<br />
|-<br />
|<br />
<source><br />
/*<br />
Author: Daniel Serpa<br />
Pseudo code: Blaise Barney (https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesPI)<br />
*/<br />
<br />
#include <iostream><br />
#include <cstdlib><br />
#include <math.h><br />
#include <iomanip><br />
<br />
double calcPI(double);<br />
<br />
int main(int argc, char ** argv) {<br />
if (argc != 2) {<br />
std::cout << "Invalid number of arguments" << std::endl;<br />
return 1;<br />
}<br />
std::srand(852);<br />
double npoints = atof(argv[1]);<br />
std::cout << "Number of points: " << npoints << std::endl;<br />
double PI = calcPI(npoints);<br />
std::cout << std::setprecision(10) << PI << std::endl;<br />
return 0;<br />
}<br />
<br />
double calcPI(double npoints) {<br />
double circle_count = 0.0;<br />
for (int j = 0; j < npoints; j++) {<br />
double x_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
double y_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
if (sqrt(pow(x_coor, 2) + pow(y_coor, 2)) < 1.0) circle_count += 1.0;<br />
}<br />
return 4.0*circle_count / npoints;<br />
}<br />
</source><br />
|}<br />
Figure 2: Serial C++ program used for profiling of the Monte Carlo method of approximating PI<br />
<br />
===== Compilation =====<br />
Program is compiled using the command: <source>gpp -O2 -g -pg -oapp pi.cpp</source><br />
<br />
===== Running =====<br />
We will profile the program using 2 billion points<br />
<source><br />
> time app 2000000000<br />
Number of points: 2e+09<br />
3.14157<br />
<br />
real 1m0.072s<br />
user 0m59.268s<br />
sys 0m0.018s<br />
</source><br />
<br />
===== Profiling =====<br />
Flat:<br />
<source><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls Ts/call Ts/call name<br />
100.80 34.61 34.61 calcPI(double)<br />
0.00 34.61 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</source><br />
Call:<br />
<source><br />
granularity: each sample hit covers 2 byte(s) for 0.03% of 34.61 seconds<br />
<br />
index % time self children called name<br />
<spontaneous><br />
[1] 100.0 34.61 0.00 calcPI(double) [1]<br />
-----------------------------------------------<br />
0.00 0.00 1/1 __libc_csu_init [16]<br />
[9] 0.0 0.00 0.00 1 _GLOBAL__sub_I_main [9]<br />
-----------------------------------------------<br />
<br />
Index by function name<br />
<br />
[9] _GLOBAL__sub_I_main (pi.cpp) [1] calcPI(double)<br />
</source><br />
<br />
===== Results =====<br />
You need many billions of points and maybe even trillions to reach a high precision for the final result but using just 2 billion dots causes the program to take over 30 seconds to run. The most intensive part of the program is the loop which is what loops 2 billion times in my run of the program while profiling, which can all be parallelized. We can determine from the profiling that 100% of the time executing the program is spent in the loop but of course that is not possible so we will go with 99.999%, using a GTX 1080 as an example GPU which has 20 SMX and each having 2048 threads, and using Amdahl's Law we can expect a speedup of 29058.094x<br />
<br />
=== Assignment 2 ===<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=138062GPU610/gpuchill2019-03-04T16:09:40Z<p>Dserpa: /* Calculation of Pi */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Calculation of Pi<br />
# [mailto:akkabia@myseneca.ca?subject=gpu610 Abdul Kabia], Some responsibility <br />
# [mailto:jtardif1@myseneca.ca?subject=gpu610 Josh Tardif], Some responsibility<br />
# [mailto:afaux@myseneca.ca?subject=gpu610 Andrew Faux], Some responsibility<br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca,akkabia@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
==== Sudoku Brute Force Solver ====<br />
<br />
I decided to profile a simple brute force Sudoku solver, found here (https://github.com/regehr/sudoku). The solver uses a simple back tracking algorithm, inserting possible values into cells, iterating through the puzzles thousands of times, until it eventually produces an answer which does not violate any of the rules of Sudoku. As such the solver runs at the same speed regardless of the human difficulty rating, able to solve easy and 'insane' level puzzles at the same speed. The solver also works independent of the ratio between clues and white space, producing quick results with even the most sparsely populated puzzles.As such the following run of the program uses a puzzle which is specifically made to play against the back tracking algorithm and provides maximum time for the solver.<br />
<br />
Test run with puzzle:<br />
<pre><br />
Original configuration:<br />
-------------<br />
| | | |<br />
| | 3| 85|<br />
| 1| 2 | |<br />
-------------<br />
| |5 7| |<br />
| 4| |1 |<br />
| 9 | | |<br />
-------------<br />
|5 | | 73|<br />
| 2| 1 | |<br />
| | 4 | 9|<br />
-------------<br />
17 entries filled<br />
solution:<br />
-------------<br />
|987|654|321|<br />
|246|173|985|<br />
|351|928|746|<br />
-------------<br />
|128|537|694|<br />
|634|892|157|<br />
|795|461|832|<br />
-------------<br />
|519|286|473|<br />
|472|319|568|<br />
|863|745|219|<br />
-------------<br />
found 1 solutions<br />
<br />
real 0m33.652s<br />
user 0m33.098s<br />
sys 0m0.015s<br />
</pre><br />
<br />
<br />
Flat profile:<br />
<pre><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
46.42 10.04 10.04 622865043 0.00 0.00 check_row<br />
23.52 15.13 5.09 1 5.09 21.32 solve<br />
18.26 19.08 3.95 223473489 0.00 0.00 check_col<br />
10.02 21.25 2.17 100654218 0.00 0.00 check_region<br />
0.72 21.40 0.16 2 0.08 0.08 print<br />
0.39 21.49 0.09 frame_dummy<br />
</pre><br />
<br />
I believe that if a GPU was used to enhance this program one would see a great increase of speed. All of the check functions essentially do the same thing, iterating through possible inserted values for any that violate the rules. If one is able to unload all of these iterations onto the GPU then there should be a corresponding increase in speed.<br />
<br />
==== Christopher Ginac Image Processing Library ====<br />
<br />
I decided to profile a single user created image processing library written by Christopher Ginac, you can follow his post of the library [https://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ here]. His library enables the user to play around with .PGM image formats. If given the right parameters, users have the following options:<br />
<br />
<pre><br />
What would you like to do:<br />
[1] Get a Sub Image<br />
[2] Enlarge Image<br />
[3] Shrink Image<br />
[4] Reflect Image<br />
[5] Translate Image<br />
[6] Rotate Image<br />
[7] Negate Image<br />
</pre><br />
<br />
I went with the Enlarge option to see how long that would take. In order for me to do this, I had to test both the limits of the program and my own seneca machine allowed space, in order to do this, I had to use a fairly large image. However, since the program creates a second image, my Seneca account ran out of space for the new image, so the program could not write out the newly enlarged image. So I had to settle on an image that was 16.3MB max, so that it could write a new one, totally in 32.6MB of space. <br />
<br />
<pre><br />
real 0m10.595s<br />
user 0m5.325s<br />
sys 0m1.446s<br />
</pre><br />
Which isn't really bad, but when we look deeper, we see where most of our time is being spent<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
21.74 1.06 1.06 1 1.06 1.06 Image::operator=(Image const&)<br />
21.33 2.10 1.04 2 0.52 0.52 Image::Image(int, int, int)<br />
18.66 3.01 0.91 154056114 0.00 0.00 Image::getPixelVal(int, int)<br />
15.59 3.77 0.76 1 0.76 2.34 Image::enlargeImage(int, Image&)<br />
14.97 4.50 0.73 1 0.73 1.67 writeImage(char*, Image&)<br />
3.69 4.68 0.18 2 0.09 0.09 Image::Image(Image const&)<br />
2.67 4.81 0.13 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.82 4.85 0.04 1 0.04 0.17 readImage(char*, Image&)<br />
0.62 4.88 0.03 1 0.03 0.03 Image::getImageInfo(int&, int&, int&)<br />
0.00 4.88 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 4.88 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 4.88 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev<br />
0.00 4.88 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 4.88 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)<br />
</pre><br />
<br />
It seems most of our time in this part of the code is spent assigning our enlarged image to the now one, and also creating our image object in the first place. I think if we were to somehow use a GPU for this process, we would see an decrease in run-time for this part of the library. Also, there also seems to be room for improvement on the very 'Image::enlargeImage' function itself. I feel like by loading said functionality onto thje GPU, we can reduce it's 0.76s to something even lower.<br />
<br />
==== Merge Sort Algorithm ====<br />
<br />
I decide to profile a vector merge sort algorithm. A merge sort is based on a based on divide and conquer technique which recursively breaks down a problem into two or more sub-problems of the same or related types. When these become simple enough to be solved directly the sub-problems are then combined to give a solution to the original problem. It first divides the array into equal halves and then combines them in a sorted manner. Due to this type of sort being broken into equal parts, I thought that it would be perfect for a GPU to be able to accelerate the process. With the sort being broken down into multiple chunks and then sent to the GPU it will be able to accomplish its task more efficiently. I was able to find the source code [https://codereview.stackexchange.com/questions/167680/merge-sort-implementation-with-vectors/ here].<br />
<br />
Profile for 10 million elements between 1 and 10000. Using -02 optimization.<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls ns/call ns/call name<br />
48.35 1.16 1.16 9999999 115.56 115.56 mergeSort(std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, <br />
std::vector<int, std::allocator<int> >&)<br />
32.80 1.94 0.78 sort(std::vector<int, std::allocator<int> >&)<br />
19.34 2.40 0.46 43708492 10.58 10.58 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, <br />
std::allocator<int> > >, int const&)<br />
0.00 2.40 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</pre><br />
As you can see 80% of the total time was spent in mergeSort and sort functions. <br /><br />
If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x.<br />
<br />
==== Calculation of Pi ====<br />
===== Initial Thoughts =====<br />
The program I decided to assess, and profile calculates the value of PI by using the approximation method called Monte Carlo. This works by having a circle that is 𝜋r2 and a square that is 4r2 with r being 1.0 and generating randomized points inside the area, both x and y being between -1 and 1 we keep track of how many points have been located inside the circle. The more points generated the more accurate the final calculation of PI will be. The amount of points needed for say billionth precision can easily reach in the hundreds of billions which would take just as many calculations of the same mathematical computation, which makes it a fantastic candidate to parallelize.<br />
<br />
<br />
[[File:Pi_calc.png]]<br />
<br/><br />
Figure 1: Graphical representation of the Monte Carlo method of approximating PI<br />
<br />
----<br />
<br />
{| class="wikitable mw-collapsible mw-collapsed"<br />
! pi.cpp<br />
|-<br />
|<br />
<source><br />
/*<br />
Author: Daniel Serpa<br />
Pseudo code: Blaise Barney (https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesPI)<br />
*/<br />
<br />
#include <iostream><br />
#include <cstdlib><br />
#include <math.h><br />
#include <iomanip><br />
<br />
double calcPI(double);<br />
<br />
int main(int argc, char ** argv) {<br />
if (argc != 2) {<br />
std::cout << "Invalid number of arguments" << std::endl;<br />
return 1;<br />
}<br />
std::srand(852);<br />
double npoints = atof(argv[1]);<br />
std::cout << "Number of points: " << npoints << std::endl;<br />
double PI = calcPI(npoints);<br />
std::cout << std::setprecision(10) << PI << std::endl;<br />
return 0;<br />
}<br />
<br />
double calcPI(double npoints) {<br />
double circle_count = 0.0;<br />
for (int j = 0; j < npoints; j++) {<br />
double x_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
double y_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
if (sqrt(pow(x_coor, 2) + pow(y_coor, 2)) < 1.0) circle_count += 1.0;<br />
}<br />
return 4.0*circle_count / npoints;<br />
}<br />
</source><br />
|}<br />
Figure 2: Serial C++ program used for profiling of the Monte Carlo method of approximating PI<br />
<br />
===== Compilation =====<br />
Program is compiled using the command: <source>gpp -O2 -g -pg -oapp pi.cpp</source><br />
<br />
===== Running =====<br />
We will profile the program using 2 billion points<br />
<source><br />
> time app 2000000000<br />
Number of points: 2e+09<br />
3.14157<br />
<br />
real 1m0.072s<br />
user 0m59.268s<br />
sys 0m0.018s<br />
</source><br />
<br />
===== Profiling =====<br />
Flat:<br />
<source><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls Ts/call Ts/call name<br />
100.80 34.61 34.61 calcPI(double)<br />
0.00 34.61 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</source><br />
Call:<br />
<source><br />
granularity: each sample hit covers 2 byte(s) for 0.03% of 34.61 seconds<br />
<br />
index % time self children called name<br />
<spontaneous><br />
[1] 100.0 34.61 0.00 calcPI(double) [1]<br />
-----------------------------------------------<br />
0.00 0.00 1/1 __libc_csu_init [16]<br />
[9] 0.0 0.00 0.00 1 _GLOBAL__sub_I_main [9]<br />
-----------------------------------------------<br />
<br />
Index by function name<br />
<br />
[9] _GLOBAL__sub_I_main (pi.cpp) [1] calcPI(double)<br />
</source><br />
<br />
===== Results =====<br />
You need many billions of points and maybe even trillions but using just 2 billion dots causes the program to take over 30 seconds to run. The most intensive part of the program is the loop which is what loops 2 billion times in my run of the program while profiling, which can all be parallelized. We can determine from the profiling that 100% of the time executing the program is spent in the loop but of course that is not possible so we will go with 99.999%, using a GTX 1080 as an example GPU which has 20 SMX and each having 2048 threads, and using Amdahl's Law we can expect a speedup of 29058.094x<br />
<br />
=== Assignment 2 ===<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=138061GPU610/gpuchill2019-03-04T16:08:51Z<p>Dserpa: /* Team Members */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Calculation of Pi<br />
# [mailto:akkabia@myseneca.ca?subject=gpu610 Abdul Kabia], Some responsibility <br />
# [mailto:jtardif1@myseneca.ca?subject=gpu610 Josh Tardif], Some responsibility<br />
# [mailto:afaux@myseneca.ca?subject=gpu610 Andrew Faux], Some responsibility<br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca,akkabia@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
==== Sudoku Brute Force Solver ====<br />
<br />
I decided to profile a simple brute force Sudoku solver, found here (https://github.com/regehr/sudoku). The solver uses a simple back tracking algorithm, inserting possible values into cells, iterating through the puzzles thousands of times, until it eventually produces an answer which does not violate any of the rules of Sudoku. As such the solver runs at the same speed regardless of the human difficulty rating, able to solve easy and 'insane' level puzzles at the same speed. The solver also works independent of the ratio between clues and white space, producing quick results with even the most sparsely populated puzzles.As such the following run of the program uses a puzzle which is specifically made to play against the back tracking algorithm and provides maximum time for the solver.<br />
<br />
Test run with puzzle:<br />
<pre><br />
Original configuration:<br />
-------------<br />
| | | |<br />
| | 3| 85|<br />
| 1| 2 | |<br />
-------------<br />
| |5 7| |<br />
| 4| |1 |<br />
| 9 | | |<br />
-------------<br />
|5 | | 73|<br />
| 2| 1 | |<br />
| | 4 | 9|<br />
-------------<br />
17 entries filled<br />
solution:<br />
-------------<br />
|987|654|321|<br />
|246|173|985|<br />
|351|928|746|<br />
-------------<br />
|128|537|694|<br />
|634|892|157|<br />
|795|461|832|<br />
-------------<br />
|519|286|473|<br />
|472|319|568|<br />
|863|745|219|<br />
-------------<br />
found 1 solutions<br />
<br />
real 0m33.652s<br />
user 0m33.098s<br />
sys 0m0.015s<br />
</pre><br />
<br />
<br />
Flat profile:<br />
<pre><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
46.42 10.04 10.04 622865043 0.00 0.00 check_row<br />
23.52 15.13 5.09 1 5.09 21.32 solve<br />
18.26 19.08 3.95 223473489 0.00 0.00 check_col<br />
10.02 21.25 2.17 100654218 0.00 0.00 check_region<br />
0.72 21.40 0.16 2 0.08 0.08 print<br />
0.39 21.49 0.09 frame_dummy<br />
</pre><br />
<br />
I believe that if a GPU was used to enhance this program one would see a great increase of speed. All of the check functions essentially do the same thing, iterating through possible inserted values for any that violate the rules. If one is able to unload all of these iterations onto the GPU then there should be a corresponding increase in speed.<br />
<br />
==== Christopher Ginac Image Processing Library ====<br />
<br />
I decided to profile a single user created image processing library written by Christopher Ginac, you can follow his post of the library [https://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ here]. His library enables the user to play around with .PGM image formats. If given the right parameters, users have the following options:<br />
<br />
<pre><br />
What would you like to do:<br />
[1] Get a Sub Image<br />
[2] Enlarge Image<br />
[3] Shrink Image<br />
[4] Reflect Image<br />
[5] Translate Image<br />
[6] Rotate Image<br />
[7] Negate Image<br />
</pre><br />
<br />
I went with the Enlarge option to see how long that would take. In order for me to do this, I had to test both the limits of the program and my own seneca machine allowed space, in order to do this, I had to use a fairly large image. However, since the program creates a second image, my Seneca account ran out of space for the new image, so the program could not write out the newly enlarged image. So I had to settle on an image that was 16.3MB max, so that it could write a new one, totally in 32.6MB of space. <br />
<br />
<pre><br />
real 0m10.595s<br />
user 0m5.325s<br />
sys 0m1.446s<br />
</pre><br />
Which isn't really bad, but when we look deeper, we see where most of our time is being spent<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
21.74 1.06 1.06 1 1.06 1.06 Image::operator=(Image const&)<br />
21.33 2.10 1.04 2 0.52 0.52 Image::Image(int, int, int)<br />
18.66 3.01 0.91 154056114 0.00 0.00 Image::getPixelVal(int, int)<br />
15.59 3.77 0.76 1 0.76 2.34 Image::enlargeImage(int, Image&)<br />
14.97 4.50 0.73 1 0.73 1.67 writeImage(char*, Image&)<br />
3.69 4.68 0.18 2 0.09 0.09 Image::Image(Image const&)<br />
2.67 4.81 0.13 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.82 4.85 0.04 1 0.04 0.17 readImage(char*, Image&)<br />
0.62 4.88 0.03 1 0.03 0.03 Image::getImageInfo(int&, int&, int&)<br />
0.00 4.88 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 4.88 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 4.88 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev<br />
0.00 4.88 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 4.88 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)<br />
</pre><br />
<br />
It seems most of our time in this part of the code is spent assigning our enlarged image to the now one, and also creating our image object in the first place. I think if we were to somehow use a GPU for this process, we would see an decrease in run-time for this part of the library. Also, there also seems to be room for improvement on the very 'Image::enlargeImage' function itself. I feel like by loading said functionality onto thje GPU, we can reduce it's 0.76s to something even lower.<br />
<br />
==== Merge Sort Algorithm ====<br />
<br />
I decide to profile a vector merge sort algorithm. A merge sort is based on a based on divide and conquer technique which recursively breaks down a problem into two or more sub-problems of the same or related types. When these become simple enough to be solved directly the sub-problems are then combined to give a solution to the original problem. It first divides the array into equal halves and then combines them in a sorted manner. Due to this type of sort being broken into equal parts, I thought that it would be perfect for a GPU to be able to accelerate the process. With the sort being broken down into multiple chunks and then sent to the GPU it will be able to accomplish its task more efficiently. I was able to find the source code [https://codereview.stackexchange.com/questions/167680/merge-sort-implementation-with-vectors/ here].<br />
<br />
Profile for 10 million elements between 1 and 10000. Using -02 optimization.<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls ns/call ns/call name<br />
48.35 1.16 1.16 9999999 115.56 115.56 mergeSort(std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, <br />
std::vector<int, std::allocator<int> >&)<br />
32.80 1.94 0.78 sort(std::vector<int, std::allocator<int> >&)<br />
19.34 2.40 0.46 43708492 10.58 10.58 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, <br />
std::allocator<int> > >, int const&)<br />
0.00 2.40 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</pre><br />
As you can see 80% of the total time was spent in mergeSort and sort functions. <br /><br />
If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x.<br />
<br />
==== Calculation of Pi ====<br />
The program I decided to assess, and profile calculates the value of PI by using the approximation method called Monte Carlo. This works by having a circle that is 𝜋r2 and a square that is 4r2 with r being 1.0 and generating randomized points inside the area, both x and y being between -1 and 1 we keep track of how many points have been located inside the circle. The more points generated the more accurate the final calculation of PI will be. The amount of points needed for say billionth precision can easily reach in the hundreds of billions which would take just as many calculations of the same mathematical computation, which makes it a fantastic candidate to parallelize.<br />
<br />
<br />
[[File:Pi_calc.png]]<br />
<br/><br />
Figure 1: Graphical representation of the Monte Carlo method of approximating PI<br />
<br />
----<br />
<br />
{| class="wikitable mw-collapsible mw-collapsed"<br />
! pi.cpp<br />
|-<br />
|<br />
<source><br />
/*<br />
Author: Daniel Serpa<br />
Pseudo code: Blaise Barney (https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesPI)<br />
*/<br />
<br />
#include <iostream><br />
#include <cstdlib><br />
#include <math.h><br />
#include <iomanip><br />
<br />
double calcPI(double);<br />
<br />
int main(int argc, char ** argv) {<br />
if (argc != 2) {<br />
std::cout << "Invalid number of arguments" << std::endl;<br />
return 1;<br />
}<br />
std::srand(852);<br />
double npoints = atof(argv[1]);<br />
std::cout << "Number of points: " << npoints << std::endl;<br />
double PI = calcPI(npoints);<br />
std::cout << std::setprecision(10) << PI << std::endl;<br />
return 0;<br />
}<br />
<br />
double calcPI(double npoints) {<br />
double circle_count = 0.0;<br />
for (int j = 0; j < npoints; j++) {<br />
double x_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
double y_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
if (sqrt(pow(x_coor, 2) + pow(y_coor, 2)) < 1.0) circle_count += 1.0;<br />
}<br />
return 4.0*circle_count / npoints;<br />
}<br />
</source><br />
|}<br />
Figure 2: Serial C++ program used for profiling of the Monte Carlo method of approximating PI<br />
<br />
===== Compilation =====<br />
Program is compiled using the command: <source>gpp -O2 -g -pg -oapp pi.cpp</source><br />
<br />
===== Running =====<br />
We will profile the program using 2 billion points<br />
<source><br />
> time app 2000000000<br />
Number of points: 2e+09<br />
3.14157<br />
<br />
real 1m0.072s<br />
user 0m59.268s<br />
sys 0m0.018s<br />
</source><br />
<br />
===== Profiling =====<br />
Flat:<br />
<source><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls Ts/call Ts/call name<br />
100.80 34.61 34.61 calcPI(double)<br />
0.00 34.61 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</source><br />
Call:<br />
<source><br />
granularity: each sample hit covers 2 byte(s) for 0.03% of 34.61 seconds<br />
<br />
index % time self children called name<br />
<spontaneous><br />
[1] 100.0 34.61 0.00 calcPI(double) [1]<br />
-----------------------------------------------<br />
0.00 0.00 1/1 __libc_csu_init [16]<br />
[9] 0.0 0.00 0.00 1 _GLOBAL__sub_I_main [9]<br />
-----------------------------------------------<br />
<br />
Index by function name<br />
<br />
[9] _GLOBAL__sub_I_main (pi.cpp) [1] calcPI(double)<br />
</source><br />
<br />
===== Results =====<br />
You need many billions of points and maybe even trillions but using just 2 billion dots causes the program to take over 30 seconds to run. The most intensive part of the program is the loop which is what loops 2 billion times in my run of the program while profiling, which can all be parallelized. We can determine from the profiling that 100% of the time executing the program is spent in the loop but of course that is not possible so we will go with 99.999%, using a GTX 1080 as an example GPU which has 20 SMX and each having 2048 threads, and using Amdahl's Law we can expect a speedup of 29058.094x<br />
<br />
=== Assignment 2 ===<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=138060GPU610/gpuchill2019-03-04T16:08:14Z<p>Dserpa: /* Assignment 1 */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Some responsibility <br />
# [mailto:akkabia@myseneca.ca?subject=gpu610 Abdul Kabia], Some responsibility <br />
# [mailto:jtardif1@myseneca.ca?subject=gpu610 Josh Tardif], Some responsibility<br />
# [mailto:afaux@myseneca.ca?subject=gpu610 Andrew Faux], Some responsibility<br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca,akkabia@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
==== Sudoku Brute Force Solver ====<br />
<br />
I decided to profile a simple brute force Sudoku solver, found here (https://github.com/regehr/sudoku). The solver uses a simple back tracking algorithm, inserting possible values into cells, iterating through the puzzles thousands of times, until it eventually produces an answer which does not violate any of the rules of Sudoku. As such the solver runs at the same speed regardless of the human difficulty rating, able to solve easy and 'insane' level puzzles at the same speed. The solver also works independent of the ratio between clues and white space, producing quick results with even the most sparsely populated puzzles.As such the following run of the program uses a puzzle which is specifically made to play against the back tracking algorithm and provides maximum time for the solver.<br />
<br />
Test run with puzzle:<br />
<pre><br />
Original configuration:<br />
-------------<br />
| | | |<br />
| | 3| 85|<br />
| 1| 2 | |<br />
-------------<br />
| |5 7| |<br />
| 4| |1 |<br />
| 9 | | |<br />
-------------<br />
|5 | | 73|<br />
| 2| 1 | |<br />
| | 4 | 9|<br />
-------------<br />
17 entries filled<br />
solution:<br />
-------------<br />
|987|654|321|<br />
|246|173|985|<br />
|351|928|746|<br />
-------------<br />
|128|537|694|<br />
|634|892|157|<br />
|795|461|832|<br />
-------------<br />
|519|286|473|<br />
|472|319|568|<br />
|863|745|219|<br />
-------------<br />
found 1 solutions<br />
<br />
real 0m33.652s<br />
user 0m33.098s<br />
sys 0m0.015s<br />
</pre><br />
<br />
<br />
Flat profile:<br />
<pre><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
46.42 10.04 10.04 622865043 0.00 0.00 check_row<br />
23.52 15.13 5.09 1 5.09 21.32 solve<br />
18.26 19.08 3.95 223473489 0.00 0.00 check_col<br />
10.02 21.25 2.17 100654218 0.00 0.00 check_region<br />
0.72 21.40 0.16 2 0.08 0.08 print<br />
0.39 21.49 0.09 frame_dummy<br />
</pre><br />
<br />
I believe that if a GPU was used to enhance this program one would see a great increase of speed. All of the check functions essentially do the same thing, iterating through possible inserted values for any that violate the rules. If one is able to unload all of these iterations onto the GPU then there should be a corresponding increase in speed.<br />
<br />
==== Christopher Ginac Image Processing Library ====<br />
<br />
I decided to profile a single user created image processing library written by Christopher Ginac, you can follow his post of the library [https://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ here]. His library enables the user to play around with .PGM image formats. If given the right parameters, users have the following options:<br />
<br />
<pre><br />
What would you like to do:<br />
[1] Get a Sub Image<br />
[2] Enlarge Image<br />
[3] Shrink Image<br />
[4] Reflect Image<br />
[5] Translate Image<br />
[6] Rotate Image<br />
[7] Negate Image<br />
</pre><br />
<br />
I went with the Enlarge option to see how long that would take. In order for me to do this, I had to test both the limits of the program and my own seneca machine allowed space, in order to do this, I had to use a fairly large image. However, since the program creates a second image, my Seneca account ran out of space for the new image, so the program could not write out the newly enlarged image. So I had to settle on an image that was 16.3MB max, so that it could write a new one, totally in 32.6MB of space. <br />
<br />
<pre><br />
real 0m10.595s<br />
user 0m5.325s<br />
sys 0m1.446s<br />
</pre><br />
Which isn't really bad, but when we look deeper, we see where most of our time is being spent<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
21.74 1.06 1.06 1 1.06 1.06 Image::operator=(Image const&)<br />
21.33 2.10 1.04 2 0.52 0.52 Image::Image(int, int, int)<br />
18.66 3.01 0.91 154056114 0.00 0.00 Image::getPixelVal(int, int)<br />
15.59 3.77 0.76 1 0.76 2.34 Image::enlargeImage(int, Image&)<br />
14.97 4.50 0.73 1 0.73 1.67 writeImage(char*, Image&)<br />
3.69 4.68 0.18 2 0.09 0.09 Image::Image(Image const&)<br />
2.67 4.81 0.13 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.82 4.85 0.04 1 0.04 0.17 readImage(char*, Image&)<br />
0.62 4.88 0.03 1 0.03 0.03 Image::getImageInfo(int&, int&, int&)<br />
0.00 4.88 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 4.88 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 4.88 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev<br />
0.00 4.88 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 4.88 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)<br />
</pre><br />
<br />
It seems most of our time in this part of the code is spent assigning our enlarged image to the now one, and also creating our image object in the first place. I think if we were to somehow use a GPU for this process, we would see an decrease in run-time for this part of the library. Also, there also seems to be room for improvement on the very 'Image::enlargeImage' function itself. I feel like by loading said functionality onto thje GPU, we can reduce it's 0.76s to something even lower.<br />
<br />
==== Merge Sort Algorithm ====<br />
<br />
I decide to profile a vector merge sort algorithm. A merge sort is based on a based on divide and conquer technique which recursively breaks down a problem into two or more sub-problems of the same or related types. When these become simple enough to be solved directly the sub-problems are then combined to give a solution to the original problem. It first divides the array into equal halves and then combines them in a sorted manner. Due to this type of sort being broken into equal parts, I thought that it would be perfect for a GPU to be able to accelerate the process. With the sort being broken down into multiple chunks and then sent to the GPU it will be able to accomplish its task more efficiently. I was able to find the source code [https://codereview.stackexchange.com/questions/167680/merge-sort-implementation-with-vectors/ here].<br />
<br />
Profile for 10 million elements between 1 and 10000. Using -02 optimization.<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls ns/call ns/call name<br />
48.35 1.16 1.16 9999999 115.56 115.56 mergeSort(std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, <br />
std::vector<int, std::allocator<int> >&)<br />
32.80 1.94 0.78 sort(std::vector<int, std::allocator<int> >&)<br />
19.34 2.40 0.46 43708492 10.58 10.58 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, <br />
std::allocator<int> > >, int const&)<br />
0.00 2.40 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</pre><br />
As you can see 80% of the total time was spent in mergeSort and sort functions. <br /><br />
If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x.<br />
<br />
==== Calculation of Pi ====<br />
The program I decided to assess, and profile calculates the value of PI by using the approximation method called Monte Carlo. This works by having a circle that is 𝜋r2 and a square that is 4r2 with r being 1.0 and generating randomized points inside the area, both x and y being between -1 and 1 we keep track of how many points have been located inside the circle. The more points generated the more accurate the final calculation of PI will be. The amount of points needed for say billionth precision can easily reach in the hundreds of billions which would take just as many calculations of the same mathematical computation, which makes it a fantastic candidate to parallelize.<br />
<br />
<br />
[[File:Pi_calc.png]]<br />
<br/><br />
Figure 1: Graphical representation of the Monte Carlo method of approximating PI<br />
<br />
----<br />
<br />
{| class="wikitable mw-collapsible mw-collapsed"<br />
! pi.cpp<br />
|-<br />
|<br />
<source><br />
/*<br />
Author: Daniel Serpa<br />
Pseudo code: Blaise Barney (https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesPI)<br />
*/<br />
<br />
#include <iostream><br />
#include <cstdlib><br />
#include <math.h><br />
#include <iomanip><br />
<br />
double calcPI(double);<br />
<br />
int main(int argc, char ** argv) {<br />
if (argc != 2) {<br />
std::cout << "Invalid number of arguments" << std::endl;<br />
return 1;<br />
}<br />
std::srand(852);<br />
double npoints = atof(argv[1]);<br />
std::cout << "Number of points: " << npoints << std::endl;<br />
double PI = calcPI(npoints);<br />
std::cout << std::setprecision(10) << PI << std::endl;<br />
return 0;<br />
}<br />
<br />
double calcPI(double npoints) {<br />
double circle_count = 0.0;<br />
for (int j = 0; j < npoints; j++) {<br />
double x_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
double y_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;<br />
if (sqrt(pow(x_coor, 2) + pow(y_coor, 2)) < 1.0) circle_count += 1.0;<br />
}<br />
return 4.0*circle_count / npoints;<br />
}<br />
</source><br />
|}<br />
Figure 2: Serial C++ program used for profiling of the Monte Carlo method of approximating PI<br />
<br />
===== Compilation =====<br />
Program is compiled using the command: <source>gpp -O2 -g -pg -oapp pi.cpp</source><br />
<br />
===== Running =====<br />
We will profile the program using 2 billion points<br />
<source><br />
> time app 2000000000<br />
Number of points: 2e+09<br />
3.14157<br />
<br />
real 1m0.072s<br />
user 0m59.268s<br />
sys 0m0.018s<br />
</source><br />
<br />
===== Profiling =====<br />
Flat:<br />
<source><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls Ts/call Ts/call name<br />
100.80 34.61 34.61 calcPI(double)<br />
0.00 34.61 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</source><br />
Call:<br />
<source><br />
granularity: each sample hit covers 2 byte(s) for 0.03% of 34.61 seconds<br />
<br />
index % time self children called name<br />
<spontaneous><br />
[1] 100.0 34.61 0.00 calcPI(double) [1]<br />
-----------------------------------------------<br />
0.00 0.00 1/1 __libc_csu_init [16]<br />
[9] 0.0 0.00 0.00 1 _GLOBAL__sub_I_main [9]<br />
-----------------------------------------------<br />
<br />
Index by function name<br />
<br />
[9] _GLOBAL__sub_I_main (pi.cpp) [1] calcPI(double)<br />
</source><br />
<br />
===== Results =====<br />
You need many billions of points and maybe even trillions but using just 2 billion dots causes the program to take over 30 seconds to run. The most intensive part of the program is the loop which is what loops 2 billion times in my run of the program while profiling, which can all be parallelized. We can determine from the profiling that 100% of the time executing the program is spent in the loop but of course that is not possible so we will go with 99.999%, using a GTX 1080 as an example GPU which has 20 SMX and each having 2048 threads, and using Amdahl's Law we can expect a speedup of 29058.094x<br />
<br />
=== Assignment 2 ===<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=File:Pi_calc.png&diff=138059File:Pi calc.png2019-03-04T15:30:05Z<p>Dserpa: </p>
<hr />
<div></div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=138058GPU610/gpuchill2019-03-04T15:20:57Z<p>Dserpa: /* Progress */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Some responsibility <br />
# [mailto:akkabia@myseneca.ca?subject=gpu610 Abdul Kabia], Some responsibility <br />
# [mailto:jtardif1@myseneca.ca?subject=gpu610 Josh Tardif], Some responsibility<br />
# [mailto:afaux@myseneca.ca?subject=gpu610 Andrew Faux], Some responsibility<br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca,akkabia@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
==== Sudoku Brute Force Solver ====<br />
<br />
I decided to profile a simple brute force Sudoku solver, found here (https://github.com/regehr/sudoku). The solver uses a simple back tracking algorithm, inserting possible values into cells, iterating through the puzzles thousands of times, until it eventually produces an answer which does not violate any of the rules of Sudoku. As such the solver runs at the same speed regardless of the human difficulty rating, able to solve easy and 'insane' level puzzles at the same speed. The solver also works independent of the ratio between clues and white space, producing quick results with even the most sparsely populated puzzles.As such the following run of the program uses a puzzle which is specifically made to play against the back tracking algorithm and provides maximum time for the solver.<br />
<br />
Test run with puzzle:<br />
<pre><br />
Original configuration:<br />
-------------<br />
| | | |<br />
| | 3| 85|<br />
| 1| 2 | |<br />
-------------<br />
| |5 7| |<br />
| 4| |1 |<br />
| 9 | | |<br />
-------------<br />
|5 | | 73|<br />
| 2| 1 | |<br />
| | 4 | 9|<br />
-------------<br />
17 entries filled<br />
solution:<br />
-------------<br />
|987|654|321|<br />
|246|173|985|<br />
|351|928|746|<br />
-------------<br />
|128|537|694|<br />
|634|892|157|<br />
|795|461|832|<br />
-------------<br />
|519|286|473|<br />
|472|319|568|<br />
|863|745|219|<br />
-------------<br />
found 1 solutions<br />
<br />
real 0m33.652s<br />
user 0m33.098s<br />
sys 0m0.015s<br />
</pre><br />
<br />
<br />
Flat profile:<br />
<pre><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
46.42 10.04 10.04 622865043 0.00 0.00 check_row<br />
23.52 15.13 5.09 1 5.09 21.32 solve<br />
18.26 19.08 3.95 223473489 0.00 0.00 check_col<br />
10.02 21.25 2.17 100654218 0.00 0.00 check_region<br />
0.72 21.40 0.16 2 0.08 0.08 print<br />
0.39 21.49 0.09 frame_dummy<br />
</pre><br />
<br />
I believe that if a GPU was used to enhance this program one would see a great increase of speed. All of the check functions essentially do the same thing, iterating through possible inserted values for any that violate the rules. If one is able to unload all of these iterations onto the GPU then there should be a corresponding increase in speed.<br />
<br />
==== Christopher Ginac Image Processing Library ====<br />
<br />
I decided to profile a single user created image processing library written by Christopher Ginac, you can follow his post of the library [https://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ here]. His library enables the user to play around with .PGM image formats. If given the right parameters, users have the following options:<br />
<br />
<pre><br />
What would you like to do:<br />
[1] Get a Sub Image<br />
[2] Enlarge Image<br />
[3] Shrink Image<br />
[4] Reflect Image<br />
[5] Translate Image<br />
[6] Rotate Image<br />
[7] Negate Image<br />
</pre><br />
<br />
I went with the Enlarge option to see how long that would take. In order for me to do this, I had to test both the limits of the program and my own seneca machine allowed space, in order to do this, I had to use a fairly large image. However, since the program creates a second image, my Seneca account ran out of space for the new image, so the program could not write out the newly enlarged image. So I had to settle on an image that was 16.3MB max, so that it could write a new one, totally in 32.6MB of space. <br />
<br />
<pre><br />
real 0m10.595s<br />
user 0m5.325s<br />
sys 0m1.446s<br />
</pre><br />
Which isn't really bad, but when we look deeper, we see where most of our time is being spent<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
21.74 1.06 1.06 1 1.06 1.06 Image::operator=(Image const&)<br />
21.33 2.10 1.04 2 0.52 0.52 Image::Image(int, int, int)<br />
18.66 3.01 0.91 154056114 0.00 0.00 Image::getPixelVal(int, int)<br />
15.59 3.77 0.76 1 0.76 2.34 Image::enlargeImage(int, Image&)<br />
14.97 4.50 0.73 1 0.73 1.67 writeImage(char*, Image&)<br />
3.69 4.68 0.18 2 0.09 0.09 Image::Image(Image const&)<br />
2.67 4.81 0.13 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.82 4.85 0.04 1 0.04 0.17 readImage(char*, Image&)<br />
0.62 4.88 0.03 1 0.03 0.03 Image::getImageInfo(int&, int&, int&)<br />
0.00 4.88 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 4.88 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 4.88 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev<br />
0.00 4.88 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 4.88 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)<br />
</pre><br />
<br />
It seems most of our time in this part of the code is spent assigning our enlarged image to the now one, and also creating our image object in the first place. I think if we were to somehow use a GPU for this process, we would see an decrease in run-time for this part of the library. Also, there also seems to be room for improvement on the very 'Image::enlargeImage' function itself. I feel like by loading said functionality onto thje GPU, we can reduce it's 0.76s to something even lower.<br />
<br />
==== Merge Sort Algorithm ====<br />
<br />
I decide to profile a vector merge sort algorithm. A merge sort is based on a based on divide and conquer technique which recursively breaks down a problem into two or more sub-problems of the same or related types. When these become simple enough to be solved directly the sub-problems are then combined to give a solution to the original problem. It first divides the array into equal halves and then combines them in a sorted manner. Due to this type of sort being broken into equal parts, I thought that it would be perfect for a GPU to be able to accelerate the process. With the sort being broken down into multiple chunks and then sent to the GPU it will be able to accomplish its task more efficiently. I was able to find the source code [https://codereview.stackexchange.com/questions/167680/merge-sort-implementation-with-vectors/ here].<br />
<br />
Profile for 10 million elements between 1 and 10000. Using -02 optimization.<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls ns/call ns/call name<br />
48.35 1.16 1.16 9999999 115.56 115.56 mergeSort(std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, <br />
std::vector<int, std::allocator<int> >&)<br />
32.80 1.94 0.78 sort(std::vector<int, std::allocator<int> >&)<br />
19.34 2.40 0.46 43708492 10.58 10.58 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, <br />
std::allocator<int> > >, int const&)<br />
0.00 2.40 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</pre><br />
As you can see 80% of the total time was spent in mergeSort and sort functions. <br /><br />
If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x.<br />
<br />
=== Assignment 2 ===<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=138057GPU610/gpuchill2019-03-04T15:16:39Z<p>Dserpa: /* Assignment 1 */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Some responsibility <br />
# [mailto:akkabia@myseneca.ca?subject=gpu610 Abdul Kabia], Some responsibility <br />
# [mailto:jtardif1@myseneca.ca?subject=gpu610 Josh Tardif], Some responsibility<br />
# [mailto:afaux@myseneca.ca?subject=gpu610 Andrew Faux], Some responsibility<br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca,akkabia@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
===Sudoku Brute Force Solver===<br />
<br />
I decided to profile a simple brute force Sudoku solver, found here (https://github.com/regehr/sudoku). The solver uses a simple back tracking algorithm, inserting possible values into cells, iterating through the puzzles thousands of times, until it eventually produces an answer which does not violate any of the rules of Sudoku. As such the solver runs at the same speed regardless of the human difficulty rating, able to solve easy and 'insane' level puzzles at the same speed. The solver also works independent of the ratio between clues and white space, producing quick results with even the most sparsely populated puzzles.As such the following run of the program uses a puzzle which is specifically made to play against the back tracking algorithm and provides maximum time for the solver.<br />
<br />
Test run with puzzle:<br />
<pre><br />
Original configuration:<br />
-------------<br />
| | | |<br />
| | 3| 85|<br />
| 1| 2 | |<br />
-------------<br />
| |5 7| |<br />
| 4| |1 |<br />
| 9 | | |<br />
-------------<br />
|5 | | 73|<br />
| 2| 1 | |<br />
| | 4 | 9|<br />
-------------<br />
17 entries filled<br />
solution:<br />
-------------<br />
|987|654|321|<br />
|246|173|985|<br />
|351|928|746|<br />
-------------<br />
|128|537|694|<br />
|634|892|157|<br />
|795|461|832|<br />
-------------<br />
|519|286|473|<br />
|472|319|568|<br />
|863|745|219|<br />
-------------<br />
found 1 solutions<br />
<br />
real 0m33.652s<br />
user 0m33.098s<br />
sys 0m0.015s<br />
</pre><br />
<br />
<br />
Flat profile:<br />
<pre><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
46.42 10.04 10.04 622865043 0.00 0.00 check_row<br />
23.52 15.13 5.09 1 5.09 21.32 solve<br />
18.26 19.08 3.95 223473489 0.00 0.00 check_col<br />
10.02 21.25 2.17 100654218 0.00 0.00 check_region<br />
0.72 21.40 0.16 2 0.08 0.08 print<br />
0.39 21.49 0.09 frame_dummy<br />
</pre><br />
<br />
I believe that if a GPU was used to enhance this program one would see a great increase of speed. All of the check functions essentially do the same thing, iterating through possible inserted values for any that violate the rules. If one is able to unload all of these iterations onto the GPU then there should be a corresponding increase in speed.<br />
<br />
===Christopher Ginac Image Processing Library===<br />
<br />
I decided to profile a single user created image processing library written by Christopher Ginac, you can follow his post of the library [https://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ here]. His library enables the user to play around with .PGM image formats. If given the right parameters, users have the following options:<br />
<br />
<pre><br />
What would you like to do:<br />
[1] Get a Sub Image<br />
[2] Enlarge Image<br />
[3] Shrink Image<br />
[4] Reflect Image<br />
[5] Translate Image<br />
[6] Rotate Image<br />
[7] Negate Image<br />
</pre><br />
<br />
I went with the Enlarge option to see how long that would take. In order for me to do this, I had to test both the limits of the program and my own seneca machine allowed space, in order to do this, I had to use a fairly large image. However, since the program creates a second image, my Seneca account ran out of space for the new image, so the program could not write out the newly enlarged image. So I had to settle on an image that was 16.3MB max, so that it could write a new one, totally in 32.6MB of space. <br />
<br />
<pre><br />
real 0m10.595s<br />
user 0m5.325s<br />
sys 0m1.446s<br />
</pre><br />
Which isn't really bad, but when we look deeper, we see where most of our time is being spent<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
21.74 1.06 1.06 1 1.06 1.06 Image::operator=(Image const&)<br />
21.33 2.10 1.04 2 0.52 0.52 Image::Image(int, int, int)<br />
18.66 3.01 0.91 154056114 0.00 0.00 Image::getPixelVal(int, int)<br />
15.59 3.77 0.76 1 0.76 2.34 Image::enlargeImage(int, Image&)<br />
14.97 4.50 0.73 1 0.73 1.67 writeImage(char*, Image&)<br />
3.69 4.68 0.18 2 0.09 0.09 Image::Image(Image const&)<br />
2.67 4.81 0.13 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.82 4.85 0.04 1 0.04 0.17 readImage(char*, Image&)<br />
0.62 4.88 0.03 1 0.03 0.03 Image::getImageInfo(int&, int&, int&)<br />
0.00 4.88 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 4.88 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 4.88 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev<br />
0.00 4.88 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 4.88 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)<br />
</pre><br />
<br />
It seems most of our time in this part of the code is spent assigning our enlarged image to the now one, and also creating our image object in the first place. I think if we were to somehow use a GPU for this process, we would see an decrease in run-time for this part of the library. Also, there also seems to be room for improvement on the very 'Image::enlargeImage' function itself. I feel like by loading said functionality onto thje GPU, we can reduce it's 0.76s to something even lower.<br />
<br />
===Merge Sort Algorithm===<br />
<br />
I decide to profile a vector merge sort algorithm. A merge sort is based on a based on divide and conquer technique which recursively breaks down a problem into two or more sub-problems of the same or related types. When these become simple enough to be solved directly the sub-problems are then combined to give a solution to the original problem. It first divides the array into equal halves and then combines them in a sorted manner. Due to this type of sort being broken into equal parts, I thought that it would be perfect for a GPU to be able to accelerate the process. With the sort being broken down into multiple chunks and then sent to the GPU it will be able to accomplish its task more efficiently. I was able to find the source code [https://codereview.stackexchange.com/questions/167680/merge-sort-implementation-with-vectors/ here].<br />
<br />
Profile for 10 million elements between 1 and 10000. Using -02 optimization.<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls ns/call ns/call name<br />
48.35 1.16 1.16 9999999 115.56 115.56 mergeSort(std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, <br />
std::vector<int, std::allocator<int> >&)<br />
32.80 1.94 0.78 sort(std::vector<int, std::allocator<int> >&)<br />
19.34 2.40 0.46 43708492 10.58 10.58 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, <br />
std::allocator<int> > >, int const&)<br />
0.00 2.40 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</pre><br />
As you can see 80% of the total time was spent in mergeSort and sort functions. <br /><br />
If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x.<br />
<br />
=== Assignment 2 ===<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=138056GPU610/gpuchill2019-03-04T15:16:38Z<p>Dserpa: /* Progress */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Some responsibility <br />
# [mailto:akkabia@myseneca.ca?subject=gpu610 Abdul Kabia], Some responsibility <br />
# [mailto:jtardif1@myseneca.ca?subject=gpu610 Josh Tardif], Some responsibility<br />
# [mailto:afaux@myseneca.ca?subject=gpu610 Andrew Faux], Some responsibility<br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca,akkabia@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
=== Assignment 2 ===<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=138055GPU610/gpuchill2019-03-04T15:15:44Z<p>Dserpa: /* Assignment 1 */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Some responsibility <br />
# [mailto:akkabia@myseneca.ca?subject=gpu610 Abdul Kabia], Some responsibility <br />
# [mailto:jtardif1@myseneca.ca?subject=gpu610 Josh Tardif], Some responsibility<br />
# [mailto:afaux@myseneca.ca?subject=gpu610 Andrew Faux], Some responsibility<br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca,akkabia@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
===Sudoku Brute Force Solver===<br />
<br />
I decided to profile a simple brute force Sudoku solver, found here (https://github.com/regehr/sudoku). The solver uses a simple back tracking algorithm, inserting possible values into cells, iterating through the puzzles thousands of times, until it eventually produces an answer which does not violate any of the rules of Sudoku. As such the solver runs at the same speed regardless of the human difficulty rating, able to solve easy and 'insane' level puzzles at the same speed. The solver also works independent of the ratio between clues and white space, producing quick results with even the most sparsely populated puzzles.As such the following run of the program uses a puzzle which is specifically made to play against the back tracking algorithm and provides maximum time for the solver.<br />
<br />
Test run with puzzle:<br />
<pre><br />
Original configuration:<br />
-------------<br />
| | | |<br />
| | 3| 85|<br />
| 1| 2 | |<br />
-------------<br />
| |5 7| |<br />
| 4| |1 |<br />
| 9 | | |<br />
-------------<br />
|5 | | 73|<br />
| 2| 1 | |<br />
| | 4 | 9|<br />
-------------<br />
17 entries filled<br />
solution:<br />
-------------<br />
|987|654|321|<br />
|246|173|985|<br />
|351|928|746|<br />
-------------<br />
|128|537|694|<br />
|634|892|157|<br />
|795|461|832|<br />
-------------<br />
|519|286|473|<br />
|472|319|568|<br />
|863|745|219|<br />
-------------<br />
found 1 solutions<br />
<br />
real 0m33.652s<br />
user 0m33.098s<br />
sys 0m0.015s<br />
</pre><br />
<br />
<br />
Flat profile:<br />
<pre><br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
46.42 10.04 10.04 622865043 0.00 0.00 check_row<br />
23.52 15.13 5.09 1 5.09 21.32 solve<br />
18.26 19.08 3.95 223473489 0.00 0.00 check_col<br />
10.02 21.25 2.17 100654218 0.00 0.00 check_region<br />
0.72 21.40 0.16 2 0.08 0.08 print<br />
0.39 21.49 0.09 frame_dummy<br />
</pre><br />
<br />
I believe that if a GPU was used to enhance this program one would see a great increase of speed. All of the check functions essentially do the same thing, iterating through possible inserted values for any that violate the rules. If one is able to unload all of these iterations onto the GPU then there should be a corresponding increase in speed.<br />
<br />
===Christopher Ginac Image Processing Library===<br />
<br />
I decided to profile a single user created image processing library written by Christopher Ginac, you can follow his post of the library [https://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ here]. His library enables the user to play around with .PGM image formats. If given the right parameters, users have the following options:<br />
<br />
<pre><br />
What would you like to do:<br />
[1] Get a Sub Image<br />
[2] Enlarge Image<br />
[3] Shrink Image<br />
[4] Reflect Image<br />
[5] Translate Image<br />
[6] Rotate Image<br />
[7] Negate Image<br />
</pre><br />
<br />
I went with the Enlarge option to see how long that would take. In order for me to do this, I had to test both the limits of the program and my own seneca machine allowed space, in order to do this, I had to use a fairly large image. However, since the program creates a second image, my Seneca account ran out of space for the new image, so the program could not write out the newly enlarged image. So I had to settle on an image that was 16.3MB max, so that it could write a new one, totally in 32.6MB of space. <br />
<br />
<pre><br />
real 0m10.595s<br />
user 0m5.325s<br />
sys 0m1.446s<br />
</pre><br />
Which isn't really bad, but when we look deeper, we see where most of our time is being spent<br />
<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total <br />
time seconds seconds calls s/call s/call name <br />
21.74 1.06 1.06 1 1.06 1.06 Image::operator=(Image const&)<br />
21.33 2.10 1.04 2 0.52 0.52 Image::Image(int, int, int)<br />
18.66 3.01 0.91 154056114 0.00 0.00 Image::getPixelVal(int, int)<br />
15.59 3.77 0.76 1 0.76 2.34 Image::enlargeImage(int, Image&)<br />
14.97 4.50 0.73 1 0.73 1.67 writeImage(char*, Image&)<br />
3.69 4.68 0.18 2 0.09 0.09 Image::Image(Image const&)<br />
2.67 4.81 0.13 17117346 0.00 0.00 Image::setPixelVal(int, int, int)<br />
0.82 4.85 0.04 1 0.04 0.17 readImage(char*, Image&)<br />
0.62 4.88 0.03 1 0.03 0.03 Image::getImageInfo(int&, int&, int&)<br />
0.00 4.88 0.00 4 0.00 0.00 Image::~Image()<br />
0.00 4.88 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)<br />
0.00 4.88 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev<br />
0.00 4.88 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&)<br />
0.00 4.88 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)<br />
</pre><br />
<br />
It seems most of our time in this part of the code is spent assigning our enlarged image to the now one, and also creating our image object in the first place. I think if we were to somehow use a GPU for this process, we would see an decrease in run-time for this part of the library. Also, there also seems to be room for improvement on the very 'Image::enlargeImage' function itself. I feel like by loading said functionality onto thje GPU, we can reduce it's 0.76s to something even lower.<br />
<br />
===Merge Sort Algorithm===<br />
<br />
I decide to profile a vector merge sort algorithm. A merge sort is based on a based on divide and conquer technique which recursively breaks down a problem into two or more sub-problems of the same or related types. When these become simple enough to be solved directly the sub-problems are then combined to give a solution to the original problem. It first divides the array into equal halves and then combines them in a sorted manner. Due to this type of sort being broken into equal parts, I thought that it would be perfect for a GPU to be able to accelerate the process. With the sort being broken down into multiple chunks and then sent to the GPU it will be able to accomplish its task more efficiently. I was able to find the source code [https://codereview.stackexchange.com/questions/167680/merge-sort-implementation-with-vectors/ here].<br />
<br />
Profile for 10 million elements between 1 and 10000. Using -02 optimization.<br />
<pre><br />
Flat profile:<br />
<br />
Each sample counts as 0.01 seconds.<br />
% cumulative self self total<br />
time seconds seconds calls ns/call ns/call name<br />
48.35 1.16 1.16 9999999 115.56 115.56 mergeSort(std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, <br />
std::vector<int, std::allocator<int> >&)<br />
32.80 1.94 0.78 sort(std::vector<int, std::allocator<int> >&)<br />
19.34 2.40 0.46 43708492 10.58 10.58 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, <br />
std::allocator<int> > >, int const&)<br />
0.00 2.40 0.00 1 0.00 0.00 _GLOBAL__sub_I_main<br />
</pre><br />
As you can see 80% of the total time was spent in mergeSort and sort functions. <br /><br />
If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x.<br />
<br />
=== Assignment 2 ===<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=138054GPU610/gpuchill2019-03-04T15:15:06Z<p>Dserpa: /* Progress */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Some responsibility <br />
# [mailto:akkabia@myseneca.ca?subject=gpu610 Abdul Kabia], Some responsibility <br />
# [mailto:jtardif1@myseneca.ca?subject=gpu610 Josh Tardif], Some responsibility<br />
# [mailto:afaux@myseneca.ca?subject=gpu610 Andrew Faux], Some responsibility<br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca,akkabia@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
=== Assignment 2 ===<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/DPS915_G_P_Index_20191&diff=138001GPU610/DPS915 G P Index 201912019-03-01T15:54:02Z<p>Dserpa: /* Group and Project Index */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20191}}<br />
<br />
Please add an overview of your group here and create a separate project page for your group!<br />
<br />
= Project Rules =<br />
<br />
# Use the Group page for a Journal of your activities throughout the course of the project<br />
# Project should cover material that differs from the material on the course web site<br />
# Presentation can be in Powerpoint or as a Walkthrough the group project page<br />
# Link to the project page should be included in the Student List table<br />
# Presentation slots (see below) are on a first-come first-served basis<br />
# Attendance at all presentations is mandatory - marks will be deducted for absenteeism<br />
# Marks will be awarded for both Group Wiki page and for the Presentation proper<br />
<br />
<br /><br />
<br />
= Potential Projects =<br />
<br />
* [[GPU610/DPS915_G_P_Index_20157 | Fall 2015 semester (Former Students)]]<br />
* [[GPU610/DPS915_G_P_Index_20171 | Winter 2017 semester (Former Students)]]<br />
* [[GPU610/DPS915_G_P_Index_20181 | Winter 2018 semester (Former Students)]]<br />
<br />
=== Suggested Projects ===<br />
<br />
* image processing - [http://cimg.eu/ CImg Library], [http://dlib.net/imaging.html dlib C++ library]<br />
* data compression - [http://codereview.stackexchange.com/questions/86543/simple-lzw-compression-algorithm LWZ algorithm], [http://www.mattmahoney.net/dc/dce.html Explained by Matt Mahoney]<br />
* grep - [http://www.boost.org/doc/libs/1_36_0/libs/regex/example/grep/grep.cpp Boost], [http://stackoverflow.com/questions/5731035/how-to-implement-grep-in-c-so-it-works-with-pipes-stdin-etc Stack Overflow ]<br />
* exclusive scan - [http://15418.courses.cs.cmu.edu/spring2016/article/4 CMU Assignment 2 Part 2]<br />
* simple circle renderer - [http://15418.courses.cs.cmu.edu/spring2016/article/4 CMU Assignment 2 Part 3]<br />
* object detection/tracking - [http://dlib.net/imaging.html#scan_fhog_pyramid dlib C++ library]<br />
* ray tracing - [http://khrylx.github.io/DSGPURayTracing/ by Yuan Ling (CMU) ] [https://github.com/jazztext/VRRayTracing/ by Kaffine Shearer (CMU)] [https://github.com/szellmann/visionaray Visionaray]<br />
* sorting algorithms - [http://www.cprogramming.com/tutorial/computersciencetheory/sortcomp.html Alex Allain cprogramming.com], [https://www.toptal.com/developers/sorting-algorithms Animations]<br />
* Jacobi's method for Poisson's equation - [https://math.berkeley.edu/~wilken/228A.F07/chr_lecture.pdf Rycroft's Lecture Note]<br />
* Gaussian Regression - [http://abhishekjoshi2.github.io/cuGP/ cuGP]<br />
* Halide - [http://haboric-hu.github.io/ Convolutional Networks]<br />
* Sudoku - [http://www.andrew.cmu.edu/user/astian/ by Tian Debebe (CMU)]<br />
<br />
=== C++ Open Source Libraries ===<br />
* List of open source libraries - [http://en.cppreference.com/w/cpp/links/libs cppreference.com]<br />
<br />
=== Carnegie-Mellon University Links ===<br />
* [http://15418.courses.cs.cmu.edu/spring2016/article/17 Spring 2016]<br />
* [http://15418.courses.cs.cmu.edu/spring2015/competition Spring 2015]<br />
* [http://15418.courses.cs.cmu.edu/spring2014/article/12 Spring 2014]<br />
<br />
=== Other Links ===<br />
* [https://sites.google.com/a/nirmauni.ac.in/cudacodes/cuda-projects Nirma University - restricted use of code to students of Nirma but may be a source of ideas]<br />
<br />
=== Reference Papers ===<br />
* [http://www.cs.utexas.edu/~pingali/CS378/2008sp/papers/GPUSurvey.pdf 2008 Survey Paper - you can search this paper for traditional topic ideas]<br />
* [http://www.nvidia.com/object/cuda_showcase_html.html Nvidia Showcase - probably too challenging - but could lead to simpler ideas]<br />
<br />
=== Interesting aspects to consider in your project ===<br />
* Try a different language - Javascript (Node.js bindings), Python (pyCUDA bindings)<br />
* Try APIs - [http://halide-lang.org/ Halide], OpenCV, Caffe, Latte<br />
* Compare CPU and GPU performance<br />
* Compare different blocksizes<br />
* Compare different algorithms on different machines<br />
* Implement your project on a Jetson TK1 board<br />
<br />
<br /><br />
<br />
= Presentation Schedule =<br />
<br />
<br />
{| border="1"<br />
|-<br />
|Team Name<br />
|Date and Time<br />
|-<br />
|<br />
|March 25 8:00<br />
|-<br />
<br />
|<br />
|March 25 8:20<br />
|-<br />
<br />
|<br />
|March 25 8:40<br />
|-<br />
<br />
|<br />
|March 25 9:00<br />
|-<br />
<br />
|<br />
|March 25 9:20<br />
|-<br />
<br />
|<br />
|March 29 8:00<br />
|-<br />
<br />
|<br />
|March 29 8:20<br />
|-<br />
<br />
|<br />
|March 29 8:40<br />
|-<br />
<br />
|[[Algo_holics | Algo-holics]]<br />
|March 29 9:00<br />
|-<br />
<br />
|<br />
|March 29 9:20<br />
|-<br />
<br />
|<br />
|April 1 8:00<br />
|-<br />
<br />
|<br />
|April 1 8:20<br />
|-<br />
<br />
|<br />
|April 1 8:40<br />
|-<br />
<br />
|[[Avengers | Avengers]]<br />
|April 1 9:00<br />
|-<br />
<br />
|<br />
|April 1 9:20<br />
|-<br />
<br />
|[[triForce |triForce]] <br />
|April 5 8:00<br />
|-<br />
<br />
|[[Ghost Cells | Ghost Cells]]<br />
|April 5 8:20<br />
|-<br />
<br />
|<br />
|April 5 8:40<br />
|-<br />
<br />
|[[GPU610/gpuchill|GPU n' Chill]]<br />
|April 5 9:00<br />
|-<br />
<br />
|group 6<br />
|April 5 9:20<br />
|-<br />
<br />
|}<br />
<br />
<br /><br />
<br />
= Group and Project Index =<br />
<br />
You can find a sample project page template [[GPU610/DPS915_Sample_Project_Page | here]]<br />
<br />
== [[GPU610/DPS915_Sample_Project_Page | Sample Group Title and email addresses]] ==<br />
<br />
# [mailto:chris.szalwinski@senecacollege.ca?subject=DPS915 Chris Szalwinski]<br />
# [mailto:fardad.soleimanloo@senecacollege.ca?subject=DPS915 Fardad Soleimanloo]<br />
# [mailto:chris.szalwinski@senecacollege.ca;fardad.soleimanloo@senecacollege.ca?subject=DPS915 eMail All]<br />
<br />
== [[Ghost_Cells | Ghost_Cells]] ==<br />
<br />
# [mailto:ysim2@myseneca.ca?subject=dps915 Tony Sim]<br />
# [mailto:rdittrich@myseneca.ca?subject=dps915 Robert Dittrich]<br />
# [mailto:izhogova@myseneca.ca?subject=dps915 Inna Zhogova]<br />
# [mailto:rdittrich@myseneca.ca,ysim2@myseneca.ca?subject=dps901,izhogova@myseneca.ca Email All]<br />
<br />
== [[Algo_holics | Algo_holics]] ==<br />
<br />
# [mailto:gsingh520@myseneca.ca?subject=gpu610 Gurpreet Singh]<br />
# [mailto:egiang1@myseneca.ca?subject=gpu610 Edgar Giang]<br />
# [mailto:ssdhillon20@myseneca.ca?subject=gpu610 Sukhbeer Dhillon]<br />
# [mailto:gsingh520@myseneca.ca,egiang1@myseneca.ca,ssdhillon20@myseneca.ca?subject=gpu610 Email All]<br />
<br />
== [[GPU610/gpuchill | GPU n' Chill]] ==<br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa]<br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Email All]</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/DPS915_G_P_Index_20191&diff=137895GPU610/DPS915 G P Index 201912019-02-20T22:16:04Z<p>Dserpa: /* Presentation Schedule */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20191}}<br />
<br />
Please add an overview of your group here and create a separate project page for your group!<br />
<br />
= Project Rules =<br />
<br />
# Use the Group page for a Journal of your activities throughout the course of the project<br />
# Project should cover material that differs from the material on the course web site<br />
# Presentation can be in Powerpoint or as a Walkthrough the group project page<br />
# Link to the project page should be included in the Student List table<br />
# Presentation slots (see below) are on a first-come first-served basis<br />
# Attendance at all presentations is mandatory - marks will be deducted for absenteeism<br />
# Marks will be awarded for both Group Wiki page and for the Presentation proper<br />
<br />
<br /><br />
<br />
= Potential Projects =<br />
<br />
* [[GPU610/DPS915_G_P_Index_20157 | Fall 2015 semester (Former Students)]]<br />
* [[GPU610/DPS915_G_P_Index_20171 | Winter 2017 semester (Former Students)]]<br />
* [[GPU610/DPS915_G_P_Index_20181 | Winter 2018 semester (Former Students)]]<br />
<br />
=== Suggested Projects ===<br />
<br />
* image processing - [http://cimg.eu/ CImg Library], [http://dlib.net/imaging.html dlib C++ library]<br />
* data compression - [http://codereview.stackexchange.com/questions/86543/simple-lzw-compression-algorithm LWZ algorithm], [http://www.mattmahoney.net/dc/dce.html Explained by Matt Mahoney]<br />
* grep - [http://www.boost.org/doc/libs/1_36_0/libs/regex/example/grep/grep.cpp Boost], [http://stackoverflow.com/questions/5731035/how-to-implement-grep-in-c-so-it-works-with-pipes-stdin-etc Stack Overflow ]<br />
* exclusive scan - [http://15418.courses.cs.cmu.edu/spring2016/article/4 CMU Assignment 2 Part 2]<br />
* simple circle renderer - [http://15418.courses.cs.cmu.edu/spring2016/article/4 CMU Assignment 2 Part 3]<br />
* object detection/tracking - [http://dlib.net/imaging.html#scan_fhog_pyramid dlib C++ library]<br />
* ray tracing - [http://khrylx.github.io/DSGPURayTracing/ by Yuan Ling (CMU) ] [https://github.com/jazztext/VRRayTracing/ by Kaffine Shearer (CMU)] [https://github.com/szellmann/visionaray Visionaray]<br />
* sorting algorithms - [http://www.cprogramming.com/tutorial/computersciencetheory/sortcomp.html Alex Allain cprogramming.com], [https://www.toptal.com/developers/sorting-algorithms Animations]<br />
* Jacobi's method for Poisson's equation - [https://math.berkeley.edu/~wilken/228A.F07/chr_lecture.pdf Rycroft's Lecture Note]<br />
* Gaussian Regression - [http://abhishekjoshi2.github.io/cuGP/ cuGP]<br />
* Halide - [http://haboric-hu.github.io/ Convolutional Networks]<br />
* Sudoku - [http://www.andrew.cmu.edu/user/astian/ by Tian Debebe (CMU)]<br />
<br />
=== C++ Open Source Libraries ===<br />
* List of open source libraries - [http://en.cppreference.com/w/cpp/links/libs cppreference.com]<br />
<br />
=== Carnegie-Mellon University Links ===<br />
* [http://15418.courses.cs.cmu.edu/spring2016/article/17 Spring 2016]<br />
* [http://15418.courses.cs.cmu.edu/spring2015/competition Spring 2015]<br />
* [http://15418.courses.cs.cmu.edu/spring2014/article/12 Spring 2014]<br />
<br />
=== Other Links ===<br />
* [https://sites.google.com/a/nirmauni.ac.in/cudacodes/cuda-projects Nirma University - restricted use of code to students of Nirma but may be a source of ideas]<br />
<br />
=== Reference Papers ===<br />
* [http://www.cs.utexas.edu/~pingali/CS378/2008sp/papers/GPUSurvey.pdf 2008 Survey Paper - you can search this paper for traditional topic ideas]<br />
* [http://www.nvidia.com/object/cuda_showcase_html.html Nvidia Showcase - probably too challenging - but could lead to simpler ideas]<br />
<br />
=== Interesting aspects to consider in your project ===<br />
* Try a different language - Javascript (Node.js bindings), Python (pyCUDA bindings)<br />
* Try APIs - [http://halide-lang.org/ Halide], OpenCV, Caffe, Latte<br />
* Compare CPU and GPU performance<br />
* Compare different blocksizes<br />
* Compare different algorithms on different machines<br />
* Implement your project on a Jetson TK1 board<br />
<br />
<br /><br />
<br />
= Presentation Schedule =<br />
<br />
<br />
{| border="1"<br />
|-<br />
|Team Name<br />
|Date and Time<br />
|-<br />
|<br />
|March 25 8:00<br />
|-<br />
<br />
|<br />
|March 25 8:20<br />
|-<br />
<br />
|<br />
|March 25 8:40<br />
|-<br />
<br />
|<br />
|March 25 9:00<br />
|-<br />
<br />
|<br />
|March 25 9:20<br />
|-<br />
<br />
|<br />
|March 29 8:00<br />
|-<br />
<br />
|<br />
|March 29 8:20<br />
|-<br />
<br />
|<br />
|March 29 8:40<br />
|-<br />
<br />
|[[Algo_holics | Algo-holics]]<br />
|March 29 9:00<br />
|-<br />
<br />
|<br />
|March 29 9:20<br />
|-<br />
<br />
|<br />
|April 1 8:00<br />
|-<br />
<br />
|<br />
|April 1 8:20<br />
|-<br />
<br />
|<br />
|April 1 8:40<br />
|-<br />
<br />
|[[Avengers | Avengers]]<br />
|April 1 9:00<br />
|-<br />
<br />
|<br />
|April 1 9:20<br />
|-<br />
<br />
|[[triForce |triForce]] <br />
|April 5 8:00<br />
|-<br />
<br />
|[[Ghost Cells | Ghost Cells]]<br />
|April 5 8:20<br />
|-<br />
<br />
|<br />
|April 5 8:40<br />
|-<br />
<br />
|[[GPU610/gpuchill|GPU n' Chill]]<br />
|April 5 9:00<br />
|-<br />
<br />
|group 6<br />
|April 5 9:20<br />
|-<br />
<br />
|}<br />
<br />
<br /><br />
<br />
= Group and Project Index =<br />
<br />
You can find a sample project page template [[GPU610/DPS915_Sample_Project_Page | here]]<br />
<br />
== [[GPU610/DPS915_Sample_Project_Page | Sample Group Title and email addresses]] ==<br />
<br />
# [mailto:chris.szalwinski@senecacollege.ca?subject=DPS915 Chris Szalwinski]<br />
# [mailto:fardad.soleimanloo@senecacollege.ca?subject=DPS915 Fardad Soleimanloo]<br />
# [mailto:chris.szalwinski@senecacollege.ca;fardad.soleimanloo@senecacollege.ca?subject=DPS915 eMail All]<br />
<br />
== [[Ghost_Cells | Ghost_Cells]] ==<br />
<br />
# [mailto:ysim2@myseneca.ca?subject=dps915 Tony Sim]<br />
# [mailto:rdittrich@myseneca.ca?subject=dps915 Robert Dittrich]<br />
# [mailto:izhogova@myseneca.ca?subject=dps915 Inna Zhogova]<br />
# [mailto:rdittrich@myseneca.ca,ysim2@myseneca.ca?subject=dps901,izhogova@myseneca.ca Email All]<br />
<br />
== [[Algo_holics | Algo_holics]] ==<br />
<br />
# [mailto:gsingh520@myseneca.ca?subject=gpu610 Gurpreet Singh]<br />
# [mailto:egiang1@myseneca.ca?subject=gpu610 Edgar Giang]<br />
# [mailto:ssdhillon20@myseneca.ca?subject=gpu610 Sukhbeer Dhillon]<br />
# [mailto:gsingh520@myseneca.ca,egiang1@myseneca.ca,ssdhillon20@myseneca.ca?subject=gpu610 Email All]</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/gpuchill&diff=137602GPU610/gpuchill2019-02-08T14:35:05Z<p>Dserpa: First page edit</p>
<hr />
<div>{{GPU610/DPS915 Index | 20131}}<br />
= GPU n' Chill =<br />
== Team Members == <br />
# [mailto:dserpa@myseneca.ca?subject=gpu610 Daniel Serpa], Some responsibility <br />
# [mailto:fardad.soleimanloo@senecacollege.ca?subject=gpu610 Fardad Soleimanloo], Some other responsibility <br />
# ...<br />
[mailto:dserpa@myseneca.ca,chris.szalwinski@senecacollege.ca?subject=gpu610 Email All]<br />
<br />
== Progress ==<br />
=== Assignment 1 ===<br />
=== Assignment 2 ===<br />
=== Assignment 3 ===</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/DPS915_Student_List_20191&diff=137599GPU610/DPS915 Student List 201912019-02-08T14:31:19Z<p>Dserpa: /* Student List */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20191}}<br />
<br />
Please add your information to the student list below!<br />
<br />
= Participation =<br />
== Student List Syntax ==<br />
Insert the following at the end of the table (if you are a student in GPU610/DPS915).<br /><br />
<big><pre>|[[User:WN | FN]] ||LN|| [[PN |GN]] ||SB|| [mailto:ID@myseneca.ca?subject=SB ID]<br />
|-</pre></big><br />
Replace the following with your own information: <br /><br />
* WN: Your Wiki User name<br />
* FN: Your First Name<br />
* LN: Your Last Name<br />
* PN: Your Group's Project Page Name on the wiki<br />
* GN: Your Group name<br />
* SB: Your Subject(example: GPU610)<br />
* ID: Your email ID (myseneca id)<br />
<br />
== Student List ==<br />
<br />
<br />
{| class="wikitable sortable" border="1" cellpadding="5"<br />
|+ GPU610/DPS915 - Student List<br />
! First Name !! Last Name !! Team Name !! Subject !! Seneca Id<br />
|-<br />
|[[User:Chris Szalwinski | Chris]]||Szalwinski||[[GPU610/DPS915 Sample Team Page|Team Name]]||GPU610||[mailto:chris.szalwinski@senecacollege.ca?subject=gpu610 chris.szalwinski]<br />
|-<br />
|[[User:ysim2 | Yoosuk]]||Sim||[[Ghost Cells|Ghost Cells]]||DPS915||[mailto:ysim2@myseneca.ca?subject=dps915 ysim2]<br />
|-<br />
|[[User:rdittrich | Robert]] ||Dittrich|| [[Ghost Cells |Ghost Cells]] ||DPS915|| [mailto:rdittrich@myseneca.ca?subject=SB rdittrich]<br />
|-<br />
|[[User:Wwpark | Woosle]]||Park||[[DPS915_Student_List_20191|Place Holder]]||DPS915||[mailto:wwpark@myseneca.ca?subject=dps915 wwpark]<br />
|-<br />
|[[User:PriyankaDhiman | Priyanka]] ||Dhiman|| [[GPU610/DPS915 Sample Team Page|Team Name]] ||GPU610|| [mailto:pdhiman2@myseneca.ca?subject=SB pdhiman2]<br />
|-<br />
|[[User:Edgar Giang | Edgar]]||Giang||[[GPU610/DPS915 Sample Team Page|Algo-holics]]||GPU610||[mailto:egiang1@myseneca.ca?subject=GPU610 egiang1]<br />
|-<br />
|Dillon||Coull||-||GPU610||[mailto:dcoull@myseneca.ca?subject=gpu610 dcoull]<br />
|-<br />
|[[User:Akkabia | Abdul]]||Kabia||[[GPU610/DPS915 Sample Team Page|Team Name]]||GPU610||[mailto:akkabia@myseneca.ca?subject=gpu610 akkabia]<br />
|-<br />
|[[User:Achisholm1 | Alex]]||achisholm1||[[GPU610/DPS915 Sample Team Page|Team Name]]||GPU610||[mailto:achisholm1@myseneca.ca?subject=gpu610 achisholm1]<br />
|-<br />
|[[User:afaux | Andrew]] ||Faux|| [[GPU610 | Team Name]] ||GPU610|| [mailto:afaux@myseneca.ca?subject=SB ID]<br />
|-<br />
|[[User:ssdhillon20 | Sukhbeer]]||Dhillon||[[GPU610/DPS915 Sample Team Page |Algo-holics]]||GPU610||[mailto:ssdhillon20@myseneca.ca?subject=GPU610 ssdhillon20]<br />
|-<br />
|[[User:gsingh520 | Gurpreet]]||Singh||[[GPU610/DPS915 Sample Team Page |Algo-holics]]||GPU610||[mailto:gsingh520@myseneca.ca?subject=GPU610 gsingh520]<br />
|-<br />
|[[User:DavidFerri | David]] ||Ferri|| [[triForce |triForce]] ||GPU610|| [mailto:dpferri@myseneca.ca?subject=GPU610 dpferri]<br />
|-<br />
|[[User:Vincent Terpstra | Vincent]] ||Terpstra|| [[triForce |triForce]] ||GPU610|| [mailto:vterpstra@myseneca.ca?subject=GPU610 vterpstra]<br />
|-<br />
|[[User:Raymond Kiguru | Raymond]] ||Kiguru|| [[triForce |triForce]] ||GPU610|| [mailto:rkiguru@myseneca.ca?subject=GPU610 rkiguru]<br />
|-<br />
|[[User: Xiaowei Huang | Xiaowei]]||Huang||[[GPU610/DPS915 Sample Team Page|group 6]]||GPU610||[mailto:xhuang110@myseneca.ca?subject=gpu610 xhuang110]<br />
|-<br />
|[[User:Akshat | Akshatkumar]] ||Patel|| [[GPU610/DPS915 Sample Team Page |Team Name]] ||DPS915|| [mailto:apatel271@myseneca.ca?subject=dps915 apatel271]<br />
|-<br />
|[[User:Yyuan34 | Yihang]] ||Yuan|| [[GPU610/DPS915 Sample Team Page |group 6]] ||GPU610|| [mailto:yyuan34@myseneca.ca?subject=gpu610 yyuan34]<br />
|-<br />
|[[User:Dserpa | Daniel]] ||Serpa|| [[GPU610/gpuchill|GPU n' Chill]] ||GPU610|| [mailto:dserpa@myseneca.ca?subject=GPU610 dserpa]<br />
|-<br />
|[[User:jtardif1 | Josh]] ||Tardif|| [[GPU610/DPS915 Sample Team Page |Team Name]] ||GPU610|| [mailto:jtardif1@myseneca.ca?subject=GPU610 jtardif1]<br />
|-<br />
|[[User:Henryleung | Henry]] ||Leung|| [[GPU610/DPS915 Sample Team Page |Team Name]] ||GPU610|| [mailto:hleung16@myseneca.ca?subject=GPU610 hleung16]<br />
|-<br />
|[[User:zzhou33 | Zhijian]] ||Zhou|| [[GPU610/DPS915 | group 6]] ||DPS915|| [mailto:zzhou33@myseneca.ca?subject=DPS915 zzhou33]<br />
|-<br />
}</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/DPS915_Student_List_20191&diff=137592GPU610/DPS915 Student List 201912019-02-08T14:26:11Z<p>Dserpa: /* Student List */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20191}}<br />
<br />
Please add your information to the student list below!<br />
<br />
= Participation =<br />
== Student List Syntax ==<br />
Insert the following at the end of the table (if you are a student in GPU610/DPS915).<br /><br />
<big><pre>|[[User:WN | FN]] ||LN|| [[PN |GN]] ||SB|| [mailto:ID@myseneca.ca?subject=SB ID]<br />
|-</pre></big><br />
Replace the following with your own information: <br /><br />
* WN: Your Wiki User name<br />
* FN: Your First Name<br />
* LN: Your Last Name<br />
* PN: Your Group's Project Page Name on the wiki<br />
* GN: Your Group name<br />
* SB: Your Subject(example: GPU610)<br />
* ID: Your email ID (myseneca id)<br />
<br />
== Student List ==<br />
<br />
<br />
{| class="wikitable sortable" border="1" cellpadding="5"<br />
|+ GPU610/DPS915 - Student List<br />
! First Name !! Last Name !! Team Name !! Subject !! Seneca Id<br />
|-<br />
|[[User:Chris Szalwinski | Chris]]||Szalwinski||[[GPU610/DPS915 Sample Team Page|Team Name]]||GPU610||[mailto:chris.szalwinski@senecacollege.ca?subject=gpu610 chris.szalwinski]<br />
|-<br />
|[[User:ysim2 | Yoosuk]]||Sim||[[Ghost Cells|Ghost Cells]]||DPS915||[mailto:ysim2@myseneca.ca?subject=dps915 ysim2]<br />
|-<br />
|[[User:rdittrich | Robert]] ||Dittrich|| [[Ghost Cells |Ghost Cells]] ||DPS915|| [mailto:rdittrich@myseneca.ca?subject=SB rdittrich]<br />
|-<br />
|[[User:Wwpark | Woosle]]||Park||[[DPS915_Student_List_20191|Place Holder]]||DPS915||[mailto:wwpark@myseneca.ca?subject=dps915 wwpark]<br />
|-<br />
|[[User:PriyankaDhiman | Priyanka]] ||Dhiman|| [[GPU610/DPS915 Sample Team Page|Team Name]] ||GPU610|| [mailto:pdhiman2@myseneca.ca?subject=SB pdhiman2]<br />
|-<br />
|[[User:Edgar Giang | Edgar]]||Giang||[[GPU610/DPS915 Sample Team Page|Algo-holics]]||GPU610||[mailto:egiang1@myseneca.ca?subject=GPU610 egiang1]<br />
|-<br />
|Dillon||Coull||-||GPU610||[mailto:dcoull@myseneca.ca?subject=gpu610 dcoull]<br />
|-<br />
|[[User:Akkabia | Abdul]]||Kabia||[[GPU610/DPS915 Sample Team Page|Team Name]]||GPU610||[mailto:akkabia@myseneca.ca?subject=gpu610 akkabia]<br />
|-<br />
|[[User:Achisholm1 | Alex]]||achisholm1||[[GPU610/DPS915 Sample Team Page|Team Name]]||GPU610||[mailto:achisholm1@myseneca.ca?subject=gpu610 achisholm1]<br />
|-<br />
|[[User:afaux | Andrew]] ||Faux|| [[GPU610 | Team Name]] ||GPU610|| [mailto:afaux@myseneca.ca?subject=SB ID]<br />
|-<br />
|[[User:ssdhillon20 | Sukhbeer]]||Dhillon||[[GPU610/DPS915 Sample Team Page |Algo-holics]]||GPU610||[mailto:ssdhillon20@myseneca.ca?subject=GPU610 ssdhillon20]<br />
|-<br />
|[[User:gsingh520 | Gurpreet]]||Singh||[[GPU610/DPS915 Sample Team Page |Algo-holics]]||GPU610||[mailto:gsingh520@myseneca.ca?subject=GPU610 gsingh520]<br />
|-<br />
|[[User:DavidFerri | David]] ||Ferri|| [[triForce |triForce]] ||GPU610|| [mailto:dpferri@myseneca.ca?subject=GPU610 dpferri]<br />
|-<br />
|[[User:Vincent Terpstra | Vincent]] ||Terpstra|| [[triForce |triForce]] ||GPU610|| [mailto:vterpstra@myseneca.ca?subject=SB ID]<br />
|-<br />
|[[User:Raymond Kiguru | Raymond]] ||Kiguru|| [[triForce |triForce]] ||GPU610|| [mailto:rkiguru@myseneca.ca?subject=GPU610 rkiguru]<br />
|-<br />
|[[User: Xiaowei Huang | Xiaowei]]||Huang||[[GPU610/DPS915 Sample Team Page|group 6]]||GPU610||[mailto:xhuang110@myseneca.ca?subject=gpu610 xhuang110]<br />
|-<br />
|[[User:Akshat | Akshatkumar]] ||Patel|| [[GPU610/DPS915 Sample Team Page |Team Name]] ||DPS915|| [mailto:apatel271@myseneca.ca?subject=dps915 apatel271]<br />
|-<br />
|[[User:Yyuan34 | Yihang]] ||Yuan|| [[GPU610/DPS915 Sample Team Page |group 6]] ||GPU610|| [mailto:yyuan34@myseneca.ca?subject=gpu610 yyuan34]<br />
|-<br />
|[[User:Dserpa | Daniel]] ||Serpa|| [[GPU610/gpunchill |GPU n' Chill]] ||GPU610|| [mailto:dserpa@myseneca.ca?subject=GPU610 dserpa]<br />
|-<br />
|[[User:jtardif1 | Josh]] ||Tardif|| [[GPU610/DPS915 Sample Team Page |Team Name]] ||GPU610|| [mailto:jtardif1@myseneca.ca?subject=GPU610 jtardif1]<br />
|-<br />
|[[User:Henryleung | Henry]] ||Leung|| [[GPU610/DPS915 Sample Team Page |Team Name]] ||GPU610|| [mailto:hleung16@myseneca.ca?subject=GPU610 hleung16]<br />
|-<br />
|[[User:zzhou33 | Zhijian]] ||Zhou|| [[GPU610/DPS915 | group 6]] ||DPS915|| [mailto:zzhou33@myseneca.ca?subject=DPS915 zzhou33]<br />
|-<br />
}</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/DPS915_Student_List_20191&diff=137587GPU610/DPS915 Student List 201912019-02-08T14:24:56Z<p>Dserpa: /* Student List */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20191}}<br />
<br />
Please add your information to the student list below!<br />
<br />
= Participation =<br />
== Student List Syntax ==<br />
Insert the following at the end of the table (if you are a student in GPU610/DPS915).<br /><br />
<big><pre>|[[User:WN | FN]] ||LN|| [[PN |GN]] ||SB|| [mailto:ID@myseneca.ca?subject=SB ID]<br />
|-</pre></big><br />
Replace the following with your own information: <br /><br />
* WN: Your Wiki User name<br />
* FN: Your First Name<br />
* LN: Your Last Name<br />
* PN: Your Group's Project Page Name on the wiki<br />
* GN: Your Group name<br />
* SB: Your Subject(example: GPU610)<br />
* ID: Your email ID (myseneca id)<br />
<br />
== Student List ==<br />
<br />
<br />
{| class="wikitable sortable" border="1" cellpadding="5"<br />
|+ GPU610/DPS915 - Student List<br />
! First Name !! Last Name !! Team Name !! Subject !! Seneca Id<br />
|-<br />
|[[User:Chris Szalwinski | Chris]]||Szalwinski||[[GPU610/DPS915 Sample Team Page|Team Name]]||GPU610||[mailto:chris.szalwinski@senecacollege.ca?subject=gpu610 chris.szalwinski]<br />
|-<br />
|[[User:ysim2 | Yoosuk]]||Sim||[[Ghost Cells|Ghost Cells]]||DPS915||[mailto:ysim2@myseneca.ca?subject=dps915 ysim2]<br />
|-<br />
|[[User:rdittrich | Robert]] ||Dittrich|| [[Ghost Cells |Ghost Cells]] ||DPS915|| [mailto:rdittrich@myseneca.ca?subject=SB rdittrich]<br />
|-<br />
|[[User:Wwpark | Woosle]]||Park||[[DPS915_Student_List_20191|Place Holder]]||DPS915||[mailto:wwpark@myseneca.ca?subject=dps915 wwpark]<br />
|-<br />
|[[User:PriyankaDhiman | Priyanka]] ||Dhiman|| [[GPU610/DPS915 Sample Team Page|Team Name]] ||GPU610|| [mailto:pdhiman2@myseneca.ca?subject=SB pdhiman2]<br />
|-<br />
|[[User:Edgar Giang | Edgar]]||Giang||[[GPU610/DPS915 Sample Team Page|Algo-holics]]||GPU610||[mailto:egiang1@myseneca.ca?subject=GPU610 egiang1]<br />
|-<br />
|Dillon||Coull||-||GPU610||[mailto:dcoull@myseneca.ca?subject=gpu610 dcoull]<br />
|-<br />
|[[User:Akkabia | Abdul]]||Kabia||[[GPU610/DPS915 Sample Team Page|Team Name]]||GPU610||[mailto:akkabia@myseneca.ca?subject=gpu610 akkabia]<br />
|-<br />
|[[User:Achisholm1 | Alex]]||achisholm1||[[GPU610/DPS915 Sample Team Page|Team Name]]||GPU610||[mailto:achisholm1@myseneca.ca?subject=gpu610 achisholm1]<br />
|-<br />
|[[User:afaux | Andrew]] ||Faux|| [[GPU610 | Team Name]] ||GPU610|| [mailto:afaux@myseneca.ca?subject=SB ID]<br />
|-<br />
|[[User:ssdhillon20 | Sukhbeer]]||Dhillon||[[GPU610/DPS915 Sample Team Page |Algo-holics]]||GPU610||[mailto:ssdhillon20@myseneca.ca?subject=GPU610 ssdhillon20]<br />
|-<br />
|[[User:gsingh520 | Gurpreet]]||Singh||[[GPU610/DPS915 Sample Team Page |Algo-holics]]||GPU610||[mailto:gsingh520@myseneca.ca?subject=GPU610 gsingh520]<br />
|-<br />
|[[User:DavidFerri | David]] ||Ferri|| [[triForce |triForce]] ||GPU610|| [mailto:dpferri@myseneca.ca?subject=GPU610 dpferri]<br />
|-<br />
|[[User:Vincent Terpstra | Vincent]] ||Terpstra|| [[triForce |triForce]] ||GPU610|| [mailto:vterpstra@myseneca.ca?subject=SB ID]<br />
|-<br />
|[[User:Raymond Kiguru | Raymond]] ||Kiguru|| [[triForce |triForce]] ||GPU610|| [mailto:rkiguru@myseneca.ca?subject=GPU610 ID]<br />
|-<br />
|[[User: Xiaowei Huang | Xiaowei]]||Huang||[[GPU610/DPS915 Sample Team Page|group 6]]||GPU610||[mailto:xhuang110@myseneca.ca?subject=gpu610 xhuang110]<br />
|-<br />
|[[User:Akshat | Akshatkumar]] ||Patel|| [[GPU610/DPS915 Sample Team Page |Team Name]] ||DPS915|| [mailto:apatel271@myseneca.ca?subject=dps915 apatel271]<br />
|-<br />
|[[User:Yyuan34 | Yihang]] ||Yuan|| [[GPU610/DPS915 Sample Team Page |group 6]] ||GPU610|| [mailto:yyuan34@myseneca.ca?subject=gpu610 yyuan34]<br />
|-<br />
|[[User:Dserpa | Daniel]] ||Serpa|| [[GPU610/DPS915 Sample Team Page |GPU n' chill]] ||GPU610|| [mailto:dserpa@myseneca.ca?subject=GPU610 dserpa]<br />
|-<br />
|[[User:jtardif1 | Josh]] ||Tardif|| [[GPU610/DPS915 Sample Team Page |Team Name]] ||GPU610|| [mailto:jtardif1@myseneca.ca?subject=GPU610 jtardif1]<br />
|-<br />
|[[User:Henryleung | Henry]] ||Leung|| [[GPU610/DPS915 Sample Team Page |Team Name]] ||GPU610|| [mailto:hleung16@myseneca.ca?subject=GPU610 hleung16]<br />
|-<br />
|[[User:zzhou33 | Zhijian]] ||Zhou|| [[GPU610/DPS915 | group 6]] ||DPS915|| [mailto:zzhou33@myseneca.ca?subject=DPS915 zzhou33]<br />
|-<br />
}</div>Dserpahttps://wiki.cdot.senecacollege.ca/w/index.php?title=GPU610/DPS915_Student_List_20191&diff=137218GPU610/DPS915 Student List 201912019-01-19T14:19:17Z<p>Dserpa: /* Student List */</p>
<hr />
<div>{{GPU610/DPS915 Index | 20191}}<br />
<br />
Please add your information to the student list below!<br />
<br />
= Participation =<br />
== Student List Syntax ==<br />
Insert the following at the end of the table (if you are a student in GPU610/DPS915).<br /><br />
<big><pre>|[[User:WN | FN]] ||LN|| [[PN |GN]] ||SB|| [mailto:ID@myseneca.ca?subject=SB ID]<br />
|-</pre></big><br />
Replace the following with your own information: <br /><br />
* WN: Your Wiki User name<br />
* FN: Your First Name<br />
* LN: Your Last Name<br />
* PN: Your Group's Project Page Name on the wiki<br />
* GN: Your Group name<br />
* SB: Your Subject(example: GPU610)<br />
* ID: Your email ID (myseneca id)<br />
<br />
== Student List ==<br />
<br />
<br />
{| class="wikitable sortable" border="1" cellpadding="5"<br />
|+ GPU610/DPS915 - Student List<br />
! First Name !! Last Name !! Team Name !! Subject !! Seneca Id<br />
|-<br />
|[[User:Chris Szalwinski | Chris]]||Szalwinski||[[GPU610/DPS915 Sample Team Page|Team Name]]||GPU610||[mailto:chris.szalwinski@senecacollege.ca?subject=gpu610 chris.szalwinski]<br />
|-<br />
|[[User:ysim2 | Yoosuk]]||Sim||[[GPU610/DPS915 Sample Team Page|Team Name]]||DPS915||[mailto:ysim2@myseneca.ca?subject=dps915 ysim2]<br />
|-<br />
|[[User:rdittrich | Robert]] ||Dittrich|| [[GPU610/DPS915 Sample Team Page |Team Name]] ||DPS915|| [mailto:rdittrich@myseneca.ca?subject=SB rdittrich]<br />
|-<br />
|[[User:Wwpark | Woosle]]||Park||[[GPU610/DPS915 Sample Team Page|Team Name]]||DPS915||[mailto:wwpark@myseneca.ca?subject=dps915 wwpark]<br />
|-<br />
|[[User:PriyankaDhiman | Priyanka]] ||Dhiman|| [[GPU610/DPS915 Sample Team Page|Team Name]] ||GPU610|| [mailto:pdhiman2@myseneca.ca?subject=SB pdhiman2]<br />
|-<br />
|[[User:Edgar Giang | Edgar]]||Giang||[[GPU610/DPS915 Sample Team Page|Team Name]]||GPU610||[mailto:egiang1@myseneca.ca?subject=GPU610 egiang1]<br />
|-<br />
|Dillon||Coull||-||GPU610||[mailto:dcoull@myseneca.ca?subject=gpu610 dcoull]<br />
|-<br />
|[[User:Akkabia | Abdul]]||Kabia||[[GPU610/DPS915 Sample Team Page|Team Name]]||GPU610||[mailto:akkabia@myseneca.ca?subject=gpu610 akkabia]<br />
|-<br />
|[[User:Achisholm1 | Alex]]||achisholm1||[[GPU610/DPS915 Sample Team Page|Team Name]]||GPU610||[mailto:achisholm1@myseneca.ca?subject=gpu610 achisholm1]<br />
|-<br />
|[[User:ssdhillon20 | Sukhbeer]]||Dhillon||[[GPU610/DPS915 Sample Team Page |Team Name]]||GPU610||[mailto:ssdhillon20@myseneca.ca?subject=GPU610 ssdhillon20]<br />
|-<br />
|[[User:gsingh520 | Gurpreet]]||Singh||[[GPU610/DPS915 Sample Team Page |Team Name]]||GPU610||[mailto:gsingh520@myseneca.ca?subject=GPU610 gsingh520]<br />
|-<br />
|[[User:DavidFerri | David]] ||Ferri|| [[GPU610/DPS915 Sample Team Page |Team Name]] ||GPU610|| [mailto:dpferri@myseneca.ca?subject=GPU610 dpferri]<br />
|-<br />
|[[User:Vincent Terpstra | Vincent]] ||Terpstra|| [[TODO |Team Name]] ||GPU610|| [mailto:vterpstra@myseneca.ca?subject=SB ID]<br />
|-<br />
|[[User: Xiaowei Huang | Xiaowei]]||Huang||[[GPU610/DPS915 Sample Team Page|Team Name]]||GPU610||[mailto:xhuang110@myseneca.ca?subject=gpu610 xhuang110]<br />
|-<br />
|[[User:Akshat | Akshatkumar]] ||Patel|| [[GPU610/DPS915 Sample Team Page |Team Name]] ||DPS915|| [mailto:apatel271@myseneca.ca?subject=dps915 apatel271]<br />
|-<br />
|[[User:Yyuan34 | Yihang]] ||Yuan|| [[GPU610/DPS915 Sample Team Page |Team Name]] ||GPU610|| [mailto:yyuan34@myseneca.ca?subject=gpu610 yyuan34]<br />
|-<br />
|[[User:Dserpa | Daniel]] ||Serpa|| [[GPU610/DPS915 Sample Team Page |Team Name]] ||GPU610|| [mailto:dserpa@myseneca.ca?subject=GPU610 dserpa]<br />
|-<br />
}</div>Dserpa