# Changes

## Group 6

, 22:00, 7 April 2019
m
Assignment 3 - Optimize
# [mailto:xhuang110@myseneca.ca?subject=gpu610 Xiaowei Huang]
# [mailto:yyuan34@myseneca.ca?subject=gpu610 Yihang Yuan]
# [mailto:zzhou33@myseneca.ca?subject=dps915 Zhijian Zhou]
[mailto:xhuang110@myseneca.ca,yyuan34@myseneca.ca,zzhou33@myseneca.ca?subject=dps901-gpu610 Email All]
== Progress ==

=== Assignment 1 - Select and Assess ===
==== Xiaowei Array Processing ====
Subject: Array Processing
Blaise Barney introduced Parallel Computing https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesArrayArray processing could become one of the parallel example, which "demonstrates calculations on 2-dimensional array elements; a function is evaluated on each array element."
Here is my source code{| class="wikitable mw-collapsible mw-collapsed"! arrayProcessing.cpp |-| <pre> // arrayProcessing.cpp// Array processing implement parallel solution #include <iostream>#include <iomanip>#include <cstdlib>#include <ctime> void init(float** randomValue, int n) { //std::srand(std::time(nullptr)); float f =1.0f / RAND_MAX; for (int i =0; i < n; i++) for (int j = 0; j < n; j++) randomValue[i][j] = std::rand() * f;} void multiply(float** a, float** b, float** c, int n) { for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) { float sum = 0.0f; for (int k = Yihang 0; k < n; k++) sum +=a[i][k] * b[k][j]; c[i][j] =sum; if(n <=10){ std::cout << "array c[" << i << "," << j << "]: " << c[i][j] << std::endl; } }} int main(int argc, char* argv[]) { // interpret command-line argument if (argc !=2) { std::cerr << argv[0] << ": invalid number of arguments\n"; Subject std: :cerr << "Usage: " << argv[0] << " size_of_matrices\n"; return 1; } int n = std::atoi(argv[1]); // size of matrices  float** a = new float*[n]; for (int i = 0; i < n; i++) a[i] = new float[n]; float** b = new float*[n]; for (int i = 0; i < n; i++) b[i] = new float[n]; float** c = new float*[n]; for (int i = 0; i < n; i++) c[i] = new float[n]; std::srand(std::time(nullptr)); init(a, n); init(b, n);  multiply(a, b, c, n);  for (int i = 0; i < n; i++) delete [] a[i]; delete [] a; for (int i = 0; i < n; i++) delete [] b[i]; delete [] b; for (int i = 0; i < n; i++) delete [] c[i]; delete [] c;}</pre> |}  Standard random method is used to initialize a 2-dimentional array. The purpose of this program is to perform a 2-dimension array calculation, which is a matrix-matrix multiplication in this example.  In this following profile example, n = 1000<pre>Flat profile:Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls Ts/call Ts/call name 100.11 1.48 1.48 multiply(float**, float**, float**, int) 0.68 1.49 0.01 init(float**, int) 0.00 1.49 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z4initPPfi</pre> <pre>Call graph granularity: each sample hit covers 2 byte(s) for 0.67% of 1.49 seconds index % time self children called name <spontaneous>[1] 99.3 1.48 0.00 multiply(float**, float**, float**, int) [1]----------------------------------------------- <spontaneous>[2] 0.7 0.01 0.00 init(float**, int) [2]----------------------------------------------- 0.00 0.00 1/1 __libc_csu_init [16][10] 0.0 0.00 0.00 1 _GLOBAL__sub_I__Z4initPPfi [10]-----------------------------------------------Index by function name [10] _GLOBAL__sub_I__Z4initPPfi (arrayProcessing.cpp) [2] init(float**, int) [1] multiply(float**, float**, float**, int)</pre>From the call graph, multiply() took major runtime to more than 99%, as it contains 3 for-loop, which T(n) is O(n^3). Besides, init() also became the second busy one, which has a O(n^2).  As the calculation of elements is independent of one another - leads to an embarrassingly parallel solution. Arrays elements are evenly distributed so that each process owns a portion of the array (subarray). It can be solved in less time with multiple compute resources than with a single compute resource. ==== The Monte Carlo Simulation (PI Calculation) ====The Monte Carlo Simulation (PI Calculation)
Got the code from here:
https://rosettacode.org/wiki/Monte_Carlo_methods#C.2B.2B
A Monte Carlo Simulation is a way of approximating the value of a function where calculating the actual value is difficult or impossible.

It uses random sampling to define constraints on the value and then makes a sort of "best guess."

{| class="wikitable mw-collapsible mw-collapsed"
! Source Code
|-
|
<pre>
#include<iostream>
#include<fstream>
#include<math.h>
#include<stdlib.h>
#include<time.h>

using namespace std;

void calculatePI(int n, float* h_a) {
float x, y;
int hit;
srand(time(NULL));
for (int j = 0; j < n; j++) {
hit = 0;
x = 0;
y = 0;
for (int i = 0; i < n; i++) {
x = float(rand()) / float(RAND_MAX);
y = float(rand()) / float(RAND_MAX);
if (y <= sqrt(1 - (x * x))) {
hit += 1;
}
}
h_a[j] = 4 * float(hit) / float(n);
}
}

int main(int argc, char* argv[]) {

if (argc != 2) {
std::cerr << argv[0] << ": invalid number of arguments\n";
std::cerr << "Usage: " << argv[0] << " size_of_matrices\n";
return 1;
}
int n = std::atoi(argv[1]); // scale
float* cpu_a;
cpu_a = new float[n];

calculatePI(n, cpu_a);

ofstream h_file;
h_file.open("h_result.txt");
float cpuSum = 0.0f;
for (int i = 0; i < n; i++) {
cpuSum += cpu_a[i];
h_file << "Host: " << cpu_a[i] << endl;
}
cpuSum = cpuSum / (float)n;
cout << "CPU Result: " << cpuSum << endl;
h_file.close();
}

</pre>
|}

As this algorithm is based on random sampling, so there is only one function that does all the work.
Flat profile:
<pre>
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
101.05 0.02 0.02 calculatePI(int, float*)
0.00 0.02 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z11calculatePIiPf
</pre>

Call graph
<pre>
granularity: each sample hit covers 2 byte(s) for 0.47% of 2.11 seconds

index % time self children called name
<spontaneous>
[1] 100.0 2.11 0.00 calculatePI(int, float*) [1]
-----------------------------------------------
0.00 0.00 1/1 __libc_csu_init [17]
[9] 0.0 0.00 0.00 1 _GLOBAL__sub_I__Z11calculatePIiPf [9]
-----------------------------------------------
Index by function name

[9] _GLOBAL__sub_I__Z11calculatePIiPf (a1.cpp) [1] calculatePI(int, float*)
</pre>

'''Results for different scale of calculation'''

[[File:Yihang.JPG]]