Changes

← Older edit

BETTERRED

18,661 bytes added, 20:30, 12 April 2017

→‎Code

The program can then be executed by running the compiled binary and it will display the time it took to generate the Mandelbrot set and save the pictures.

{| class="wikitable mw-collapsible mw-collapsed"! Mandelbrot CPU( ... )|-|<syntaxhighlight lang=~~= Observations ===~~"cpp">#include <iostream>#include <complex>#include <vector>#include <chrono>#include <functional>

~~The program takes a significant amount of time to run as the calculations are being done on the CPU~~#include "window. ~~There are nested loops present within the program that can be parallelized to make the program faster~~h"#include "save_image.h"#include "utils.h"

~~The code also has the size of the image and the iterations hard~~// clang++ -~~coded which can be modified to make the program significantly longer to process and make it tough on the GPU's for benchmarking and stability testing by running the process in a loop~~std=c++11 -stdlib=libc++ -O3 save_image.cpp utils. ~~The code is relatively straight forward and the parallelization should also be easy to implement and test~~cpp mandel.cpp -lfreeimage

// Use an alias to simplify the use of complex type

using Complex = std::complex<float>;

~~=== Hotspot ===~~// Convert a pixel coordinate to the complex domainComplex scale(window<int> &scr, window<float> &fr, Complex c) { Complex aux(c.real() / (float)scr.width() * fr.width() + fr.x_min(), c.imag() / (float)scr.height() * fr.height() + fr.y_min()); return aux;}

~~Hotspot for~~ // Check if a point is in the ~~program was found in~~ set or escapes to infinity, return the ~~fractal~~number if iterationsint escape() Complex c, int iter_max, const std::function ~~which calls the get_iterations~~<Complex(Complex, Complex) ~~function that contains 2-nested for loops and a call to escape(~~> &func) ~~which contains a while loop. Profiling the runtime with Instruments on OSX displayed that the fractal~~{ Complex z(0) function took up the most amount of runtime and this is the function that will be parallelized using CUDA. Once the function is parallelized, the iterations and size of the image can be increased in order to make the computation relatively stressful on the GPU to get a benchmark or looped in order to do stress testing for GPUs.; int iter = 0;

while (abs(z) < 2.0 && iter < iter_max) {

z = func(z, c);

iter++;

}

return iter;

}

// Loop over each pixel from our image and check if the points associated with this pixel escape to infinityvoid get_number_iterations(window<int> &scr, window<float> &fract, int iter_max, std::vector<int> &colors, const std::function<Complex( Complex, Complex)> &func) { int k =0, progress =-1; for(int i = ~~Profiling Data Screenshots~~ scr.y_min(); i < scr.y_max(); ++i) { for(int j =scr.x_min(); j < scr.x_max(); ++j) { Complex c((float)j, (float)i); c =scale(scr, fract, c); colors[k] =escape(c, iter_max, func); k++; } if(progress < (int)(i*100.0/scr.y_max())){ progress = (int)(i*100.0/scr.y_max()); std::cout << progress << "%\n"; } }}

~~Profile~~ void fractal(window<int> &scr, window<float> &fract, int iter_max, std::vector<int> &colors, const std::function<Complex( Complex, Complex)> &func, const char *fname, bool smooth_color) { auto start = std::chrono::steady_clock::now(); get_number_iterations(scr, fract, iter_max, colors, func); auto end = std::chrono::steady_clock::now(); std::cout << "Time to generate " << fname << " = " << std::chrono::duration <float, std::milli> (end - start).count() << " [~~https~~ms]" << std::~~//drive.google.com/open?id=0B2Y_atB3DptbUG5oRWMyUGNQdlU Profile]~~endl;

~~Hotspot Code - [https:~~ //~~drive.google.com/open?id=0B2Y_atB3DptbRlhCUTNyeEFDbEk Hotspot Code]~~Save (show) the result as an image plot(scr, colors, iter_max, fname, smooth_color);}

void mandelbrot() { // Define the size of the image window<int> scr(0, 1000, 0, 1000); // The domain in which we test for points window<float> fract(-2.2, 1.2, ---1.7, 1.7);

~~== Introduction : GPU Benchmarking~~ //~~Testing for NBody : Joshua Kraitberg =~~The function used to calculate the fractal auto func =[] (Complex z, Complex c) -> Complex {return z * z + c; };

~~This program uses Newtonian mechanics and a four-order symplectic Candy-Rozmus integration~~ int iter_max = 500; const char *fname = "mandelbrot.png"; bool smooth_color = true; std::vector<int> colors(~~a symplectic algorithm guarantees exact conservation of energy and angular momentum)~~scr. ~~The initial conditions are obtained from JPL Horizons, ahd constants~~ size(~~like masses, gravitational constant~~) are those recommended by the International Astronomical Union. The program currently does not take into account effects like general relativity, the non-spherical shapes of celestial objects, tidal effects on Earth, etc. It also does not take the 500 asteroids used by JPL Horizons into accound in its model of the Solar System.);

~~[https:~~ //~~github~~Experimental zoom (bugs ?).~~com~~This will modify the fract window (the domain in which we calculate the fractal function) /~~fding~~/~~nbody Source]~~zoom(1.0, -1.225, -1.22, 0.15, 0.16, fract); //Z2 fractal(scr, fract, iter_max, colors, func, fname, smooth_color);}

~~=== Compilation Instructions: ===~~void triple_mandelbrot() { // Define the size of the image window<int> scr(0, 2000, 0, 2000); // The domain in which we test for points window<float> fract(-1.5, 1.5, -1.5, 1.5);

~~For Unix~~ /~~Linux based systems:~~/ The function used to calculate the fractal auto func = [] (Complex z, Complex c) -> Complex {return z * z * z + c; };

~~g++ -~~ int iter_max = 500; const char *fname = "triple_mandelbrot.png"; bool smooth_color = true; std~~=c++11 c++~~::vector<int> colors(scr.size()); fractal(scr, fract, iter_max, colors, func, fname, smooth_color);} int main() { mandelbrot(); // triple_mandelbrot(); return 0;} </~~nbody.cpp~~syntaxhighlight>|}

=== Observations ===

The program ~~is quite fast for~~ takes a significant amount of time to run as the calculations are being ~~a single-threaded~~ done on the CPU ~~application~~. ~~Almost all~~ There are nested loops present within the ~~CPU time~~ program that can be parallelized to make the program faster. The code also has the size of the image and the iterations hard-coded which can be modified to make the program significantly longer to process and make it tough on the GPU's for benchmarking and stability testing by running the process in a loop. The code is ~~spent manipulating data~~ relatively straight forward and ~~iterating in vectors~~the parallelization should also be easy to implement and test.

=== Hotspot ===

~~Essentially all~~ Hotspot for the ~~time spent running is spent~~ program was found in the ~~doing calculation on vectors. The dowork~~ fractal() function ~~iteratively~~ which calls the ~~CRO_step~~ get_iterations() function ~~found in integrators~~that contains 2-nested for loops and a call to escape() which contains a while loop.~~h file. The CRO_step~~ Profiling the runtime with Instruments on OSX displayed that the fractal() function ~~is where~~ took up the most amount of runtime and this is the ~~vector calculations take place~~function that will be parallelized using CUDA. ~~A large amount of is also done in~~ Once the ~~calculate_a~~ function ~~which~~ is ~~used~~ parallelized, the iterations and size of the image can be increased in order to ~~calulate~~ make the ~~acceleration~~ computation relatively stressful on ~~all~~ the ~~planets~~GPU to get a benchmark or looped in order to do stress testing for GPUs.

~~=== Profiling Data and Screenshots ===~~

~~{| class~~=~~"wikitable mw-collapsible mw-collapsed"! NBody Hot Functions~~|-| == Profiling Data Screenshots ===

~~<syntaxhighlight lang="cpp">void dowork(double t){~~ ~~int numtimes=int(abs(t~~Profile - [https://~~dt));~~ ~~dt=t~~drive.google.com/~~double(numtimes+1);~~ ~~numtimes=numtimes+1;~~ ~~for (int i~~open?id=~~0;i<numtimes;i++){~~ ~~CRO_step(dt,a);~~ }} 0B2Y_atB3DptbUG5oRWMyUGNQdlU Profile]

~~void CRO_step(register double mydt,void (*a)()){~~ ~~long double macr_a[4] = {0.5153528374311229364,~~ Hotspot Code -~~0.085782019412973646,0.4415830236164665242, 0.1288461583653841854};~~ ~~long double macr_b~~[~~4] = {0~~https://drive.~~1344961992774310892, -0~~google.~~2248198030794208058, 0.7563200005156682911, 0.3340036032863214255};~~ ~~for (int i=0;i<4;i++){~~ ~~a();~~ ~~for (int j~~com/open?id=~~0;j<ncobjects;j++){~~ ~~cobjects[j~~0B2Y_atB3DptbRlhCUTNyeEFDbEk Hotspot Code]~~->v += cobjects[j]->a * mydt*macr_b[i];~~ ~~cobjects[j]->pos += cobjects[j]->v * mydt*macr_a[i];~~ } ~~} //We should really expand the loop for efficiency~~}

~~void calculate_a(){~~ ~~for (int j1=0;j1<ncobjects;j1++){~~ ~~cobjects[j1]~~-~~>a=vect(0,0,0);~~ } ~~for (int j1=0; j1<ncobjects;j1++){~~ ~~for (int j2=j1+1;j2<ncobjects;j2++){~~ ~~double m1=cobjects[j1]~~-~~>m;~~ ~~double m2=cobjects[j2]~~-~~>m;~~ ~~vect dist=cobjects[j1]~~-~~>pos-cobjects[j2]->pos;~~ ~~double magd=dist.mag();~~ ~~vect base=dist*(1.0/(magd*magd*magd));~~ ~~cobjects[j1]->a+=base*(-m2);~~ ~~cobjects[j2]->a+=base*m1;~~ } }}~~</syntaxhighlight>~~

|}== Introduction : GPU Benchmarking/Testing for NBody : Joshua Kraitberg ==

~~{| class~~This program uses Newtonian mechanics and a four-order symplectic Candy-Rozmus integration (a symplectic algorithm guarantees exact conservation of energy and angular momentum). The initial conditions are obtained from JPL Horizons, ahd constants (like masses, gravitational constant) are those recommended by the International Astronomical Union. The program currently does not take into account effects like general relativity, the non-spherical shapes of celestial objects, tidal effects on Earth, etc. It also does not take the 500 asteroids used by JPL Horizons into accound in its model of the Solar System. [https://github.com/fding/nbody Source] === Compilation Instructions: ===~~"wikitable mw~~ For Unix/Linux based systems: g++ -~~collapsible mw-collapsed"~~std=c++11 c++/nbody.cpp === Observations ===~~! NBody Hot Spot Data~~|The program is quite fast for being a single-threaded CPU application. Almost all the CPU time is spent manipulating data and iterating in vectors.~~| Call graph (explanation follows)~~=== Hotspot ===

Essentially all the time spent running is spent in the doing calculation on vectors. The dowork function iteratively calls the CRO_step function found in integrators.h file. The CRO_step function is where most of the vector calculations take place. A large amount of is also done in the calculate_a function which is used to calulate the acceleration on all the planets.

~~granularity: each sample hit covers 4 byte(s) for 0.16% of 6.18 seconds~~=== Profiling Data and Screenshots ===

~~index % time self children called name~~{| class="wikitable mw-collapsible mw-collapsed"! NBody Hot Functions|-| <~~spontaneous~~syntaxhighlight lang="cpp">~~[1] 99.7 0.00 6.16 main [1]~~void dowork(double t){ ~~0.00 6.15 1~~ int numtimes=int(abs(t/dt)); dt=t/double(numtimes+1 ~~dowork(double~~) ~~[3]~~; ~~0.00~~ numtimes=numtimes+1; for (int i=0~~.01 1/1 totalL~~;i<numtimes;i++){ CRO_step(dt,a) ~~[14]~~; ~~0.00 0.00 1/1 totalE~~ }} void CRO_step(register double mydt,void (*a)() ){ long double macr_a[164] = {0.00 5153528374311229364, -0.~~00 1/1 initialize() [17]~~ 085782019412973646,0.00 4415830236164665242, 0.~~00 28/32712799 vect::operator-(vect const&)~~ 1288461583653841854}; long double macr_b[84] = {0.00 1344961992774310892, -0.~~00 14/118268959 vect::operator*(double const&) [5]~~ 02248198030794208058, 0.00 7563200005156682911, 0.~~00 14/5032775 vect::operator=~~3340036032863214255}; for (int i=0;i<4;i++){ a(~~vect const&~~) ~~[11]~~; for (int j=0~~.00 0.00 42/42 std::vector<int, std::allocator<int~~;j<ncobjects;j++){ cobjects[j]->v += cobjects[j]-> ~~>::operator~~a * mydt*macr_b[i]~~(unsigned int)~~ ; cobjects[22j] ~~0.00 0.00 16/16 bool std::operator~~->pos +=~~=<char, std::char_traits~~cobjects[j]->v * mydt*macr_a[i]; } } //We should really expand the loop for efficiency} void calculate_a(){ for (int j1=0;j1<~~char>, std::allocator<char>~~ ncobjects;j1++){ cobjects[j1]->a=vect(~~std::basic_string<char~~0,0, std::char_traits<char>, std::allocator<char> > const&, char const*0) ~~[33]~~; } for (int j1=0~~.00 0.00 15/35 std::vector~~; j1<ncobjects;j1++){ for (int j2=j1+1;j2<~~int, std::allocator<int> >::size(~~ncobjects;j2++) ~~const~~ { double m1=cobjects[23j1]->m; ~~0.00 0.00 14/14 std::vector<int, std::allocator<int~~ double m2=cobjects[j2]-> m; vect dist=cobjects[j1]->~~::push_back(int const&)~~ pos-cobjects[39j2]->pos; 0 double magd=dist.~~00 0~~mag(); vect base=dist*(1.~~00 14~~0/~~14 getobj~~(~~int~~magd*magd*magd)) ; cobjects[36j1]->a+=base*(-m2); ~~0.00 0.00 3~~ cobjects[j2]->a+=base*m1; } }}</~~3 std::vector<double, std::allocator<double> >::operator[]~~syntaxhighlight> |} {| class="wikitable mw-collapsible mw-collapsed"! NBody Hot Spot Data|-| Call graph (~~unsigned int~~explanation follows) ~~[90]~~ ~~0.00 0.00 2/2 print_hline~~ granularity: each sample hit covers 4 byte(s) ~~[94]~~ for 0.~~00 0~~16% of 6.~~00 2/10 std::vector~~18 seconds index % time self children called name <~~double, std::allocator<double> >::size() const~~ spontaneous>[451] 0 99.00 7 0.00 1/ 6.16 main [1 ~~std::ios_base::precision(int) [146~~] 0.00 06.00 15 1/1 ~~std::vector<double, std::allocator<~~dowork(double~~> >::vector(~~) [~~142~~3] 0.00 0.00 01 1/1 ~~std::vector<int, std::allocator<int> >::vector() [144~~totalL() [14] 0.00 0.00 1/1 ~~std::vector<double, std::allocator<double> >::push_back(double const&) [141~~totalE() [16] 0.00 0.00 1/1 ~~std::vector<std::string, std::allocator<std::string> >::vector() [135~~initialize() [17] 0.00 0.00 1 28/~~1 std~~32712799 vect::~~vector<std::string, std::allocator<std::string> >::~vector() [136]~~operator-(vect const&) [8] 0.00 0.00 1 14/~~1 JD~~118268959 vect::operator*(tm*double const&) [~~103~~5] 0.00 0.00 1 14/~~1 std~~5032775 vect::~~vector<double, std::allocator<double> >::push_back(double&~~operator=(vect const&) [~~140~~11] 0.00 0.00 1 42/1 42 std::vector<int, std::allocator<int> >::~~~vector~~operator[](unsigned int) [~~145~~22] 0.00 0.00 1 16/1 16 bool std::~~vector~~operator==<~~double~~char, std::~~allocator~~char_traits<~~double>~~ char>, std::~~~vector~~allocator<char> >(~~) [143]-----------------------------------------------~~ std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*) [33] 0.00 0.~~14 6.01 89870~~00 15/~~89870 dowork~~35 std::vector<int, std::allocator<int> >::size(~~double~~) const [323]~~[2] 99.6~~ 0.00 0.00 14/14 ~~6.01 89870 CRO_step(double~~ std::vector<int, ~~void (*)~~std::allocator<int> >::push_back()int const&) [239] 10.18 00 40.~~22 359480~~00 14/~~359480 calculate_a~~14 getobj(int) [436] 0.20 00 0.~~29 20130880~~00 3/~~118268959 vect~~3 std::~~operator*(~~vector<double ~~const&)~~ , std::allocator<double> >::operator[](unsigned int) [590] 0.12 00 0.00 ~~10065440~~ 2/~~75490814 vect::operator+=(vect const&~~2 print_hline() [794] 0.00 0.00 2/10 std::vector<double, std::allocator<double> >::size() const [45]~~-----------------------------------------------~~ 0.00 60.15 00 1/1 ~~main~~ std::ios_base::precision(int) [1146]~~[3] 99~~ 0.6 00 0.00 ~~6.15~~ 1 ~~dowork(~~/1 std::vector<double~~) [3~~, std::allocator<double> >::vector() [142] 0.14 00 60.~~01 89870~~00 1/~~89870 CRO_step(double~~1 std::vector<int, ~~void (*)~~std::allocator<int> >::vector()) [2144] 0.00 0.00 1/1 std::~~abs(~~vector<double, std::allocator<double> >::push_back(double const&) [~~147~~141]~~-----------------------------------------------~~ 0.00 0.00 1/1 std::vector<std::string, std::allocator<std::string> >::vector() [135] 10.18 00 40.~~22 359480~~00 1/~~359480 CRO_step(double~~1 std::vector<std::string, ~~void (*)~~std::allocator<std::string> >::~vector()) [2136]~~[4] 87~~ 0.5 00 0.00 1~~.18 4.22 359480 calculate_a~~/1 JD(tm*) [4103] 10.00 0.00 1~~.39 98138040~~/~~118268959 vect~~1 std::operator*vector<double, std::allocator<double> >::push_back(double ~~const~~&&) [5140] 0.78 00 0.00 ~~65425360~~ 1/~~75490814 vect~~1 std::vector<int, std::allocator<int> >::~~operator+=~~~vector(~~vect const&~~) [7145] 0.26 00 0.~~37 32712680/32712799 vect~~00 1/1 std::vector<double, std::allocator<double> >::~~operator-~~~vector(~~vect const&)~~ ) [8] ~~0.32 0.00 32712680/32712785 vect::mag() [10]~~ ~~0.08 0.00 5032720/5032775 vect::operator=(vect const&) [11]~~ ~~0.01 0.00 5032720/5032775 vect::vect(double, double, double) [13]-----------------------------------------------~~ ~~0.00 0.00 11/118268959 initialize() [17]~~ ~~0.00 0.00 14/118268959 main [1]~~ ~~0.00 0.00 14/118268959 totalL() [14]~~ ~~0.20 0.29 20130880/118268959 CRO_step(double, void (*)()) [2]~~ ~~1.00 1.39 98138040/118268959 calculate_a() [4][5] 46.5 1.20 1.67 118268959 vect::operator*(double const&) [5]~~ ~~1.67 0.00 118268959/118268959 vect::operator*=(double const&) [6]-----------------------------------------------~~ ~~1.67 0.00 118268959/118268959 vect::operator*(double const&) [5][6] 27.1 1.67 0.00 118268959 vect::operator*=(double const&) [6]-----------------------------------------------~~ ~~0.00 0.00 14/75490814 totalL() [14]~~ ~~0.12 0.00 10065440/75490814 CRO_step(double, void (*)()) [2]~~ ~~0.78 0.00 65425360/75490814 calculate_a() [4][7] 14.6 0.91 0.00 75490814 vect::operator+=(vect const&) [7]-----------------------------------------------~~ ~~0.00 0.00 28/32712799 main [1]~~ ~~0.00 0.00 91/32712799 totalE() [16]~~ ~~0.26 0.37 32712680/32712799 calculate_a() [4][8] 10.4 0.27 0.38 32712799 vect::operator-(vect const&) [8]~~ ~~0.38 0.00 32712799/32712799 vect::operator-=(vect const&) [9]-----------------------------------------------~~ ~~0.38 0.00 32712799/32712799 vect::operator-(vect const&) [8][9] 6.1 0.38 0.00 32712799 vect::operator-=(vect const&) [9]-----------------------------------------------~~ ~~0.00 0.00 105/32712785 totalE() [16]~~ ~~0.32 0.00 32712680/32712785 calculate_a() [4][10] 5.2 0.32 0.00 32712785 vect::mag() [10~~143]

-----------------------------------------------

0.00 14 6.01 89870/89870 dowork(double) [3][2] 99.6 0.00 14~~/5032775 main~~ 6.01 89870 CRO_step(double, void (*)()) [12] 01.00 18 04.00 22 359480/359480 ~~41/5032775 initialize~~calculate_a() [174] 0.08 20 0.~~00 5032720~~29 20130880/~~5032775~~ 118268959 ~~calculate_a~~vect::operator*(double const&) [45]~~[11] 1.4~~ 0.08 12 0.00 ~~5032775~~ 10065440/75490814 vect::operator+=(vect const&) [117]

-----------------------------------------------

~~<spontaneous>~~ 0.00 6.15 1/1 main [1][123] 99.6 0.00 6.15 1 dowork(double) [3 ] 0.14 6.01 89870/89870 CRO_step(double, void (*)()) [2] 0.02 00 0.00 ~~vect~~ 1/1 std::~~operator+~~abs(~~vect const&~~double) [12147]

-----------------------------------------------

1.18 4.22 359480/359480 CRO_step(double, void (*)()) [2][4] 87.5 1.18 4.22 359480 calculate_a() [4] 1.00 1.39 98138040/118268959 vect::operator*(double const&) [5] 0.00 78 0.00 1465425360/~~5032775~~ 75490814 ~~cross~~vect::operator+=(vect const&, ) [7] 0.26 0.37 32712680/32712799 vect::operator-(vect const&) [158] 0.00 32 0.00 4132712680/~~5032775~~ 32712785 ~~initialize~~vect::mag() [1710] 0.01 08 0.00 5032720/5032775 ~~calculate_a~~vect::operator=(vect const&) [411]~~[13] 0.2~~ 0.01 0.00 5032720/5032775 vect::vect(double, double, double) [13]

-----------------------------------------------

0.00 0.~~01 1~~00 11/~~1 main~~ 118268959 initialize() [117]~~[14]~~ 0.1 00 0.00 ~~0.01~~ 14/118268959 main [1 ~~totalL() [14~~] 0.01 00 0.00 14/~~14 cross~~118268959 totalL(~~vect const&, vect const&~~) [1514] 0.00 20 0.~~00 14~~29 20130880/118268959 ~~vect::operator~~CRO_step(double, void (*)(~~double const&~~)) [52] 01.00 01.~~00 14~~39 98138040/~~75490814~~ 118268959 calculate_a() [4][5] 46.5 1.20 1.67 118268959 vect::operator+=*(~~vect~~ double const&) [75] 01.00 67 0.00 1118268959/85 118268959 vect::~~vect~~operator*=(double const&) [216]

-----------------------------------------------

1.67 0.00 118268959/118268959 vect::operator*(double const&) [5][6] 27.1 1.67 0.00 118268959 vect::operator*=(double const&) [6]----------------------------------------------- 0.01 00 0.00 14/14 75490814 totalL() [14] 0.12 0.00 10065440/75490814 CRO_step(double, void (*)()) [152] 0.78 0.00 65425360/75490814 calculate_a() [4][7] 14.6 0.91 0.00 75490814 vect::operator+=(vect const&) [7]----------------------------------------------- 0.00 0.00 28/32712799 main [1 ] 0.01 00 0.00 14 91/32712799 totalE() [16] 0.26 0.37 32712680/32712799 calculate_a() [4][8] 10.4 0.27 0.38 32712799 ~~cross~~vect::operator-(vect const&, ) [8] 0.38 0.00 32712799/32712799 vect::operator-=(vect const&) [159]----------------------------------------------- 0.00 38 0.00 1432712799/~~5032775~~ 32712799 vect::operator-(vectconst&) [8][9] 6.1 0.38 0.00 32712799 vect::operator-=(~~double, double, double~~vect const&) [139]|} ~~{| class="wikitable mw~~-~~collapsible mw~~-~~collapsed"~~---------------------------------------------~~! NBody gprof Complete Data~~ 0.00 0.00 105/32712785 totalE(~~Warning: long~~)[16]|- 0.32 0.00 32712680/32712785 calculate_a() [4]~~| Call graph~~ [10] 5.2 0.32 0.00 32712785 vect::mag(~~explanation follows~~)[10]----------------------------------------------- 0.00 0.00 14/5032775 main [1]~~granularity: each sample hit covers 4 byte~~ 0.00 0.00 41/5032775 initialize(s) ~~for~~ [17] 0.~~16% of 6~~08 0.~~18 seconds~~00 5032720/5032775 calculate_a() [4] ~~index % time~~ [11] 1.4 ~~self children~~ 0.08 ~~called name~~0.00 5032775 vect::operator=(vect const&) [11]----------------------------------------------- <spontaneous>[112] 990.7 3 0.00 02 60.16 00 ~~main~~ vect::operator+(vect const&) [112]----------------------------------------------- 0.00 60.~~15 1~~00 14/~~1 dowork~~5032775 cross(~~double~~vect const&, vect const&) [315] 0.00 0.~~01 1~~00 41/~~1 totalL~~5032775 initialize() [1417] 0.01 0.00 5032720/5032775 calculate_a() [4][13] 0.2 0.01 0.00 ~~1/1 totalE~~5032775 vect::vect(double, double, double) [1613]----------------------------------------------- 0.00 0.00 01 1/1 ~~initialize~~main [1][14] 0.1 0.00 0.01 1 totalL() [1714] 0.00 01 0.00 2814/~~32712799~~ 14 cross(vect~~::operator-(~~const&, vect const&) [815]

0.00 0.00 14/118268959 vect::operator*(double const&) [5]

0.00 0.00 14/75490814 vect::operator+=(vect const&) [7] 0.00 0.00 1/85 vect::vect() [21]----------------------------------------------- 0.01 0.00 14/14 totalL() [14][15] 0.1 0.01 0.00 14 cross(vect const&, vect const&) [15] 0.00 0.00 14/5032775 vect::vect(double, double, double) [13]|} {| class="wikitable mw-collapsible mw-collapsed"! NBody gprof Complete Data (Warning: long)|-| Call graph (explanation follows) granularity: each sample hit covers 4 byte(s) for 0.16% of 6.18 seconds index % time self children called name <spontaneous>[1] 99.7 0.00 6.16 main [1] 0.00 6.15 1/1 dowork(double) [3] 0.00 0.01 1/1 totalL() [14] 0.00 0.00 1/1 totalE() [16] 0.00 0.00 1/1 initialize() [17] 0.00 0.00 28/32712799 vect::operator-(vect const&) [8] 0.00 0.00 14/118268959 vect::operator*(double const&) [5] 0.00 0.00 14/5032775 vect::operator=(vect const&) [11] 0.00 0.00 42/42 std::vector<int, std::allocator<int> >::operator[](unsigned int) [22] 0.00 0.00 16/16 bool std::operator==<char, std::char_traits<char>, std::allocator<char> >(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*) [33] 0.00 0.00 15/35 std::vector<int, std::allocator<int> >::size() const [23] 0.00 0.00 14/14 std::vector<int, std::allocator<int> >::push_back(int const&) [39] 0.00 0.00 14/14 getobj(int) [36] 0.00 0.00 3/3 std::vector<double, std::allocator<double> >::operator[](unsigned int) [90]

0.00 0.00 2/2 print_hline() [94]

0.00 0.00 2/10 std::vector<double, std::allocator<double> >::size() const [45]

An even better way would be to integrate the Gaussian function instead of just taking point samples. Refer to the two graphs on the right.<br/>

The graphs plot the continuous distribution function and the discrete kernel approximation. One thing to look out for are the tails of the distribution vs. kernel ~~support~~weight:<br/>

For the current configuration, we have 13.36% of the curve’s area outside the discrete kernel. Note that the weights are renormalized such that the sum of all weights is one. Or in other words:<br/>

the probability mass outside the discrete kernel is redistributed evenly to all pixels within the kernel. The weights are calculated by numerical integration of the continuous gaussian distribution<br/>

char *destFileName = argv[2];

#endif /* ~~RUN_GROF~~ RUN_GPROF */

if (showUsage)

To compile and run the program:

# Navigate to the directory you want to run the program in.

# Save [http://matrix.senecac.on.ca/~cpaul12/cinque_terre.bmp this] image and place it into the directory you will be running the program from.

# Copy the Linux version of the main source code above and paste it into a [your chosen file name].cpp file.

# Copy the Linux version of the header source code above and paste it into a file named windows.h.

To compile and run the program:

# Navigate to the directory you want to run the program in.

# Save [http://matrix.senecac.on.ca/~cpaul12/cinque_terre.bmp this] image and place it into the directory you will be running the program from.

# Copy the Linux version of the main source code above and paste it into a [your chosen file name].cpp file.

# Copy the Linux version of the header source code above and paste it into a file named windows.h.

for parallelization using CUDA. The sigma (σ) and the kernel size can be increased in order to make the computation stressful on the GPU to get a significant benchmark.

= Assignment 2 /3 - Parallelize & Optimize =* For gaussian blur we say it's unoptimized because we feel that there is more that can be done to reduce the execution times.<br/>  The code displayed in the code snippets does use CUDA parallel constructs and fine tuning techniques such as streaming - async.== Gaussian Blur ==

{| class="wikitable mw-collapsible mw-collapsed"

! ~~Culptit~~ Unoptimized* - BlurImage( ... )

|-

|

#include <windows.h> // for bitmap headers.

#include <algorithm>

#include <chrono>

#include <cuda_runtime.h>

#include <device_functions.h>

//#~~if defined(__NVCC__) && __CUDACC_VER_MAJOR__ != 1const int ntpb = 1024;~~ifdef __CUDACC__//#~~elif defined(__NVCC__) &&~~if __CUDACC_VER_MAJOR__ == 1

//const int ntpb = 512;

//#else

//const int ntpb = 1024;

//#endif

const int ntpb = 1024;const ~~float c_pi~~ int STREAMS = ~~3.14159265359f~~32;

void check(cudaError_t error) {

}

~~const uint8_t* GetPixelOrBlack(const SImageData& image, int x, int y)~~struct BGRPixel { ~~static const uint8_t black[3] = { 0, 0, 0 }~~float b; ~~if (x < 0 || x >= image.m_width ||~~ ~~y < 0 || y >= image.m_height)~~float g; { ~~return black~~float r; };

__global__ void blur_kernel(BGRPixel* imageIn, BGRPixel* imageOut, float* blur, int n_blur, int x, int start, int jump) { ~~return &image~~int idx = blockDim.~~m_pixels[(y~~ x* ~~image~~blockIdx.~~m_pitch)~~ x + threadIdx.x * 3];}// Location on the row

~~__global__ void horizontal_blur_kernel~~ if (~~float* pixels, float* output, float* intergrals, int nIntegrals, int width, int height, int pitch~~idx < x) { // int p id = ~~pitch;~~ ~~//int x = width;~~ ~~//int y = height;~~ ~~//int n = nIntegrals;~~ ~~int idy = blockIdx.x*blockDim.x~~ start + ~~threadIdx.x;~~ ~~int~~ idx ~~= blockIdx.y*blockDim.y + threadIdx.y~~; // int ~~startOffset~~ bstart = id -~~1 * int~~(~~nIntegrals~~ n_blur / 2)*jump;

~~//float* dst;~~ //const float* BGRPixel pixel; ~~//const float black[3] = { 0.0f, 0.0f, 0.0f };~~ ~~//float blurred_pixel[3] =~~ { 0.0f, 0.0f, 0.0f };

// for (int i = 0; i < nn_blur; ++i) { ~~// Prefetch for integrals and pixels~~ // int ty bid = ~~y + startOffset~~ bstart + i*jump; ~~// pixel = (idx < 0 || idx >= x ||~~ ~~// idy < 0 || idy >~~ float iblur = ~~ty) ? black : &pixels~~blur[~~(ty * p) + idx * 3~~i];

~~// blurred_pixel[0]~~ pixel.b += ~~pixel~~imageIn[0bid] .b * ~~intergrals[i]~~iblur; ~~// blurred_pixel[1]~~ pixel.g += ~~pixel~~imageIn[1bid] .g * ~~intergrals[i]~~iblur; ~~// blurred_pixel[2]~~ pixel.r += ~~pixel~~imageIn[2bid] .r * ~~intergrals[i]~~iblur; // }

~~//dst = &output[idy*p + idx * 3];~~ ~~//dst[0] = blurred_pixel[0];~~ ~~//dst[1] = blurred_pixel[1];~~ ~~//dst~~ imageOut[2id] .b = ~~blurred_pixel[2]~~pixel.b; ~~//if (idx == 0) {~~ ~~output~~imageOut[~~idx*width + idy~~id] .g = ~~pixels[idx*width + idy]~~pixel.g; ~~//if (idx % 3 == 0) {~~ ~~// output~~imageOut[~~idx + idy*width~~id] .r = 0pixel.r; ~~//}~~ //}

}

void BlurImage(const SImageData& srcImage, SImageData &destImage, float xblursigma, float yblursigma, unsigned int xblursize, unsigned int yblursize)

{

~~float~~int xImage = srcImage.m_width; // Width of image int yImage = srcImage.m_height; // Height of image int imageSize = xImage* ~~d_ipixels~~yImage; int xPadded = xImage + (xblursize - 1); // Width including padding int yPadded = yImage + (yblursize - 1); // ~~Device input pixel array~~ Height including padding ~~float~~int paddedSize = xPadded* ~~d_opixels~~yPadded; int xPad = xblursize / 2; // Number of padding columns on each side int yPad = yblursize / 2; int padOffset = xPadded*yPad + xPad; // ~~Device output~~ Offset to first pixel ~~array~~in padded image float* pinnedImage = nullptr; BGRPixel* d_padded1 = nullptr; BGRPixel* d_padded2 = nullptr; float* ~~d_integrals~~d_xblur = nullptr; // XBlur integrals int n_xblur; // ~~Stores guassian kernel intergrals~~N

~~int n~~ float* d_yblur = ~~srcImage.m_height*srcImage.m_pitch~~nullptr; // YBlur integrals int ~~nblks = (n + ntpb - 1)~~ n_yblur; // ~~ntpb;~~N

~~dim3 dimBlock~~// Allocate memory for host and device check(1cudaHostAlloc((void**)&pinnedImage, 3* imageSize * sizeof(float), 0)); ~~dim3 dimGrid~~check(~~srcImage.m_width~~cudaMalloc((void*3*)&d_padded1, paddedSize * sizeof(BGRPixel))); check(cudaMalloc((void**)&d_padded2, ~~srcImage.m_height~~paddedSize * sizeof(BGRPixel)));

~~check~~// Copy image to pinned memory for (cudaMalloc((void*int i = 0; i < 3 *imageSize; ++i)~~&d_ipixels, srcImage.m_pitch*srcImage.m_height * sizeof~~{ pinnedImage[i] = (float)))srcImage.m_pixels[i]; ~~check(cudaMalloc((void**)&d_opixels, srcImage.m_pitch*srcImage.m_height*sizeof(float)));~~}

// Allocate and assign intergrals

{

~~std::vector<float> temp~~auto row_blur = GaussianKernelIntegrals(xblursigma, xblursize); auto col_blur = GaussianKernelIntegrals(~~srcImage.m_pixels~~yblursigma, yblursize); // ROW n_xblur = row_blur.size()); ~~std::transform~~check(cudaMalloc(~~srcImage.m_pixels.begin~~(void**)&d_xblur, ~~srcImage.m_pixels.end~~n_xblur * sizeof(float))); check(cudaMemcpy(d_xblur, ~~temp~~row_blur.~~begin~~data(), []n_xblur * sizeof(~~auto e~~float), cudaMemcpyHostToDevice) { ~~return e / 255.0f;~~ });

// COLUMN n_yblur = col_blur.size(); check(cudaMalloc((void**)&d_yblur, n_yblur * sizeof(float))); check(cudaMemcpy(~~d_ipixels~~d_yblur, ~~temp~~col_blur.data(), ~~3 * srcImage.m_width*srcImage.m_height~~ n_yblur * sizeof(float), cudaMemcpyHostToDevice));

}

~~// horizontal blur from d_ipixels to d_opixels~~

{

~~int nIntegrals;~~

~~// Scoped so that the row is cleared once it's copied~~ cudaStream_t stream[STREAMS]; { ~~auto row~~ int nblks = ~~GaussianKernelIntegrals~~(~~xblursigma, xblursize~~xImage + (ntpb - 1)) / ntpb; ~~nIntegrals~~ for (int i = ~~row.size~~0; i < STREAMS; ++i) { check(cudaStreamCreate(&stream[i])); }

~~check~~ for (~~cudaMalloc~~int i = 0; i < yImage;) { for (~~(void**)~~int j = 0; j < STREAMS &&~~d_integrals~~i < yImage; ++j, ~~row.size() * sizeof(float))~~++i);{ ~~check~~cudaMemcpyAsync(~~cudaMemcpy(d_integrals~~d_padded1 + padOffset + i*xPadded, ~~row.data~~pinnedImage + (3 * i*xImage), ~~row.size()~~ 3 * xImage * sizeof(float), cudaMemcpyHostToDevice), stream[j]);

}

for (int i = 0; i < yImage;) { ~~horizontal_blur_kernel~~ for (int j = 0; j <STREAMS && i < yImage; ++j, ++i) { blur_kernel <<~~dimGrid~~<nblks, ~~dimBlock~~ ntpb, 0, stream[j] >>> (~~d_ipixels~~d_padded1, ~~d_opixels~~d_padded2, ~~d_integrals~~d_xblur, ~~nIntegrals~~n_xblur, ~~srcImage.m_width~~xImage, ~~srcImage.m_height~~padOffset + i*xPadded, ~~srcImage.m_pitch~~1); } }

~~cudaDeviceSynchronize~~ for (int i = 0; i < yImage;);{ ~~check~~for (~~cudaGetLastError~~int j = 0; j < STREAMS && i < yImage; ++j, ++i) { blur_kernel << <nblks, ntpb, 0, stream[j] >> > ()d_padded2, d_padded1, d_yblur, n_yblur, xImage, padOffset + i*xPadded, xPadded); } }

for (int i = 0; i < yImage;) { for (int j = 0; j < STREAMS && i < yImage; ++j, ++i) { check(cudaMemcpyAsync(pinnedImage + (3 * i*xImage), d_padded1 + padOffset + i*xPadded, xImage * sizeof(BGRPixel), cudaMemcpyDeviceToHost, stream[j])); } } for (int i = 0; i < STREAMS; ++i) { check(cudaStreamSynchronize(stream[i])); check(~~cudaFree~~cudaStreamDestroy(~~d_integrals~~stream[i]));

}

destImage.m_width = srcImage.m_width;

destImage.m_height = srcImage.m_height;

destImage.m_pitch = srcImage.m_pitch; destImage.m_pixels.resize~~(destImage.m_height * destImage.m_pitch);~~ { ~~std::vector<float> temp~~(srcImage.m_pixels.size()); ~~check~~ for (~~cudaMemcpy(temp.data(), d_opixels,~~ int i = 0; i < 3 * ~~srcImage~~imageSize; i++) { destImage.~~m_width*srcImage.m_height * sizeof~~m_pixels[i] = (~~float~~uint8_t)~~, cudaMemcpyDeviceToHost~~pinnedImage[i]; }; check(cudaFree(d_xblur)); ~~std::transform~~ check(~~temp.begin~~cudaFree(d_yblur)~~, temp.end~~); check(~~), destImage.m_pixels.begin~~cudaFreeHost(pinnedImage)~~, [](auto e~~) {; ~~return~~ check(~~int)~~cudaFree(~~e * 255.0f~~d_padded1); }); } check(cudaFree(~~d_ipixels~~d_padded2)); check(~~cudaFree~~cudaDeviceReset(~~d_opixels~~)); ~~check(cudaDeviceReset~~} int main(int argc, char **argv)); { ~~//// allocate space for copying the image for destImage and tmpImage~~float xblursigma, yblursigma; ~~//destImage.m_width~~ bool showUsage = ~~srcImage.m_width;~~argc < 5 || ~~//destImage.m_height~~ (sscanf(argv[3], "%f", &xblursigma) != 1) || (sscanf(argv[4], "%f", &yblursigma) != ~~srcImage.m_height~~1); ~~//destImage.m_pitch = srcImage.m_pitch~~char *srcFileName = argv[1]; ~~//destImage.m_pixels.resize(destImage.m_height~~ char * ~~destImage.m_pitch)~~destFileName = argv[2]; ~~//SImageData tmpImage;~~if (showUsage) ~~//tmpImage.m_width = srcImage.m_width~~{ printf("Usage: <source> <dest> <xblur> <yblur>\nBlur values are sigma\n\n"); ~~//tmpImage.m_height = srcImage.m_height~~ WaitForEnter(); return 1; } //~~tmpImage.m_pitch = srcImage.m_pitch;~~calculate pixel sizes, and make sure they are odd ~~//tmpImage.m_pixels.resize~~int xblursize = PixelsNeededForSigma(xblursigma) | 1; int yblursize = PixelsNeededForSigma(~~tmpImage.m_height * tmpImage.m_pitch~~yblursigma)| 1; ~~//// horizontal~~ printf("Attempting to blur ~~from srcImage into tmpImage~~ ~~//{~~ ~~// auto row = GaussianKernelIntegrals(xblursigma, xblursize~~a 24 bit image.\n"); ~~// int startOffset~~ printf(" Source=%s\n Dest=%s\n blur= ~~-1 * int(row~~[%0.1f, %0.~~size() / 2~~1f] px=[%d,%d]\n\n", srcFileName, destFileName, xblursigma, yblursigma, xblursize, yblursize); ~~// for~~ SImageData srcImage; if (LoadImage(~~int y = 0; y < tmpImage.m_height; ++y~~srcFileName, srcImage)) // { ~~// for~~ printf(~~int x = 0~~"%s loaded\n", srcFileName); ~~x < tmpImage.m_width~~ SImageData destImage; ~~++x)~~ ~~// {~~ // auto t1 = std::~~array<float, 3> blurredPixel = { { 0.0f, 0.0f, 0.0f } }~~chrono::high_resolution_clock::now(); // ~~for (unsigned int i = 0; i < row.size~~BlurImage(srcImage, destImage, xblursigma, yblursigma, xblursize, yblursize); ~~++i)~~ // { ~~// const uint8_t *pixel = GetPixelOrBlack~~auto t2 = std::chrono::high_resolution_clock::now(~~srcImage, x + startOffset + i, y~~); ~~// blurredPixel[0] += float~~ std::cout << "BlurImage time: " << std::chrono::duration_cast<std::chrono::microseconds>(~~pixel[0]~~t2 - t1) * row[i].count() << "us" << std::endl; // if (SaveImage(destFileName, destImage)) ~~blurredPixel[1] += float~~printf(~~pixel[1]~~"Blurred image saved as %s\n", destFileName) * row[i]; // else { ~~blurredPixel[2] += float~~printf(~~pixel[2]~~"Could not save blurred image as %s\n", destFileName) * row[i]; WaitForEnter(); return 1; } // } else // { ~~uint8_t *destPixel = &tmpImage.m_pixels[y * tmpImage.m_pitch + x * 3]~~printf("could not read 24 bit bmp file %s\n\n", srcFileName); // ~~destPixel[0] = uint8_t~~WaitForEnter(~~blurredPixel[0]~~); // ~~destPixel[~~return 1~~] = uint8_t(blurredPixel[1])~~; ~~// destPixel[2] = uint8_t(blurredPixel[2])~~} return 0; // } /</ syntaxhighlight> |} ~~//}~~== Objectives == ~~//// vertical blur from tmpImage into destImage~~The main objective was to not change the main function. This objective was met, although code had to be added for profiling. ~~//{~~ ~~// auto row~~ = ~~GaussianKernelIntegrals(yblursigma, yblursize);~~= Steps ===== Host Memory Management === ~~// int startOffset = -1 * int~~In the original program a bmp is loaded into an vector of uint8_t. This is not ideal for CUDA, therefore an array of pinned memory was allocated. This array contains the same amount of elements but stores them as a structure, "BGRPixel" which is three contiguous floats. The vector is then transferred over to pinned memory.{| class="wikitable mw-collapsible mw-collapsed"! Host Memory Management - Code(~~row~~.~~size(~~.. ) ~~/ 2);~~|- ~~// for (int y~~ |<syntaxhighlight lang= ~~0; y < destImage.m_height; ++y)~~"cpp">struct SImageData // { ~~// for~~ SImageData() : m_width(0) , m_height(~~int x =~~ 0~~; x < destImage.m_width; ++x~~) // {} long m_width; long m_height; long m_pitch; // std::~~array~~vector<~~float, 3~~uint8_t> ~~blurredPixel =~~ m_pixels;}; struct BGRPixel { ~~{ 0.0f, 0.0f, 0.0f } }~~ float b; ~~// for (unsigned int i = 0~~float g; ~~i < row.size()~~ float r;}; ~~++i)~~ ~~// {~~ // void BlurImage(const ~~uint8_t *pixel = GetPixelOrBlack(tmpImage~~SImageData& srcImage, SImageData &destImage, float xblursigma, float yblursigma, xunsigned int xblursize, ~~y + startOffset + i~~unsigned int yblursize);{ int xImage = srcImage.m_width; // ~~blurredPixel[0] +~~Width of image int yImage = ~~float(pixel[0]) * row[i]~~srcImage.m_height; // Height of image ~~// blurredPixel[1]~~ int imageSize = xImage*yImage; int xPadded = xImage +~~= float~~(~~pixel[~~xblursize - 1]) * row[i]; // ~~blurredPixel[2] +~~Width including padding int yPadded = ~~float~~yImage + (~~pixel[2]~~yblursize - 1) * row[i]; // Height including padding ~~// }~~int paddedSize = xPadded*yPadded; int xPad = xblursize /2; / ~~uint8_t *destPixel~~ / Number of padding columns on each side int yPad = yblursize / 2; int padOffset = ~~&destImage.m_pixels[y~~ xPadded* ~~destImage.m_pitch~~ yPad + ~~x * 3]~~xPad; // ~~destPixel[0]~~ Offset to first pixel in padded image float* pinnedImage = ~~uint8_t(blurredPixel[0])~~nullptr; ~~// destPixel[1]~~ BGRPixel* d_padded1 = ~~uint8_t(blurredPixel[1])~~nullptr; ~~// destPixel[2]~~ BGRPixel* d_padded2 = ~~uint8_t(blurredPixel[2])~~nullptr; // }... // }Allocate memory for host and device //check(cudaHostAlloc((void**)&pinnedImage, 3 * imageSize * sizeof(float), 0)); check(cudaMalloc((void**)&d_padded1, paddedSize * sizeof(BGRPixel))); check(cudaMalloc((void**)&d_padded2, paddedSize * sizeof(BGRPixel))); // Copy image to pinned memory for (int i = 0; i < 3 * imageSize; ++i) { pinnedImage[i] = (float)srcImage.m_pixels[i]; } // ...}</syntaxhighlight> |} === Device Memory Management ===To get a blurred pixel the surrounding pixels must be sampled, in some cases this means sampling pixels outside the bounds of the image. In the original, a simple if check was used to determine if the pixel was outside the bounds or the image, if it was a black pixel was returned instead. This if statement most likely would have caused massive thread divergence in a kernel, therefore the images created in device memory featured additional padding of black pixels to compensate for this. Two such images were created, one to perform horizontal blur and one to perform vertical blur. Other small device arrays were also needed to store the Gaussian integrals that are used to produce the blurring effect.<br>{| class="wikitable mw-collapsible mw-collapsed"! Padding example|-| <div style="display:inline;">[[File:shrunk.png]]</div><div style="display:inline;">[[File:pad.png]]</div><br>This is how the image would be padded for 3x3 sigma blur. The original image is 2560x1600 -> 11.7MB With blur sigmas [x = 3, y = 3] and conversion to float the padded images will be 2600x1640 -> 48.8MB Increase of 4.1% pixels and with the conversion for uint8_t to float total increase of 317% in memory requirements on the GPU Since two padded images are needed at least 97.6MB will be on the GPU |} === Host to Device ===To copy the pinned image to the device an array of streams was used to asynchronously copy each row of the image over. Doing so allowed the rows to be easily copied over while avoiding infringing on the extra padding pixels.=== Kernels ===First one image is blurred horizontally. One image is used as a reference while the other is written to. Kernels are also executed using the streams, so that each stream will blur a single row at a time. After the horizontal blur is finished the vertical blur is launched in the same manner, except that the previously written to image is used as a reference while the previous reference is now written to. The two blur are able to use the same kernel due to the fact that the pixel sampling technique works by iterating through pixels because of this the step size can be changed to sample across the row or down the column. === Device to Host ===After that is done the image is copied back using the streams in the same way it was copied over.=== Code === {| class="wikitable mw-collapsible mw-collapsed"! Unoptimized* - BlurImage -- Exert( ... )|-|<syntaxhighlight lang="cpp">const int ntpb = 1024;const int STREAMS = 32; void check(cudaError_t error) { if (error != cudaSuccess) { throw std::exception(cudaGetErrorString(error)); }} struct SImageData{ SImageData() : m_width(0) , m_height(0) { } long m_width; long m_height; long m_pitch; std::vector<uint8_t> m_pixels;}; float Gaussian(float sigma, float x){ return expf(-(x*x) / (2.0f * sigma*sigma));} float GaussianSimpsonIntegration(float sigma, float a, float b){ return ((b - a) / 6.0f) * (Gaussian(sigma, a) + 4.0f * Gaussian(sigma, (a + b) / 2.0f) + Gaussian(sigma, b));} std::vector<float> GaussianKernelIntegrals(float sigma, int taps){ std::vector<float> ret; float total = 0.0f; for (int i = 0; i < taps; ++i) { float x = float(i) - float(taps / 2); float value = GaussianSimpsonIntegration(sigma, x - 0.5f, x + 0.5f); ret.push_back(value); total += value; } // normalize it for (unsigned int i = 0; i < ret.size(); ++i) { ret[i] /= total; } return ret;} struct BGRPixel { float b; float g; float r;}; __global__ void blur_kernel(BGRPixel* imageIn, BGRPixel* imageOut, float* blur, int n_blur, int x, int start, int jump) { int idx = blockDim.x*blockIdx.x + threadIdx.x; // Location on the row if (idx < x) { int id = start + idx; int bstart = id - (n_blur / 2)*jump; BGRPixel pixel{ 0.0f, 0.0f, 0.0f }; for (int i = 0; i < n_blur; ++i) { int bid = bstart + i*jump; float iblur = blur[i]; pixel.b += imageIn[bid].b * iblur; pixel.g += imageIn[bid].g * iblur; pixel.r += imageIn[bid].r * iblur; } imageOut[id].b = pixel.b; imageOut[id].g = pixel.g; imageOut[id].r = pixel.r; }} void BlurImage(const SImageData& srcImage, SImageData &destImage, float xblursigma, float yblursigma, unsigned int xblursize, unsigned int yblursize){ int xImage = srcImage.m_width; // Width of image int yImage = srcImage.m_height; // Height of image int imageSize = xImage*yImage; int xPadded = xImage + (xblursize - 1); // Width including padding int yPadded = yImage + (yblursize - 1); // Height including padding int paddedSize = xPadded*yPadded; int xPad = xblursize / 2; // Number of padding columns on each side int yPad = yblursize / 2; int padOffset = xPadded*yPad + xPad; // Offset to first pixel in padded image float* pinnedImage = nullptr; BGRPixel* d_padded1 = nullptr; BGRPixel* d_padded2 = nullptr; float* d_xblur = nullptr; // XBlur integrals int n_xblur; // N float* d_yblur = nullptr; // YBlur integrals int n_yblur; // N // Allocate memory for host and device check(cudaHostAlloc((void**)&pinnedImage, 3 * imageSize * sizeof(float), 0)); check(cudaMalloc((void**)&d_padded1, paddedSize * sizeof(BGRPixel))); check(cudaMalloc((void**)&d_padded2, paddedSize * sizeof(BGRPixel))); // Copy image to pinned memory for (int i = 0; i < 3 * imageSize; ++i) { pinnedImage[i] = (float)srcImage.m_pixels[i]; } // Allocate and assign intergrals { auto row_blur = GaussianKernelIntegrals(xblursigma, xblursize); auto col_blur = GaussianKernelIntegrals(yblursigma, yblursize); // ROW n_xblur = row_blur.size(); check(cudaMalloc((void**)&d_xblur, n_xblur * sizeof(float))); check(cudaMemcpy(d_xblur, row_blur.data(), n_xblur * sizeof(float), cudaMemcpyHostToDevice)); // COLUMN n_yblur = col_blur.size(); check(cudaMalloc((void**)&d_yblur, n_yblur * sizeof(float))); check(cudaMemcpy(d_yblur, col_blur.data(), n_yblur * sizeof(float), cudaMemcpyHostToDevice)); } cudaStream_t stream[STREAMS]; int nblks = (xImage + (ntpb - 1)) / ntpb; for (int i = 0; i < STREAMS; ++i) { check(cudaStreamCreate(&stream[i])); } for (int i = 0; i < yImage;) { for (int j = 0; j < STREAMS && i < yImage; ++j, ++i) { cudaMemcpyAsync(d_padded1 + padOffset + i*xPadded, pinnedImage + (3 * i*xImage), 3 * xImage * sizeof(float), cudaMemcpyHostToDevice, stream[j]); } } for (int i = 0; i < yImage;) { for (int j = 0; j < STREAMS && i < yImage; ++j, ++i) { blur_kernel << <nblks, ntpb, 0, stream[j] >> > (d_padded1, d_padded2, d_xblur, n_xblur, xImage, padOffset + i*xPadded, 1); } } for (int i = 0; i < yImage;) { for (int j = 0; j < STREAMS && i < yImage; ++j, ++i) { blur_kernel << <nblks, ntpb, 0, stream[j] >> > (d_padded2, d_padded1, d_yblur, n_yblur, xImage, padOffset + i*xPadded, xPadded); } } for (int i = 0; i < yImage;) { for (int j = 0; j < STREAMS && i < yImage; ++j, ++i) { check(cudaMemcpyAsync(pinnedImage + (3 * i*xImage), d_padded1 + padOffset + i*xPadded, xImage * sizeof(BGRPixel), cudaMemcpyDeviceToHost, stream[j])); } } for (int i = 0; i < STREAMS; ++i) { check(cudaStreamSynchronize(stream[i])); check(cudaStreamDestroy(stream[i])); } destImage.m_width = srcImage.m_width; destImage.m_height = srcImage.m_height; destImage.m_pitch = srcImage.m_pitch; destImage.m_pixels.resize(srcImage.m_pixels.size()); for (int i = 0; i < 3 * imageSize; i++) { destImage.m_pixels[i] = (uint8_t)pinnedImage[i]; }; check(cudaFree(d_xblur)); check(cudaFree(d_yblur)); check(cudaFreeHost(pinnedImage)); check(cudaFree(d_padded1)); check(cudaFree(d_padded2)); check(cudaDeviceReset());} </syntaxhighlight> |} == Results ==Obtained using Quadro K620<br>[[File:uvso2.png]][[File:usession.png]][[File:ktimes.png]]<br>Using a Quadro K2000<br>[[File:streams.png]] == Output Images ==[http://imgur.com/a/CtMOc Image Gallery][https://seneca-my.sharepoint.com/personal/jkraitberg_myseneca_ca/_layouts/15/guestaccess.aspx?docid=099a13c42168943b587de4b59e4634e06&authkey=Afl_iMqjNyFhoYu3bopOw5E 135MB Image][https://seneca-my.sharepoint.com/personal/jkraitberg_myseneca_ca/_layouts/15/guestaccess.aspx?docid=007880dac1dd74d09b74fc448dc3fac38&authkey=AdqHCKEjZCXzlyftjZWxFCA 135MB 3x3 Result] == Mandelbrot =={| class="wikitable mw-collapsible mw-collapsed"! Unoptimized - Mandelbrot( ... )|-|<syntaxhighlight lang="cpp">//C++ Includes#include <iostream>#include <complex>#include <vector>#include <chrono>#include <functional>#include <cuda_runtime.h> //CUDA Complex Numbers#include <cuComplex.h> //Helper Includes#include "window.h"#include "save_image.h"#include "utils.h" const int ntpb = 32; //Compute Color for each pixel__global__ void computeMandelbrot( int iter_max, int* d_colors, int fract_width, int fract_height, int scr_width, int scr_height, int fract_xmin, int fract_ymin){ int row = blockIdx.y * blockDim.y + threadIdx.y; //Row int col = blockIdx.x * blockDim.x + threadIdx.x; //Col int idx = row * scr_width + col; //Pixel Index if(col < scr_width && row < scr_height){ //Use Floating Complex Numbers to calculate color for each pixel int result = 0; cuFloatComplex c = make_cuFloatComplex((float)col, (float)row); cuFloatComplex d = make_cuFloatComplex(cuCrealf(c) / (float)scr_width * fract_width + fract_xmin , cuCimagf(c) / (float)scr_height * fract_height + fract_ymin); cuFloatComplex z = make_cuFloatComplex(0.0f, 0.0f); while((cuCabsf(z) < 2.0f) && (result < iter_max)){ z = (cuCaddf(cuCmulf(z,z),d)); result++; } d_colors[idx] = result; //Output }} void mandelbrot(){ window<int> scr(0, 1000, 0, 1000); //Image Size window<float> fract(-2.2,1.2,-1.7,1.7); //Fractal Size int iter_max = 500; //Iterations const char* fname = "mandlebrot_gpu.png"; //Output File Name bool smooth_color = true; //Color Smoothing int nblks = (scr.width() + ntpb - 1)/ ntpb; //Blocks std::vector<int> colors(scr.size()); //Output Vector //Allocate Device Memory int* d_colors; cudaMalloc((void**)&d_colors, scr.size() * sizeof(int)); //Grid Layout dim3 dGrid(nblks, nblks); dim3 dBlock(ntpb, ntpb); //Execute Kernel auto start = std::chrono::steady_clock::now(); computeMandelbrot<<<dGrid, dBlock>>>(iter_max, d_colors, fract.width(), fract.height(), scr.width(), scr.height(), fract.x_min(), fract.y_min()); cudaDeviceSynchronize(); auto end = std::chrono::steady_clock::now(); //Output Time std::cout << "Time to generate " << fname << " = " << std::chrono::duration <float, std::milli> (end - start).count() << " [ms]" << std::endl; //Copy Data back to Host cudaMemcpy(colors.data(), d_colors, scr.size() * sizeof(int), cudaMemcpyDeviceToHost); //Plot Data and Free Memory plot(scr, colors, iter_max, fname, smooth_color); cudaFree(d_colors);} int main(){ mandelbrot(); return 0;}</syntaxhighlight>|}

~~int~~ === Objectives ===The mainobjective was refactor the get_number_iterations(~~int argc, char **argv~~){ ~~float xblursigma~~function and the subsequent functions called that created the nested loops. The objective was met as all the functions were refactored into a single device function that did the calculation for a single pixel of the image. As the original program was done with doubles, ~~yblursigma;~~all of the doubles were changed to floats.

~~bool showUsage~~ = ~~argc < 5 ||~~ ~~(sscanf(argv[3], "%f", &xblursigma) !~~= ~~1) ||~~ ~~(sscanf(argv[4], "%f", &yblursigma) !~~= ~~1);~~Steps ===

~~char *srcFileName~~ = ~~argv[1];~~== Host Memory Management === ~~char *destFileName = argv[2];~~No changes were needed to the Host Memory as no data is copied from the host to the device. The vector on the host that contains the data was not changed and data from the device was copied to this vector to be output the plot file.

~~if (showUsage)~~=== Device Memory Management === { ~~printf("Usage: <source> <dest> <xblur> <yblur>\nBlur~~ Only a single array to hold the value for each pixel was created on the device. This array has a size of image width * image height and the row and columns for each image are calculated from this which are used in the complex number calculations along with the values ~~are sigma\n\n");~~ ~~WaitForEnter();~~ ~~return 1;~~ }that specify the parameters of the fractal.

~~// calculate pixel sizes~~=== Kernels ===The three functions from the original code ( get_number_iterations() , escape() and ~~make sure they are odd~~ ~~int xblursize = PixelsNeededForSigma~~scale() were refactored into a single computeMandelbrot(~~xblursigma~~) ~~| 1;~~ ~~int yblursize = PixelsNeededForSigma~~function. The device kernel calculates the row and column for the pixel and then uses the row and colmn values along with the picture width and fractal parameters to calculate the value. Complex floating point numbers are used using the cuComplex.h header file which also includes the operations for the complex numbers as well. As threads are not reliant on each other for any data, no use of __syncthreads(~~yblursigma~~) ~~| 1;~~is required. As threads complete computing the values, they output the value to the d_colors array.

~~printf("Attempting~~ === Device to ~~blur a 24 bit image.\n");~~ ~~printf(" Source~~Host =~~%s\n Dest~~=~~%s\n blur~~=~~[%0~~After that is done the image is copied back using a single memcpy to the host.~~1f, %0.1f] px=[%d,%d]\n\n", srcFileName, destFileName, xblursigma, yblursigma, xblursize, yblursize);~~

~~SImageData srcImage;~~=== Results === ~~if (LoadImage(srcFileName, srcImage))~~ { ~~printf("%s loaded\n", srcFileName);~~ ~~SImageData destImage;~~ ~~BlurImage(srcImage, destImage, xblursigma, yblursigma~~The program was compiled using clang++ , ~~xblursize, yblursize);~~ if icpc (~~SaveImage(destFileName, destImage~~Intel Parallel Studio Compiler)) ~~printf("Blurred image saved~~ and NVCC for the GPU. Runtimes for the standard clang++ version were extremely slow as ~~%s\n", destFileName);~~ ~~else~~ { ~~printf("Could not save blurred~~ the size of the resultant image increased. Compiling the program using the icpc compiler brought in significant changes without modifying any code and reduced runtimes drastically for running purely on a CPU. Using the parallel version based on CUDA improved the runtime massively over the clang++ compiled version and even the icpc version as ~~%s\n", destFileName);~~ ~~WaitForEnter();~~ ~~return 1;~~ } } ~~else~~ { ~~printf("~~more values could ~~not read 24 bit bmp file %s\n\n", srcFileName);~~ ~~WaitForEnter();~~ ~~return 1;~~ } ~~return 0;~~}be calculated in parallel.~~</syntaxhighlight>~~[[Image:Mandelbrot.png | 750px]]

|}=== Output Images ===[http://imgur.com/a/R3ZAH Image Output]

= ~~Assignment 3 - Optimize~~ == Future Optimizations ===As there isn't any data intensive tasks in this program, further optimizations would include creating streams of kernels and having them execute concurrently in order to improve runtime of the current solution.

Cpaul12

147

edits

CDOT Wiki β

Changes

BETTERRED

CDOT Wiki ^β