Changes

Jump to: navigation, search

Algo holics

5,092 bytes added, 03:00, 8 April 2019
Kernel Version 2
=== Assignment 2 ===
Our initial idea was to use the neural network code for our assignment 2. But since the algorithm itself was not very accurate (2/10 correct predictions even after 10,000 training iterations), we decided to paralellize merge sort. Soon we realized that since its Big O classification was n log n, offloading computations to GPU would not be that effective. So, we settled with the cosine transform library, as described below.
== COSINE TRANSFORMATION (A Discrete ==Cosine Transform for Real Data) Tranformation====
The Cosine_Transform is a simple C++ library which demonstrates properties of the Discrete cosine Transform for real data. The Discrete Cosine Transform or DCT is used to create jpeg (compressed images).
The formula used here is:
 C(u,v) =
| (√1/n) , if u=0; 0≤v≤n-1
C(u,v) = |(√2/n) * cos[((2*v+1)π*u)/2n], if 1≤u≤n-1; 0≤v≤n-1
Where, u is the row index, v is the column index and n is the total number of elements in a row/column in the computational matrix respectively.
This [https://www.youtube.com/watch?v=tW3Hc0Wrgl0 Link] can be used for better understanding of the above formula. Here is the [https://people.sc.fsu.edu/~jburkardt/cpp_src/cosine_transform/cosine_transform.html source code] used.
----=====Profiling=====The flat profile for the above serial code looks like:
Here is the [https://people.sc.fsu.edu/~jburkardt/cpp_src/cosine_transform/cosine_transform.html source code] used.{| class="wikitable mw-collapsible mw-collapsed"! Flat Profile|-|
[[File:host.jpg]]
 
The graph for the above code looks like:
 
 
 
To increase the efficiency of the program we transformed the '''cosine_transform_data''' function into a kernel name '''cosTransformKernel''' which offload the compute intense calculation of the program to the GPU.
1
2
3
4 granularity: each sample hit covers 2 byte(s) for 0.68% of 1.47 seconds
5
6 index % time self children called name
7 <spontaneous>
8 [1] 100.0 0.00 1.47 main [1]
9 0.00 1.47 1/1 cosine_transform_test01(int) [3]
10 -----------------------------------------------
11 1.47 0.00 1/1 cosine_transform_test01(int) [3]
12 [2] 100.0 1.47 0.00 1 cosine_transform_data(int, double*) [2]
13 -----------------------------------------------
14 0.00 1.47 1/1 main [1]
15 [3] 100.0 0.00 1.47 1 cosine_transform_test01(int) [3]
16 1.47 0.00 1/1 cosine_transform_data(int, double*) [2]
17 0.00 0.00 1/1 r8vec_uniform_01_new(int, int&) [14]
18 0.00 0.00 1/1 reportTime(char const*, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) [13]
19 0.00 0.00 1/1 std::common_type<std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::type std::chrono::operator-<std::chrono::_V2::s teady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >(std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 10 00000000l> > > const&, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&) [21]
20 -----------------------------------------------
21 0.00 0.00 1/3 std::chrono::duration<long, std::ratio<1l, 1000l> > std::chrono::__duration_cast_impl<std::chrono::duration<long, std::ratio<1l, 1000l> >, std::ratio<1l, 1000000l>, long, true, false>: :__cast<long, std::ratio<1l, 1000000000l> >(std::chrono::duration<long, std::ratio<1l, 1000000000l> > const&) [18]
22 0.00 0.00 2/3 std::common_type<std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::type std::chrono::operator-<long, std::ratio<1l , 1000000000l>, long, std::ratio<1l, 1000000000l> >(std::chrono::duration<long, std::ratio<1l, 1000000000l> > const&, std::chrono::duration<long, std::ratio<1l, 1000000000l> > const&) [22]
23 [10] 0.0 0.00 0.00 3 std::chrono::duration<long, std::ratio<1l, 1000000000l> >::count() const [10]
24 -----------------------------------------------
25 0.00 0.00 2/2 std::common_type<std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::type std::chrono::operator-<std::chrono::_V2::s teady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >(std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 10 00000000l> > > const&, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&) [21]
26 [11] 0.0 0.00 0.00 2 std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::time_since_epoch() const [11]
27 -----------------------------------------------
28 0.00 0.00 1/1 __libc_csu_init [28]
29 [12] 0.0 0.00 0.00 1 _GLOBAL__sub_I__Z20r8vec_uniform_01_newiRi [12]
30 0.00 0.00 1/1 __static_initialization_and_destruction_0(int, int) [15]
31 -----------------------------------------------
32 0.00 0.00 1/1 cosine_transform_test01(int) [3]
33 [13] 0.0 0.00 0.00 1 reportTime(char const*, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) [13]
34 0.00 0.00 1/1 std::enable_if<std::chrono::__is_duration<std::chrono::duration<long, std::ratio<1l, 1000l> > >::value, std::chrono::duration<long, std::ratio<1l, 1000l> > >::type std::chrono::duratio n_cast<std::chrono::duration<long, std::ratio<1l, 1000l> >, long, std::ratio<1l, 1000000000l> >(std::chrono::duration<long, std::ratio<1l, 1000000000l> > const&) [17]
35 0.00 0.00 1/1 std::chrono::duration<long, std::ratio<1l, 1000l> >::count() const [16]
36 -----------------------------------------------
37 0.00 0.00 1/1 cosine_transform_test01(int) [3]
38 [14] 0.0 0.00 0.00 1 r8vec_uniform_01_new(int, int&) [14]
39 -----------------------------------------------
40 0.00 0.00 1/1 _GLOBAL__sub_I__Z20r8vec_uniform_01_newiRi [12]
41 [15] 0.0 0.00 0.00 1 __static_initialization_and_destruction_0(int, int) [15]
42 -----------------------------------------------
43 0.00 0.00 1/1 reportTime(char const*, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) [13]
44 [16] 0.0 0.00 0.00 1 std::chrono::duration<long, std::ratio<1l, 1000l> >::count() const [16]
45 -----------------------------------------------
46 0.00 0.00 1/1 reportTime(char const*, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) [13]
47 [17] 0.0 0.00 0.00 1 std::enable_if<std::chrono::__is_duration<std::chrono::duration<long, std::ratio<1l, 1000l> > >::value, std::chrono::duration<long, std::ratio<1l, 1000l> > >::type std::chrono::duration_ca st<std::chrono::duration<long, std::ratio<1l, 1000l> >, long, std::ratio<1l, 1000000000l> >(std::chrono::duration<long, std::ratio<1l, 1000000000l> > const&) [17]
48 0.00 0.00 1/1 std::chrono::duration<long, std::ratio<1l, 1000l> > std::chrono::__duration_cast_impl<std::chrono::duration<long, std::ratio<1l, 1000l> >, std::ratio<1l, 1000000l>, long, true, false>: :__cast<long, std::ratio<1l, 1000000000l> >(std::chrono::duration<long, std::ratio<1l, 1000000000l> > const&) [18]
49 -----------------------------------------------
50 0.00 0.00 1/1 std::enable_if<std::chrono::__is_duration<std::chrono::duration<long, std::ratio<1l, 1000l> > >::value, std::chrono::duration<long, std::ratio<1l, 1000l> > >::type std::chrono::duratio n_cast<std::chrono::duration<long, std::ratio<1l, 1000l> >, long, std::ratio<1l, 1000000000l> >(std::chrono::duration<long, std::ratio<1l, 1000000000l> > const&) [17]
51 [18] 0.0 0.00 0.00 1 std::chrono::duration<long, std::ratio<1l, 1000l> > std::chrono::__duration_cast_impl<std::chrono::duration<long, std::ratio<1l, 1000l> >, std::ratio<1l, 1000000l>, long, true, false>::__c ast<long, std::ratio<1l, 1000000000l> >(std::chrono::duration<long, std::ratio<1l, 1000000000l> > const&) [18]
52 0.00 0.00 1/3 std::chrono::duration<long, std::ratio<1l, 1000000000l> >::count() const [10]
|}
As is evident, the algorithm is O(n2) currently. Using thread indices on the GPU to replace the for loops could potentially improve performance.
To increase the efficiency of the program we transformed the '''cosine_transform_data''' function into a kernel named '''cosTransformKernel''' which offloads the compute intense calculation of the program to the GPU.
=====Kernel Version 1=====
{| class="wikitable mw-collapsible mw-collapsed"
! Modified Code
|
%%cu # include <iostream> # include <iomanip> # include <ctime> # include <chrono> # include <cstdlib> # include <cmath> #include <cuda_runtime.h> using namespace std; using namespace std::chrono; const double pi = 3.141592653589793; const int ntpb = 1024; void cosine_transform_test01 ( int size );
double *r8vec_uniform_01_new ( int n, int &seed ){
int i;
const int i4_huge = 2147483647;
int k;
double *r;
  if ( seed == 0 ) {
cerr << "\n";
cerr << "R8VEC_UNIFORM_01_NEW - Fatal error!\n";
exit ( 1 );
}
 
r = new double[n];
  for ( i = 0; i < n; i++ ) {
k = seed / 127773;
 
seed = 16807 * ( seed - k * 127773 ) - k * 2836;
  if ( seed < 0 ) {
seed = seed + i4_huge;
}
}
return r;
}
double *cosine_transform_data ( int n, double d[] ){
double angle;
double *c;
int i;
int j;
 
c = new double[n];
  for ( i = 0; i < n; i++ ) {
c[i] = 0.0;
for ( j = 0; j < n; j++ ) {
angle = pi * ( double ) ( i * ( 2 * j + 1 ) ) / ( double ) ( 2 * n );
c[i] = c[i] + cos ( angle ) * d[j];
}
return c;
}
void reportTime(const char* msg, steady_clock::duration span) { auto ms = duration_cast<milliseconds>(span); std::cout << msg << " - took - " << ms.count() << " millisecs" << std::endl; __global__ void cosTransformKernel(double *a, double *b, int n){
double angle;
const double pi = 3.141592653589793;
}
b[i] *= sqrt( 2.0 / (double) n );
}
int main (int argc, char* argv[] ){
if (argc != 2) {
std::cerr << argv[0] << ": invalid number of arguments\n";
cosine_transform_test01 (n);
return 0;
}
void cosine_transform_test01 ( int size){
int n = size;
int seed;
double *r;
double *hs; //host side pointer to store the array returned from host side cosine_transform_data, for comparison purposes
double *s = new double[n];
double *d_a;
delete [] hs;
}
 
|}
'''Analysis'''
The graph for the execution time difference between the device and the host looks like:
To analyze the flat profile, enter the following command[[File:kernel1.png]]
> gprof Even though the kernel includes a for-p -b myapp > myapploop the execution time has decreased drastically. Thats because each thread is now responsible for one calculating one element of the final Cos transformed matrix(unit vector).flt
-p directs the profiler (gprof) to output a flat profile. === Assignment 3 ===
-b directs For optimizing the profiler code better, we thought of removing the iterative loop from the kernel by using threadIdx.y to omit detailed explanations control calculation of each element's cosine for that position in the column headings from supposed matrix. The problem in this was that each thread was in a racing condition to write to the same memory location, to sum up the cosine transformations for all elements of that row. We solved this by using the outputatomic function. Its prototype is as follows. double atomicAdd(double* address, double value)
 The flat profile for the above execution looks like:=====Kernel Version 2=====
{| class="wikitable mw-collapsible mw-collapsed"
! Flat ProfileKernel 2
|-
|
# include <cmath>
# include <cstdlib>
# include <iostream>
# include <iomanip>
# include <ctime>
# include <chrono>
# include <cstdlib>
# include <cmath>
#include <limits>
#include <cuda_runtime.h>
#include <cuda.h>
using namespace std;
using namespace std::chrono;
const double pi = 3.141592653589793;
const unsigned ntpb = 32;
void cosine_transform_test01 ( int size );
 
double *r8vec_uniform_01_new ( int n, int &seed ){
int i;
const int i4_huge = 2147483647;
int k;
double *r;
if ( seed == 0 ){
cerr << "\n";
cerr << "R8VEC_UNIFORM_01_NEW - Fatal error!\n";
cerr << " Input value of SEED = 0.\n";
exit ( 1 );
}
r = new double[n];
for ( i = 0; i < n; i++ ){
k = seed / 127773;
seed = 16807 * ( seed - k * 127773 ) - k * 2836;
if ( seed < 0 ){
seed = seed + i4_huge;
}
r[i] = ( double ) ( seed ) * 4.656612875E-10;
}
return r;
}
 
double *cosine_transform_data ( int n, double d[] ){
double angle;
double *c;
int i;
int j;
c = new double[n];
for ( i = 0; i < n; i++ ){
c[i] = 0.0;
for ( j = 0; j < n; j++ ){
angle = pi * ( double ) ( i * ( 2 * j + 1 ) ) / ( double ) ( 2 * n );
c[i] = c[i] + cos ( angle ) * d[j];
}
c[i] = c[i] * sqrt ( 2.0 / ( double ) ( n ) );
}
return c;
}
 
 
void reportTime(const char* msg, steady_clock::duration span) {
auto ms = duration_cast<milliseconds>(span);
std::cout << msg << " - took - " <<
ms.count() << " millisecs" << std::endl;
}
__global__ void cosTransformKernel(double *a, double *b, const int n){
double angle;
const double pi = 3.141592653589793;
int j = blockIdx.x * blockDim.x + threadIdx.x;
int i = blockIdx.y * blockDim.y + threadIdx.y;
if(i<n && j<n){
angle = pi * ( double ) ( i * ( 2 * j + 1 ) ) / ( double ) ( 2 * n );
double value = cos ( angle ) * a[j];
b[i] = atomicAdd(&b[i], value);
}
//square root of the whole cos transformed row term
if(j==n-1 && i<n){
b[i] *= sqrt ( 2.0 / ( double ) ( n ) );
}
}
1 int main (int argc, char* argv[] ){ if (argc != 2) { 3 4 granularity std:: each sample hit covers 2 byte(s) for cerr << argv[0.68% ] << ": invalid number of 1.47 secondsarguments\n"; 5 std::cerr << "Usage: " << argv[0] << " size_of_vector\n"; 6 index % time self children called name return 1; 7 <spontaneous>} 8 [1] 100.0 0.00 1.47 main int n = std::atoi(argv[1]); 9 0.00 1.47 1/1 cosine_transform_test01(intn) [3]; return 0; 10 -----------------------------------------------}  11 1.47 0.00 1/1 void cosine_transform_test01(intsize) [3]{ int n = size; 12 [2] 100.0 1.47 0.00 1 cosine_transform_data( int, seed; double*) [2]r; 13 ----------------------------------------------- 14 0.00 1.47 1 double *hs; /1 main [1] 15 [3] 100.0 0.00 1.47 1 cosine_transform_test01(int) [3] 16 1.47 0.00 1/1 host side pointer to store the array returned from host side cosine_transform_data(int, for comparison purposes double*) s = new double[2n]; 17 0.00 0.00 1 //1 r8vec_uniform_01_new(int, int&) [14]double *t; double *d_a; double *d_b; 18 0.00 0.00 1 //1 reportTimeallocate memory on the device for the randomly generated array and for the array in which transform values will be stored cudaMalloc((char constvoid**)&d_a, std::chrono::duration<long, std::ratio<1l, 1000000000l> >sizeof(double) * n) [13]; 19 0.00 0.00 1/1 std::common_type<std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::type std::chrono::operator-<std::chrono::_V2::s teady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > cudaMalloc((std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 10 00000000l> > > constvoid**)&d_b, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&sizeof(double) * n) [21]; seed = 123456789; 20 ----------------------------------------------- 21 0.00 0.00 1/3 std::chrono::duration<long, std::ratio<1l, 1000l> > std::chrono::__duration_cast_impl<std::chrono::duration<long, std::ratio<1l, 1000l> >, std::ratio<1l, 1000000l>, long, true, false>: :__cast<long, std::ratio<1l, 1000000000l> > r = r8vec_uniform_01_new (std::chrono::duration<long, std::ratio<1ln, 1000000000l> > const&seed ) [18]; 22 0.00 0.00 2 //3 std::common_type<std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::type std::chrono::operator-<long, std::ratio<1l , 1000000000l>, long, std::ratio<1l, 1000000000l> >copy randomly generated values from host to device for(std::chrono::durationint i=0; i<long, std::ratio<1l, 1000000000l> > const&, std::chrono::duration<long, std::ratio<1l, 1000000000l> > const&n; i++) [22] 23 s[10i] =0.0 0.00 0.00 3 std::chrono::duration<long; cudaMemcpy(d_a, std::ratio<1lr, 1000000000l> >::countsizeof(double)*n,cudaMemcpyHostToDevice) const [10]; 24 ----------------------------------------------- 25 0.00 0.00 2/2 std::common_type<std::chrono::duration<long cudaMemcpy(d_b, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::type std::chrono::operator-<std::chrono::_V2::s teady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >sizeof(std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 10 00000000l> > > const&, std::chrono::time_point<std::chrono::_V2::steady_clockdouble)*n, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&cudaMemcpyHostToDevice) [21]; 26 [11] 0.0 0.00 0.00 2 std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >::time_since_epoch int nblks = () const [11] 27 -------------n + ntpb ---------------------------------- 28 0.00 0.00 1) /1 __libc_csu_init [28]ntpb; 29 [12] 0.0 0.00 0.00 1 _GLOBAL__sub_I__Z20r8vec_uniform_01_newiRi [12] 30 0.00 0.00 1/1 __static_initialization_and_destruction_0 dim3 grid(intnblks,nblks, int) [15] 31 ----------------------------------------------- 32 0.00 0.00 1/1 cosine_transform_test01(int) [3]; 33 [13] 0.0 0.00 0.00 1 reportTime dim3 block(char const*ntpb, std::chrono::duration<longntpb, std::ratio<1l, 1000000000l> >1) [13]; 34 0.00 0.00 1/1 std steady_clock::enable_if<std::chrono::__is_duration<std::chrono::duration<longtime_point ts, stdte; ts = steady_clock::rationow(); cosTransformKernel<1l, 1000l> > >::value, std::chrono::duration<long, std::ratio<1lgrid, 1000lblock> > >::type std::chrono::duratio n_cast<std::chrono::duration<long, std::ratio<1l, 1000l> >, long, std::ratio<1l, 1000000000l> >(std::chrono::duration<longd_a, std::ratio<1ld_b, 1000000000l> > const&size) [17]; 35 0.00 0.00 1/1 std::chrono::duration<long, std::ratio<1l, 1000l> >::count cudaDeviceSynchronize() const [16]; 36 ----------------------------------------------- 37 0.00 0.00 1/1 cosine_transform_test01 te = steady_clock::now(int) [3]; 38 [14] 0.0 0.00 0.00 1 r8vec_uniform_01_new reportTime(int"Cosine Transform on device", int&te-ts) [14]; 39 ----------------------------------------------- 40 0.00 0.00 1/1 _GLOBAL__sub_I__Z20r8vec_uniform_01_newiRi [12] 41 [15] 0.0 0.00 0.00 1 __static_initialization_and_destruction_0 cudaMemcpy(ints,d_b, intsizeof(double) [15] 42 ----------------------------------------------- 43 0.00 0.00 1/1 reportTime(char const*n, std::chrono::duration<long, std::ratio<1l, 1000000000l> >cudaMemcpyDeviceToHost) [13]; 44 [16] 0.0 0.00 0.00 1 std::chrono::duration<long, std::ratio<1l, 1000l> > ts = steady_clock::countnow() const [16]; 45 ----------------------------------------------- 46 0.00 0.00 1/1 reportTime hs = cosine_transform_data (char const*n, std::chrono::duration<long, std::ratio<1l, 1000000000l> >r ) [13]; 47 [17] 0.0 0.00 0.00 1 std::enable_if<std::chrono::__is_duration<std::chrono::duration<long, std::ratio<1l, 1000l> > >::value, std::chrono::duration<long, std::ratio<1l, 1000l> > >::type std::chrono::duration_ca st<std::chrono::duration<long, std te = steady_clock::ratio<1l, 1000l> >, long, std::ratio<1l, 1000000000l> >now(std::chrono::duration<long, std::ratio<1l, 1000000000l> > const&) [17]; 48 0.00 0.00 1/1 std::chrono::duration<long, std::ratio<1l, 1000l> > std::chrono::__duration_cast_impl<std::chrono::duration<long, std::ratio<1l, 1000l> >, std::ratio<1l, 1000000l>, long, true, false>: :__cast<long, std::ratio<1l, 1000000000l> > reportTime(std::chrono::duration<long"Cosine Transform on host", std::ratio<1l, 1000000000l> > const&te-ts) [18];  49 ----------------------------------------------- cudaFree(d_a); 50 0.00 0.00 1/1 std::enable_if<std::chrono::__is_duration<std::chrono::duration<long, std::ratio<1l, 1000l> > >::value, std::chrono::duration<long, std::ratio<1l, 1000l> > >::type std::chrono::duratio n_cast<std::chrono::duration<long, std::ratio<1l, 1000l> >, long, std::ratio<1l, 1000000000l> > cudaFree(std::chrono::duration<long, std::ratio<1l, 1000000000l> > const&d_b) ; delete [17]r; 51 delete [18] 0.0 0.00 0.00 1 std::chrono::duration<long, std::ratio<1l, 1000l> > std::chrono::__duration_cast_impl<std::chrono::duration<long, std::ratio<1l, 1000l> >, std::ratio<1l, 1000000l>, long, true, false>::__c ast<long, std::ratio<1l, 1000000000l> >(std::chrono::duration<long, std::ratio<1l, 1000000000l> > const&) s; delete [18]hs; 52 0.00 0.00 1 //3 std::chrono::duration<long, std::ratio<1l, 1000000000l> >::count() const delete [10] t;
return;
}
|}
Here is a comparison between the naive and optimized kernel
the device vs host is the one with the second code[[File:kernel2.jpg]]
 === Assignment 3 ===Evidently, there is some performance boost for the new version. However, each call to atomicAdd by a thread locks the global memory until the old value is read and added to the passed value. This deters faster execution as might be expected.
57
edits

Navigation menu