Changes

← Older edit

Sirius

5,031 bytes added, 11:01, 9 April 2018

→‎Conclusion

0.00 17.52 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z10setRandArrRSt6vectorIiSaIiEEi

0.00 17.52 0.00 1 0.00 0.00 void std::__insertion_sort<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_less_iter>(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_less_iter)

</source>

==== Source Code ====

//std::sort Algorithm

void stdSort(vector<int>& array,int arrSize,steady_clock::time_point ts,steady_clock::time_point te){

cout << "--==Execution Time of std::sort Alogirthm==--" << endl;

/*std::sort Algorithm*/

//Time the fill of 1 vector

ts = steady_clock::now();

//Fill array with random numbers

setRandArr(array, arrSize);

te = steady_clock::now();

printTiming("std::sort Vector (1) Initialize", te - ts);

//Start timing of std::sort

ts = steady_clock::now();

//Use std::sort to sort vector array1

sort(array.begin(),array.end());

//End timing std::sort

te = steady_clock::now();

//Print Results

printTiming("std::sort algorithm", te - ts);

}

//saxpy Algorithm

void saxpyAlg(int arrSize,steady_clock::time_point ts,steady_clock::time_point te){

cout << endl << "--==Execution Time of saxpy Alogirthm==--" << endl;

/*saxpy Algorithm*/

vector<int> saxpyX,saxpyY;

int saxpyA = 15;

//Time the fill of 2 vectors

ts = steady_clock::now();

setRandArr(saxpyX, arrSize);

setRandArr(saxpyY, arrSize);

te = steady_clock::now();

printTiming("saxpy Vectors (2) Initialize", te - ts);

//Start timing of saxpy

ts = steady_clock::now();

for (int i = 0;i < arrSize;++i)

saxpyY[i] = saxpyA*saxpyX[i] + saxpyY[i];

//End timing of saxpy

te = steady_clock::now();

printTiming("saxpy Algorithm", te - ts);

}

//Prefix Sum Algorithm

void prefixSum(vector<int>& array,int arrSize,steady_clock::time_point ts,steady_clock::time_point te){

cout << endl << "--==Execution Time of Prefix-Sum Alogirthm==--" << endl;

/*Prefix-Sum Algorithm*/

vector<int> psSum;

array.clear();

//Time the fill of 1 vector

ts = steady_clock::now();

//Fill array with random numbers

setRandArr(array, arrSize);

te = steady_clock::now();

printTiming("Prefix-Sum Vector (1) Initialize", te - ts);

//Start timing of Prefix-Sum

ts = steady_clock::now();

psSum.push_back(array[0]);

for (int i = 1;i < arrSize;++i)

psSum.push_back(psSum[i - 1] + array[i]);

//End timing of Prefix-Sum

te = steady_clock::now();

printTiming("Prefix-Sum Algorithm", te - ts);

}

</source>

4.00 0.75 0.03 2176681 0.00 0.00 std::__detail::_Hashtable_iterator<std::pair<unsigned int const, std::string>, false, false> std::_Hashtable<unsigned int, std::pair<unsigned int const, std::string>, std::allocator<std::pair<unsigned int const, std::string> >, std::_Select1st<std::pair<unsigned int const, std::string> >, std::equal_to<unsigned int>, std::hash<unsigned int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, false, false, true>::_M_insert_bucket<std::pair<unsigned int, std::string> >(std::pair<unsigned int, std::string>&&, unsigned int, unsigned int)

0.00 0.75 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z18convert_int_to_bini

====Source Code====

void compress(string input, int size, string filename) {

unordered_map<string, int> compress_dictionary(MAX_DEF);

//Dictionary initializing with ASCII

for ( int unsigned i = 0 ; i < 256 ; i++ ){

compress_dictionary[string(1,i)] = i;

}

string current_string;

unsigned int code;

unsigned int next_code = 256;

//Output file for compressed data

ofstream outputFile;

outputFile.open(filename + ".lzw");

for(char& c: input){

current_string = current_string + c;

if ( compress_dictionary.find(current_string) ==compress_dictionary.end() ){

if (next_code <= MAX_DEF)

compress_dictionary.insert(make_pair(current_string, next_code++));

current_string.erase(current_string.size()-1);

outputFile << convert_int_to_bin(compress_dictionary[current_string]);

current_string = c;

}

if (current_string.size())

outputFile << convert_int_to_bin(compress_dictionary[current_string]);

outputFile.close();

}

void decompress(string input, int size, string filename) {

unordered_map<unsigned int, string> dictionary(MAX_DEF);

//Dictionary initializing with ASCII

for ( int unsigned i = 0 ; i < 256 ; i++ ){

dictionary[i] = string(1,i);

}

string previous_string;

unsigned int code;

unsigned int next_code = 256;

//Output file for decompressed data

ofstream outputFile;

outputFile.open(filename + "_uncompressed.txt");

int i =0;

while (i<size){

//Extracting 12 bits and converting binary to decimal

string subinput = input.substr(i,12);

bitset<12> binary(subinput);

code = binary.to_ullong();

i+=12;

if ( dictionary.find(code) ==dictionary.end() )

dictionary.insert(make_pair(code,(previous_string + previous_string.substr(0,1))));

outputFile<<dictionary[code];

if ( previous_string.size())

dictionary.insert(make_pair(next_code++,previous_string + dictionary[code][0]));

previous_string = dictionary[code];

}

outputFile.close();

}

</source>

=== Assignment 2 ===

Solution:

----

One way to address low compute utilization is to attempt to increase occupancy of each SM. According to Cuda's occupancy calculator the machine we were using for testing had a compute capability of 6.1. This means that each SM had 32 resident blocks and 2048 resident threads. To achieve maximum occupancy you would have 2048/32 = 64 threads/ block. To determine an appropriate grid size we would divide the total number of pixels by the 64 threads/block. This allows us to use dynamic grid sizing depending on the size of the image passed in.

The number of blocks for the grid had been recalculated to incorporate the complexity of the image and the new threads per block.

Problem:

----

We considered shared memory when optimizing our kernel. When attempting to implement shared memory we realized that it would be a difficult task to complete because every pixel in a block needs access to a different range of pixels for averaging. One major problem was that neighborhood pixels may fall out of range of the block. We also attempted to store the entire image in shared memory but this solution is not scalable to larger image sizes as shared memory is a limited resource.

Below you'll see that our optimizations although show slight improvements sometimes, it was not effective. We are currently still looking for a way to implement shared memory which will surely improve efficiency and execution time.

==== Graph ====

[[File:boxFilterOptimize.png | 750px]]

With further optimization, we managed to slightly improve the execution time of the blur effect.

Below are the final results of all the test runs as well as the corresponding graph.

==== Results ====

[[File:boxFilterFinalTable.png | 500px]]

Jpildush

66

edits

CDOT Wiki β

Changes

Sirius

CDOT Wiki ^β