Difference between revisions of "N/A"

From CDOT Wiki
Jump to: navigation, search
(Assignment 2)
(Assignment 2)
Line 441: Line 441:
 
'''Final thoughts'''
 
'''Final thoughts'''
  
[[File:A2_comparison.xlxs]]
 
 
[https://docs.google.com/spreadsheets/d/1nDCtm8ar2AmhZujV4QeHUHoLfq80L1buAJqD0BALR0A/edit?usp=sharing https://docs.google.com/spreadsheets/d/1nDCtm8ar2AmhZujV4QeHUHoLfq80L1buAJqD0BALR0A/edit?usp=sharing GPU_CPU_Comparisons]
 
[https://docs.google.com/spreadsheets/d/1nDCtm8ar2AmhZujV4QeHUHoLfq80L1buAJqD0BALR0A/edit?usp=sharing https://docs.google.com/spreadsheets/d/1nDCtm8ar2AmhZujV4QeHUHoLfq80L1buAJqD0BALR0A/edit?usp=sharing GPU_CPU_Comparisons]
  

Revision as of 16:21, 23 March 2019


GPU610/DPS915 | Student List | Group and Project Index | Student Resources | Glossary

N/A

Team Members

  1. Woosle Park, Data Compression
  2. Jordan Pitters, Image Processing
  3. Akshat Patel, Sorting Algorithms

Email All;

Progress

Assignment 1

Application 1 - Data Compression


Description: https://www.geeksforgeeks.org/lzw-lempel-ziv-welch-compression-technique/

The algorithm used for data compression here is the Lempel–Ziv–Welch (LZW) algorithm. It is a lossless algorithm meaning no data is lost during compression for a file. This algorithm is generally used for gif or pdf files but for this example, I used a .txt file because it was easier to manipulate and scale in size. The file used for compression is a .txt version of the Holy Bible(https://raw.githubusercontent.com/mxw/grmr/master/src/finaltests/bible.txt) because the contents are large enough to see the compression time and percentage. The algorithm should read a files sequence of symbols and grouping them into strings and then converting it into bit 12 code that is then stored into a table. That table is then referred to when decompressing a file doing a reverse sequence of steps from compression.

Source Code:

Code by RobTwentyFour

//  Compile with gcc 4.7.2 or later, using the following command line:
//
//    g++ -std=c++0x lzw.c -o lzw
//
//LZW algorithm implemented using fixed 12 bit codes.

#include <iostream>
#include <sstream>
#include <fstream>

#include <bitset>
#include <string>
#include <unordered_map>

#define MAX_DEF 4096

using namespace std;

string convert_int_to_bin(int number)
{
   string result = bitset<12>(number).to_string();
   return result;
}

void compress(string input, int size, string filename) {
   unordered_map<string, int> compress_dictionary(MAX_DEF);
   //Dictionary initializing with ASCII
   for ( int unsigned i = 0 ; i < 256 ; i++ ){
   compress_dictionary[string(1,i)] = i;
   }
   string current_string;
   unsigned int code;
   unsigned int next_code = 256;
   //Output file for compressed data
   ofstream outputFile;
   outputFile.open(filename + ".lzw");

   for(char& c: input){
   current_string = current_string + c;
   if ( compress_dictionary.find(current_string) ==compress_dictionary.end() ){
           if (next_code <= MAX_DEF)
               compress_dictionary.insert(make_pair(current_string, next_code++));
           current_string.erase(current_string.size()-1);
           outputFile << convert_int_to_bin(compress_dictionary[current_string]);
           current_string = c;
       }   
   }   
   if (current_string.size())
           outputFile << convert_int_to_bin(compress_dictionary[current_string]);
   outputFile.close();
}



void decompress(string input, int size, string filename) {
   unordered_map<unsigned int, string> dictionary(MAX_DEF);
   //Dictionary initializing with ASCII
   for ( int unsigned i = 0 ; i < 256 ; i++ ){
   dictionary[i] = string(1,i);
   }
   string previous_string;
   unsigned int code;
   unsigned int next_code = 256;
   //Output file for decompressed data
   ofstream outputFile;
   outputFile.open(filename + "_uncompressed.txt");

   int i =0;
   while (i<size){
       //Extracting 12 bits and converting binary to decimal
       string subinput = input.substr(i,12);
       bitset<12> binary(subinput);
       code = binary.to_ullong();
       i+=12;

       if ( dictionary.find(code) ==dictionary.end() ) 
           dictionary.insert(make_pair(code,(previous_string + previous_string.substr(0,1))));
       outputFile<<dictionary[code];
       if ( previous_string.size())
           dictionary.insert(make_pair(next_code++,previous_string + dictionary[code][0])); 
       previous_string = dictionary[code];
       }
   outputFile.close();
}

string convert_char_to_string(const char *pCh, int arraySize){
   string str;
   if (pCh[arraySize-1] == '\0') str.append(pCh);
   else for(int i=0; i<arraySize; i++) str.append(1,pCh[i]);
   return str;
}

static void show_usage()
{
       cerr << "Usage: \n"
             << "Specify the file that needs to be compressed or decompressed\n"
             <<"lzw -c input    #compress file input\n"
             <<"lzw -d input    #decompress file input\n"
             <<"Compressed data will be found in a file with the same name but with a .lzw extension\n"
             <<"Decompressed data can be found in a file with the same name and a _uncompressed.txt extension\n"
             << endl;
}


int main (int argc, char* argv[]) {
   streampos size;
   char * memblock;

   if (argc <2)
   {
       show_usage();   
       return(1);
   }
   ifstream file (argv[2], ios::in|ios::binary|ios::ate);
   if (file.is_open())
   {
       size = file.tellg();
       memblock = new char[size];
       file.seekg (0, ios::beg);
       file.read (memblock, size);
       file.close();
       string input = convert_char_to_string(memblock,size);
       if (string( "-c" ) == argv[1] )
           compress(input,size, argv[2]);
       else if (string( "-d" ) == argv[1] )
           decompress(input,size, argv[2]);
       else
           show_usage();
   }
   else {
   cout << "Unable to open file."<<endl;
   show_usage();
   }
   return 0;
}

Flatline Profiles:

bible.txt - 4,351,186 bytes

Each sample counts as 0.01 seconds.
 %   cumulative   self              self     total           
time   seconds   seconds    calls  ns/call  ns/call  name    
50.04      0.18     0.18  5758089    31.29    31.29  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, std::string const&, unsigned long) const
50.04      0.36     0.18                             compress(std::string, int, std::string)
 0.00      0.36     0.00  1402806     0.00    31.29  std::__detail::_Map_base<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::string const&)
 0.00      0.36     0.00     4098     0.00     0.00  show_usage()
 0.00      0.36     0.00     4097     0.00     0.00  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<std::string const, int>, true>*)
 0.00      0.36     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z18convert_int_to_bini
 0.00      0.36     0.00        1     0.00     0.00  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear()


bible2.txt - 8,702,373 bytes

Each sample counts as 0.01 seconds.
 %   cumulative   self              self     total           
time   seconds   seconds    calls  ns/call  ns/call  name    
48.39      0.44     0.44 11511109    38.26    38.26  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, std::string const&, unsigned long) const
46.19      0.86     0.42                             compress(std::string, int, std::string)
 5.50      0.91     0.05  2804639    17.84    56.10  std::__detail::_Map_base<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::string const&)
 0.00      0.91     0.00     4098     0.00     0.00  show_usage()
 0.00      0.91     0.00     4097     0.00     0.00  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<std::string const, int>, true>*)
 0.00      0.91     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z18convert_int_to_bini
 0.00      0.91     0.00        1     0.00     0.00  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear()


bible3.txt - 13,053,560 bytes

Each sample counts as 0.01 seconds.
 %   cumulative   self              self     total           
time   seconds   seconds    calls  ns/call  ns/call  name    
47.58      0.58     0.58 17264129    33.63    33.63  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, std::string const&, unsigned long) const
42.66      1.10     0.52                             compress(std::string, int, std::string)
 7.38      1.19     0.09  4206472    21.41    55.04  std::__detail::_Map_base<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::string const&)
 1.64      1.21     0.02                             convert_char_to_string(char const*, int)
 0.82      1.22     0.01                             std::pair<std::__detail::_Node_iterator<std::pair<unsigned int const, std::string>, false, false>, bool> std::_Hashtable<unsigned int, std::pair<unsigned int const, std::string>, std::allocator<std::pair<unsigned int const, std::string> >, std::__detail::_Select1st, std::equal_to<unsigned int>, std::hash<unsigned int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_emplace<std::pair<unsigned int, std::string> >(std::integral_constant<bool, true>, std::pair<unsigned int, std::string>&&)
 0.00      1.22     0.00     4098     0.00     0.00  show_usage()
 0.00      1.22     0.00     4097     0.00     0.00  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<std::string const, int>, true>*)
 0.00      1.22     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z18convert_int_to_bini
 0.00      1.22     0.00        1     0.00     0.00  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear()


bible4.txt - 17,039,360 bytes

Each sample counts as 0.01 seconds.
 %   cumulative   self              self     total           
time   seconds   seconds    calls  ns/call  ns/call  name    
60.43      0.96     0.96 22530032    42.65    42.65  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, std::string const&, unsigned long) const
32.73      1.48     0.52                             compress(std::string, int, std::string)
 6.29      1.58     0.10  5486575    18.24    60.89  std::__detail::_Map_base<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::string const&)
 0.63      1.59     0.01                             convert_char_to_string(char const*, int)
 0.00      1.59     0.00     4098     0.00     0.00  show_usage()
 0.00      1.59     0.00     4097     0.00     0.00  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<std::string const, int>, true>*)
 0.00      1.59     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z18convert_int_to_bini
 0.00      1.59     0.00        1     0.00     0.00  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear()


Conclusion:

This of time is spent in the compress function and the hashtable takes up most of the time because it is constantly being manipulated and read from. It looks like if the hashtable and the compress function were to be parallelized about 90% of the run time would be affected. The big-O for the application should be O(n) time so there is a linear increase in time based on file size, however the hashtable grew more in time compared to the compress function the larger the file. This application is not good for parallelization because of the dictionary hashtable. Due to the hastable or the compress dictionary needing to be accessible globally and be constantly modifiable and read this could pose issues if multiple threads were running especially since modifying and reading the table needs to be done sequentially for efficient compression. Threading the compress could lead to errors in compressions making this difficult to parallelize.




Application 2 - Image Processing


Description: The code in focus was borrowed from user "cwginac" at DreamInCode: cwginac - Image Processing Tutorial

I stumbled across this code while searching for a program written in C++ that holds the purpose of processing images, and can be deployed on Linux without too many issues with libraries. I began my research looking for open source image processing programs, which lead me to a number of libraries and sources, (including CLMG). However, those sources were mainly in JAVA. Concerned that I didn't truly understand the process, I then redefined my focus to understanding image processing. I thus searched for "image processing tutorials in C++". The DreamInCode website was one that was listed on the front page. The code uses standard libraries to handle images in the PGM format. It is a fairly straight forward program that intakes a number of images as command-line arguments, with the first being the image to edit, and provides the user some options to process the image, outputting the result to one of the provided image paths. The one thing I noticed the program lacks is a method for converting images to PGM, considering the program requires it. Therefore, for my testing of the program, I took a JPEG image and converted it to PGM using an online method found here: Dan's Tools - Convert Files. Providing just the one image as a command-line argument only yielded 4 options that include getting/setting the values of pixels and getting other information. Looking through the code I knew there was more to it, but the method to process an image using the program required 2 arguments: first is the original image, second is the output image. With these provided, the program allows the user to rotate, invert/ reflect, enlarge, shrink, crop, translate, and negate.


The code is found on the site, near the end of the article. To run it, I made a Makefile. The code downloaded and borrowed from the site are stored as text files, so I renamed them as .cpp and .h files within the Linux environment. Here are the files for ease of access: File:Main.cpp.txt | File:Image.h.txt | File:Image.cpp.txt Also, here is the Makefile:

#Makefile for A1 - Image Processing
#
GCC_VERSION = 8.2.0
PREFIX = /usr/local/gcc/${GCC_VERSION}/bin/
CC = ${PREFIX}gcc
CPP = ${PREFIX}g++

main: main.o
	$(CPP) -pg -omain main.o
main.o: main.cpp
	$(CPP) -c -O2 -g -pg -std=c++17 main.cpp image.cpp
clean:
	rm *.o

From my test, I used a 768 KB image borrowed from the web, enlarged it a couple times, shrank it, rotated it, and negated it. The result was an 18.7 MB image. The time it took to run the program was:

real    1m33.427s
user    0m0.431s
sys     0m0.493s

The generated FLAT profile of the program revealed:

 %   cumulative   self              self     total
time   seconds   seconds    calls  ms/call  ms/call  name
34.46      0.21     0.21        7    30.03    30.03  Image::operator=(Image const&)
26.26      0.37     0.16        6    26.69    26.69  Image::Image(int, int, int)
13.13      0.45     0.08                             Image::rotateImage(int, Image&)
11.49      0.52     0.07                             writeImage(char*, Image&)
 9.85      0.58     0.06                             Image::enlargeImage(int, Image&)
 1.64      0.59     0.01                             readImage(char*, Image&)
 1.64      0.60     0.01                             Image::negateImage(Image&)
 1.64      0.61     0.01                             Image::shrinkImage(int, Image&)
 0.00      0.61     0.00        7     0.00     0.00  Image::~Image()
 0.00      0.61     0.00        1     0.00     0.00  _GLOBAL__sub_I__ZN5ImageC2Ev
 0.00      0.61     0.00        1     0.00     0.00  Image::Image(Image const&)


A second run of the program, with me enlarging the image, rotating, translating, and negating the image resulted in a time of:

real    1m0.968s
user    0m0.295s
sys     0m0.297s

And a Flat profile of:

 %   cumulative   self              self     total
time   seconds   seconds    calls  ms/call  ms/call  name
33.37      0.14     0.14        6    23.36    23.36  Image::operator=(Image const&)
26.22      0.25     0.11        5    22.02    22.02  Image::Image(int, int, int)
14.30      0.31     0.06                             writeImage(char*, Image&)
14.30      0.37     0.06                             Image::rotateImage(int, Image&)
 7.15      0.40     0.03                             Image::enlargeImage(int, Image&)
 2.38      0.41     0.01                             Image::reflectImage(bool, Image&)
 2.38      0.42     0.01                             Image::translateImage(int, Image&)
 0.00      0.42     0.00        6     0.00     0.00  Image::~Image()
 0.00      0.42     0.00        1     0.00     0.00  _GLOBAL__sub_I__ZN5ImageC2Ev
 0.00      0.42     0.00        1     0.00     0.00  Image::Image(Image const&)


Conclusion:

Ignoring the real time spent, the majority of time is spent in the equals operator function and the class constructor, most likely because the image is constantly being manipulated, read from, and being copied to and from temporary storage for ease of use and object safety. Other than the basic functions (like read/write), it looks like the rotate and enlarge functions take a larger amount of time, which could mean that, if they were to be parallelized, it could positively affect the run time. My discernment of the big-O notation for the rotate function is O(n^2) which shows a quadratic growth rate, whereas the enlarge function had a notation of O(n^3) or greater. The reason for the rotate function having a longer run-time could be due to the fact that I enlarged the image before rotating it, but the notations don't lie. Personally, I'd say that this application is not the best for parallelization because of its simplicity in handling the images, but I can definitely see how one or more of the functions in the program can be parallelized. Some of the issues posed in making the program parallel is centered upon the image needing to be accessible to every other function, and, considering that the image is being processed, it would be constantly modified and read from. I simple terms, I think that, if multiple threads were running to quicken the program, the computation of the image could lead to errors in processing resulting in a corrupted image, distortions, and things of the sort. I may be wrong in this thought, but, to my knowledge, not being to avoid such issues makes this program somewhat difficult to safely parallelize.



Application 3 - Sorting Algorithms

Description:

I decided to select an option from the suggested projects – Sorting Algorithms. The sorting algorithms I included were: Bubble, Insertion, Selection, Heap, Merge, and Quicksort. I decided to create an application that uses all of the sorting algorithms and calls their respective functions instead of creating individual modules and profiling them, simply because the 99% percent of the total running time would be taken up by the sorting algorithm function. So for a better understanding and comparison of the time taken by each sorting algorithm, I decided to create a single module with functions that perform the sorting algorithms. I allocated memory for all of the arrays and populated a single array of size n with randomly generated data. I then copied the memory from the populated array to all of the other arrays to ensure consistent data throughout all of the sorting algorithms. I then passed each array to its respective sorting function which returned the sorted array using pass-by-reference. One of the things to keep in mind is that when n increases (the size of the array being sorted), the time increases. I have included 3 different profiles where n (the size of the array) equals 50000, 100000 and lastly 50000.


Source Code:

 File:SortingAlgorithms.cpp.txt
 The link to the source code AND the make file can also be found on GitHub at: https://github.com/Akshat55/Sorting-Algorithms-Comparison


N = 50,000

 Total Time (seconds) = 7.98 seconds

Flat profile:

Each sample counts as 0.01 seconds.

 %   cumulative   self              self     total           
time   seconds   seconds    calls  ms/call  ms/call  name    
63.21      5.04     5.04                             bubbleSort(int*, int)
29.91      7.42     2.38                             selectionSort(int*, int)
 6.79      7.96     0.54                             insertionSort(int*, int)
 0.13      7.97     0.01    49999     0.00     0.00  merge(int*, std::vector<int, std::allocator<int> >&, int, int, int)
 0.13      7.98     0.01                             quickSort(int*, int, int)
 0.00      7.98     0.00        8     0.00     1.25  mergeIndexSort(int*, std::vector<int, std::allocator<int> >&, int, int)
 0.00      7.98     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z9fillArrayPii


N = 100,000

 Total Time (seconds) = 31.42 seconds 

Flat profile:

Each sample counts as 0.01 seconds.

 %   cumulative   self              self     total           
time   seconds   seconds    calls  ms/call  ms/call  name    
63.91     20.05    20.05                             bubbleSort(int*, int)
29.15     29.19     9.14                             selectionSort(int*, int)
 6.96     31.38     2.18                             insertionSort(int*, int)
 0.06     31.40     0.02                             heapSort(int*, int)
 0.03     31.41     0.01    99999     0.00     0.00  merge(int*, std::vector<int, std::allocator<int> >&, int, int, int)
 0.03     31.42     0.01                             quickSort(int*, int, int)
 0.00     31.42     0.00        8     0.00     1.25  mergeIndexSort(int*, std::vector<int, std::allocator<int> >&, int, int)
 0.00     31.42     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z9fillArrayPii


N = 500,000

 Total Time (minutes) = 13.47 minutes 

Flat profile:

Each sample counts as 0.01 seconds.

 %   cumulative   self              self     total           
time   seconds   seconds    calls  ms/call  ms/call  name    
62.36    503.32   503.32                             bubbleSort(int*, int)
30.67    750.86   247.54                             selectionSort(int*, int)
 7.08    807.99    57.14                             insertionSort(int*, int)
 0.02    808.12     0.13                             heapSort(int*, int)
 0.01    808.20     0.08   499999     0.00     0.00  merge(int*, std::vector<int, std::allocator<int> >&, int, int, int)
 0.01    808.26     0.06                             quickSort(int*, int, int)
 0.00    808.26     0.00        8     0.00    10.01  mergeIndexSort(int*, std::vector<int, std::allocator<int> >&, int, int)
 0.00    808.26     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z9fillArrayPii

Conclusion: Based on the results from profiling 3 different sizes of arrays, we can assume that majority of the time taken to sort the arrays is taken by the O(n^2) algorithms (Bubble, Insertion, and Selection). However, the other sorting algorithms (Heap, Merge, Quick) that are O(n log(n)) are extremely fast even when the size of the arrays are large. As observed from the profile, the elapsed time increased as the size of the array increased. I went from approximately 8 seconds to execute the entire program to 13 minutes to execute.


Final Selection: Sorting Algorithms Based on the profiled applications above, we think that the sorting algorithms would benefit a lot from offloading to the GPU. Sorting Algorithms are commonly used in programming and can have a strong impact on the programs speed and efficiency. Since they are so commonly used, we think it would be quite interesting to see if we can speed up the O(n^2) sorting algorithms to potentially match the sorting speed of the O(n log n) algorithms since there won’t much change for the parallelized version of them, as they are already fast.

Assignment 2

Sorting Algorithm parallelization completion:


Akshatkumar Patel – Bubble Sort & Merge Sort


Both (Bubble & Merge) sorting algorithms were created and profiled on Tesla K80 GPU (Cloud – Google Colab). In order to compile them the following was typed:

 !nvcc bubble.cu -o bubble

Same can be applied to merge sort, but if it is not a computer_35 architecture the following compilation command must be used:

 !nvcc -arch=sm_35 -rdc=true merge.cu -o merge -lcudadevrt

Observation: We thought offloading the given entire sorting algorithms (since they are computing intensive in a way) would result in increase in speed. However, we observed that this was not the case for quite a few of the sorting algorithms since they are not optimized. One of the things that stood out to me was that I had to change my approach to bubble sort in order to make it work, I had to use the Odd/Even bubble sort. For bubble sort, when N gets bigger time increases more than that of the CPU’s. I tried using threads to make it faster but that resulted in a slower speed.

As for merge, I had difficulty doing recursion on the kernel but was solved using “-arch=sm_35 -rdc=true” command line switch. Merge can be optimized and improved greatly unlike the odd-even bubble sort since I found out there to be multiple solutions to creating it. I also happened to find a merge sort implementation in CUDA 6.0 Samples which were quite complex to understand but overall much faster than my implementation.

File:A2 bubble.cu.txt

File:A2 merge.cu.txt

Woosle Park – Insertion Sort & Heap Sort


Both the insertion and heap sorting algorithms where created and profiled on a gtx1080 the algorithms were compiled using visual studio 2017.

Observation: For heapsort the recursive algorithm creates a cuda warning of potential stack overflow. For the heapify kernel identifying the left and right element of the heap worked better using bit manipulation to locate them. Same issue occurred here as well in the labs where my gpu is too fast so the results of each kernel flat lined despite the element increased. That being said you do notice a slight increase in speed comparing the gpu results the higher the number of elements. Insertion sort currently running in 1-dimensional gird for testing will be changed in A3.


File:A2 insertion.cu.txt

File:A2 heap.cu.txt

Jordan Pitters – Selection & Quick Sort


Both the insertion and heap sorting algorithms where created and profiled on a gtx1080 the algorithms were compiled using visual studio 2017.

Observation: Both the Selection & Quick sorting algorithms were created and profiled on Tesla K80 GPU (Cloud). Specifically, the compiler provided by Google Collaboratory at https://colab.research.google.com. In order to compile them the following steps were followed:

  1. open a cell for code and name the code file in the first line with: “%%file name.cu”. For example, I named the file using: “%%file TSort.cu” (meaning “Test Sort”).
  2. Put the code into the space, and when done click the play button to save the code to the cloud.
  3. Next, open another code cell and enter: “!nvcc name.cu -o name” (Do not run it yet)
  4. Specifically, I used “!nvcc -arch=sm_35 -rdc=true TSort.cu -o TSort -lcudadevrt” to run my code as it increased the compute capability to allow the required recursive calls of kernels. It is recommended to use this to run your code if it requires a computer_35 architecture or greater
  5. Finally, on a new line from the code in the previous step, or in a new cell, enter: “!./name 50” and click the play button to compile and run the code (the 50 is a command line argument).
  6. To time the code I used: “!nvprof ./TSort 50000” (as an example).


Observation: We considered the methods we would need to take to completing the task of parallelizing the sorting algorithms and had assumed that offloading the entire algorithm (off the CPU on to the GPU) could yield results in speed. However, after testing and observations, we found that the sorting algorithms would only yield positive results if we managed a fine balance between CPU and GPU code calls, whether the GPU calls are optimized. This was especially so for Selection Sort, as we tried completely offloading the algorithm and it didn’t seem to yield any positive results. We were not sure of the methods that were necessary to optimize the algorithm GPU-wise, and had meager ideas about designing a grid structure, so it remained something we would need to investigate. The Quick Sort function on the other hand was difficult to design because the algorithm utilized several recursive calls to do complete its sorting, which would mean recursive kernel calls, if we were to offload as planned. It eventually came to the point where we were not sure what to do about the algorithm and had to research the capabilities of CUDA kernels, as well as potential other theorized ways of parallelizing Quick Sort. We found a great many number of comments on the algorithm and pseudo code suggesting the same recursive conclusion we reached. We also found that kernels can indeed be recursively called and are required to be tied to streams to designate correlated kernels. As such, the streams and several thread synchs, among other things were used to complete the task of parallelizing Quick Sort. It was not simple, and the cloud compiler did not allow recursive calls of kernels as default, so some research was done that yielded the command line switch “-arch=sm_35 -rdc=true” which allowed recursion so that testing could be done. In the end, the code was not optimized, but we were able to prove the capability of parallelization for the Selection and Quick sorting algorithms.


File:A2 quick.cu.txt

File:A2 selection.cu.txt


Final thoughts

https://docs.google.com/spreadsheets/d/1nDCtm8ar2AmhZujV4QeHUHoLfq80L1buAJqD0BALR0A/edit?usp=sharing GPU_CPU_Comparisons

Our parallel sorting algorithms were slower on the GPU. However, after reprofiling on an average computer with the following specs:

OS: Windows 10

Processor: Intel (R) Core (TM) i5-5200U CPU @2.2GHz

Ram: 8GB


The parallelized sorting algorithms were much faster than on the CPU listed above. Matrix performance is much faster than of an average computer hence the completion time in profiles are lower. Another thing to note is that, choosing the right number of threads & blocks is important or the sorting can be potentially incorrect. Many of our attempts at sorting worked when n was smaller than or equal to 50000 but as the we tested 500 thousand the sort was incorrect.

Assignment 3