N/A

Team Members

1. Woosle Park, Data Compression
2. Akshat Patel,
3. Jordan Pitters,

Progress

Assignment 1

Application 1 - Data Compression

The algorithm used for data compression here is the Lempel–Ziv–Welch (LZW) algorithm. It is a lossless algorithm meaning no data is lost during compression for a file. This algorithm is generally used for gif or pdf files but for this example, I used a .txt file because it was easier to manipulate and scale in size. The file used for compression is a .txt version of the Holy Bible(https://raw.githubusercontent.com/mxw/grmr/master/src/finaltests/bible.txt) because the contents are large enough to see the compression time and percentage. The algorithm should read a files sequence of symbols and grouping them into strings and then converting it into bit 12 code that is then stored into a table. That table is then referred to when decompressing a file doing a reverse sequence of steps from compression.

Source Code:

```//  Compile with gcc 4.7.2 or later, using the following command line:
//
//    g++ -std=c++0x lzw.c -o lzw
//
//LZW algorithm implemented using fixed 12 bit codes.

#include <iostream>
#include <sstream>
#include <fstream>

#include <bitset>
#include <string>
#include <unordered_map>

#define MAX_DEF 4096

using namespace std;

string convert_int_to_bin(int number)
{
string result = bitset<12>(number).to_string();
return result;
}

void compress(string input, int size, string filename) {
unordered_map<string, int> compress_dictionary(MAX_DEF);
//Dictionary initializing with ASCII
for ( int unsigned i = 0 ; i < 256 ; i++ ){
compress_dictionary[string(1,i)] = i;
}
string current_string;
unsigned int code;
unsigned int next_code = 256;
//Output file for compressed data
ofstream outputFile;
outputFile.open(filename + ".lzw");

for(char& c: input){
current_string = current_string + c;
if ( compress_dictionary.find(current_string) ==compress_dictionary.end() ){
if (next_code <= MAX_DEF)
compress_dictionary.insert(make_pair(current_string, next_code++));
current_string.erase(current_string.size()-1);
outputFile << convert_int_to_bin(compress_dictionary[current_string]);
current_string = c;
}
}
if (current_string.size())
outputFile << convert_int_to_bin(compress_dictionary[current_string]);
outputFile.close();
}

void decompress(string input, int size, string filename) {
unordered_map<unsigned int, string> dictionary(MAX_DEF);
//Dictionary initializing with ASCII
for ( int unsigned i = 0 ; i < 256 ; i++ ){
dictionary[i] = string(1,i);
}
string previous_string;
unsigned int code;
unsigned int next_code = 256;
//Output file for decompressed data
ofstream outputFile;
outputFile.open(filename + "_uncompressed.txt");

int i =0;
while (i<size){
//Extracting 12 bits and converting binary to decimal
string subinput = input.substr(i,12);
bitset<12> binary(subinput);
code = binary.to_ullong();
i+=12;

if ( dictionary.find(code) ==dictionary.end() )
dictionary.insert(make_pair(code,(previous_string + previous_string.substr(0,1))));
outputFile<<dictionary[code];
if ( previous_string.size())
dictionary.insert(make_pair(next_code++,previous_string + dictionary[code][0]));
previous_string = dictionary[code];
}
outputFile.close();
}

string convert_char_to_string(const char *pCh, int arraySize){
string str;
if (pCh[arraySize-1] == '\0') str.append(pCh);
else for(int i=0; i<arraySize; i++) str.append(1,pCh[i]);
return str;
}

static void show_usage()
{
cerr << "Usage: \n"
<< "Specify the file that needs to be compressed or decompressed\n"
<<"lzw -c input    #compress file input\n"
<<"lzw -d input    #decompress file input\n"
<<"Compressed data will be found in a file with the same name but with a .lzw extension\n"
<<"Decompressed data can be found in a file with the same name and a _uncompressed.txt extension\n"
<< endl;
}

int main (int argc, char* argv[]) {
streampos size;
char * memblock;

if (argc <2)
{
show_usage();
return(1);
}
ifstream file (argv[2], ios::in|ios::binary|ios::ate);
if (file.is_open())
{
size = file.tellg();
memblock = new char[size];
file.seekg (0, ios::beg);
file.close();
string input = convert_char_to_string(memblock,size);
if (string( "-c" ) == argv[1] )
compress(input,size, argv[2]);
else if (string( "-d" ) == argv[1] )
decompress(input,size, argv[2]);
else
show_usage();
}
else {
cout << "Unable to open file."<<endl;
show_usage();
}
return 0;
}
```

Flatline Profiles:

bible.txt - 4,351,186 bytes

```Each sample counts as 0.01 seconds.
%   cumulative   self              self     total
time   seconds   seconds    calls  ns/call  ns/call  name
50.04      0.18     0.18  5758089    31.29    31.29  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, std::string const&, unsigned long) const
50.04      0.36     0.18                             compress(std::string, int, std::string)
0.00      0.36     0.00  1402806     0.00    31.29  std::__detail::_Map_base<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::string const&)
0.00      0.36     0.00     4098     0.00     0.00  show_usage()
0.00      0.36     0.00     4097     0.00     0.00  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<std::string const, int>, true>*)
0.00      0.36     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z18convert_int_to_bini
0.00      0.36     0.00        1     0.00     0.00  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear()
```

bible2.txt - 8,702,373 bytes

```Each sample counts as 0.01 seconds.
%   cumulative   self              self     total
time   seconds   seconds    calls  ns/call  ns/call  name
48.39      0.44     0.44 11511109    38.26    38.26  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, std::string const&, unsigned long) const
46.19      0.86     0.42                             compress(std::string, int, std::string)
5.50      0.91     0.05  2804639    17.84    56.10  std::__detail::_Map_base<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::string const&)
0.00      0.91     0.00     4098     0.00     0.00  show_usage()
0.00      0.91     0.00     4097     0.00     0.00  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<std::string const, int>, true>*)
0.00      0.91     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z18convert_int_to_bini
0.00      0.91     0.00        1     0.00     0.00  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear()
```

bible3.txt - 13,053,560 bytes

```Each sample counts as 0.01 seconds.
%   cumulative   self              self     total
time   seconds   seconds    calls  ns/call  ns/call  name
47.58      0.58     0.58 17264129    33.63    33.63  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, std::string const&, unsigned long) const
42.66      1.10     0.52                             compress(std::string, int, std::string)
7.38      1.19     0.09  4206472    21.41    55.04  std::__detail::_Map_base<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::string const&)
1.64      1.21     0.02                             convert_char_to_string(char const*, int)
0.82      1.22     0.01                             std::pair<std::__detail::_Node_iterator<std::pair<unsigned int const, std::string>, false, false>, bool> std::_Hashtable<unsigned int, std::pair<unsigned int const, std::string>, std::allocator<std::pair<unsigned int const, std::string> >, std::__detail::_Select1st, std::equal_to<unsigned int>, std::hash<unsigned int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_emplace<std::pair<unsigned int, std::string> >(std::integral_constant<bool, true>, std::pair<unsigned int, std::string>&&)
0.00      1.22     0.00     4098     0.00     0.00  show_usage()
0.00      1.22     0.00     4097     0.00     0.00  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<std::string const, int>, true>*)
0.00      1.22     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z18convert_int_to_bini
0.00      1.22     0.00        1     0.00     0.00  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear()
```

bible4.txt - 17,039,360 bytes

```Each sample counts as 0.01 seconds.
%   cumulative   self              self     total
time   seconds   seconds    calls  ns/call  ns/call  name
60.43      0.96     0.96 22530032    42.65    42.65  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, std::string const&, unsigned long) const
32.73      1.48     0.52                             compress(std::string, int, std::string)
6.29      1.58     0.10  5486575    18.24    60.89  std::__detail::_Map_base<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::string const&)
0.63      1.59     0.01                             convert_char_to_string(char const*, int)
0.00      1.59     0.00     4098     0.00     0.00  show_usage()
0.00      1.59     0.00     4097     0.00     0.00  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<std::string const, int>, true>*)
0.00      1.59     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z18convert_int_to_bini
0.00      1.59     0.00        1     0.00     0.00  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear()
```

Conclusion:

This of time is spent in the compress function and the hashtable takes up most of the time because it is constantly being manipulated and read from. It looks like if the hashtable and the compress function were to be parallelized about 90% of the run time would be affected. The big-O for the application should be O(n) time so there is a linear increase in time based on file size. This application is not good for parallelization because of the dictionary hashtable. Due to the hastable needing to be accessible globally and be constantly modifiable and read this could pose issues if multiple threads were running especially since modifying and reading the table needs to be done sequentially for efficient compression.