Project: LZW Compress
LZW stand for "Lempel–Ziv–Welch". It is a universal lossless data compression algorithm which is widely use in compressing software. It is widely used in image compress and Unix file compress. Here is a project example of LZW compressing written by c++. Here's the link for more details: https://codereview.stackexchange.com/questions/86543/simple-lzw-compression-algorithm
Basic Compile Command:(Linux)
g++ -std=c++0x lzw.c -o lzw
To Execute Projet
- Compress file:
./lzw -c file.txt
- Decopress file:
./lzw -d file.txt
Project: Grep - Count Keyword in File
On the Linux system, we are very familiar with the Grep command. This is a similar project to that. But this one is more simple. It just counts how many times the keyword appears in the file. After compile the program, we run it as a command like this:
Here, it counts how many 'is' are there in file 'sample.txt'.
We are going to work on 'Grep' project.
Through the profile, we can see that the most cost of the runtime is 'grep()' and 'convert_char_to_string()' functions.
Serial grep method
Then, we start working in grep function. As file size growing up, which means more characters in file, the runtime of 'grep()' grows much more faster than the other functions. So we are going to work on 'grep()' function.
We can see, there're two for loop in grep() function. In loop is inside another. Which mean it's kind of O(n^m) runtime function. That's why it cost a lot of time to run.
The idea of grep is very simple. The inside loop make a temporary string from input string with the length of keyword length. Then it compare the temp string to keyword. If they equal, means this temp string is the same as keyword string. It increases the counter by 1.
Simple? But to parallel it is not that simple...
Parallel grep idea
It's impossible to compare the string keyword to the file input string cause all characters are individual in threads. That make us spend a huge time to figure out the solution.
We create an integer array. The size of this array is the same as input's size. This array is to record the characters exist in that position. If characcter exist, we mark it as specific number. Wecall this array as band.
This is our main idea to parallel the grep function. Yes, it still have some loops. But those loops are not as large as that in serial grep function. So we start to design the code foe it.
Parallel grep code
Here are part of parallel grep code:
Yes. The grep of parallel's result is the same as serial's result! It works!
We got it!
Next, we optimize the code so that it runs faster.
Use shared memory
According to the data, the number of ntpb is not an important factor affecting time.
That's all for our assignment.
You can see my code here: https://gist.github.com/KignorChan/764c71d9d96561e4101d9830e0bc5e69
Idea from LZW compress program: https://codereview.stackexchange.com/questions/86543/simple-lzw-compression-algorithm