DPS915 C U D A B O Y S

From CDOT Wiki
Jump to: navigation, search

C U D A B O Y S

Team Members

  1. Manjot Sandhu, Some responsibility
  2. Johnathan Ragimov, Some other responsibility
  3. Oleg Eustace, Some other responsibility

Email All

Progress

Assignment 1

✓ Profile 0: File Encryption

Description

This piece of software takes a file and ecrypts it one of 4 ways:

  1. Byte Inversion
  2. Byte Cycle
  3. Xor Cipher
  4. RC4 Cipher


Inside the byteCipher method, exists a for loop that could use optimization. Within this loop specifically, the lines that call the cycle and rc4_output functions are the ones that are taking the longest time to execute:

      for (int i = 0; i < bufferSize; i++){
               // going over every byte in the file
               switch (mode) {
                       case 0: // inversion
                               buffer[i] = ~buffer[i];
                               break;
                       case 1: // cycle
                               buffer [i] = cycle (buffer [i]);
                               break;
                       case 2: // RC4
                               buffer [i] = buffer [i] ^ rc4_output();
                               break;
               }
       }

Here is what these functions cycle and rc4_output functions look like:

char cycle (char value) {
       int leftMask = 170;
       int rightMask = 85;
       int iLeft = value & leftMask;
       int iRight = value & rightMask;
       iLeft = iLeft >> 1;
       iRight = iRight << 1;
       return iLeft | iRight;
}
unsigned char rc4_output() {
   unsigned char temp;
   i = (i + 1) & 0xFF;
   j = (j + S[i]) & 0xFF;
   temp = S[i];
   S[i] = S[j];
   S[j] = temp;
   return S[(S[i] + S[j]) & 0xFF];
}


We need to change these two functions so they are added to the CUDA device as "device functions".

Profiling on Linux

The following test runs were performed on the following Virtual Machine:

  • CentOS 7
  • i7-3820 @ 3.6 GHz
  • 2GB DDR3
  • gcc version 4.8.3


Using compiler settings:

g++ -c -O2 -g -pg -std=c++11 encFile.cpp 


RC4 Cipher - 283 MB mp3 File

[root@jr-net-cent7 aes]# time ./encFile 4 /home/johny/aes/music.mp3
/home/johny/aes/music.mp3
* * * File Protector * * *
Mode 4: RC4 cipher
Please enter the RC4 key (8 chars min)
testing123
The password is: testing123
Beginning encryption
Completed: 100%
Cipher completed.
Program terminated.
real    0m6.758s
user    0m3.551s
sys     0m0.068s
Flat profile:
Each sample counts as 0.01 seconds.
 %   cumulative   self              self     total           
time   seconds   seconds    calls  ms/call  ms/call  name    
84.05      1.70     1.70 296271519     0.00     0.00  rc4_output()
13.39      1.97     0.27                             byteCipher(int, std::string)
 2.73      2.02     0.06        1    55.09    55.09  rc4_init(unsigned char*, unsigned int)
 0.00      2.02     0.00        1     0.00     0.00  _GLOBAL__sub_I_S

As we can see the rc4_output and byteCipher functions take up most of the processing time.


RC4 Cipher - 636 MB iso File

[root@jr-net-cent7 aes]# time ./encFile 4 /home/johny/aes/cent.iso
/home/johny/aes/cent.iso
 * * * File Protector * * *
Mode 4: RC4 cipher
Please enter the RC4 key (8 chars min)
testing123
The password is: testing123
Beginning encryption
Completed: 100%
Cipher completed.
Program terminated.
real    0m10.293s
user    0m8.235s
sys     0m0.312s
Flat profile:
Each sample counts as 0.01 seconds.
 %   cumulative   self              self     total           
time   seconds   seconds    calls  ms/call  ms/call  name    
74.86      3.59     3.59 666894336     0.00     0.00  rc4_output()
23.21      4.70     1.11                             byteCipher(int, std::string)
 2.09      4.80     0.10        1   100.16   100.16  rc4_init(unsigned char*, unsigned int)
 0.00      4.80     0.00        1     0.00     0.00  _GLOBAL__sub_I_S

RC4 Cipher - 789 MB iso File

[root@jr-net-cent7 aes]# time ./encFile 4 /home/johny/aes/xu.iso
/home/johny/aes/xu.iso
 * * * File Protector * * *
Mode 4: RC4 cipher
Please enter the RC4 key (8 chars min)
testing123
The password is: testing123
Beginning encryption
Completed: 100%
Cipher completed.
Program terminated.
real    0m12.566s
user    0m10.170s
sys     0m0.228s
Flat profile:
Each sample counts as 0.01 seconds.
 %   cumulative   self              self     total           
time   seconds   seconds    calls  ms/call  ms/call  name    
75.51      4.40     4.40 827326464     0.00     0.00  rc4_output()
23.02      5.74     1.34                             byteCipher(int, std::string)
 1.63      5.84     0.10        1    95.15    95.15  rc4_init(unsigned char*, unsigned int)
 0.00      5.84     0.00        1     0.00     0.00  _GLOBAL__sub_I_S


Profiling on Windows

The following test runs were performed on the following Machine:

  • Windows 10
  • i7-4790k @ 4GHz
  • 16GB DDR3
  • Visual Studio 2013


RC4 Cipher - 283 MB mp3 File

Winmp3.png


RC4 Cipher - 636 MB iso File

Wincent.png


RC4 Cipher - 789 MB iso File

Winxu.png


Byte Cycle - 283 MB mp3 File

Winmp32.png


Byte Cycle - 636 MB iso File

Wincent2.png


Byte Cycle - 789 MB iso File

Winxu2.png

✗ Profile 1: PI Approximation

  • Sample run:
[root@jr-net-cent a1]# time ./pi
3.024
operation - took - 0.0001040000 secs
3.1676000000
operation - took - 0.0002280000 secs
3.1422800000
operation - took - 0.0022700000 secs
3.1418720000
operation - took - 0.0222910000 secs
3.1412748000
operation - took - 0.2185140000 secs
3.1417290800
operation - took - 2.2039310000 secs
3.1415420600
operation - took - 21.9592080000 secs
3.1415625844
operation - took - 47.1807910000 secs
3.1415537704
real    3m33.129s 
user    3m32.925s 
sys     0m0.016s
  • gprof:
Each sample counts as 0.01 seconds.
%   cumulative   self              self     total           
time   seconds   seconds    calls   s/call   s/call  name    
100.52    106.93   106.93       11     9.72     9.72  calcpi(int, int*)
0.00    106.93     0.00       11     0.00     0.00  reportTime(char const*, std::chrono::duration<long, std::ratio<1l, 1000000l> >)
0.00    106.93     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z10reportTimePKcNSt6chrono8durationIlSt5ratioILl1ELl1000000EEEE

✗ Profile 2: Wave Form Generator

This is the program we selected to optimize. It's a great candidate because it has 2 primary functions that have a few for loops in them. One of the functions reads an Mp3 file and writes wave data to a file -- this function takes quite a bit of time to execute. The other function actually takes this data and converts it to a view-able sound wave image. Both functions would benefit greatly from the extra processing power that a GPU provides: mp3 read/decode time would be greatly reduced.

'This piece of code is too complex and requires a linux environment to run. Please see Profile 0 for the one we are currently using.'

  • Sample Run
[root@jr-net-cent7 ~]# time audiowaveform -i Steph\ DJ\ -\ Noise\ Control\ Episode\ 025\ Feat\ Jack\ Diamond\ 13th\ January\ 2014.mp3 -o test.dat -z 256 -b 8
Input file: Steph DJ - Noise Control Episode 025 Feat Jack Diamond 13th January 2014.mp3
Format: Audio MPEG layer III stream
Bit rate: 320000 kbit/s
CRC: no
Mode: normal LR stereo
Emphasis: no
Sample rate: 44100 Hz
Generating waveform data...
Samples per pixel: 256
Input channels: 2
Done: 100%
Recoverable frame level error: lost synchronization
Frames decoded: 283540 (123:26.759)
Generated 1275930 points
Writing output file: test.dat
Resolution: 8 bits
real    0m32.486s
user    0m32.409s
sys     0m0.056s
[root@jr-net-cent7 ~]# which audiowaveform
/usr/local/bin/audiowaveform
[root@jr-net-cent7 ~]# gprof -p -b /usr/local/bin/audiowaveform > final.dat


  • gprof
Each sample counts as 0.01 seconds.
 %   cumulative   self              self     total           
time   seconds   seconds    calls   s/call   s/call  name    
58.71      4.28     4.28    79746     0.00     0.00  WaveformGenerator::process(short const*, int)
31.00      6.54     2.26        1     2.26     7.27  Mp3AudioFileReader::run(AudioProcessor&)
 9.33      7.22     0.68 653276160     0.00     0.00  MadFixedToSshort(int)
 0.41      7.25     0.03        1     0.03     0.03  dumpInfo(std::ostream&, mad_header const&)
 0.14      7.26     0.01  7655603     0.00     0.00  short const& std::forward<short const&>(std::remove_reference<short const&>::type&)
 0.14      7.27     0.01  2551860     0.00     0.00  writeInt8(std::ostream&, signed char)
 0.14      7.28     0.01  2551860     0.00     0.00  std::vector<short, std::allocator<short> >::push_back(short const&)
 0.14      7.29     0.01  1275930     0.00     0.00  WaveformBuffer::getMinSample(int) const
 0.00      7.29     0.00  2551938     0.00     0.00  operator new(unsigned long, void*)
 0.00      7.29     0.00  2551860     0.00     0.00  void __gnu_cxx::new_allocator<short>::construct<short, short const&>(short*, short const&)
 0.00      7.29     0.00  2551860     0.00     0.00  std::vector<short, std::allocator<short> >::operator[](unsigned long) const
 0.00      7.29     0.00  2551860     0.00     0.00  std::enable_if<std::allocator_traits<std::allocator<short> >::__construct_helper<short, short const&>::value, void>::type std::allocator_traits<std::allocator<short> >::_S_construct<short, short const&>(std::allocator<short>&, short*, short const&)
 0.00      7.29     0.00  2551860     0.00     0.00  decltype (_S_construct({parm#1}, {parm#2}, (forward<short const&>)({parm#3}))) std::allocator_traits<std::allocator<short> >::construct<short, short const&>(std::allocator<short>&, short*, short const&)
 0.00      7.29     0.00  1275931     0.00     0.00  WaveformGenerator::reset()
 0.00      7.29     0.00  1275930     0.00     0.00  WaveformBuffer::appendSamples(short, short)
 0.00      7.29     0.00  1275930     0.00     0.00  WaveformBuffer::getMaxSample(int) const
 0.00      7.29     0.00    79747     0.00     0.00  AudioFileReader::showProgress(long long, long long)
 0.00      7.29     0.00     7272     0.00     0.00  BstdRead
 0.00      7.29     0.00     7271     0.00     0.00  BstdFileEofP
.....

✗Profile 3: String Processor

  • Sample run:
ext_string_example
es + 123 = ext_string123
456 + es = 456ext_string
es * 3   = ext_stringext_stringext_string
3  * es  = ext_stringext_stringext_string
original:  abc1234?abc1234?abc1234
es - abc = 1234?1234?1234
es - 123 = abc?abc?abc
es -   ? = abc1234abc1234abc1234
ext_string == eXt_StRiNg
original:  eXt_StRiNg
lowercase: ext_string
uppercase: EXT_STRING
original:              [   ext_string   ]
remove leading space:  [ext_string   ]
remove trailing space: [ext_string]
es: abc, ijk, pqr, xyz ---> split: (abc) (ijk) (pqr) (xyz)
es: abc, ijk, pqr, xyz ---> split_n(3): (abc) (ijk) (pqr)
es: 1, -23, 456, -7890 ---> parse: (1) (-23) (456) (-7890)
es: 1.1, -23.32, 456.654, -7890.0987 ---> parsed: (1.1000000000000001) (-23.32) (456.654) (-7890.0986999999996)
non_repeated_char_example
No non-repeated chars in string.
First non repeated char: a
First non repeated char: b
No non-repeated chars in string.
First non repeated char: c
First non repeated char: 1
translation_table_example
Before: Such is this simple string sample....Wowzers!
After: S5ch 3s th3s s3mpl2 str3ng s1mpl2....W4wz2rs!
Before: Such is this simple string sample....Wowzers!
After: S5ch 3s th3s s3mpl2 str3ng s1mpl2....W4wz2rs!
Before: Such is this simple string sample....Wowzers!
After: S5ch 3s th3s s3mpl2 str3ng s1mpl2....W4wz2rs!
Before: Such is this simple string sample....Wowzers!
After: S5ch 3s th3s s3mpl2 str3ng s1mpl2....W4wz2rs!
Before: Such is this simple string sample....Wowzers!
After: S5ch 3s th3s s3mpl2 str3ng s1mpl2....W4wz2rs!
find_n_consecutive_example
Result-01: [1] Location: [0]] Length: [1]
Result-02: [22] Location: [2]] Length: [2]
Result-03: [333] Location: [5]] Length: [3]
Result-04: [4444] Location: [9]] Length: [4]
Result-05: [55555] Location: [14]] Length: [5]
Result-06: [666666] Location: [20]] Length: [6]
Result-07: [7777777] Location: [27]] Length: [7]
Result-08: [88888888] Location: [35]] Length: [8]
Result-09: [999999999] Location: [44]] Length: [9]
Result-01: [a] Location: [0]] Length: [1]
Result-02: [bB] Location: [2]] Length: [2]
Result-03: [cCc] Location: [5]] Length: [3]
Result-04: [dDdD] Location: [9]] Length: [4]
Result-05: [EeEeE] Location: [14]] Length: [5]
Result-06: [fFfFfF] Location: [20]] Length: [6]
Result-07: [gGgGgGg] Location: [27]] Length: [7]
Result-08: [HhHhHhHh] Location: [35]] Length: [8]
Result-09: [IiIiIiIiI] Location: [44]] Length: [9]
split_on_consecutive_example
1 Consecutive digits: 1 2 2 3 3 3 4 4 4 4 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 7 8 9 9 0 0 0 1 1 1 1 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 4
2 Consecutive digits: 22 33 44 44 55 55 66 66 66 77 77 77 99 00 11 11 22 22 33 33 33 44 44 44
3 Consecutive digits: 333 444 555 666 666 777 777 000 111 222 333 333 444 444
4 Consecutive digits: 4444 5555 6666 7777 1111 2222 3333 4444
5 Consecutive digits: 55555 66666 77777 22222 33333 44444
6 Consecutive digits: 666666 777777 333333 444444
1 Consecutive letters: A B B C C C D D D D E E E E E F F F F F F G G G G G G G H I I J J J K K K K L L L L L M M M M M M N N N N N N N
2 Consecutive letters: BB CC DD DD EE EE FF FF FF GG GG GG II JJ KK KK LL LL MM MM MM NN NN NN
3 Consecutive letters: CCC DDD EEE FFF FFF GGG GGG JJJ KKK LLL MMM MMM NNN NNN
4 Consecutive letters: DDDD EEEE FFFF GGGG KKKK LLLL MMMM NNNN
5 Consecutive letters: EEEEE FFFFF GGGGG LLLLL MMMMM NNNNN
6 Consecutive letters: FFFFFF GGGGGG MMMMMM NNNNNN
index_of_example
Index of pattern[0123456789ABC]: 0
Index of pattern[123456789ABC]: 1
Index of pattern[23456789ABC]: 2
Index of pattern[3456789ABC]: 3
Index of pattern[456789ABC]: 4
Index of pattern[56789ABC]: 5
Index of pattern[6789ABC]: 6
Index of pattern[789ABC]: 7
Index of pattern[89ABC]: 8
Index of pattern[9ABC]: 9
Index of pattern[xyz]: 4294967295
truncatedint_example
i = -1234
i = -1234
u = 1234
i = -1234
u = 1234
real    0m0.248s
user    0m0.080s
sys     0m0.024s
  • Profile:


Flat profile:
Each sample counts as 0.01 seconds.
 %   cumulative   self              self     total          
time   seconds   seconds    calls  ms/call  ms/call  name   
83.33      0.05     0.05  1000008     0.00     0.00  unsigned int boost::uniform_int<unsigned int>::generate<boost::random::detail::pass_through_engine<boost::random::detail::pass_through_engine<boost::random::mersenne_twister<unsigned int, 32, 624, 397, 31, 2567483615u, 11, 7, 2636928640u, 15, 4022730752u, 18, 3346425566u>&> > >(boost::random::detail::pass_through_engine<boost::random::detail::pass_through_engine<boost::random::mersenne_twister<unsigned int, 32, 624, 397, 31, 2567483615u, 11, 7, 2636928640u, 15, 4022730752u, 18, 3346425566u>&> >&, unsigned int, unsigned int, unsigned int)
16.67      0.06     0.01        1    10.00    60.00  strtk::generate_random_data(unsigned char*, unsigned int, unsigned int, unsigned int)
 0.00      0.06     0.00     1642     0.00     0.00  boost::random::mersenne_twister<unsigned int, 32, 624, 397, 31, 2567483615u, 11, 7, 2636928640u, 15, 4022730752u, 18, 3346425566u>::twist(int)
 0.00      0.06     0.00      979     0.00     0.00  strtk::text::is_digit(char)
 0.00      0.06     0.00      978     0.00     0.00  strtk::text::is_letter(char)

Assignment 2

Description

Removing CPU Bottleneck

Removing the old CPU bottleneck in the byteCipher function:

      for (int i = 0; i < bufferSize; i++){
               // going over every byte in the file
               switch (mode) {
                       case 0: // inversion
                               buffer[i] = ~buffer[i];
                               break;
                       case 1: // cycle
                               buffer [i] = cycle (buffer [i]);
                               break;
                       case 2: // RC4
                               buffer [i] = buffer [i] ^ rc4_output();
                               break;
               }
       }

And replacing it with

...
if (mode == 1)
   getCycleBuffer << < dGrid, dBlock >> >(d_a, bufferSize, d_output);
if (mode == 2)
   getRC4Buffer << < dGrid, dBlock >> >(d_a, bufferSize, d_output);
...


Device Functions

Converting cycle and rc4_output functions to device functions:

/**
* Description:  Device function cycle
**/
__device__  char cycle(char value) {
  int leftMask = 170;
  int rightMask = 85;
  int iLeft = value & leftMask;
  int iRight = value & rightMask;
  iLeft = iLeft >> 1;
  iRight = iRight << 1;
  return iLeft | iRight;
}
/**
* Description:  Device function RC4
**/
__device__ unsigned char rc4_output() {
  unsigned char temp;
  unsigned char S[0x100]; // dec 256
  unsigned int i, j;
  i = (i + 1) & 0xFF;
  j = (j + S[i]) & 0xFF;
  temp = S[i];
  S[i] = S[j];
  S[j] = temp;
  return S[(S[i] + S[j]) & 0xFF];
}


Creating Kernels

We created kernels for each of the 2 different methods of Cipher that the program handles (RC4 and Cycle, but not the others -- read on):

/**
* Description:  RC4 Cuda Kernel
**/
__global__ void getRC4Buffer(char * buffer, int bufferSize) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (idx < bufferSize)
    buffer[idx] = buffer[idx] ^ rc4_output();
}
/**
* Description:  Cycle Cuda Kernel
**/
__global__ void getCycleBuffer(char * buffer, int bufferSize) {
 int idx = blockIdx.x * blockDim.x + threadIdx.x;
 if (idx < bufferSize)
   buffer[idx] = cycle(buffer[idx]);
}


You may be asking what about the two other methods of cipher: byte inversion and xor cipher? Well, as it turns out these methods run perfectly fine on the CPU and usually are faster on the CPU than the GPU. We initially had converted these functions over to CUDA, but we soon discovered that these functions did not need to be converted as they ran faster on the CPU than they did on the GPU.

Here's an example of run time of Xor Cipher on both CPU and GPU with the 789MB file:

GPU: http://i.imgur.com/0PsLxzQ.png -- 6.263 seconds

CPU: http://i.imgur.com/ktn14q3.png -- 3.722 seconds


As we can see, the CPU runs way faster than the GPU: no parallelization needed here!

Profiling

The following test runs were performed on the following Machine:

  • Windows 10
  • i7-4790k @ 4.0GHz
  • 16GB DDR3
  • Nvidia GTX 430


RC4 Profiling

RC4 Cipher - 283 MB mp3 File

Total runtime: 1.358 seconds

Music.png


RC4 Cipher - 636 MB iso File

Total runtime: 3.87 seconds

Cent.png


RC4 Cipher - 789 MB iso File

Total runtime: 5.072 seconds

Xu.png


RC4 Run time comparisons: CPU vs. CUDA

Comparing Windows vs. Windows for most accurate results.

Cpuvscuda.png


Byte Cycle Profiling

Byte Cycle - 283 MB mp3 File

Total runtime: 3.467 seconds

Music2.png


Byte Cycle - 636 MB iso File

Total runtime: 8.088 seconds

Cent2.png


Byte Cycle - 789 MB iso File

Total runtime: 9.472 seconds

Xu2.png


Byte Cycle time comparisons: CPU vs. CUDA

Comparing Windows vs. Windows for most accurate results.

Cpuvscuda2.png

Conclusion

RC4 Findings

We are seeing about 540% (~5.4x) performance increase while using CUDA instead of the CPU in all 3 of the test cases.


Byte Cycle Findings

We are seeing about 320% (~3.2x) performance increase while using CUDA instead of the CPU in all 3 of the test cases.


Overall, we think that these are amazing results and a significant improvement in performance over the CPU version of the code. Both of these functions have greatly improved in run time and efficiency

Assignment 3

Due to the nature of the way this program was structured by the original developer, optimization was not really needed. The benefits were very small, but here are the optimized kernels for safe measure:

RC4 OPTIMIZED Cuda Kernel

 /**
 * Description:  RC4 Cuda Kernel
 **/
__global__ void getRC4Buffer(char * buffer, int bufferSize, int ntpb) {	
	int idx = blockIdx.x * blockDim.x + threadIdx.x;
	int tid = threadIdx.x;

	__shared__ float sharedMem[_NTPB];

	sharedMem[tid] = buffer[idx];
	__syncthreads();

	if (idx < bufferSize)
		sharedMem[tid] = cycle(sharedMem[tid]);
	__syncthreads();

	buffer[idx] = sharedMem[tid];
}

Cycle OPTIMIZED Cuda Kernel

/**
* Description:  Cycle Cuda Kernel
**/
__global__ void getCycleBuffer(char * buffer, int bufferSize) {
	int idx = blockIdx.x * blockDim.x + threadIdx.x;
	int tid = threadIdx.x;

	__shared__ float sharedMem[_NTPB];

	sharedMem[tid] = buffer[idx];
	__syncthreads();

	if (idx < bufferSize)
		sharedMem[tid] = cycle(sharedMem[tid]);
	__syncthreads();

	buffer[idx] = sharedMem[tid];
}

The device functions were not modified.

A3graph.png