Open main menu

CDOT Wiki β

Changes

DPS915 C U D A B O Y S

4,769 bytes added, 12:57, 8 December 2015
Assignment 3
Inside the byteCipher method, exists a for loop that could use optimization. Within this loop specifically, the lines that call the <code>cycle</code> and <code>rc4_output</code> functions are the ones that are taking the longest time to execute:
for (int i = 0; i < bufferSize; i++){
// going over every byte in the file
}
This for loop then can call one of two Here is what these functions: <code>cycle</code> or and <code>rc4_output</code>. functions look like:
char cycle (char value) {
int leftMask = 170;
We need to change these two functions so they are added to the CUDA device as "device functions". We also need to convert this for loop into a kernel ==== Profiling on Linux ==== The following test runs were performed on the following Virtual Machine:* CentOS 7* i7-3820 @ 3.6 GHz* 2GB DDR3* gcc version 4.8.3  Using compiler settings: g++ -c -O2 -g -pg -std=c++11 encFile.cpp  
==== Profiling ====
'''RC4 Cipher - 283 MB mp3 File'''
1.63 5.84 0.10 1 95.15 95.15 rc4_init(unsigned char*, unsigned int)
0.00 5.84 0.00 1 0.00 0.00 _GLOBAL__sub_I_S
 
 
==== Profiling on Windows ====
 
The following test runs were performed on the following Machine:
* Windows 10
* i7-4790k @ 4GHz
* 16GB DDR3
* Visual Studio 2013
 
 
'''RC4 Cipher - 283 MB mp3 File'''
 
[[File:winmp3.png]]
 
 
'''RC4 Cipher - 636 MB iso File'''
 
[[File:wincent.png]]
 
 
'''RC4 Cipher - 789 MB iso File'''
 
[[File:winxu.png]]
 
 
 
'''Byte Cycle - 283 MB mp3 File'''
 
[[File:winmp32.png]]
 
 
'''Byte Cycle - 636 MB iso File'''
 
[[File:wincent2.png]]
 
 
'''Byte Cycle - 789 MB iso File'''
 
[[File:winxu2.png]]
=== <span style="color: red">&#x2717; Profile 1: PI Approximation</span> ===
=== <span style="color: red">&#x2717;Profile 3: String Processor</span> ===
'''compiled by Oleg'''
* Sample run:
==== Description ====
 ''' Removing CPU Bottleneck ''' Removing the old CPU bottleneckin the <code>byteCipher</code> function:
for (int i = 0; i < bufferSize; i++){
// going over every byte in the file
And replacing it with
...
if (mode == 0)
getInversionBuffer << < dGrid, dBlock >> >(d_a, bufferSize, d_output);
if (mode == 1)
getCycleBuffer << < dGrid, dBlock >> >(d_a, bufferSize, d_output);
...
 
''' Device Functions '''
Converting <code>cycle</code> and <code>rc4_output</code> functions to device functions:
return S[(S[i] + S[j]) & 0xFF];
}
 
 
'''Creating Kernels'''
 
We created kernels for each of the 2 different methods of Cipher that the program handles (RC4 and Cycle, but not the others -- read on):
/**
* Description: RC4 Cuda Kernel
**/
__global__ void getRC4Buffer(char * buffer, int bufferSize) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < bufferSize)
buffer[idx] = buffer[idx] ^ rc4_output();
}
 
/**
* Description: Cycle Cuda Kernel
**/
__global__ void getCycleBuffer(char * buffer, int bufferSize) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < bufferSize)
buffer[idx] = cycle(buffer[idx]);
}
 
 
You may be asking what about the two other methods of cipher: '''byte inversion''' and '''xor cipher'''? Well, as it turns out these methods run perfectly fine on the CPU and usually are faster on the CPU than the GPU. We initially had converted these functions over to CUDA, but we soon discovered that these functions did not need to be converted as they ran faster on the CPU than they did on the GPU.
 
Here's an example of run time of Xor Cipher on both CPU and GPU with the 789MB file:
 
GPU: http://i.imgur.com/0PsLxzQ.png -- 6.263 seconds
 
CPU: http://i.imgur.com/ktn14q3.png -- 3.722 seconds
 
 
As we can see, the CPU runs way faster than the GPU: no parallelization needed here!
==== Profiling ====
 
The following test runs were performed on the following Machine:
* Windows 10
* i7-4790k @ 4.0GHz
* 16GB DDR3
* Nvidia GTX 430
 
 
===== RC4 Profiling =====
'''RC4 Cipher - 283 MB mp3 File'''
 
Total runtime: 1.358 seconds
 
[[File:music.png]]
 
'''RC4 Cipher - 636 MB iso File'''
Total runtime: 3.87 seconds [[File:cent.png]]  
'''RC4 Cipher - 789 MB iso File'''
 
Total runtime: 5.072 seconds
 
[[File:xu.png]]
 
 
''' RC4 Run time comparisons: CPU vs. CUDA '''
 
Comparing Windows vs. Windows for most accurate results.
 
[[File:cpuvscuda.png]]
 
 
===== Byte Cycle Profiling =====
 
'''Byte Cycle - 283 MB mp3 File'''
 
Total runtime: 3.467 seconds
 
[[File:music2.png]]
 
 
'''Byte Cycle - 636 MB iso File'''
 
Total runtime: 8.088 seconds
 
[[File:cent2.png]]
 
 
'''Byte Cycle - 789 MB iso File'''
 
Total runtime: 9.472 seconds
 
[[File:xu2.png]]
 
 
''' Byte Cycle time comparisons: CPU vs. CUDA '''
 
Comparing Windows vs. Windows for most accurate results.
 
[[File:cpuvscuda2.png]]
 
==== Conclusion ====
 
''' RC4 Findings'''
 
We are seeing about <span style="color: green; font-size:14px">540% (~5.4x)</span> performance increase while using CUDA instead of the CPU in all 3 of the test cases.
 
 
''' Byte Cycle Findings'''
 
We are seeing about <span style="color: green; font-size:14px">320% (~3.2x)</span> performance increase while using CUDA instead of the CPU in all 3 of the test cases.
 
 
Overall, we think that these are amazing results and a significant improvement in performance over the CPU version of the code. Both of these functions have greatly improved in run time and efficiency
== Assignment 3 ==
 
Due to the nature of the way this program was structured by the original developer, optimization was not really needed. The benefits were very small, but here are the optimized kernels for safe measure:
 
'''RC4 OPTIMIZED Cuda Kernel'''
<pre>
/**
* Description: RC4 Cuda Kernel
**/
__global__ void getRC4Buffer(char * buffer, int bufferSize, int ntpb) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int tid = threadIdx.x;
 
__shared__ float sharedMem[_NTPB];
 
sharedMem[tid] = buffer[idx];
__syncthreads();
 
if (idx < bufferSize)
sharedMem[tid] = cycle(sharedMem[tid]);
__syncthreads();
 
buffer[idx] = sharedMem[tid];
}
</pre>
 
'''Cycle OPTIMIZED Cuda Kernel'''
<pre>
/**
* Description: Cycle Cuda Kernel
**/
__global__ void getCycleBuffer(char * buffer, int bufferSize) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int tid = threadIdx.x;
 
__shared__ float sharedMem[_NTPB];
 
sharedMem[tid] = buffer[idx];
__syncthreads();
 
if (idx < bufferSize)
sharedMem[tid] = cycle(sharedMem[tid]);
__syncthreads();
 
buffer[idx] = sharedMem[tid];
}
</pre>
 
The device functions were not modified.
 
[[File:a3graph.png]]