Changes

Jump to: navigation, search

Team Sonic

5,449 bytes added, 22:35, 8 March 2013
Team Sonic
= Team Sonic =
[[Image:sonicteam.png|right|frameless|widthpx727| ]]
== Members ==
# [mailto:pvaaheeswaran@senecacollege.ca?subject=gpu610 Prasanth Vaaheeswaran]
== About ==
== Required Files ==
#[http://www5.cs.fau.de/research/projects/rabbitct/download/ RabbitCT]#[http://www5.cs.fau.de/fileadmin/Forschung/Software/RabbitCT/download/rabbitct_512-v2.rctd Dataset (Included problem sizes: 128, 256, 512)]
== Progress ==
=== Assignment 1 ===
</pre>
I'm sure running this on our GTX 480 can yield better results. (hopefully).
<!--
====[[User:Leo_Turalba | Leo]]====
=====Topic=====
=====Updates=====
-->
=== Assignment 2 ===
 
=====Background=====
In assignment 1 we identified the 'RCTAlgorithmBackprojection' function located in the dll/shared library was taking up virtually 100% of the execution time. Our goal for this assignment was to off-load that logic to the GPU using cuda api, leaving everything else intact.
=====Process=====
The work we did can be separated in to three basic steps:
#Compile base software from source (RabbitCTRunner.exe and dll)
#Write cuda kernel to replace 'RCTAlgorithmBackprojection' in dll
#Integrate cuda kernel in to RabbitCTRunner
Using preprocessor directives the RabbitCTRunner was fitted with a token named GPU. If GPU was defined, our kernel along with all the necessary code will be compiled and used by RabbitCTRunner. This allowed us to have a small footprint on the changes applied and also allows us to use the flag to compile back the original RabbitCTRunner.
 
======Kernel======
 
<pre>
__global__
void Backprojection(float * gpu_matrix_result, RabbitCtGlobalData * r) // problem_size is r.L
 
{
unsigned int L = r->L;
float O_L = r->O_L;
float R_L = r->R_L;
double* A_n = r->A_n;
//float* I_n = r->I_n;
//float* f_L = r->f_L;
// for optimization, put these ^ in shared memory.
 
//s_rcgd = r;
 
unsigned int col = blockDim.x * blockIdx.x + threadIdx.x; // this is like the original "i" iterator
unsigned int row = blockDim.y * blockIdx.y + threadIdx.y; // this is like the original "j" iterator
unsigned int depth = blockDim.z * blockIdx.z + threadIdx.z; // this is like the original "k" iterator
 
if (row < L && col < L && depth < L)
{
unsigned int final_index = L * L * depth + L * row + col;
double z = O_L + (double)depth * R_L;
double y = O_L + (double)row * R_L;
double x = O_L + (double)col * R_L;
 
double w_n = A_n[2] * x + A_n[5] * y + A_n[8] * z + A_n[11];
double u_n = (A_n[0] * x + A_n[3] * y + A_n[6] * z + A_n[9] ) / w_n; // inline this
double v_n = (A_n[1] * x + A_n[4] * y + A_n[7] * z + A_n[10]) / w_n; // inline this
 
// p_hat_n inlined:
int i = (int)u_n;
int j = (int)v_n;
double alpha = u_n - (int)u_n;
double beta = v_n - (int)v_n;
double p_hat_n_result =
(1.0 - alpha) * (1.0 - beta) * p_n(i , j )
+ alpha * (1.0 - beta) * p_n(i+1, j )
+ (1.0 - alpha) * beta * p_n(i , j+1)
+ alpha * beta * p_n(i+1, j+1);
///////
 
gpu_matrix_result[final_index] += (float)(1.0 / (w_n * w_n) * p_hat_n_result);
 
 
// this call calculates the index value twice because of +=
}
}
</pre>
======Memory Allocation======
<pre>
#ifdef GPU
float * gpu_matrix;
cudaMalloc((void**)&gpu_matrix, numel_vol * sizeof(float));
cudaMemcpy(gpu_matrix, rctgdata.f_L, numel_vol * sizeof(float), cudaMemcpyHostToDevice);
const int threads_per_block = 16;
const int num_of_blocks = (problem_size + threads_per_block - 1) / threads_per_block;
dim3 block(threads_per_block, threads_per_block, threads_per_block);
dim3 grid(num_of_blocks, num_of_blocks, num_of_blocks);
#endif
</pre>
======Execution Config======
<pre>
#ifdef GPU
Backprojection<<<grid, block>>>(gpu_matrix, &rctgdata);
#else
</pre>
 
=====Metrics=====
Original RabbitCTRunner with volume size 128:
<pre>
C:\Users\pvaaheeswaran\Desktop\cuda rabbit>Rabbit.exe LolaBunny.dll rabbitct_512-v2.rctd original 128
RabbitCT runner http://www.rabbitct.com/
Info: using 4 buffer subsets with 240 projections each.
Running ... this may take some time.
(\_/)
(='.'=)
(")_(")
 
--------------------------------------------------------------
Quality of reconstructed volume:
Root Mean Squared Error: 38914.3 HU
Mean Squared Error: 1.51433e+009 HU^2
Max. Absolute Error: 65535 HU
PSNR: -19.5571 dB
 
--------------------------------------------------------------
Runtime statistics:
Total: 339.759 s
Average: 353.915 ms
 
FULL RUNTIME: 340.507 secs
C:\Users\pvaaheeswaran\Desktop\cuda rabbit>
 
</pre>
 
CudaRabbit, volume size 128.
<pre>
C:\Users\pvaaheeswaran\Desktop\cuda rabbit>CudaRabbit.exe LolaBunny.dll rabbitct_512-v2.rctd result 128/
RabbitCT runner http://www.rabbitct.com/
Info: using 4 buffer subsets with 240 projections each.
Running ... this may take some time.
(\_/)
(='.'=)
(")_(")
 
--------------------------------------------------------------
Quality of reconstructed volume:
Root Mean Squared Error: 38914.3 HU
Mean Squared Error: 1.51433e+009 HU^2
Max. Absolute Error: 65535 HU
PSNR: -19.5571 dB
 
--------------------------------------------------------------
Runtime statistics:
Total: 0.004881 s
Average: 0.00508437 ms
 
FULL RUNTIME: 0.512 secs
 
C:\Users\pvaaheeswaran\Desktop\cuda rabbit>
</pre>
 
=====Graph=====
This graph represents for only 128 volume size.
[[Image:sonicgraph1.png|left|frame]]
=====Summary=====
As you can see, using cuda has improved the performance by ~99.85%, we went from over 5 minutes to under a second. However, this code still needs to be optimized and run on bigger volume sizes. This will be our next goal.
=== Assignment 2 ===
=== Assignment 3 ===

Navigation menu