Changes

Team Sonic

3,053 bytes added, 20:27, 8 March 2013

→‎Assignment 2

-->

=== Assignment 2 ===

=====Background=====

In assignment 1 we identified the 'RCTAlgorithmBackprojection' function located in the dll/shared library was taking up virtually 100% of the execution time. Our goal for this assignment was to off-load that logic to the GPU using cuda api, leaving everything else intact.

=====Summary=====

The work we did can be separated in to three basic steps:

#Compile base software from source (RabbitCTRunner.exe and dll)

#Write cuda kernel to replace 'RCTAlgorithmBackprojection' in dll

#Integrate cuda kernel in to RabbitCTRunner

======Kernel======

<pre>

__global__

void Backprojection(float * gpu_matrix_result, RabbitCtGlobalData * r) // problem_size is r.L

{

unsigned int L = r->L;

float O_L = r->O_L;

float R_L = r->R_L;

double* A_n = r->A_n;

//float* I_n = r->I_n;

//float* f_L = r->f_L;

// for optimization, put these ^ in shared memory.

//s_rcgd = r;

unsigned int col = blockDim.x * blockIdx.x + threadIdx.x; // this is like the original "i" iterator

unsigned int row = blockDim.y * blockIdx.y + threadIdx.y; // this is like the original "j" iterator

unsigned int depth = blockDim.z * blockIdx.z + threadIdx.z; // this is like the original "k" iterator

if (row < L && col < L && depth < L)

{

unsigned int final_index = L * L * depth + L * row + col;

double z = O_L + (double)depth * R_L;

double y = O_L + (double)row * R_L;

double x = O_L + (double)col * R_L;

double w_n = A_n[2] * x + A_n[5] * y + A_n[8] * z + A_n[11];

double u_n = (A_n[0] * x + A_n[3] * y + A_n[6] * z + A_n[9] ) / w_n; // inline this

double v_n = (A_n[1] * x + A_n[4] * y + A_n[7] * z + A_n[10]) / w_n; // inline this

// p_hat_n inlined:

int i = (int)u_n;

int j = (int)v_n;

double alpha = u_n - (int)u_n;

double beta = v_n - (int)v_n;

double p_hat_n_result =

(1.0 - alpha) * (1.0 - beta) * p_n(i , j )

+ alpha * (1.0 - beta) * p_n(i+1, j )

+ (1.0 - alpha) * beta * p_n(i , j+1)

+ alpha * beta * p_n(i+1, j+1);

///////

gpu_matrix_result[final_index] += (float)(1.0 / (w_n * w_n) * p_hat_n_result);

// this call calculates the index value twice because of +=

}

</pre>

======Memory Allocation======

<pre>

#ifdef GPU

float * gpu_matrix;

cudaMalloc((void**)&gpu_matrix, numel_vol * sizeof(float));

cudaMemcpy(gpu_matrix, rctgdata.f_L, numel_vol * sizeof(float), cudaMemcpyHostToDevice);

const int threads_per_block = 16;

const int num_of_blocks = (problem_size + threads_per_block - 1) / threads_per_block;

dim3 block(threads_per_block, threads_per_block, threads_per_block);

dim3 grid(num_of_blocks, num_of_blocks, num_of_blocks);

#endif

</pre>

======Execution Config======

<pre>

#ifdef GPU

Backprojection<<<grid, block>>>(gpu_matrix, &rctgdata);

#else

</pre>

=== Assignment 3 ===

Pvaaheeswaran

1

edit

Changes

Team Sonic

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools