Open main menu

CDOT Wiki β

Changes

BetaT

2,564 bytes added, 19:10, 12 April 2017
no edit summary
}
Array 1 Array 2==== HOW THE ALGORITHM WORKS ====
xxxxx xxxxx xxxxx xxxxx This is focusing on the algorithm inside the CALCULATE Kernel only.
xxxxx xxxxx 1. We begin with 2 Arrays
xxxxx xxxxx [[File:2Arrazs.png]]
xxxxx xxxxx
Upon initialization the 1st 2. The first column of the first First array has its variables set depending on a condition, this will be represented is initialized by (o)'s belowthe INITIALIZE Kernel.
Array 1 Array 2[[File:Initialize.png]]
oxxxx xxxxx oxxxx xxxxx 3. The second array copies the values from the first column of the First array
oxxxx xxxxx [[File:Copy1stColumn.png]]
oxxxx xxxxx 4. The First array copies a single value from the Second array
oxxxx xxxxx [[File:2ndCall.png]]
The next kernel below will execute the following calucations5. 1st: Array 2 will copy The remaining values for the first 2nd column of Array 1.the First array are calculated through the Second array as follows.This will be represented by (o)'s on Array2
Array 1 Array 2[[File:3rdCall.png]]
oxxxx oxxxx oxxxx oxxxx  oxxxx oxxxx  oxxxx oxxxx  oxxxx oxxxx   2nd: Array 1 will set the values in its [0,1] dimension->(marked by 2) to the values in Array 2's [1,0] dimension (marked by a 2)6Array 1 Array 2 o2xxx oxxxx oxxxx 2xxxx  oxxxx oxxxx  oxxxx oxxxx  oxxxx oxxxx   3rd: Next Array 1 will calculate its next The 2nd column (marked by of the 3) by performing a calculation as shown above on Array 2's first First array is now copied into the 2nd column (marked by 3 ).  Array 1 Array 2 o3xxx 3xxxx o3xxx 3xxxx  o3xxx 3xxxx  o3xxx 3xxxx  o3xxx 3xxxx   This process will loop until it has reached end of the Second arrayand the cycle is repeated until finished.
[[File:LAstReset.png]]
== CPU VS GPU Loop Comparisons Only==
Executing the program again with a problem size of 2000 2000 or 4,000,000 we yield the following results. Keep in mind this is only for the kernel launch or the for-loops executing, not the program as a whole.
Keep in mind these times are only for the kernel launches and not the program as a whole.
Device with compute capability 3.0 found (index 0) Name: Quadro K2000 Compute Capability: 3.0 Total Global Memory: 2147483648 Max Threads per block: 1024 maxGridSize: 002EF650 maxThreadsDim: 002EF644 Clock Rate in khz: 954000PARALLIZED GPU CODE
Fist for loop - took - 0 millisecs
Press any key to continue . . .
As compared to the ORIGINAL CPU version of the original programCODE
Initialize arrays loop - took - 17 milliseconds
}
===== OPTIMIZED CALCULATE WAVE KERNEL CHANGES=====
The code below has been altered to remove the (j) variable and combined the two (if) statements into one, so that we can reduce (Thread Divergence), as well as move the (- c*dt/dx* ) recurring instruction set, and place it into a variable called total, so that each thread is NOT performing the same operation which causes a decrease in performance.
}
===== OPTIMIZED INITIALIZATION KERNEK KERNEL CHANGES =====
I removed the variable (j), removed the syncthreads() which were not needed, I also removed the function running on the CPU that initializes all indexes int he arrays to 0, and moved it into the GPU below.
(5000 ^ 2) 28787 | 127373 | n/a | 1417
(10000 ^ 2) 124179 | 522576 | n/a | 3054
 
 
[[File:ParallelizedVSOptimized.png]]
== SECOND OPTIMIZATION ==
== THIRD OPTIMIZATION ==
=== SAVING TRAVEL COSTS BY REMOVING THE UNNECESSARY ARRAY ===
As we discovered above, the second array is not necessary while we are performing all the calculations on Shared Memory which can be seen in section 3.3.2. This provides us with the ability to further optimize our Kernel by reducing the amount of time we spend transferring data across the PCI bus. Below is an image of the data transfer times for the CALCULATE kernel.
 
 
Since both of the original Arrays are not needed in the final Kernel solution, we can save 50% of our transfer time across the PCI bus by removing one of the arrays.
[[File:MEmCpy10000.png]]
=== GETTING 100% OCCUPANCY PER MULTIPROCESSOR===
 
'''Occupancy Calculator
 
The CUDA Toolkit includes a spreadsheet that accepts as parameters the compute capability, the number of threads per block, the number of registers per thread and the shared memory per block. This spreadsheet evaluates these parameters against the resource limitations of the specified compute capability. This spreadsheet is named CUDA_Occupancy_Calculator.xls and stored under the ../tools/ sub-directory of the installed Toolkit.'''
 
Source--> https://scs.senecac.on.ca/~gpu610/pages/content/resou.html
 
With the existing CALCULATE Kernel the CUDA Occupancy Calculator is providing the following statistics as shown below...
 
 
 
[[File:OriginalCalculator.png]]
 
 
 
The current CALCULATE Kernel is only utilizing 50% of the MultiProcessor as shown above. If the threads per block are switched from 32 to 512 we will achieve 100% occupancy as shown below.
 
 
 
[[File:100Calculator.png]]
 
 
=== CALCULATE KERNEL ===
 
Here is the final CALCULATE Kernel for the application.
The changes include removal of the second array.
 
// kernerl
__global__ void Calculate(float* u, int nx, int c, float dx, float dt)
{
__shared__ float s[ntpb];
int i = blockIdx.x * blockDim.x + threadIdx.x;
int t = threadIdx.x;
float total = c*dt / dx;
if (i < nx && i != 0 && t != 0)
{
for (int it = 1; it <= nx - 1; it++)
{
s[t - 1] = u[(i - 1) * nx + it - 1];
u[it] = s[1];
__syncthreads();
u[i * nx + it] = s[t] - total * (s[t] - s[t - 1]);
__syncthreads();
}
}
}
 
=== OPTIMIZATION TIME COMPARISONS ===
 
Below is a graph comparing times between Optimizations illustrating the amount of execution time saved in each iteration.
 
The times are listed in milliseconds.
 
[[File:OPTIMIZATIONCOMPARISON.png]]
 
 
= CONCLUSIONS =
 
== OVERALL TIME COMPARISONS ==
 
Below are the final comparisons of all execution times between the CPU and GPU.
 
All times are in milliseconds.
 
[[File:finalCompare.png]]
 
== APPLICATION OUTPUT ==
 
Upon completion of the application it will create a file based on the output of the algorithm. The following image below displays that output comparing the original program to the parallelized program.
 
[[File:outputs.png]]
 
== FINAL THOUGHTS ==
 
Upon completion of this Project I have learned a few things:
 
First, I learned that not all program can be parallelized even if they seem to be a good candidate to begin with.
 
Secondly, understand the algorithm of the application is a key factor in being able to optimize the solution, because sometimes you will need to rearrange the code in order to obtain better performance from the GPU and understanding the algorithm will help ensure that the output at the end of the program will remain the same.
Since both Thirdly the management of the original Arrays resources and constraints, having registers, shared memory, constant memory, latency, threads, and multi-processors are not needed in all factors which need to be considered when using the final Kernel solution we GPU. Understanding how these resources can save 50% of our transfer time across the PCI busimpact and influence your program helps deciding which ones to use in specific situations.
212
edits