Changes

← Older edit

BetaT

2,564 bytes added, 19:10, 12 April 2017

no edit summary

}

~~Array 1 Array 2~~==== HOW THE ALGORITHM WORKS ====

~~xxxxx xxxxx~~ ~~xxxxx xxxxx~~ This is focusing on the algorithm inside the CALCULATE Kernel only.

~~xxxxx xxxxx~~ 1. We begin with 2 Arrays

~~xxxxx xxxxx~~ [[File:2Arrazs.png]]

~~xxxxx xxxxx~~

~~Upon initialization the 1st~~ 2. The first column of the ~~first~~ First array ~~has its variables set depending on a condition, this will be represented~~ is initialized by ~~(o)'s below~~the INITIALIZE Kernel.

~~Array 1 Array 2~~[[File:Initialize.png]]

~~oxxxx xxxxx~~ ~~oxxxx xxxxx~~ 3. The second array copies the values from the first column of the First array

~~oxxxx xxxxx~~ [[File:Copy1stColumn.png]]

~~oxxxx xxxxx~~ 4. The First array copies a single value from the Second array

~~oxxxx xxxxx~~ [[File:2ndCall.png]]

~~The next kernel below will execute the following calucations~~5. ~~1st:~~ ~~Array 2 will copy~~ The remaining values for the ~~first~~ 2nd column of ~~Array 1.~~the First array are calculated through the Second array as follows.~~This will be represented by (o)'s on Array2~~

~~Array 1 Array 2~~[[File:3rdCall.png]]

~~oxxxx oxxxx~~ ~~oxxxx oxxxx~~ ~~oxxxx oxxxx~~ ~~oxxxx oxxxx~~ ~~oxxxx oxxxx~~ ~~2nd: Array 1 will set the values in its [0,1] dimension->(marked by 2) to the values in Array 2's [1,0] dimension (marked by a 2)~~6. ~~Array 1 Array 2~~ ~~o2xxx oxxxx~~ ~~oxxxx 2xxxx~~ ~~oxxxx oxxxx~~ ~~oxxxx oxxxx~~ ~~oxxxx oxxxx~~ ~~3rd:~~ ~~Next Array 1 will calculate its next~~ The 2nd column ~~(marked by~~ of the ~~3) by performing a calculation as shown above on Array 2's first~~ First array is now copied into the 2nd column ~~(marked by 3 ).~~ ~~Array 1 Array 2~~ ~~o3xxx 3xxxx~~ ~~o3xxx 3xxxx~~ ~~o3xxx 3xxxx~~ ~~o3xxx 3xxxx~~ ~~o3xxx 3xxxx~~ ~~This process will loop until it has reached end~~ of the Second arrayand the cycle is repeated until finished.

[[File:LAstReset.png]]

== CPU VS GPU Loop Comparisons Only==

Executing the program again with a problem size of 2000 2000 or 4,000,000 we yield the following results. ~~Keep in mind this is only for the kernel launch or the for-loops executing, not the program as a whole.~~

Keep in mind these times are only for the kernel launches and not the program as a whole.

~~Device with compute capability 3.0 found (index 0)~~ ~~Name: Quadro K2000~~ ~~Compute Capability: 3.0~~ ~~Total Global Memory: 2147483648~~ ~~Max Threads per block: 1024~~ ~~maxGridSize: 002EF650~~ ~~maxThreadsDim: 002EF644~~ ~~Clock Rate in khz: 954000~~PARALLIZED GPU CODE

Fist for loop - took - 0 millisecs

Press any key to continue . . .

~~As compared to the~~ ORIGINAL CPU ~~version of the original program~~CODE

Initialize arrays loop - took - 17 milliseconds

}

===== OPTIMIZED CALCULATE WAVE KERNEL CHANGES=====

The code below has been altered to remove the (j) variable and combined the two (if) statements into one, so that we can reduce (Thread Divergence), as well as move the (- c*dt/dx* ) recurring instruction set, and place it into a variable called total, so that each thread is NOT performing the same operation which causes a decrease in performance.

}

===== OPTIMIZED INITIALIZATION ~~KERNEK~~ KERNEL CHANGES =====

I removed the variable (j), removed the syncthreads() which were not needed, I also removed the function running on the CPU that initializes all indexes int he arrays to 0, and moved it into the GPU below.

(5000 ^ 2) 28787 | 127373 | n/a | 1417

(10000 ^ 2) 124179 | 522576 | n/a | 3054

[[File:ParallelizedVSOptimized.png]]

== SECOND OPTIMIZATION ==

== THIRD OPTIMIZATION ==

=== SAVING TRAVEL COSTS BY REMOVING THE UNNECESSARY ARRAY ===

As we discovered above, the second array is not necessary while we are performing all the calculations on Shared Memory which can be seen in section 3.3.2. This provides us with the ability to further optimize our Kernel by reducing the amount of time we spend transferring data across the PCI bus. Below is an image of the data transfer times for the CALCULATE kernel.

Since both of the original Arrays are not needed in the final Kernel solution, we can save 50% of our transfer time across the PCI bus by removing one of the arrays.

[[File:MEmCpy10000.png]]

=== GETTING 100% OCCUPANCY PER MULTIPROCESSOR===

'''Occupancy Calculator

The CUDA Toolkit includes a spreadsheet that accepts as parameters the compute capability, the number of threads per block, the number of registers per thread and the shared memory per block. This spreadsheet evaluates these parameters against the resource limitations of the specified compute capability. This spreadsheet is named CUDA_Occupancy_Calculator.xls and stored under the ../tools/ sub-directory of the installed Toolkit.'''

Source--> https://scs.senecac.on.ca/~gpu610/pages/content/resou.html

With the existing CALCULATE Kernel the CUDA Occupancy Calculator is providing the following statistics as shown below...

[[File:OriginalCalculator.png]]

The current CALCULATE Kernel is only utilizing 50% of the MultiProcessor as shown above. If the threads per block are switched from 32 to 512 we will achieve 100% occupancy as shown below.

[[File:100Calculator.png]]

=== CALCULATE KERNEL ===

Here is the final CALCULATE Kernel for the application.

The changes include removal of the second array.

// kernerl

__global__ void Calculate(float* u, int nx, int c, float dx, float dt)

{

__shared__ float s[ntpb];

int i = blockIdx.x * blockDim.x + threadIdx.x;

int t = threadIdx.x;

float total = c*dt / dx;

if (i < nx && i != 0 && t != 0)

{

for (int it = 1; it <= nx - 1; it++)

{

s[t - 1] = u[(i - 1) * nx + it - 1];

u[it] = s[1];

__syncthreads();

u[i * nx + it] = s[t] - total * (s[t] - s[t - 1]);

__syncthreads();

}

=== OPTIMIZATION TIME COMPARISONS ===

Below is a graph comparing times between Optimizations illustrating the amount of execution time saved in each iteration.

The times are listed in milliseconds.

[[File:OPTIMIZATIONCOMPARISON.png]]

= CONCLUSIONS =

== OVERALL TIME COMPARISONS ==

Below are the final comparisons of all execution times between the CPU and GPU.

All times are in milliseconds.

[[File:finalCompare.png]]

== APPLICATION OUTPUT ==

Upon completion of the application it will create a file based on the output of the algorithm. The following image below displays that output comparing the original program to the parallelized program.

[[File:outputs.png]]

== FINAL THOUGHTS ==

Upon completion of this Project I have learned a few things:

First, I learned that not all program can be parallelized even if they seem to be a good candidate to begin with.

Secondly, understand the algorithm of the application is a key factor in being able to optimize the solution, because sometimes you will need to rearrange the code in order to obtain better performance from the GPU and understanding the algorithm will help ensure that the output at the end of the program will remain the same.

~~Since both~~ Thirdly the management of ~~the original Arrays~~ resources and constraints, having registers, shared memory, constant memory, latency, threads, and multi-processors are ~~not needed in~~ all factors which need to be considered when using the ~~final Kernel solution we~~ GPU. Understanding how these resources can ~~save 50% of our transfer time across the PCI bus~~impact and influence your program helps deciding which ones to use in specific situations.

Jadach1

212

edits

CDOT Wiki β

Changes

BetaT

CDOT Wiki ^β