Changes

GPU610/SSD

2,609 bytes added, 23:10, 12 February 2013

→‎Drive_God_Lin

==[[User:Dylan Segna| Dylan]]:==

I profiled CERN's Drive_God_Lin application.

The resulting profile is generated using gprof.

% cumulative self self total

time seconds seconds calls ms/call ms/call name

49.47 10.34 10.34 314400 0.03 0.03 zfunr_

28.76 16.35 6.01 524 11.47 11.47 ordres_

10.81 18.61 2.26 314400 0.01 0.01 cfft_

3.30 19.30 0.69 314400 0.00 0.04 tunelasr_

3.06 19.94 0.64 1048 0.61 13.63 spectrum_

1.67 20.29 0.35 314400 0.00 0.00 calcr_

All other functions took less than 1% of the application's run time,

so I am not considering them when assessing which parts of the application need to be optimized,

as the speed increase would be negligible.

In total, the application ran for 20.90 seconds using 1 core.

The specifications of the system being used to generate this profile was:

Ubuntu 12.04 LTS 32-bit

Intel® Core™2 Quad CPU Q9550 @ 2.83GHz × 4

3.9 GB Memory

NVIDIA GTX 480

The 3 most problematic functions are zfunr_ , ordres_ , and cfft.

1. zfunr_ takes up almost 49.5% of the application processing time, and it is also called 314,400 times.

It is possible that this time can be significantly increased using a many-core GPU to run hundreds of threads simultaneously.

2. cfft_ is called just as much as the zfunr_ function, but it takes significantly less time in the application. It is likely

that this function can be optimized to use the GPU, however the speed increase may be minimal compared to the zfunr_ function.

3. The ordres_ function takes 28.76% of the application time, but is only called 524 times throughout the application.

This subroutine is used to order harmonics, and is incredibly long. Depending on the ordering algorithm used in this subroutine,

it may be possible to optimize the serial performance of the subroutine before further optimizing it to use the GPU.

4. Many other functions are called thousands of times throughout the application. It may be possible to optimize some of these

to attain a minuscule speed increase, assuming the memory read/write rate between the GPU and CPU doesn't increase run-time instead.

However it is clear that these functions are not priority.

It is important to note that the application currently uses OpenMP,

so multi-core processing can be used. When set to utilize 12 threads,

the program finishes in around 5 seconds.

I have forked a copy of the code into my own GitHub repository [https://github.com/dsegna/Drive_GPU here]

----

Dylan Segna

10

edits

Changes

GPU610/SSD

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools