Changes

← Older edit

Team NP Complete

1,986 bytes added, 12:58, 22 December 2017

→‎OpenMP

Keep in mind: this really only happens on the particle level, and not even to all particles. Only particles with low mass and high energy are capable of quantum tunneling, at least consistently.

=OpenMP=

OpenMP is a parallel programming API provided by Intel for C, C++, and Fortran. It enables flexible implementation of parallel algorithms, allowing computers of all builds to utilize all cores it has access to.

In this project, the <nowiki>#pragma omp parallel for</nowiki> statement was used in several locations in the program where for loops had no ~~external~~ dependencies. Where there were dependencies, math was used in such a way that the for loops no longer ~~required~~ had the ~~external variables~~dependencies. Their usage will be discussed further down.

=Program=

==Without ~~Parallel Processes~~OpenMP==

[[File:FFT.png|thumb|fig 3.1 Fourier transformation code block.]]

[[File:NonParallel.png|1000px|center|Non-Parallel Process]]

==With ~~Optimized Parallel Processes~~OpenMP and Optimization==

[[File:PFFT.png|thumb|fig 4.1 Fourier transformation code block, with OpenMP parallelization.]]

[[File:PFT.png|thumb|fig 4.2 Fourier transformations called in OpenMP.]]

[[File:Parallel.png|1000px|center|Non-Parallel Process]]

=Comparison and Analysis=

Below, you can see the analysis provided by the VTune Amplifier, through visual studio. Here we can see the total elapsed time as well as the overhead time that the program used:

[[File:initial_analysis.png|~~1000px~~750px|~~center~~none|Initial performance ]]

[[File:initial_histo_analysis.png|~~1000px~~750px|~~center~~none|First histogram analysis]]

[[File:initial_analysis_parallel_vs_serial.png|~~1000px~~750px|~~center~~none|Initial comparison between parallel and serial implementations]]

What these graphs and figures are telling us is that we've efficiently parallelized the lowest of the low hanging fruit. The parallel regions themselves are reasonably efficient at what they do, but as it turns out, they contributed very, very little to the computational load in the first place. A lot of CPU time is spent at synchronization barriers, and there is a major hotspot which can still be parallelized.

We parallelized the code for rendering and state evolution above. We did not, however, yet parallelize the FFT code. The reason for this is that we initially thought that there was a loop-carried dependency that could not be resolved:

[[File:FFT_crop.png|500px|none|Hotspot]] The issue is the <code>T *= phiT;</code> line. This means that every time the loop counter <code>l</code> increases, <code>T</code> gets multiplied by <code>phiT</code>. This statement seems painfully obvious, but it prevents us from parallelizing the code since the iterations can't be done in an arbitrary order. What it does mean, however, is that we can remove that line and replace any usage of <code>T</code> with <code>pow(phiT, l)</code>. We can then parallelize it, since the iterations are now order-invariant. When we do that, the FPS somehow does not change. In fact, if we remove the parallel for construct, the FPS drops to a meager 1 frame per second. This is awful, and likely because the <code>pow()</code> operation is very computationally expensive. Are we stuck? Of course not. We can apply math. There is a property of complex numbers which allows us to turn exponentiation into multiplication. If we write the complex number <code>phiT</code> as <code>phiT = cos(arg) + i * sin(arg)</code>, and we can since it has norm 1, we have <code>phiT ** l = cos(l * arg) + i * sin(l * arg)</code>. This gave a tremendous speedup since the trigonometric functions are apparently less costly than exponentiation. The code and new vTune analyses are below. [[File:PFFT_crop_no_dynamic.png|500px|none|No dynamic]] [[File:after_fft_analysis.png|750px|none|After FFTanalysis]] [[File:after_fft_hist.png|~~1000px~~750px|~~center~~none|After FFT ~~Pre-Parallelization~~analysis]] So we've better parallelized the program by attacking a computationally expensive loop which runs many times. We have achieved a massive speedup. That said, it seems that a lot of CPU time is spent idle, especially in the region we have just parallelized. (In addition, the FPS is not very stable). This is because each loop iteration is guaranteed to do a different amount of work. When the loop counter <code>l</code> is small, the inner loop runs many times. When it is large, it does not. This is most easily fixed by adding dynamic scheduling.

[[File:~~after_fft_analysis~~PFFT_crop.png|~~1000px~~500px|~~center~~none|~~After FFT analysis~~Dynamic]]

[[File:~~after_fft_hist~~after_dynamic_analysis.png|~~1000px~~750px|~~center~~none|~~After FFT analysis~~Dynamic Performance]]

[[File:~~after_dynamic_analysis~~after_dynamic_histo.png|~~1000px~~750px|~~center~~none|Dynamic ~~Performance~~Histogram]]

~~[[File:after_dynamic_histo~~As the analyses show, that loop now shows more efficient CPU usage. It also gave a slightly higher, stabilized FPS.~~png|1000px|center|Dynamic Histogram]]~~

=Conclusion=

After countless hours programming this simulation, incorporating OpenMP into the finished program did not prove to be very difficult. The most challenging part of including it was identifying the bottleneck for the performance, where CPU idle time was occurring. After the Fourier Transformation Function was identified as the cause of the bottleneck, the loop-carried dependency was factored out and an OpenMP for loop was added. After observing that there was still a considerable amount of CPU idle time, the program was changed to include a dynamic version of the OpenMP for loop, which indicated to the program to dynamically load-balance threads, as opposed to wasting time creating a (statically) set amount of threads that it may not have much to actually compute. After the dynamic scheduling was added, the program jumped to a consistent efficiency and run time.

Claffan

28

edits

CDOT Wiki β

Changes

Team NP Complete

CDOT Wiki ^β