Changes

← Older edit

WhySoSerial?

3,265 bytes added, 21:08, 11 April 2017

no edit summary

[[File:lookupfunc_hotspot.jpeg]]

'''Flat profile'''

Each sample counts as 0.01 seconds.

Looking at the overall test data, this is a very prominent trend. ''Lookup()'' is always the most resource intensive method. This confirms assumptions made for Big-O regarding this function

-Profile Overview

[[File:Profilesataglance.jpg]]

S(n) = the final speed increase as a multiplicative

For this analysis - the Nvidia ~~1070~~ GeForce GTX1070 GPU is available for testing and has a total of 1920 CUDA cores(absolute max for available hardware).

- ~2/3 of time spent was deduced by taking a weighted average of results. Files that had low byte sizes or less lines of text were not taken into consideration .

- interesting to note that when a block of text was combined as one line , 52% of time was spent splitting the string, and still ~47% of time was spent in lookup().

S(n) = 1 / 1 - (0.6667) + 0.6667 / 1920

- [https://github.com/Pooch11/DPS915 GitHub Link]

=== Assignment 2 ===

== WordTranslator - Parallel Approach ==

[https://github.com/Pooch11/DPS915/tree/Parallel Parallel Solution Branch of GitHub ]

Issues arose when attempting to change data within a kernel on device memory

Learning outcomes:

1. Kernels do not accept complex objects from the host ( Maps, vectors, strings)

2. Kernels load and execute on sequential memory. Device Pointers

3. Replacing the data using a character pointer (char*) proved exceedingly difficult.

Thus the solution had to be modified in order to accommodate these issues. A couple of options were available to overcome this.

1. We will match a pattern found by the kernel

2. We will record where the result was found and the position that we found this match. This would allow another more sophisticated device function to make these translations. This function on a CPU would be at most O(n^2)

3. Instead, introduce a structure to manage our complex data. (See Below)

[[File:Struct.PNG]]

Structures can be passed into the Kernel, but initialization and access are exceedingly difficult and slow.

'''New CPU Code to Parallelize'''

[[File:Matching_CPU.PNG]]

New CPU code - has an approximate runtime growth rate of O(n^2).

'''GPU Kernel'''

[[File:Matching_GPU.PNG]]

Internally Optimized to use shared memory for our result array. The kernel freely changes these values to true and false depending on where matches are found.

Notes:

Instead of using a result array __ballot(PREDICATE) could be considered (More research on this to be done).

MPI - More knowledge of Message Passing Interface might be needed for a full solution to this problem. See [https://en.wikipedia.org/wiki/Message_Passing_Interface#Dynamic_process_management MPI Wikipedia]

=== Assignment 3 ===

==Optimizations==

Optimizations were different in nature than typical Optimizations for Kernel Launches.

'''Launch Configurations'''

First the configuration was optimized. At first threads were launched based on the length of the target text. Later, it was found more useful to launch as many threads as the device could hold for a single block.

[[File:Configurations.PNG]]

'''Pattern Matching'''

The algorithm used in this approach could be revised considerably since there is lots of overlap between the characters checked by threads.

'''''Knuth Morris-Pratt'''''

[https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm Link to Wiki explanation]

'''''Horspool'''''

[https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%E2%80%93Horspool_algorithm Link to Wiki explanation]

'''''Shift Or'''''

[https://en.wikipedia.org/wiki/Bitap_algorithm Link to Wiki explanation]

'''Runtime'''

[[File:Runtime_CPUvsGPU.png]]

[[File:CPUvsGPU_Timing_Matching.PNG]]

'''Streaming Kernel Launch'''

Instead of making changes to increase the efficiency of the Kernel, changes were made to incorporate the spirit of the original solution. Taking in multiple words from a lexicon and making changes to a large text.

Thus, streaming the launch of this kernel can be introduce to split the target data into manageable partitions. In addition, if the patterns are the same size, we can use streaming to look for more than one word concurrently.

~~=== Dictionary Translation ===~~[[File:Streaming_Kernel.PNG]]

AdamP

76

edits

CDOT Wiki β

Changes

WhySoSerial?

CDOT Wiki ^β