Difference between revisions of "DPS921/ASCII"

From CDOT Wiki
Jump to: navigation, search
(Solution)
(vTune Amplifier with OpenMP (Alex))
Line 37: Line 37:
 
==vTune Amplifier with OpenMP (Alex)==
 
==vTune Amplifier with OpenMP (Alex)==
  
 +
'''vTune Amplifier Overview:'''
  
 +
We used vTune Amplifier to profile our code to check it's initial performance.
 +
 +
vTune Amplifier is a code profiler that can identify:
 +
* Where time is being spent during the code’s execution
 +
* Amount of concurrency in the code
 +
* Bottlenecks created by synchronous primitives
 +
 +
vTune Amplifier has multiple analysis it can perform but we analyzed the ASCII art program with the '''Hotspot Analysis''' and '''HPC Performance Characterization'''.
 +
 +
To preface the following analysis, this was done using a 30 second video at 1080p.
 +
 +
[[File: Vt_types.png]]
 +
 +
The initial hotspot analysis of our serial code took '''44.904 seconds''' to complete with the '''imageToTextScaledNative''' function as the top hotspot.
 +
[[File:Ascii_art_serial_hotspot.png]]
 +
 +
Looking at the thread visualization pane we can see that only one thread is used by the serial implementation of the function.
 +
[[File:Ascii_art_serial_threads.png]]
 +
 +
To fix this we added the statement:
 +
 +
  #pragma omp parallel for
 +
 +
to our code. So referring to our sudocode, the new code looked something like this:
 +
 +
#pragma omp parallel for
 +
for( j ) {
 +
  for( k ) {
 +
      int sum=0
 +
        for(y) {
 +
            for(x) {
 +
              int index          // using j, k x, y
 +
              sum  +=  input[index];
 +
            }
 +
        }
 +
        int ave            // get average using sum
 +
        for(y) {
 +
            for(x) {
 +
              int index          // using j, k x, y
 +
              int charIndex  // using  x, y
 +
              output[index] = ascii[charIndex]
 +
          }
 +
        }
 +
    }
 +
 +
This improved our overall runtime and reduced it to '''36.095 seconds'''.
 +
 +
[[File:Ascii_art_parallel_hotspot.png]]
 +
 +
Note that our overall CPU time has increased. This is a good sign as that means that we are utilizing our CPU more.
 +
 +
Looking at our thread usage we see that we are using 8 threads instead of 1 and the presence of the '''omp$parallel''' attached to our function.
 +
 +
[[File:Ascii_art_parallel_threads.png]]
 +
 +
Then we ran our code through the HPC Performance Characterization analysis to verify the efficiency of our OpenMP implementation.
 +
 +
[[File:Ascii_art_hpc_og.png]]
 +
 +
This analysis showed that we could potentially gain '''1.311''' seconds of runtime if we optimized OpenMP. This occurred because there is an imbalance in the work distribution of our threads which means one or more threads complete their task before the other threads. This means they are idle and wait at the barrier for the other threads to complete their work ultimately resulting in unused processing power.
 +
 +
To fix this we added dynamic scheduling to our code so the workload is dynamically allocated so when one thread finishes their work and there is more work to be done, the thread will pick up more work.
 +
 +
So our pragma statement was changed to:
 +
 +
  #pragma omp parallel for schedule(dynamic)
 +
 +
Running our code again through a HPC Performance Characterization analysis yielded these results:
 +
 +
[[File:Ascii_art_hpc_cur.png]]
  
 
==Intel Adviser (Dmytro)==
 
==Intel Adviser (Dmytro)==

Revision as of 16:21, 30 November 2018

NoName or Pixels2

Team Members

  1. Alex
  2. Dmytro
  3. Yuriy Kartuzov


Presentation

Google Slides presentation can be found here


ASCII Art (Yuriy)

Introduction

The idea is take an image and turn it into a pictorial representation using ascii character. We use PNG as input and TXT file as output. The idea with TXT format is that they can be pasted into editors, their font can be modified, text colour and background changed for a customized look.

R2d2.jpg

We decided to take it a step further and output a PNG file as well. This loses some of the functionality mentioned above, however now we are able to process videos since we can take a frame and run our algorithm through it. Having a video will also highlight how efficient our algorithm can run and whether we can keep up with live processing.


OUTPUT Samples

  1. Live video stream (MP4)
  2. Video file processing (MKV)
  3. Reverse sampling with white colour font and black colour background (PNG)
  4. R2D2 above (TXT)

Note: for best video quality download the files for view since Google compresses them further for playback in browsers. For a text file you'll need to decrease font such that no lines wrap unto to the next line and don't forget to use monospace font.

Working with image and video files

Opencv.jpg is an open source computer vision library which can do a lot of cool stuff. As beginners we use it for:

  • Read and write image files
  • Read and write video files
  • Read from video stream like camera
  • Display video in native OS window


vTune Amplifier with OpenMP (Alex)

vTune Amplifier Overview:

We used vTune Amplifier to profile our code to check it's initial performance.

vTune Amplifier is a code profiler that can identify:

  • Where time is being spent during the code’s execution
  • Amount of concurrency in the code
  • Bottlenecks created by synchronous primitives

vTune Amplifier has multiple analysis it can perform but we analyzed the ASCII art program with the Hotspot Analysis and HPC Performance Characterization.

To preface the following analysis, this was done using a 30 second video at 1080p.

Vt types.png

The initial hotspot analysis of our serial code took 44.904 seconds to complete with the imageToTextScaledNative function as the top hotspot. Ascii art serial hotspot.png

Looking at the thread visualization pane we can see that only one thread is used by the serial implementation of the function. Ascii art serial threads.png

To fix this we added the statement:

  #pragma omp parallel for

to our code. So referring to our sudocode, the new code looked something like this:

  1. pragma omp parallel for

for( j ) {

  for( k ) {
     int sum=0
        for(y) {
           for(x) {
              int index           // using j, k x, y
              sum  +=  input[index];
           }
        }
        int ave            // get average using sum
        for(y) {
           for(x) {
              int index          // using j, k x, y
              int charIndex   // using  x, y
              output[index] = ascii[charIndex]
          }
       }
   }

This improved our overall runtime and reduced it to 36.095 seconds.

Ascii art parallel hotspot.png

Note that our overall CPU time has increased. This is a good sign as that means that we are utilizing our CPU more.

Looking at our thread usage we see that we are using 8 threads instead of 1 and the presence of the omp$parallel attached to our function.

Ascii art parallel threads.png

Then we ran our code through the HPC Performance Characterization analysis to verify the efficiency of our OpenMP implementation.

Ascii art hpc og.png

This analysis showed that we could potentially gain 1.311 seconds of runtime if we optimized OpenMP. This occurred because there is an imbalance in the work distribution of our threads which means one or more threads complete their task before the other threads. This means they are idle and wait at the barrier for the other threads to complete their work ultimately resulting in unused processing power.

To fix this we added dynamic scheduling to our code so the workload is dynamically allocated so when one thread finishes their work and there is more work to be done, the thread will pick up more work.

So our pragma statement was changed to:

  #pragma omp parallel for schedule(dynamic)

Running our code again through a HPC Performance Characterization analysis yielded these results:

Ascii art hpc cur.png

Intel Adviser (Dmytro)

What can the Advisor Do:

1. Vectorization Optimization

  • Use the cache-aware Roofline Analysis to identify high-impact, under-optimized loops and get tips for faster code.
  • Quickly find what's blocking vectorization where it matters most to make the best use of your machine's Single Instruction Multiple Data (SIMD) capabilities.
  • Identify where it is safe to force compiler vectorization.
  • Use memory analysis to find inefficient memory usage.

2. Thread Prototyping

  • Use Threading Advisor to fast-track threading design.
  • Its simple workflow lets you quickly model threading designs while delivering the data and tips you need to make faster design and optimization decisions.

(More could be found https://software.intel.com/en-us/advisor)


Our Hotspot analysis from vTune(see previous section for details on vTune) identified that there are two bottlenecks in our code:

  • imageToScaleeNaive - the main conversion function that creates image drawn with characters from the original buffer.
  • calcLum - averages out the RGB values of a single pixel to convert it to grey scale.

See the details below:

Ascii vtune runtime.png

As you can see, calcLum takes the second highest time. However, it does not have a single loop, and really only has a single line of code.

int calcLum(const unsigned char *pixels)
{
   return (0.2126 * (int)pixels[2]) + (0.7152 * (int)pixels[1]) + (0.0722 * (int)pixels[0]);
}

My initial thoughts were to take the code form the functions and just put it inside the loop, after all it is only called in a single place in the code and by putting it inside the main function, I would save on memory stack allocation.

Ascii variable sizes.JPG

In the snippet above, I moved hard-coded variables into constants that are declared before the loop starts as well as assigned each RGB value to an int variable. The last bit is needed because we need to get ASCII codes of each character to get its intensity value.


In doing so, I got the following results:

Ascii data type missmatch refactoring.JPG

The above indicates that now we have 3 datatype miss-matches that take additional time on each loop to convert from char to int and then from int to float for the final calculation of the sum.

Solution

Unfortunately, sometimes there is no way around the datatype conversion, as it is the only way to retrieve the information. So all we can do now is minimize the impact.

Ascii solution.JPG

The final solution above involves making sure that the data types stay consistent across the calculation by converting chars straight to floats and using float constants, thus only making a single type cast for every bracket pair.

The final solution achieved a speedup of around 200ms on a 10s video clip.

Summary

...todo