Difference between revisions of "SPO600 Algorithm Selection Lab"

From CDOT Wiki
Jump to: navigation, search
(Analyzing Results)
(Three Approaches)
(36 intermediate revisions by the same user not shown)
Line 1: Line 1:
[[Category:SPO600 Labs]]{{Admon/lab|Purpose of this Lab|In this lab, you will select one of two algorithms for adjusting the volume of PCM audio samples based on benchmarking of two possible approaches.}}
+
[[Category:SPO600 Labs]]{{Admon/lab|Purpose of this Lab|In this lab, you will investigate the impact of different algorithms which produce the same effect. You will test and select one of three algorithms for adjusting the volume of PCM audio samples based on benchmarking.}}
  
== Lab 5 ==
+
== Lab 6 ==
  
1. Write two different approaches to adjusting the volume of a sequence of sound samples, using different algorithms. In each case, you should take a series of signed 16-bit integers representing sound waveform samples and multiply each by a floating point "volume scaling factor" in the range 0.000-1.000. It is recommended that one approach be the naive multiplication of the sample by the volume scaling factor, and the second approach be dramatically different (e.g., table lookup, multiplication by bit-shifting, memoization, or another approach).  
+
=== Background ===
 +
* Digital sound is typically represented, uncompressed, as signed 16-bit integer signal samples. There is are two streams of samples, one each for the left and right stereo channels, at typical sample rates of 44.1 or 48 thousand samples per second per channel, for a total of 88.2 or 96 thousand samples per second (kHz). Since there are 16 bits (2 bytes) per sample, the data rate is 88.2 * 1000 * 2 = 176,400 bytes/second (~172 KiB/sec) or 96 * 1000 * 2 = 192,000 bytes/second (~187.5 KiB/sec).
 +
* To change the volume of sound, each sample can be scaled (multiplied) by a volume factor, in the range of 0.00 (silence) to 1.00 (full volume).
 +
* On a mobile device, the amount of processing required to scale sound will affect battery life.
  
2. Test which approach is faster. Control the variables and use a large run of data (at least hundreds millions of samples). Use both [[SPO600 Servers|x86 and AArch64]] systems for testing - DO NOT compare results between the architectures (because they are different classes of systems) but DO compare the relative performance of the algorithms on each architecture. For example, you might note that "Algorithm I is NN% faster than Algorithm II on Architecture A, but NN% slower on Architecture B".
+
=== Three Approaches ===
  
3. Blog about your results. Important! -- explain what you're doing so that a reader coming across your blog post understands the context (in other words, don't just jump into a discussion of optimization results -- give your post some context).
+
Three approaches to this problem are provided:
 +
 
 +
# The basic or Naive algorithm (<code>vol1.c</code>). This approach multiplies each sound sample by 0.75, casting from signed 16-bit integer to floating point and back again. Casting between integer and floating point can be [[Expensive|expensive]] operations.
 +
# A lookup-based algorithm (<code>vol2.c</code>). This approach uses a pre-calculated table of all 65536 possible results, and looks up each sample in that table instead of multiplying.
 +
# A fixed-point algorithm (<code>vol3.c</code>). This approach uses fixed-point math and bit shifting to perform the multiplication without using floating-point math.
 +
 
 +
=== Don't Compare Across Machines ===
 +
 
 +
In this lab, ''do not'' compare the relative performance across different machines, because the systems provided have a wide range of processor implementations, from server-class to mobile-class. However, ''do'' compare the relative performance of the various algorithms on the ''same'' machine.
 +
 
 +
=== Benchmarking ===
 +
 
 +
Get the files for this lab from one of the [[SPO600 Servers]] -- but you can perform the lab wherever you want (feel free to use your laptop or home system). Test on both an x86_64 and an AArch64 system.
 +
 
 +
Review the contents of this archive:
 +
* <code>vol.h</code> controls the number of samples to be processed
 +
* <code>vol1.c</code>, <code>vol2.c</code>, and <code>vol3.c</code> implement the various algorithms
 +
* The <code>Makefile</code> can be used to build the programs
 +
 
 +
Perform these steps:
 +
# Unpack the archive <code>/public/spo600-algorithm-selection-lab.tgz</code>
 +
# Study each of the source code files and make sure that you understand what the code is doing.
 +
# '''Make a prediction''' of the relative performance of each scaling algorithm.
 +
# Build and test each of the programs.
 +
#* Do all of the algorithms produce the same output?
 +
#** How can you verify this?
 +
#** If there is a difference, is it significant enough to matter?
 +
#* Change the number of samples so that each program takes a reasonable amount of time to execute (suggested minimum 20 seconds, 1 minute or more is better).
 +
# Test the performance of each program.
 +
#* Find a way to measure performance ''without'' the time taken to perform the test setup pre-processing (generating the samples) and post-processing (summing the results) so that you can measure ''only'' the time taken to scale the samples. '''This is the hard part!'''
 +
#* How much time is spent scaling the sound samples?
 +
#* Do multiple runs take the same time? How much variation do you observe? What is the likely cause of this variation?
 +
#* Is there any difference in the results produced by the various algorithms?
 +
#* Does the difference between the algorithms vary depending on the architecture and implementation on which you test?
 +
#* What is the relative memory usage of each program?
 +
# Was your prediction accurate?
 +
 
 +
=== Deliverables ===
 +
 
 +
Blog about your experiments with a detailed analysis of your results, including memory usage, performance, accuracy, and trade-offs.
 +
 
 +
Make sure you convincingly prove your results to your reader! Also be sure to explain what you're doing so that a reader coming across your blog post understands the context (in other words, don't just jump into a discussion of optimization results -- give your post some context).
 +
 
 +
'''Optional - Recommended:''' Compare results across several '''implementations''' of AArch64 and x86_64 systems. Note that on different CPU implementations, the relative performance of different algorithms will vary; for example, table lookup may outperform other algorithms on a system with a fast memory system (cache), but not on a system with a slower memory system.
 +
* For AArch64, you could compare the performance on AArchie against the various class servers, or between the class servers and a Raspberry Pi 3 (in 64-bit mode) or an ARM Chromebook.
 +
* For x86_64, you could compare the performance of different processors, such as xerxes, your own laptop or desktop, and Seneca systems such as Matrix or lab desktops.
  
 
=== Things to consider ===
 
=== Things to consider ===
  
==== Design of Your Test ====
+
==== Design of Your Tests ====
 
+
* Most solutions for a problem of this type involve generating a large amount of data in an array, processing that array using the function being evaluated, and then storing that data back into an array. The test setup can take more time than the actual test! Make sure that you measure the time taken in the code under test only -- you need to be able to remove the rest of the processing time from your evaluation.
* Most solutions for a problem of this type involve generating a large amount of data in an array, processing that array using the function being evaluated, and then storing that data back into an array. Make sure that you measure the time taken in the test function only -- you need to be able to remove the rest of the processing time from your evaluation.
 
 
* You may need to run a very large amount of sample data through the function to be able to detect its performance.
 
* You may need to run a very large amount of sample data through the function to be able to detect its performance.
 
* If you do not use the output from your calculation (e.g., do something with the output array), the compiler may recognize that, and remove the code you're trying to test. Be sure to process the results in some way so that the optimizer preserves the code you want to test. It is a good idea to calculate some sort of verification value to ensure that both approaches generate the same results.
 
* If you do not use the output from your calculation (e.g., do something with the output array), the compiler may recognize that, and remove the code you're trying to test. Be sure to process the results in some way so that the optimizer preserves the code you want to test. It is a good idea to calculate some sort of verification value to ensure that both approaches generate the same results.
* You can test using actual sound data (see the tips section, below) or using generated data. If you're generating data, it is best to use a pseudo-random number generator which is seeded with the same value every time, so that each run processes the same data.
+
* Be aware of what other tasks the system is handling during your test run, including software running on behalf of other users.
* Be aware of what other tasks the system is handling during your test run.
 
  
==== Analyzing Results ====
+
=== Tips ===
* What is the impact of various optimization levels on the software performance?
+
{{Admon/tip|Analysis|Do a thorough analysis of the results. Be certain (and prove!) that your performance measurement ''does not'' include the generation or summarization of the test data. Do multiple runs and discard the outliers. Decide whether to use mean, minimum, or maximum time values from the multiple runs, and explain why you made that decision. Control your variables well. Show relative performance as percentage change, e.g., "this approach was NN% faster than that approach".}}
* Does the distribution of data matter?
 
* If samples are fed at CD rate (44100 samples per second x 2 channels), can both algorithms keep up?
 
* What is the memory footprint of each approach?
 
* What is the performance of each approach?
 
* What is the energy consumption of each approach? (What information do you need to calculate this?)
 
* Xerxes and Betty have different performance profiles, so it's not reasonable to compare performance between the machines, but it is reasonable to compare the relative performance of the two algorithms in each context. Do you get similar results?
 
* What other optimizations can be applied to this problem?
 
  
=== Competition ===
+
{{Admon/tip|Non-Decimal Notation|In this lab, the number prefix 0x indicates a hexadecimal number, and 0b indicates a binary number, in harmony with the C language.}}
* How fast can you scale 500 million int16 PCM sound samples?
+
 
 +
{{Admon/tip|Time and Memory Usage of a Program|You can get basic timing information for a program by running <code>time ''programName''</code> -- the output will show the total time taken (real), the amount of CPU time used to run the application (user), and the amount of CPU time used by the operating system on behalf of the application (system).
 +
 
 +
The version of the <code>time</code> command located in <code>/bin/time</code> gives slightly different information than the version built in to bash -- including maximum resident memory usage: <code>/bin/time ''./programName''</code>}}
  
=== Tips ===
 
 
{{Admon/tip|SOX|If you want to try this with actual sound samples, you can convert a sound file of your choice to raw 16-bit signed integer PCM data using the [http://sox.sourceforge.net/ sox] utility present on most Linux systems and available for a wide range of platforms.}}
 
{{Admon/tip|SOX|If you want to try this with actual sound samples, you can convert a sound file of your choice to raw 16-bit signed integer PCM data using the [http://sox.sourceforge.net/ sox] utility present on most Linux systems and available for a wide range of platforms.}}
  
{{Admon/tip|Stack Limit|Fixed-size, non-static arrays will be placed in the stack space. The size of the stack space is controlled by per-process limits, inherited from the shell, and adjustable with the <code>ulimit</code> command. Allocating an array larger than the stack size limit will cause a segmentation fault, usually on the first write. To see the current stack limit, use <code>ulimit -s</code> (displayed value is in KB; default is usually 8192 KB or 8 MB). To set the current stack limit, place a new size in KB or the keyword <code>unlimited</code>after the <code>-s</code> argument.<br /><br />Alternate (and preferred) approach: allocate the array space with <code>malloc()</code> or <code>calloc()</code>.}}
+
{{Admon/tip|stdint.h|The <code>stdint.h</code> header provides definitions for many specialized integer size types. Use <code>int16_t</code> for 16-bit signed integers.}}
  
{{Admon/tip|stdint.h|The <code>stdint.h</code> header provides definitions for many specialized integer size types. Use <code>int16_t</code> for 16-bit signed integers.}}
+
{{Admon/tip|Scripting|Use bash scripting capabilities to reduce tedious manual steps!}}

Revision as of 11:26, 9 March 2020

Lab icon.png
Purpose of this Lab
In this lab, you will investigate the impact of different algorithms which produce the same effect. You will test and select one of three algorithms for adjusting the volume of PCM audio samples based on benchmarking.

Lab 6

Background

  • Digital sound is typically represented, uncompressed, as signed 16-bit integer signal samples. There is are two streams of samples, one each for the left and right stereo channels, at typical sample rates of 44.1 or 48 thousand samples per second per channel, for a total of 88.2 or 96 thousand samples per second (kHz). Since there are 16 bits (2 bytes) per sample, the data rate is 88.2 * 1000 * 2 = 176,400 bytes/second (~172 KiB/sec) or 96 * 1000 * 2 = 192,000 bytes/second (~187.5 KiB/sec).
  • To change the volume of sound, each sample can be scaled (multiplied) by a volume factor, in the range of 0.00 (silence) to 1.00 (full volume).
  • On a mobile device, the amount of processing required to scale sound will affect battery life.

Three Approaches

Three approaches to this problem are provided:

  1. The basic or Naive algorithm (vol1.c). This approach multiplies each sound sample by 0.75, casting from signed 16-bit integer to floating point and back again. Casting between integer and floating point can be expensive operations.
  2. A lookup-based algorithm (vol2.c). This approach uses a pre-calculated table of all 65536 possible results, and looks up each sample in that table instead of multiplying.
  3. A fixed-point algorithm (vol3.c). This approach uses fixed-point math and bit shifting to perform the multiplication without using floating-point math.

Don't Compare Across Machines

In this lab, do not compare the relative performance across different machines, because the systems provided have a wide range of processor implementations, from server-class to mobile-class. However, do compare the relative performance of the various algorithms on the same machine.

Benchmarking

Get the files for this lab from one of the SPO600 Servers -- but you can perform the lab wherever you want (feel free to use your laptop or home system). Test on both an x86_64 and an AArch64 system.

Review the contents of this archive:

  • vol.h controls the number of samples to be processed
  • vol1.c, vol2.c, and vol3.c implement the various algorithms
  • The Makefile can be used to build the programs

Perform these steps:

  1. Unpack the archive /public/spo600-algorithm-selection-lab.tgz
  2. Study each of the source code files and make sure that you understand what the code is doing.
  3. Make a prediction of the relative performance of each scaling algorithm.
  4. Build and test each of the programs.
    • Do all of the algorithms produce the same output?
      • How can you verify this?
      • If there is a difference, is it significant enough to matter?
    • Change the number of samples so that each program takes a reasonable amount of time to execute (suggested minimum 20 seconds, 1 minute or more is better).
  5. Test the performance of each program.
    • Find a way to measure performance without the time taken to perform the test setup pre-processing (generating the samples) and post-processing (summing the results) so that you can measure only the time taken to scale the samples. This is the hard part!
    • How much time is spent scaling the sound samples?
    • Do multiple runs take the same time? How much variation do you observe? What is the likely cause of this variation?
    • Is there any difference in the results produced by the various algorithms?
    • Does the difference between the algorithms vary depending on the architecture and implementation on which you test?
    • What is the relative memory usage of each program?
  6. Was your prediction accurate?

Deliverables

Blog about your experiments with a detailed analysis of your results, including memory usage, performance, accuracy, and trade-offs.

Make sure you convincingly prove your results to your reader! Also be sure to explain what you're doing so that a reader coming across your blog post understands the context (in other words, don't just jump into a discussion of optimization results -- give your post some context).

Optional - Recommended: Compare results across several implementations of AArch64 and x86_64 systems. Note that on different CPU implementations, the relative performance of different algorithms will vary; for example, table lookup may outperform other algorithms on a system with a fast memory system (cache), but not on a system with a slower memory system.

  • For AArch64, you could compare the performance on AArchie against the various class servers, or between the class servers and a Raspberry Pi 3 (in 64-bit mode) or an ARM Chromebook.
  • For x86_64, you could compare the performance of different processors, such as xerxes, your own laptop or desktop, and Seneca systems such as Matrix or lab desktops.

Things to consider

Design of Your Tests

  • Most solutions for a problem of this type involve generating a large amount of data in an array, processing that array using the function being evaluated, and then storing that data back into an array. The test setup can take more time than the actual test! Make sure that you measure the time taken in the code under test only -- you need to be able to remove the rest of the processing time from your evaluation.
  • You may need to run a very large amount of sample data through the function to be able to detect its performance.
  • If you do not use the output from your calculation (e.g., do something with the output array), the compiler may recognize that, and remove the code you're trying to test. Be sure to process the results in some way so that the optimizer preserves the code you want to test. It is a good idea to calculate some sort of verification value to ensure that both approaches generate the same results.
  • Be aware of what other tasks the system is handling during your test run, including software running on behalf of other users.

Tips

Idea.png
Analysis
Do a thorough analysis of the results. Be certain (and prove!) that your performance measurement does not include the generation or summarization of the test data. Do multiple runs and discard the outliers. Decide whether to use mean, minimum, or maximum time values from the multiple runs, and explain why you made that decision. Control your variables well. Show relative performance as percentage change, e.g., "this approach was NN% faster than that approach".
Idea.png
Non-Decimal Notation
In this lab, the number prefix 0x indicates a hexadecimal number, and 0b indicates a binary number, in harmony with the C language.
Idea.png
Time and Memory Usage of a Program
You can get basic timing information for a program by running time programName -- the output will show the total time taken (real), the amount of CPU time used to run the application (user), and the amount of CPU time used by the operating system on behalf of the application (system). The version of the time command located in /bin/time gives slightly different information than the version built in to bash -- including maximum resident memory usage: /bin/time ./programName
Idea.png
SOX
If you want to try this with actual sound samples, you can convert a sound file of your choice to raw 16-bit signed integer PCM data using the sox utility present on most Linux systems and available for a wide range of platforms.
Idea.png
stdint.h
The stdint.h header provides definitions for many specialized integer size types. Use int16_t for 16-bit signed integers.
Idea.png
Scripting
Use bash scripting capabilities to reduce tedious manual steps!