# Changes

## SPO600 Algorithm Selection Lab

, 9 March
no edit summary
=== Background ===
* Digital sound is typically represented, uncompressed, as signed 16-bit integer signal samples. There is are two streams of samples, one each for the left and right stereo channels, at typical sample rates of 44.1 or 48 thousand samples per second per channel, for a total of 88.2 or 96 thousand samples per second (kHz). Since there are 16 bits (2 bytes) per sample, the data rate is 88.2 * 1000 * 2 = 176,400 bytes/second (~172 KiB/sec) or 96 * 1000 * 2 = 192,000 bytes/second (~187.5 KiB/sec).
* To change the volume of sound, each sample can be scaled (multiplied) by a volume factor, in the range of 0.00 (silence) to 1.00 (full volume).
* On a mobile device, the amount of processing required to scale sound will affect battery life.
=== Basic Sound Scale Program Three Approaches ===
Get the files for Three approaches to this lab on one of the [[SPO600 Servers]] -- but you can perform the lab wherever you want.problem are provided:
# Unpack the archive The basic or Naive algorithm (<code>/public/spo600-algorithm-selection-labvol1.tgzc</code>). This approach multiplies each sound sample by 0.75, casting from signed 16-bit integer to floating point and back again, which can be [[Expensive|expensive]] operations.# Examine the A lookup-based algorithm (<code>vol1vol2.c</code> source code). This program:approach uses a pre-calculated table of all 65536 possible results, and looks up each sample in that table instead of multiplying.## Creates 5,000,000 random "sound samples" in a data array A fixed-point algorithm (the number of samples is set in the <code>volvol3.hc</code> file).## Scales those samples by the volume factor 0.75 This approach uses fixed-point math and stores them back bit shifting to perform the data arraymultiplication without using floating-point math.## Sums the output array and prints the sum.# Build and test this file.=== Don't Compare Across Machines ===#* Does it produce the same output each time?# Test In this lab, ''do not'' compare the relative performance across different machines, because the systems provided have a wide range of this program.#* How long does it take processor implementations, from server-class to run the scaling?#* How much time is spent scaling the sound samples? Be sure to eliminate the time taken for the nonmobile-scaling part of the program (e.gclass.However, random sample generation).#* Do multiple runs take the same time? How much variation ''do you observe? What is '' compare the likely cause relative performance of this variation?#* Is there any difference in the results produced by the various algorithms? How much does numeric accuracy matter in this application?on the ''same'' machine.
=== Alternate Approaches Benchmarking ===
The sample program uses Get the most basic, obvious algorithm files for the problem. Let's call this "Algorithm 0", or lab from one of the "Naive Algorithm". Note that it uses casting between integer and floating-point formats as well as multiplication -- both of which can be [[Expensive|expensiveSPO600 Servers]] operations-- but you can perform the lab wherever you want (feel free to use your laptop or home system). Test on both an x86_64 and an AArch64 system.
Try these alternate algorithms for scaling Review the contents of this archive:* <code>vol.h</code> controls the sound number of samples by modifying copies of to be processed* <code>vol1.c</code>, <code>vol2. Edit c</code>, and <code>vol3.c</code> implement the various algorithms* The <code>Makefile</code> can be used to build your modified the programs as well as the original. Test each approach to see the performance impact:
Perform these steps:# PreUnpack the archive <code>/public/spo600-algorithm-calculate selection-lab.tgz</code># Study each of the source code files and make sure that you understand what the code is doing.# '''Make a lookup table (array) prediction''' of all possible sample values multiplied by the volume factor, relative performance of each scaling algorithm.# Build and look up test each sample in that table to get of the scaled valuesprograms. (You'll have #* Do all of the algorithms produce the same output?#** How can you verify this?#** If there is a difference, is it significant enough to handle matter?#* Change the fact number of samples so that the input values range from -32768 each program takes a reasonable amount of time to +32767execute (suggested minimum 20 seconds, while C arrays accept only a positive index1 minute or more is better).# Convert Test the volume factor 0performance of each program.75 #* Find a way to measure performance ''without'' the time taken to a fixperform the test setup pre-point integer by multiplying by a binary number representing a fixedprocessing (generating the samples) and post-point value "1". For example, you could use 0b100000000 processing (= 256 in decimalsumming the results) so that you can measure ''only'' the time taken to represent 1scale the samples.00, and therefore use 0.75 '''This is the hard part!'''#* How much time is spent scaling the sound samples?#* 256 = 192 for your volume factor. Multiply Do multiple runs take the same time? How much variation do you observe? What is the likely cause of this fixed-point integer volume factor variation?#* Is there any difference in the results produced by each sample, then shift the result to various algorithms?#* Does the right difference between the required number of bits after algorithms vary depending on the multiplication (>>8 if architecture and implementation on which you're using 256 as test?#* What is the multiplier).relative memory usage of each program?# Was your prediction accurate?
=== Deliverables ===
ImportantMake sure you convincingly prove your results to your reader! -- Also be sure to explain what you're doing so that a reader coming across your blog post understands the context (in other words, don't just jump into a discussion of optimization results -- give your post some context).
'''Optional - Recommended:''' Compare results across several '''implementations ''' of AArch64 and x86_64 systems. Note that on different CPU implementations, the relative performance of different algorithms will vary; for example, table lookup may outperform other algorithms on a system with a fast memory system (cache), but not on a system with a slower memory system.* For AArch64, you could compare the performance on AArchie against another 64-bit ARM system such as the various class servers, or between the class servers and a Raspberry Pi 3 (in 64-bit mode) or an ARM Chromebook.* For x86_64, you could compare the performance of different processors, such as xerxes, your own laptop or desktop, and Seneca systems such as Matrix, Zenit, or lab desktops.
=== Things to consider ===
==== Design of Your Tests ====
* Most solutions for a problem of this type involve generating a large amount of data in an array, processing that array using the function being evaluated, and then storing that data back into an array. The test setup can take more time than the actual test! Make sure that you measure the time taken in the code under test function only -- you need to be able to remove the rest of the processing time from your evaluation.* You may need to run a very large amount of sample data through the function to be able to detect its performance. Feel free to edit the sample count in <code>vol.h</code> as necessary.
* If you do not use the output from your calculation (e.g., do something with the output array), the compiler may recognize that, and remove the code you're trying to test. Be sure to process the results in some way so that the optimizer preserves the code you want to test. It is a good idea to calculate some sort of verification value to ensure that both approaches generate the same results.
* Be aware of what other tasks the system is handling during your test run, including software running on behalf of other users.
==== Analyzing Results =Tips ===* What is the impact {{Admon/tip|Analysis|Do a thorough analysis of various optimization levels on the software performance? results. Be certain (For example, compiling with -O0 / -O1 / -O2 / -O3and prove!)* Does that your performance measurement ''does not'' include the distribution generation or summarization of the test data matter? (e.gDo multiple runs and discard the outliers.Decide whether to use mean, is there any difference if there are no absolute large numbersminimum, or no negative numbers?)* If samples are fed at CD rate (44100 samples per second x 2 channels x 2 bytes per sample), can each of maximum time values from the algorithms keep up?* What is the memory footprint of each approach?* What is the performance of each approach?* What is the energy consumption of each approach? (What information do you need to calculate this?)* Various machines within an architecture have very different performance profiles, energy consumptionmultiple runs, and hardware costs -- so it's not reasonable to compare explain why you made that decision. Control your variables well. Show relative performance between machinesas percentage change, but it is reasonable to compare the relative performance of the algorithms in each contexte.g. Does the ratio of performance of the various approaches remain constant across the machines? Why or why not?* What other optimizations can be applied to , "this problem?approach was NN% faster than that approach".}}
=== Tips ===
{{Admon/tip|Non-Decimal Notation|In this lab, the number prefix 0x indicates a hexadecimal number, and 0b indicates a binary number, in harmony with the C language.}}

{{Admon/tip|Time and Memory Usage of a Program|You can get basic timing information for a program by running <code>time ''programName''</code> -- the output will show the total time taken (real), the amount of CPU time used to run the application (user), and the amount of CPU time used by the operating system on behalf of the application (system).
Another The version of the <code>time</code> command, located in <code>/bin/time</code>, gives slightly different information, than the version built in to bash -- including maximum resident memory usage: <code>/bin/time ''./programName''</code>}}
{{Admon/tip|SOX|If you want to try this with actual sound samples, you can convert a sound file of your choice to raw 16-bit signed integer PCM data using the [http://sox.sourceforge.net/ sox] utility present on most Linux systems and available for a wide range of platforms.}}

{{Admon/tip|Stack Limit|Fixed-size, non-static arrays will be placed in the stack space. The size of the stack space is controlled by per-process limits, inherited from the shell, and adjustable with the <code>ulimit</code> command. Allocating an array larger than the stack size limit will cause a segmentation fault, usually on the first write. To see the current stack limit, use <code>ulimit -s</code> (displayed value is in KB; default is usually 8192 KB or 8 MB). To set the current stack limit, place a new size in KB or the keyword <code>unlimited</code>after the <code>-s</code> argument.<br /><br />Alternate (and preferred) approach, as used in the provided sample code: allocate the array space with <code>malloc()</code> or <code>calloc()</code>.}}
{{Admon/tip|stdint.h|The <code>stdint.h</code> header provides definitions for many specialized integer size types. Use <code>int16_t</code> for 16-bit signed integers.}}
1,119
edits