Changes

Jump to: navigation, search

Fall 2019 SPO600 Weekly Schedule

3,449 bytes added, 10:18, 2 October 2019
Week 5 - Class I
=== Week 5 - Class I ===
* SIMD and Auto-vectorization
* * SIMD is an acronym for "Single Instruction, Multiple Data", and refers to a class of instructions which perform the same operation on several separate pieces of data in parallel. SIMD instructions also include related instructions to set up data for SIMD processing, and to summarize results.** SIMD is based on very wide registers (128 bits to 2048 bits on implementations current as of 2019), and these wide registers can be treated as multiple "lanes" of similar data. These SIMD registers, also called vector registers, can therefore be thought of as small arrays of values.** A 128-bit SIMD register can be used as:*** two 64-bit lanes*** four 32-bit lanes*** eight 16-bit lanes*** sixteen 8-bit lanes** Each architecture has a different notation for SIMD data. In AArch64 (which will be our focus):*** Vector usage uses the notation v''n''.''s'' where ''n'' is the register number and ''s'' is the shape of the lanes, expressed as the number of lanes and a letter indicating the width of the lanes: q for quad-word (128 bits), d for double-word (64 bits), s for single-word (32 bits), h for half-word (16 bits), and b for byte (8 bits). Therefore, <code>v0.16b</code> is vector register 0 used as 16 lanes of 8 bits (1 byte) each, while <code>v8.4s</code> is vector register 8 used as 4 lanes of 32 bits each. Most instructions permit either 64 or 128 bits of the register to be used.*** Scalar usage uses the lane width followed by the vector register number. Therefore, <code>q3</code> refers to vector register 3 used as a single 128-bit value, and <code>s3</code> refers to the same register used as a single 32-bit register. When using less than 128 bits, the remaining bits are either zero-filled (unsigned usage) or sign-extended (signed usage: the upper bits are filled with the sign bit, i.e., the same value as the high bit of the active part of the register).** Most SIMD operations work on corresponding lanes of the operand registers. For example, the AArch64 instruction <code>add v0.8h, v1.8h, v2.8h</code> will take the value in the first lane of register 1, add the value in the first lane of register 2, and place the result in the first lane of register 0. At the same time, the other lanes processed in the same way, resulting in 8 simultaneous addition operations being performed.** A small number of SIMD operations work across lanes, e.g., to find the lowest or highest value in all of the lanes, to add the lanes together, or to duplicate a single value into all of the lanes of a register. These are usually used to set up or summarize the results of SIMD operations -- for example, a value of 0 might be duplicated into all of the lanes of a result register, then a loop applied to sum array data into the results register, and then a lane-summing operation performed to merge the results from all of the lanes.* SIMD capabilities can be used in a program in one of three different ways:*# The compiler's ''auto-vectorizer'' can be used to identify sections of code to which SIMD is applicable, and SIMD code will automatically be generated.*#* This works for the basic SIMD operations, but may not be applicable to advanced SIMD instructions, which don't clearly map to C statements.*#* The compiler will be very cautious about vectorizing code. See the Resources section below for insight into these challenges.*#* Vectorization in applied by default only at the -O3 level in most compilers. *# [[Inline Assembly Language|Inline Assembler]]* # C Intrinsics
* [[SPO600 Vectorization Lab|Vectorization Lab]] (Optional lab - recommended)

Navigation menu