CDOT Wikiβ

Introduction

Intel Advisor is bundled with Intel Parallel Studio.

Vectorization

Vectorization is the process of utilizing vector registers to perform a single instruction on multiple values all at the same time.

A CPU register is a very, very tiny block of memory that sits right on top of the CPU. A 64-bit CPU can store 8 bytes of data in a single register.

A vector register is an expanded version of a CPU register. A 128-bit vector register can store 16 bytes of data. A 256-bit vector register can store 32 bytes of data.

The vector register can then be divided into lanes, where each lane stores a single value of a certain data type.

A 128-bit vector register can be divided into the following ways:

• 16 lanes: 16x 8-bit characters
• 8 lanes: 8x 16-bit integers
• 4 lanes: 4x 32-bit integers / floats
• 2 lanes: 2x 64-bit integers
• 2 lanes: 2x 64-bit doubles
a | b | c | d | e | f | g | h

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

10  | 20  | 30  | 40

1.5 | 2.5 | 3.5 | 4.5

1000    | 2000

3.14159 | 3.14159


SIMD Extensions

SSE stands for Streaming SIMD Extensions which refers to the addition of a set of SIMD instructions as well as new XMM registers. SIMD stands for Single Instruction Multiple Data and refers to instructions that can perform a single operation on multiple elements in parallel.

List of SIMD extensions:

• SSE
• SSE2
• SSE3
• SSSE3
• SSE4.1
• SSE4.2
• AVX
• AVX2

(For Unix/Linux) To display what instruction set and extensions set your processor supports, you can use the following commands:

$uname -a$ lscpu
\$ cat /proc/cpuinfo


SSE Examples

For SSE >= SSE4.1, to multiply two 128-bit vector of signed 32-bit integers, you would use the following Intel intrisic function:

__m128i _mm_mullo_epi32(__m128i a, __m128i b)

Prior to SSE4.1, the same thing can be done with the following sequence of function calls.

// Vec4i operator * (Vec4i const & a, Vec4i const & b) {
// #ifdef
__m128i a13 = _mm_shuffle_epi32(a, 0xF5);              // (-,a3,-,a1)
__m128i b13 = _mm_shuffle_epi32(b, 0xF5);              // (-,b3,-,b1)
__m128i prod02 = _mm_mul_epu32(a, b);                  // (-,a2*b2,-,a0*b0)
__m128i prod13 = _mm_mul_epu32(a13, b13);              // (-,a3*b3,-,a1*b1)
__m128i prod01 = _mm_unpacklo_epi32(prod02, prod13);   // (-,-,a1*b1,a0*b0)
__m128i prod23 = _mm_unpackhi_epi32(prod02, prod13);   // (-,-,a3*b3,a2*b2)
__m128i prod = _mm_unpacklo_epi64(prod01, prod23);     // (ab3,ab2,ab1,ab0)

Code sample was taken from this StackOverflow thread: Fastest way to multiply two vectors of 32bit integers in C++, with SSE

Here is a link to an interactive guide to Intel Intrinsics: Intel Intrinsics SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2

Vectorization Examples

Serial Version

int a[8] = { 1, 2, 3, 4, 5, 6, 7, 8 };

for (int i = 0; i < 8; i++) {
a[i] *= 10;
}

SIMD Version

int a[8] = { 1, 2, 3, 4, 5, 6, 7, 8 };
int ten[4] = { 10, 10, 10, 10 }
__m128i va, v10;

for (int i = 0; i < 8; i+=4) {
va = _mm_loadu_si128((__m128i*)&a[i]);         // 1  2  3  4
va = _mm_mullo_epi32(va, v10);                 // 10 20 30 40
_mm_storeu_si128((__m128i*)&a[i], va);         // [10, 20, 30, 40]
}

Here is a great tutorial on how to use Intel Advisor to vectorize your code. Intel® Advisor Tutorial: Add Efficient SIMD Parallelism to C++ Code Using the Vectorization Advisor

You can find the sample code in the directory of your Intel Parallel Studio installation. Just unzip the file and you can open the solution in Visual Studio or build on the command line.

For default installations, the sample code would be located here: C:\Program Files (x86)\IntelSWTools\Advisor 2019\samples\en\C++\vec_samples.zip

Loop Unrolling

The compiler can "unroll" a loop so that the body of the loop is duplicated a number of times, and as a result, reduce the number of conditional checks and counter increments per loop.

Warning: Do not write your code like this. The compiler will do it for you, unless you tell it not to.

#pragma nounroll
for (int i = 0; i < 50; i++) {
foo(i);
}

for (int i = 0; i < 50; i+=5) {
foo(i);
foo(i+1);
foo(i+2);
foo(i+3);
foo(i+4);
}

// foo(0)
// foo(1)
// foo(2)
// foo(3)
// foo(4)
// ...
// foo(45)
// foo(46)
// foo(47)
// foo(48)
// foo(49)

Data Dependencies

Pointer Alias

A pointer alias means that two pointers point to the same location in memory or the two pointers overlap in memory.

If you compile the vec_samples project with the macro, the matvec function declaration will include the restrict keyword. The restrict keyword will tell the compiler that pointers a and b do not overlap and that the compiler is free optimize the code blocks that uses the pointers.

multiply.c

#ifdef NOALIAS
void matvec(int size1, int size2, FTYPE a[][size2], FTYPE b[restrict], FTYPE x[], FTYPE wr[])
#else
void matvec(int size1, int size2, FTYPE a[][size2], FTYPE b[], FTYPE x[], FTYPE wr[])
#endif

To learn more about the restrict keyword and how the compiler can optimize code if it knows that two pointers do not overlap, you can visit this StackOverflow thread: What does the restrict keyword mean in C++?

Loop-Carried Dependency

Pointers that overlap one another may introduce a loop-carried dependency when those pointers point to an array of data. The vectorizer will make this assumption and, as a result, will not auto-vectorize the code.

In the code example below, a is a function of b. If pointers a and b overlap, then there exists the possibility that if a is modified then b will also be modified, and therefore may create the possibility of a loop-carried dependency. This means the loop cannot be vectorized.

void func(int* a, int* b) {
...
for (i = 0; i < size1; i++) {
for (j = 0; j < size2; j++) {
a[i] = foo(b[j]);
}
}
}

The following image illustrates the loop-carried dependency when two pointers overlap.

Magnitude of a Vector

To demonstrate a more familiar example of a loop-carried dependency that would block the auto-vectorization of a loop, I'm going to include a code snippet that calculates the magnitude of a vector.

To calculate the magnitude of a vector: length = sqrt(x^2 + y^2 + z^)

for (int i = 0; i < n; i++)
sum += x[i] * x[i];

length = sqrt(sum);

As you can see, there is a loop-carried dependency with the variable sum. The diagram below illustrates why the loop cannot be vectorized (nor can it be threaded). The dashed rectangle represents a single iteration in the loop, and the arrows represents dependencies between nodes. If an arrow crosses the iteration rectangle, then those iterations cannot be executed in parallel.

To resolve the loop-carried dependency, use simd and the reduction clause to tell the compiler to auto-vectorize the loop and to reduce the array of elements to a single value. Each SIMD lane will compute its own sum and then combine the results into a single sum at the end.

#pragma omp simd reduction(+:sum)
for (int i = 0; i < n; i++)
sum += x[i] * x[i];

length = sqrt(sum);

Memory Alignment

Intel Advisor can detect if there are any memory alignment issues that may produce inefficient vectorization code.

A loop can be vectorized if there are no data dependencies across loop iterations.

Peeled and Remainder Loops

However, if the data is not aligned, the vectorizer may have to use a peeled loop to address the misalignment. So instead of vectorizing the entire loop, an extra loop needs to be inserted to perform operations on the front-end of the array that not aligned with memory.

A remainder loop is the result of having a number of elements in the array that is not evenly divisible by the vector length (the total number of elements of a certain data type that can be loaded into a vector register).

Even if the array elements are aligned with memory, say at 16 byte boundaries, you might still encounter a "remainder" loop that deals with back-end of the array that cannot be included in the vectorized code. The vectorizer will have to insert an extra loop at the end of the vectorized loop to perform operations on the back-end of the array.

For example, if you have a 4 x 19 array of floats, and your system access to a 128-bit vector registers, then you should add 1 column to make the array 4 x 20 so that the number of columns is evenly divisible by the number of floats that can be loaded onto a 128-bit vector register, which is 4 floats.

Aligned vs Unaligned Instructions

There are two versions of SIMD instructions for loading into and storing from vector registers: aligned and unaligned.

The following table contains a list of Intel Intrinsics functions for both aligned and unaligned load and store instructions.

Aligned Unaligned Description
void _mm_store_pd (double* mem_addr, __m128d a) void _mm_storeu_pd (double* mem_addr, __m128 a) Store 128-bits (composed of 2 packed double-precision (64-bit) floating-point elements) from a into memory.
void _mm_store_ps (float* mem_addr, __m128 a) void _mm_storeu_ps (float* mem_addr, __m128 a) Store 128-bits (composed of 4 packed single-precision (32-bit) floating-point elements) from a into memory.
void _mm_store_si128 (__m128i* mem_addr, __m128i a) void _mm_storeu_si128 (__m128i* mem_addr, __m128i a) Store 128-bits of integer data from a into memory.

The functions are taken from Intel's interactive guide to Intel Intrinsics: Intel Intrinsics SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2

Aligning Data

To align data elements to an x amount of bytes in memory, use the align macro.

Code snippet that is used to align the data elements in the vec_samples project.

// Tell the compiler to align the a, b, and x arrays
// boundaries.  This allows the vectorizer to use aligned instructions
// and produce faster code.
#ifdef _WIN32
_declspec(align(ALIGN_BOUNDARY, OFFSET)) FTYPE a[ROW][COLWIDTH];
_declspec(align(ALIGN_BOUNDARY, OFFSET)) FTYPE b[ROW];
_declspec(align(ALIGN_BOUNDARY, OFFSET)) FTYPE x[COLWIDTH];
_declspec(align(ALIGN_BOUNDARY, OFFSET)) FTYPE wr[COLWIDTH];
#else
FTYPE a[ROW][COLWIDTH]	__attribute__((align(ALIGN_BOUNDARY, OFFSET)));
FTYPE b[ROW]			__attribute__((align(ALIGN_BOUNDARY, OFFSET)));
FTYPE x[COLWIDTH]		__attribute__((align(ALIGN_BOUNDARY, OFFSET)));
FTYPE wr[COLWIDTH]		__attribute__((align(ALIGN_BOUNDARY, OFFSET)));
#endif // _WIN32