Changes

Jump to: navigation, search

GPU621/Intel Advisor

312 bytes added, 15:48, 23 November 2018
add images
== Vectorization Examples ==
[INSERT IMAGE HERE][[File:CPUCachelineVectorization-example-serial.png|center|frame]]
=== Serial Version ===
=== SIMD Version ===
 
[[File:Vectorization-example-simd.png|frame]]
<source lang="cpp">
If you compile the vec_samples project with the macro, the <code>matvec</code> function declaration will include the <code>restrict</code> keyword. The <code>restrict</code> keyword will tell the compiler that pointers <code>a</code> and <code>b</code> do not overlap and that the compiler is free optimize the code blocks that uses the pointers.
 
[INSERT IMAGE HERE]
[[File:CPUCacheline.png|center|frame]]
==== multiply.c ====
The following image illustrates the loop-carried dependency when two pointers overlap.
[INSERT IMAGE HERE][[File:CPUCachelinePointer-alias.png|center|frame]]
== Memory Alignment ==
Intel Advisor can detect if there are any memory alignment issues that may produce inefficient vectorization code.
A loop can be vectorized if there are no data dependencies across loop iterations.  === Peeled and Remainder Loops === However, if the data is not aligned, the vectorizer may have to use a "'''peeled" ''' loop to address the misalignment. So instead of vectorizing the entire loop, an extra loop needs to be inserted to perform operations on the front-end of the array that not aligned with memory. [[File:Memory-alignment-peeled.png|frame]] A remainder loop is the result of having a number of elements in the array that is not evenly divisible by the vector length (the total number of elements of a certain data type that can be loaded into a vector register). [[File:Memory-alignment-remainder.png|frame]] === Padding === Even if the array elements are aligned with memory, say at 16 byte boundaries, you might still encounter a "remainder" loop that deals with back-end of the array that cannot be included in the vectorized code. The vectorizer will have to insert an extra loop at the end of the vectorized loop to perform operations on the back-end of the array. To address this issue, add some padding. For example, if you have a <code>4 x 19</code> array of floats, and your system access to a 128-bit vector registers, then you should add 1 column to make the array <code>4 x 20</code> so that the number of columns is evenly divisible by the number of floats that can be loaded onto a 128-bit vector register, which is 4 floats.
[INSERT IMAGE HERE[File:Memory-alignment-padding.png|frame]]
=== Aligned vs Unaligned Instructions ===
The functions are taken from Intel's interactive guide to Intel Intrinsics: [https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2 Intel Intrinsics SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2]
=== Alignment Aligning Data ===
To align data elements to an <code>x</code> amount of bytes in memory, use the <code>align</code> macro.
#endif // _WIN32
</source>
 
=== Padding ===
 
Even if the array elements are aligned with memory, say at 16 byte boundaries, you might still encounter a "remainder" loop that deals with back-end of the array that cannot be included in the vectorized code. The vectorizer will have to insert an extra loop at the end of the vectorized loop to perform operations on the back-end of the array.
 
To address this issue, add some padding.
 
For example, if you have a <code>4 x 19</code> array of floats, and your system access to a 128-bit vector registers, then you should add 1 column to make the array <code>4 x 20</code> so that the number of columns is evenly divisible by the number of floats that can be loaded onto a 128-bit vector register, which is 4 floats.
 
[INSERT IMAGE HERE]
= Summary =
49
edits

Navigation menu