Skip to main content

Auto - Vectorization with little help from GCC!

This tutorial helps the programmers to benefit the progress of the auto-vectorization algorithms that are implemented in modern compilers, such as gcc. Before you start playing with the vectorization of your code i assume that you don't have any bottleneck in you code (like dynamic memory allocation etc) in the critical path. In this tutorial we will use the gcc 4.4.1, but the same steps can be applied to newer or older versions. 

First of all there are two issues with auto vectorization: 1) gcc must know the architecture (eg what SIMD instructions are available) 2) The data structures must by properly aligned in memory

The first step is to find the architecture of your processor and point it to gcc using the flags -mtune=... / -march=... you specify the architecture.  For example, my laptop is core2Duo so i put -march=core2. You can find more more information hereThe next problem we must solve is knowledge of memory alignment.  You can help the compiler by allocating everything with alignment at least 16 bytes For example: 

1) Initialized static arrays : 

static const uint8_t key_perm[0x40] __attribute__(( aligned(16) )) = { 0x12,... }; 

2) Try to use even space allocate in stack to be aliment (warning: there is a bug with some gcc versions and when running the gcc on cygwin for windows).

3) For the dynamic allocation you should use: memalign (or posix_memalign ) instead of malloc.  


For steps 1 and 2 note that recent versions of gcc automatically align the arrays in the memory.
For example: 

$ cat test.c 
#define N 200000
int a[N] __attribute__(( aligned(16) ));
int b[N] __attribute__(( aligned(16) ));
int c[N] __attribute__(( aligned(16) ));

double  f2(){    
        int i;   
        for (i = 0; i < N; i++)       
                a[i] = b[i] + c[i];
        }

$ gcc -O3 -ftree-vectorize -march=core2 -ftree-vectorizer-verbose=1 -c ./test.c 
./test.c:8: note: LOOP VECTORIZED../test.c:6: note: vectorized 1 loops in function.

The verbose flag should give some info of what vectorized and what not. Even if the arrays are declared in stack, gcc is able to vectorize the loop. Checking the size of the object file we have:


$ size test.o    text    data     bss     dec     hex filename     94       0       0      94      5e test.o

94 Bytes of text, not so bad. Now lets make the example more complicated. Note that i use gcc-4.4, the behavior of older versions may be different.


double  f2(int *a, int *b, int *c, int N){    
     int i;   
     for (i = 0; i < N; i++)       
         a[i] = b[i] + c[i];
}



Compiler must solve two issues in the above example: first the number of iterations is unknown and second the pointers a,b, and may be to point on the same memory location. Dependency analysis on pointer is also called alias analysis. Compiling the example we are getting this:


$ gcc -O3 -ftree-vectorize -march=core2 -ftree-vectorizer-verbose=3 -c ./test.c 
./test.c:6: note: vect_model_load_cost: unaligned supported by hardware.
./test.c:6: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
./test.c:6: note: vect_model_load_cost: unaligned supported by hardware.
./test.c:6: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
./test.c:6: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
./test.c:6: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
./test.c:6: note: cost model: Adding cost of checks for loop versioning to treat misalignment.
./test.c:6: note: cost model: Adding cost of checks for loop versioning aliasing.
./test.c:6: note: cost model: epilogue peel iters set to vf/2 because loop iterations are unknown .

./test.c:6: note: Cost model analysis:   
Vector inside of loop cost: 6  
Vector outside of loop cost: 20 
Scalar iteration cost: 4  
Scalar outside cost: 1  
prologue iterations: 0  
epilogue iterations: 2  
Calculated minimum iters for profitability: 7
./test.c:6: note:   Profitability threshold = 6
./test.c:6: note: Vectorization may not be profitable.
./test.c:6: note: created 2 versioning for alias checks.
./test.c:6: note: LOOP VECTORIZED.
./test.c:3: note: vectorized 1 loops in function.

$ size test.o    text    data     bss     dec     hex filename    262       0       0     262     106 test.o


As a result, we must have at least 7 iterations and gcc inserts runtime calls. We can help the situation by using the keyword restrict that is part of C99 standard. More information about gcc and restrict keyword you can find here and here.  For example for the same loop you must change to:

double  f2(int   * restrict a, int * restrict b, int *restrict c, int N)


And the compile:
$ gcc -O3 -ftree-vectorize -march=core2 -ftree-vectorizer-verbose=3 -std=c99 -c ./test.c 
./test.c:11: note: vect_model_load_cost: unaligned supported by hardware.
./test.c:11: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
./test.c:11: note: vect_model_load_cost: unaligned supported by hardware.
./test.c:11: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
./test.c:11: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
./test.c:11: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
./test.c:11: note: cost model: prologue peel iters set to vf/2.
./test.c:11: note: cost model: epilogue peel iters set to vf/2 because peeling for alignment is unknown .
./test.c:11: note: Cost model analysis:   
Vector inside of loop cost: 6  
Vector outside of loop cost: 24  
Scalar iteration cost: 4  
Scalar outside cost: 7 
prologue iterations: 2  
epilogue iterations: 2  
Calculated minimum iters for profitability: 5

./test.c:11: note:   Profitability threshold = 4
./test.c:11: note: Vectorization may not be profitable.
./test.c:11: note: LOOP VECTORIZED.
./test.c:6: note: vectorized 1 loops in function.


We are able to decrease the minimum profitability to 5 iterations by disabling the aliasing analysis.


Hard stuff (next level):


1) Eliminate branches in your code: good for vectorization but the impact in x86 is low - branch predictors are working good. You can use build in unlikely to set hints about branches. For manual branch elimination you can try:http://cellperformance.beyond3d.com/articles/2006/04/benefits-to-branch-elimination.html

2) Write some SIMD assembly directly  ( look here and here <http://gcc.gnu.org/onlinedocs/gcc/X86-Built_002din-Functions.html>). Of course the manuals of intel is also a good start (many pages... but there are good :)). 

Popular posts from this blog

Processing Milky Way in RawTherapee

This text is an analysis of a video I did some months ago how to process photos of our Milky Way in RawTherapee. You can find the picture here . The photo was taken by wiegemalt. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Editing: Step 1: Fixing lighting The first thing someone notices when opening the picture is the extreme noise due to high ISO used (1600 - check the picture on the right). The first thought is to de-noise the picture, however, if you do that you are going to loose the details of the night sky. The main reason for the high noise is the additional exposure Rawtherapee adds when the 'Auto' button is pressed. In particular, the RT adds +2.4 EV to equalize the picture. This is Wrong! What we want is to keep the noise down and at the same time bring the stars up. That's why we are going to play with the Tone Curve of the RT. To adjust the light properly we increase the cont

Static linking with gcc and g++

In this tutorial, we will explain what the static linking is, how this affect the size of final binary, and why statically linking with g++ sometimes is pain. By definition, a statically compiled binary is a group of programmer ‘s routines, external functions, and variables which are packed into the final binary executable. The compiler or the linker produces the final object and embeds all the functions and variables and the linking phase.  There are two reasons of using dynamic linking and shared libraries: 1) Avoid creating a huge binary, if all the programs use a standard set of libraries why not having the operating system providing to them 2) Compatibility on operating system and machine dependant characteristics: sometimes the libraries must be implemented based on the architecture or the operating system and using dynamic linking is an easy way to avoid this catch. On the other hand, static linking is the ideal way of distributing one software product, paying

Homemade NAS after two and half years

Few years ago, I created a NAS system using my old PC. The old pc is powerful enough to do the basic NAS operation except real time transcoding of 4K or more than one 1080 streams.  The setup was setup as follows: 1 SSD 128 GB for boot device 2x 4 TBs for the main storage configured as RAID 1 The hard disks were from different manufacturers (WD red plus and Segate) to avoid catastrophic situations: disks are part of a bad batch and you will have a second failed disk during the build of the RAID array. Here is a list of my mistakes: My first mistake was CentOS. I wanted a stable machine without surprises, unexpected behavior, as much less bugs affecting my setup. I had three free/open source options:  Open Media Vault: Debian based Linux with many configuration options. The main downside is that you don't have support for ZFS and you have to very often use the command line to fix things. Moreover, there are a plethora of open issues/bugs that can make your life harder than expected