Auto - Vectorization with little help from GCC!

This tutorial helps the programmers to benefit the progress of the auto-vectorization algorithms that are implemented in modern compilers, such as gcc. Before you start playing with the vectorization of your code i assume that you don't have any bottleneck in you code (like dynamic memory allocation etc) in the critical path. In this tutorial we will use the gcc 4.4.1, but the same steps can be applied to newer or older versions.

First of all there are two issues with auto vectorization: 1) gcc must know the architecture (eg what SIMD instructions are available) 2) The data structures must by properly aligned in memory

The first step is to find the architecture of your processor and point it to gcc using the flags -mtune=... / -march=... you specify the architecture. For example, my laptop is core2Duo so i put -march=core2. You can find more more information here. The next problem we must solve is knowledge of memory alignment. You can help the compiler by allocating everything with alignment at least 16 bytes For example:

1) Initialized static arrays :

static const uint8_t key_perm[0x40] __attribute__(( aligned(16) )) = { 0x12,... };

2) Try to use even space allocate in stack to be aliment (warning: there is a bug with some gcc versions and when running the gcc on cygwin for windows).

3) For the dynamic allocation you should use: memalign (or posix_memalign ) instead of malloc.

For steps 1 and 2 note that recent versions of gcc automatically align the arrays in the memory.
For example:

$ cat test.c
#define N 200000
int a[N] __attribute__(( aligned(16) ));
int b[N] __attribute__(( aligned(16) ));
int c[N] __attribute__(( aligned(16) ));

double f2(){
int i;
for (i = 0; i < N; i++)
a[i] = b[i] + c[i];
}

$ gcc -O3 -ftree-vectorize -march=core2 -ftree-vectorizer-verbose=1 -c ./test.c
./test.c:8: note: LOOP VECTORIZED../test.c:6: note: vectorized 1 loops in function.$

The verbose flag should give some info of what vectorized and what not. Even if the arrays are declared in stack, gcc is able to vectorize the loop. Checking the size of the object file we have:

$ size test.o   text    data    bss    dec    hex filename 94    0    0    94    5e test.o

94 Bytes of text, not so bad. Now lets make the example more complicated. Note that i use gcc-4.4, the behavior of older versions may be different.

double f2(int *a, int *b, int *c, int N){
int i;
for (i = 0; i < N; i++)
a[i] = b[i] + c[i];
}

Compiler must solve two issues in the above example: first the number of iterations is unknown and second the pointers a,b, and may be to point on the same memory location. Dependency analysis on pointer is also called alias analysis. Compiling the example we are getting this:

$ gcc -O3 -ftree-vectorize -march=core2 -ftree-vectorizer-verbose=3 -c ./test.c
./test.c:6: note: vect_model_load_cost: unaligned supported by hardware.
./test.c:6: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
./test.c:6: note: vect_model_load_cost: unaligned supported by hardware.
./test.c:6: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
./test.c:6: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
./test.c:6: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
./test.c:6: note: cost model: Adding cost of checks for loop versioning to treat misalignment.
./test.c:6: note: cost model: Adding cost of checks for loop versioning aliasing.
./test.c:6: note: cost model: epilogue peel iters set to vf/2 because loop iterations are unknown .

./test.c:6: note: Cost model analysis:
Vector inside of loop cost: 6
Vector outside of loop cost: 20
Scalar iteration cost: 4
Scalar outside cost: 1
prologue iterations: 0
epilogue iterations: 2
Calculated minimum iters for profitability: 7
./test.c:6: note: Profitability threshold = 6
./test.c:6: note: Vectorization may not be profitable.
./test.c:6: note: created 2 versioning for alias checks.
./test.c:6: note: LOOP VECTORIZED.
./test.c:3: note: vectorized 1 loops in function.

$ size test.o   text    data    bss    dec    hex filename 262    0    0    262    106 test.o

As a result, we must have at least 7 iterations and gcc inserts runtime calls. We can help the situation by using the keyword restrict that is part of C99 standard. More information about gcc and restrict keyword you can find here and here. For example for the same loop you must change to:

double f2(int * restrict a, int * restrict b, int *restrict c, int N)

And the compile:
$ gcc -O3 -ftree-vectorize -march=core2 -ftree-vectorizer-verbose=3 -std=c99 -c ./test.c
./test.c:11: note: vect_model_load_cost: unaligned supported by hardware.
./test.c:11: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
./test.c:11: note: vect_model_load_cost: unaligned supported by hardware.
./test.c:11: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
./test.c:11: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
./test.c:11: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
./test.c:11: note: cost model: prologue peel iters set to vf/2.
./test.c:11: note: cost model: epilogue peel iters set to vf/2 because peeling for alignment is unknown .
./test.c:11: note: Cost model analysis:
Vector inside of loop cost: 6
Vector outside of loop cost: 24
Scalar iteration cost: 4
Scalar outside cost: 7
prologue iterations: 2
epilogue iterations: 2
Calculated minimum iters for profitability: 5

./test.c:11: note: Profitability threshold = 4
./test.c:11: note: Vectorization may not be profitable.
./test.c:11: note: LOOP VECTORIZED.
./test.c:6: note: vectorized 1 loops in function.

We are able to decrease the minimum profitability to 5 iterations by disabling the aliasing analysis.

Hard stuff (next level):

1) Eliminate branches in your code: good for vectorization but the impact in x86 is low - branch predictors are working good. You can use build in unlikely to set hints about branches. For manual branch elimination you can try:http://cellperformance.beyond3d.com/articles/2006/04/benefits-to-branch-elimination.html.

2) Write some SIMD assembly directly ( look here and here <http://gcc.gnu.org/onlinedocs/gcc/X86-Built_002din-Functions.html>). Of course the manuals of intel is also a good start (many pages... but there are good :)).

Static linking with gcc and g++

In this tutorial, we will explain what the static linking is, how this affect the size of final binary, and why statically linking with g++ sometimes is pain. By definition, a statically compiled binary is a group of programmer ‘s routines, external functions, and variables which are packed into the final binary executable. The compiler or the linker produces the final object and embeds all the functions and variables and the linking phase. There are two reasons of using dynamic linking and shared libraries: 1) Avoid creating a huge binary, if all the programs use a standard set of libraries why not having the operating system providing to them 2) Compatibility on operating system and machine dependant characteristics: sometimes the libraries must be implemented based on the architecture or the operating system and using dynamic linking is an easy way to avoid this catch. On the other hand, static linking is the ideal way of distributing one software product, pay...

Personal Blog

Search This Blog

Auto - Vectorization with little help from GCC!

Popular posts from this blog

Processing Milky Way in RawTherapee

Static linking with gcc and g++

Processing a High Dynamic Range raw image with RawTherapee