+ void foo(float *f) {
+ for (int i = 0; i != 1024; ++i)
+ f[i] = floorf(f[i]);
+ }
+
+Partial unrolling during vectorization
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Modern processors feature multiple execution units, and only programs that contain a
+high degree of parallelism can fully utilize the entire width of the machine.
+The Loop Vectorizer increases the instruction level parallelism (ILP) by
+performing partial-unrolling of loops.
+
+In the example below the entire array is accumulated into the variable 'sum'.
+This is inefficient because only a single execution port can be used by the processor.
+By unrolling the code the Loop Vectorizer allows two or more execution ports
+to be used simultaneously.
+
+.. code-block:: c++
+
+ int foo(int *A, int *B, int n) {
+ unsigned sum = 0;
+ for (int i = 0; i < n; ++i)
+ sum += A[i];
+ return sum;
+ }
+
+The Loop Vectorizer uses a cost model to decide when it is profitable to unroll loops.
+The decision to unroll the loop depends on the register pressure and the generated code size.
+
+Performance
+-----------
+
+This section shows the the execution time of Clang on a simple benchmark:
+`gcc-loops <http://llvm.org/viewvc/llvm-project/test-suite/trunk/SingleSource/UnitTests/Vectorizer/>`_.
+This benchmarks is a collection of loops from the GCC autovectorization
+`page <http://gcc.gnu.org/projects/tree-ssa/vectorization.html>`_ by Dorit Nuzman.
+
+The chart below compares GCC-4.7, ICC-13, and Clang-SVN with and without loop vectorization at -O3, tuned for "corei7-avx", running on a Sandybridge iMac.
+The Y-axis shows the time in msec. Lower is better. The last column shows the geomean of all the kernels.