Don't leave easy performance wins on the table.

Or vectorize all the things!

Apr 25, 2021

One of the most exciting feature of Java 16 is vector API (JEP 338) that makes it possible to take advantage of available SIMD instructions and by doing so significantly improve performance.

When reading an example from JEP documentation I was somewhat shocked to see that a simple scalar computation

has to be rewritten as a hardly readable

to get the desired vectorized assembly.

"Phew, great that I don't use Java", thought I and went on to see what would Go do in such case. To my big disappointment, Go does not seem to support SIMD intrinsics and generates non-vectorized assembly :(

Convinced that Clang would not disappoint me I checked the assembly using compiler explorer with highest level of optimization and noticed that even though it does a lot of useful optimizations, including loop unrolling, it's still using only 128bit XMM registers:

but easily switches to 512bit ZMM registers when foundation AVX 512 support is requested via -mavx512f flag:

AVX 512 support was added by Intel with the Haswell processor, which shipped in 2013, so it's very likely that in 2021 your servers have it.

Moral of the story? Don't leave performance on the table - know your hardware and how to take full advantage of it.

Software Bits Newsletter

Discussion about this post