One of the most exciting feature of Java 16 is vector API (JEP 338) that makes it possible to take advantage of available SIMD instructions and by doing so significantly improve performance.
When reading an example from JEP documentation I was somewhat shocked to see that a simple scalar computation
has to be rewritten as a hardly readable
to get the desired vectorized assembly.
"Phew, great that I don't use Java", thought I and went on to see what would Go do in such case. To my big disappointment, Go does not seem to support SIMD intrinsics and generates non-vectorized assembly :(
Convinced that Clang would not disappoint me I checked the assembly using compiler explorer with highest level of optimization and noticed that even though it does a lot of useful optimizations, including loop unrolling, it's still using only 128bit XMM registers:
but easily switches to 512bit ZMM registers when foundation AVX 512 support is requested via -mavx512f
flag:
AVX 512 support was added by Intel with the Haswell processor, which shipped in 2013, so it's very likely that in 2021 your servers have it.
Moral of the story? Don't leave performance on the table - know your hardware and how to take full advantage of it.