Vectorized programming is a programming technique that uses vector instructions to perform operations on multiple data points at the same time. This can significantly improve the performance of programs that work with large amounts of data. Vector instructions are available on most modern CPUs. They allow the CPU to perform multiple operations on a single data stream in parallel. This can be much faster than performing the operations one at a time.
When it comes to SIMD, we usually think about SSE*, AVX and Neon, but vectorization has been used for years before any of them appeared. Let’s take a look at a common parsing subroutine to check if string has a hex prefix ‘0x’. The first thing that comes to mind is something like
which, by using a trick for fast lowercase transformation, can be reduced to
Clang generates the same assembly for both versions
It may seem like there is nothing else to be done here. But we are dealing with just 2 single byte characters here, so why do we have to process them separately? What if we convert first 2 chars into a 16bit integer so that they could be processed as a single unit? And that’s exactly what ada library for URL parsing does
Since it’s a generic library, it has to take into account endianness so that the second character is converted to lowercase using `| 0x20` trick we’ve used above. Also, it’s a UB to reinterpret_cast string bytes into uint16_t, so std::memcpy is has to be used until C++20’s std::bit_cast can be used.
Surely this implementation would result in a lot of assembly, right?
Turns out that optimizing compilers like clang easily get rid of memcpy
s and in the end everything boils down to and
, cmp
and sete
.
In this article we had to process just 2 chars, but even without going all the way to something like AVX512, we can easily process 8 chars using uint64_t available on most platforms.
nice, and apparently I'm not the only one that is using memcpy instead of reinterpret_cast ;)
Yes, I always show assembly to the ones that have doubts.