Last time, we’ve taken a look at different ways to compute population count and ended up with:
Kernighan’s algorithm
parallel summing
Go’s
math.bits.OnesCount64
function that can leveragePOPCNT
instruction if it’s supported by the hardware
As expected, hardware supported instruction was a winner, but it was only marginally faster than parallel summing approach. The most likely suspect for such a small performance advantage is the runtime check for whether POPCNT
is supported, which clearly does not have to happen at each function invocation, but, unfortunately, does happen.
To verify this hypothesis, we’ll have to use Go’s inline assembly in order to access POPCNT
instruction directly:
Also, while doing some extra research on this topic, I have discovered a splendid paper - “Faster Population Counts Using AVX2 Instructions” that among other things includes Wilkes-Wheeler-Gill algorithm that I have added for comparison:
Unfortunately Go compiler does not inline assembly, so to make comparisons more fair, I’ve added //go:noinline
directive to all other functions to prevent them from being inlined.
And last but least, another frequently used approach for computing population count is to use tables with precomputed values. Just for the fun of it, I have added an implementation that uses a table for the first 256
integers:
Now it’s time to see how they perform. As a baseline let’s compute population count for 0
:
As expected, our inline assembly using POPCNT
instruction won, followed by a naive Kernighan’s implementation, for which 0
is the best case scenario. I also expected table based approach to perform poorly since it performs memory accesses.
Now, let’s see what happens for the case when all bits are set:
Since it’s the worst case for the Kernighan’s algorithm, its performance is the worst and is an order of magnitude slower than all other implementations. Since the rest of the algorithms have O(1)
time complexity, their numbers are not affected much.
So the conclusions are:
runtime checks in
math.bits.OnesCount64
add ~15% of overhead on top of plainPOPCNT
versionWilkes-Wheeler-Gill algorithm outperforms the parallel summing that is used in Go’s
OnesCount64
function in casePOPCNT
is not supported by hardware. I should probably submit a PR to improve itfunction invocation overhead is significant and among other things is caused by the fact that Go passes arguments using stacks instead of registers and will most likely be a topic of a future article
I highly recommend checking out “Faster Population Counts Using AVX2 Instructions” in case you need to compute population count for arrays of values
Full source code is available in the gist.