Parallel arrays performance.

Benchmarking in Go.

Apr 28, 2021

In the previous newsletter we ended up with Go code for array of structs and struct of arrays code:

In theory we expect TotalAge to be faster for ParallelUsers since we expect more efficient cache usage and better prefetching efficiency. Conveniently Go comes with built-in benchmark testing support, so let's write some tests:

To establish the baseline and check the impact on cache efficiency let's first set userMetadataSize to 0. Unsurprisingly we get very close results:

Increasing it to just 4 bytes results in ~25% drop in performance:

Interestingly increasing it to 8 does not change the numbers:

which demonstrates effect of User struct alignment. Once we exceed the alignment padding by increasing metadata size to 16, we start losing more performance:

Setting metadata size to 64 results in extreme cache inefficiency and ~5x worse performance:

Since there are multiple layers of caches, performance drop does not stop here and further increase of metadata size to 256 results in a whopping ~16x degradation:

So we have a clear winner here - parallel arrays definitely deliver on its promise. Unfortunately Go's compiler does not leverage SIMD instructions so we cannot verify effect of vectorization and in one of the next posts we'll use Rust to see what kinds of wins we can expect from parallel arrays backed by SIMD instructions.

To be continued...

Software Bits Newsletter

Discussion about this post