In the previous newsletter we ended up with Go code for array of structs and struct of arrays code:
In theory we expect TotalAge
to be faster for ParallelUsers
since we expect more efficient cache usage and better prefetching efficiency. Conveniently Go comes with built-in benchmark testing support, so let's write some tests:
To establish the baseline and check the impact on cache efficiency let's first set userMetadataSize
to 0
. Unsurprisingly we get very close results:
Increasing it to just 4
bytes results in ~25% drop in performance:
Interestingly increasing it to 8
does not change the numbers:
which demonstrates effect of User
struct alignment. Once we exceed the alignment padding by increasing metadata size to 16
, we start losing more performance:
Setting metadata size to 64
results in extreme cache inefficiency and ~5x worse performance:
Since there are multiple layers of caches, performance drop does not stop here and further increase of metadata size to 256
results in a whopping ~16x degradation:
So we have a clear winner here - parallel arrays definitely deliver on its promise. Unfortunately Go's compiler does not leverage SIMD instructions so we cannot verify effect of vectorization and in one of the next posts we'll use Rust to see what kinds of wins we can expect from parallel arrays backed by SIMD instructions.
To be continued...