How often do we think about which if
branch should go first? And even when we do, do we use some scientific approach, gut feel or a stylistic approach? And finally, does it even matter? Let’s find out.
Let’s use a simple counting function that returns the number of nums
that are equal and not equal to needle
:
We’ll use 2 benchmarks to compare the time it takes to execute counting function when all elements are equal to needle
and when all elements are not equal to needle
The results below show that it’s almost 2X slower to count no
s:
To understand why let’s take a look at the generated assembly. In particular at the lines handling if
branches:
These lines show that after evaluating condition CMPQ R8, AX
, for the true
case we have to execute JNE 18
, which, in theory, should be easy for a branch predictor to predict, and INCQ SI
before we start another loop iteration with JMP 10
. And in case of false
, JNE 18
will jump to INCQ DI
and start a new loop iteration with JMP 10
.
So it looks like the same number of instructions is executed no matter which branch we take. So what gives?
Well, the difference is in JNE 18
effect - when condition is true, we don’t jump and continue processing following instructions taking advantage of the CPU pipeline
So while JNE 18
is being decoded, CPU can work on fetching INCQ SI
and while it’s executed, INCQ SI
will be decoded and JMP 10
fetched, but in case condition evaluates to false
we have to jump to INCQ DI
that hasn’t even been fetched yet and the work that was spent on fetching and decoding instructions for the true
case will be discarded.
To avoid this wasted effort and improve efficiency, CPUs have a branch predictor used for speculative execution and the best way to see its impact is to employ the power of randomness
The benchmark above looks almost identical to ones we’ve used previously but with one small but important difference - the 1s and 0s are populated using random number generator. This small difference has a huge impact though
resulting in ~80% slowdown compared to a version with all 1s and ~70% compared to a version of all 0s.
Since branch predictor behavior is not always obvious, it’s best to use perf
tool:
$ perf record ./binary
$ perf stat ./binary
Performance counter stats for './binary:
...
1,605,231,778 branch-misses # 16.04% of all branches
that will helpfully print the percentage of branches branch predictor made a wrong call and high numbers indicate room for improvement.
So in the perf run case, the binary is the Rand one? and it had 16% misses. What was the Yes' and the No's ?