How often do we think about which
if branch should go first? And even when we do, do we use some scientific approach, gut feel or a stylistic approach? And finally, does it even matter? Let’s find out.
Let’s use a simple counting function that returns the number of
nums that are equal and not equal to
We’ll use 2 benchmarks to compare the time it takes to execute counting function when all elements are equal to
and when all elements are not equal to
The results below show that it’s almost 2X slower to count
To understand why let’s take a look at the generated assembly. In particular at the lines handling
These lines show that after evaluating condition
CMPQ R8, AX, for the
true case we have to execute
JNE 18, which, in theory, should be easy for a branch predictor to predict, and
INCQ SI before we start another loop iteration with
JMP 10. And in case of
JNE 18 will jump to
INCQ DI and start a new loop iteration with
So it looks like the same number of instructions is executed no matter which branch we take. So what gives?
Well, the difference is in
JNE 18 effect - when condition is true, we don’t jump and continue processing following instructions taking advantage of the CPU pipeline
JNE 18 is being decoded, CPU can work on fetching
INCQ SI and while it’s executed,
INCQ SI will be decoded and
JMP 10 fetched, but in case condition evaluates to
false we have to jump to
INCQ DI that hasn’t even been fetched yet and the work that was spent on fetching and decoding instructions for the
true case will be discarded.
To avoid this wasted effort and improve efficiency, CPUs have a branch predictor used for speculative execution and the best way to see its impact is to employ the power of randomness
The benchmark above looks almost identical to ones we’ve used previously but with one small but important difference - the 1s and 0s are populated using random number generator. This small difference has a huge impact though
resulting in ~80% slowdown compared to a version with all 1s and ~70% compared to a version of all 0s.
Since branch predictor behavior is not always obvious, it’s best to use
$ perf record ./binary $ perf stat ./binary Performance counter stats for './binary: ... 1,605,231,778 branch-misses # 16.04% of all branches
that will helpfully print the percentage of branches branch predictor made a wrong call and high numbers indicate room for improvement.