One of the most common performance microoptimization advices is to avoid branches. But there are no rules without exceptions and to see why let’s take a look at the partition function used in quicksort
and quickselect
functions
that with a template parameter that performs conditional swap. We’ll now compare 2 different implementations, one of which is branchless
and one that uses a branch
Even though branchless version contains more instructions, I’d still hope it to perform better. But let’s take both versions for a spin:
But the results reveal that the “optimized” version is 1.5X slower
I’m not sure why “optimized” assembly version is not using cmov
that may potentially speed things up, but the moral of the story is that any rule may contain exceptions and it’s always best to measure before applying any performance advice.
You can explore generated assembly using compiler explorer.