Bit-testing on ARM64.
Or embracing modern CPU architecture.
In the previous article we've looked at how different compilers handle
on x86 CPU architecture.
But as ARM64 mobile space domination starts to expand to server market with solutions like AWS Graviton and desktop market with Apple's M1, this time we'll take a look at Aarch64 assembly generated by GCC:
I have annotated all assembly instructions with pseudo-code comments but even original code turned out to be fairly concise and readable:
cmp w0, 33sets the unsigned lower flag ifchis smaller than33, since the largestchthat can possibly match is32(0x20);mov x1, 13824andmovk x1, 0x1, lsl 32create a bitset mask inx1where1is set at positions that correspond to values ofch + 1;lsr x0, x1, x0andand w0, w0, 1selects a bit at thech + 1th position;finally
csel w0, w0, wzr, ccsetsw0to the value of the bitmask select above in casech < 33and0otherwise.
In addition to being very concise, the generated assembly is also branchless, which is pipeline friendly, since there are no branches to mispredict and as such no reason to flush the pipeline.
The assembly generated by Clang does contain branches, just like its x86 version from the previous article:
sub w8, w0, #9doesw8 = w0 - 9to shift the range of possiblechvalues from[9, 32]to[0, 23];cmp w8, #5andb.hs .LBB2_2checks ifw8can potentially include0x0009,0x000A,0x000C,0x000Din the new[0,23]range and if that's the case, we have to check the last potential match,32usingcmp w0, #32andcset w0, eq;mov w9, #27creates a bitset mask0b11011inw9that has1at each position corresponding tow8values we are interested in -0,1,3and4;lsr w8, w9, w8doesw8 = w9 << w8ensuring that the transformed value ofchstored inw8is positioned at the rightmost position ofw8;and finally
tbnz w8, #0, .LBB2_3branches to.LBB2_3in case ofw8's rightmost bit is set that sets the return valuew0totrue(#1).
Without benchmarks on target hardware it's hard to say which version is faster, but since Clang's version contains branches and more instructions, it's likely that GCC version would perform better.
Since iOS and M1 applications are usually compiled with Xcode that uses Clang, it would be interesting to benchmark performance sensitive applications with GCC to see if it makes a noticeable difference.



