Bit-testing on ARM64.

Or embracing modern CPU architecture.

In the previous article we've looked at how different compilers handle

on x86 CPU architecture.

But as ARM64 mobile space domination starts to expand to server market with solutions like AWS Graviton and desktop market with Apple's M1, this time we'll take a look at Aarch64 assembly generated by GCC:

I have annotated all assembly instructions with pseudo-code comments but even original code turned out to be fairly concise and readable:

  • cmp w0, 33 sets the unsigned lower flag if ch is smaller than 33, since the largest ch that can possibly match is 32 (0x20);

  • mov x1, 13824 and movk x1, 0x1, lsl 32 create a bitset mask in x1 where 1 is set at positions that correspond to values of ch + 1;

  • lsr x0, x1, x0 and and w0, w0, 1 selects a bit at the ch + 1th position;

  • finally csel w0, w0, wzr, cc sets w0 to the value of the bitmask select above in case ch < 33 and 0 otherwise.

In addition to being very concise, the generated assembly is also branchless, which is pipeline friendly, since there are no branches to mispredict and as such no reason to flush the pipeline.

The assembly generated by Clang does contain branches, just like its x86 version from the previous article:

  • sub w8, w0, #9 does w8 = w0 - 9 to shift the range of possible ch values from [9, 32] to [0, 23];

  • cmp w8, #5 and b.hs .LBB2_2 checks if w8 can potentially include 0x0009, 0x000A, 0x000C, 0x000D in the new [0,23] range and if that's the case, we have to check the last potential match, 32 using cmp w0, #32 and cset w0, eq;

  • mov w9, #27 creates a bitset mask 0b11011 in w9 that has 1 at each position corresponding to w8 values we are interested in - 0, 1, 3 and 4;

  • lsr w8, w9, w8 does w8 = w9 << w8 ensuring that the transformed value of ch stored in w8 is positioned at the rightmost position of w8;

  • and finally tbnz w8, #0, .LBB2_3 branches to .LBB2_3 in case of w8's rightmost bit is set that sets the return value w0 to true (#1).

Without benchmarks on target hardware it's hard to say which version is faster, but since Clang's version contains branches and more instructions, it's likely that GCC version would perform better.

Since iOS and M1 applications are usually compiled with Xcode that uses Clang, it would be interesting to benchmark performance sensitive applications with GCC to see if it makes a noticeable difference.