These days it’s very common to encounter all sorts of partition keys or unique identifiers for bucketing, partitioning and other means to distribute data across machines. Oftentimes these keys are created as a concatenation of some sort of data “features”. It’s not a secret that string operations are relatively expensive, so if possible it would be great to avoid at least some of them.
Let’s take a look at an example I encountered a few days ago - the bucketing key was constructed based on 3 boolean values
Even though fmt1 has 3 parameters, they all have low cardinality - just 2. So the total function input cardinality is just 8. As such we can technically write an if with 8 branches to handle each case, but it’s not going to look very pretty. Well, functions are just fancy tables, and since there are only 8 elements in this table, why don’t we materialize it?
This looks fairly concise and easy to understand. But wait, there is more - these kinds of strings are usually used to create larger strings or passed somewhere else by reference, so ideally it would be great to avoid these copies. We cannot return string references from fmt1 since those references would be to locals, but tables are a different story and we can even use our good old friend - constexpr
Ok, it’s finally time to get some numbers. We’ll use brute-force benchmarks
And looks like our effort was not wasted. Lookup table of strings is 3X faster than concatenations and lookup string view table is 18X faster.
Even though our function used only boolean parameters, this approach works for other types as well, but obviously higher cardinalities would require larger tables that may get somewhat unwieldy. Fortunately, constexpr functions can be used to construct such tables and remove the need for manual construction at the cost of additional compile-time overhead.