Efficient packing.

Or making the most out of your precious cache.

Jun 26, 2021

Due to combination of historical and hardware design reasons storage for the basic datatypes on an x86 or ARM processor doesn’t normally start at arbitrary byte addresses in memory and instead each type except char has an alignment requirement: chars can start on any byte address, 2-byte shorts must start on an even address, 4-byte ints or floats must start on an address divisible by 4, and 8-byte longs or doubles must start on an address divisible by 8. This requirement improves memory access speed because data can be fetched and stored in using a single instruction instead of multiple ones required in case data spans multiple machine word boundaries.

While the reasoning for having alignment requirements is interesting on its own, in this post we’ll focus on its practical implications. As always, to make things concrete, let’s take a look at a simple example struct:

Intuitively we’d expect this struct to take 4 bytes (a) + 1 byte (c) + 8 bytes (d) + 1 byte (e) = 14 bytes in total, but because of the alignment requirements outlined above, d has to have 8 byte alignment, which means that compiler has to insert 3 byte padding after c. With this correction, we’d expect sizeof(Record1) to be 4 + 1 + 3 + 8 + 1 = 17 bytes but there is one more type of padding we haven’t taken into account - struct trailing padding.

Without going into the details, the general idea is to ensure that the first address following the structure data has the same alignment. Since generally struct instances have the alignment of their widest scalar member, Record1’s alignment is 8, based on its widest 8 byte double d. The closest 8 byte aligned address after 17 is 24, so we should expect compiler to add 7 byte trailing padding to Record1 making its total size 24 bytes.

We can easily verify this by using C++’s static_assert:

Now the question is - is it the best we can do? Let’s move our double field d to the start of the struct:

Following the above rules we expect its size to be 8 bytes (d) + 4 bytes (a) + 1 byte c + 1 byte e + 2 byte trailing padding to ensure that the first address after Record2 is 8 byte aligned. This is confirmed by the static_assert:

As such this simple layout change allowed us to save 8 bytes of precious cache space. What does this it mean in practice? Let’s check:

        
#include <vector>

using namespace std;

struct Record1 {
    int a;
    char c;
    double d;
    bool e;
};

struct Record2 {
    double d;
    int a;
    char c;
    bool e;
};

const int N = 10000;

vector<Record1> recs1(N);
vector<Record2> recs2(N);

template <class T>
int count_as(const vector<T> recs) {
  int total = 0;
  for (const auto& rec : recs) {
    total += rec.a;
  }
  return total;
}

static void bench_1(benchmark::State& state) {
  int count = 0;
  for (auto _ : state) {
    benchmark::DoNotOptimize(count += count_as(recs1));
  }
}
BENCHMARK(bench_1);

static void bench_2(benchmark::State& state) {
  int count = 0;
  for (auto _ : state) {
    benchmark::DoNotOptimize(count += count_as(recs2));
  }
}
BENCHMARK(bench_2);

view raw padding_bench.cpp hosted with ❤ by GitHub

And we get an somewhat expected 1.5 speedup:

If it’s so easy to do, why don’t compilers do this for us? They certainly can, but since it’s not always safe, for example in case we ship raw bytes over the network and use type casts instead of expensive serialization, they don’t do this automatically. There is also an option to force compilers to pack structs tightly without any padding by using the packed attribute:

which depending on the hardware and access pattern may be even faster since it saves us 2 more bytes:

So what’s the best way to ensure we don’t leave extra space on the table?

with GCC we can use -Wpadded flag that does not have the most helpful output, but still points us in the right direction
pahole tool, which has a much more user-friendly output
and is also supported by the wonderful godbolt online compiler.

Software Bits Newsletter

Discussion about this post