Dominik Grabiec Blog

Presenting at ACCU 2025

2025-02-24T11:00:00+00:00

Just a quick post about me attending and presenting two talks at the ACCU conference in April this year (2025).

The main talk will be an expanded version of my CppCon talk from last year titled “Optimising Data Building in Game Development”, going into more detail and hopefully with a more consistent presentation. The second talk will be a shorter presentation on mistakes that people have made in handling data during game development.

Originally I had only intended to do the one main presentation, but after submitting the proposal I felt that it would get rejected as I had already presented it. So I came up with some other ideas for shorter talks and submitted the best one, and now I am presenting both.

If you happen to be at the conference then I would be delighted if you would attend my talks, or even meet me in the hallway to say hi.

How to Layout Data in C++ Classes

2025-02-01T11:00:00+00:00

The layout of data members within a class is an important consideration in writing C++, it affects readability and understanding of the class, and can impact performance as well. There are a lot of things to consider when organising and ordering data members, and in this article I will go through my thoughts and explain the guidelines I used when writing C++ code.

Take note that these are just guidelines and one size will not fit all situations, so feel free to mix and match them as required. I’ve ordered them roughly from most readable and least packed to least readable and most packed.

Initializer List Order

The most important thing to remember is that the order of initialisation of the member variables happens in the order they are declared in the class definition, not in the order they are listed in the initializer list in the constructor. There is a C++ core guideline which says to keep the order of members in the class definition and in the initializer list the same.

The main effect of this in regards to the layout of data members is that you will want to initialise some data members before others, especially when they are not just simple assignments. For example computing the size from width and height parameters before allocating an array to store the data.

The first (and probably main) method of organisation is to group related members together so that they are logically close in the source code. An example of this is putting members related to storage of data in one group, and members related to efficiently finding the data in another group. I’d like to think that people do this naturally, grouping related members together surrounded with whitespace, but it may just be me.

One could argue that you don’t need to create these groupings because in principle a class should only do one thing, so all its members should be in one group. However in reality classes can do one thing but still contain many members and syb-systems used to accomplish that, or they are responsible for several things so it makes sense to group the members for those logically to aid readability.

One down side of this method is that you’re likely to get plenty of padding inside and in between groups of members, as smaller data members such as bool, int, etc are placed next to larger data structures which have bigger alignment requirements. Though depending on the class it might not matter as we will discuss below.

Group Members By Usage

Related to the above method is to group members by usage together, so that they are physically close to each other in memory, and more likely to be on the same cache line in the processor. The main difference between the previous method and this is that you’re taking into consideration how the data is used and not just what part of the system it logically belongs to.

This can reduce readability of the class definition but in the same vein it can also increase performance in some specific circumstances.

Minimise Padding

The next major method of organisation is to arrange the members of a class in a way that minimises padding and wasted space within the class. This is especially useful when you’re creating many thousands or millions of instances of these classes, as each byte of wasted space becomes significant in aggregate. This also helps to efficiently pack these classes contiguously in memory, but it can come at a cost to readability.

An important detail to remember here is that a class’s size is a multiple of its alignment, and its alignment is the highest alignment of its members. This means that padding will be inserted after the last data member to make the class’s size a multiple of its alignment.

The simplest way to minimise padding is to put the elements with the highest alignment requirements at the beginning of the class, followed by members with successively smaller alignment requirements, with byte sized elements like bool or uint8_t at the end. Of course in doing this there may be gaps created in between the larger members, and in this case fill the gaps by moving any appropriately sized elements in between the larger ones. If all goes well then there should not be any padding, or only a few bytes of padding at the end of the class.

Think of this as filling a jar with various sized rocks, first put in the biggest rocks (big members), then fill the gaps in with smaller pebbles (smaller members, integers, etc), and finally pour in the sand to fill in the remaining space (bytes and booleans).

This works best with more plain-old-data style classes and structures that contain many smaller sized members that themselves have smaller alignment requirements and do not have any internal padding. If larger classes are used be aware that they might have their own internal padding which is created automatically due to their alignment requirements. A simple example of this is:

struct BufferView	// Size: 16 bytes, Alignment: 8 bytes
{
	void* data;	// Size: 8 bytes, Alignment: 8 bytes
	int size;	// Size: 4 bytes, Alignment: 4 bytes
			// Padding: 4 bytes
};

Making this a member of another class will introduce 4 bytes of padding each time it is used, even if it is followed by a 4-byte value which would otherwise fit within the padding.

Packing

The remedy for this situation is to use compiler-specific attributes and pragmas to specify the packing of elements within the classes. For MSVC this involves surrounding the class declaration(s) with a pragma pack declaration like so:

#pragma pack(push, 1)	// Could also be 4 instead of 1
struct BufferView	// Size: 12 bytes, Alignment: 1 byte
{
	void* data;	// Size: 8 bytes, Alignment: 8 bytes
	int size;	// Size: 4 bytes, Alignment: 4 bytes
};
#pragma pack(pop)

Likewise on GCC and Clang you can use the packed attribute to tell the compiler to pack the members tightly. Note that the packed attribute has to be applied to each class declaration separately in order to pack the elements as tightly as possible.

Visual Studio 2022 version 17.8.0 introduced a neat little feature to show the size of a type or value in a tooltip, it helps as you can quickly and easily see what effect moving a member has on the size of the class. There are also other plugins which help visualise class members and padding, though I do not use them.

Compressing Members

In some cases just packing the elements is not enough, as the size of the members just exceeds a multiple of the alignment, thereby creating a relatively large amount of padding at the end of the class.

These techniques can be used to combat this, helping reduce the size of the data and making it fit more nicely within a multiple of the alignment. This becomes important when these classes are stored contiguously and processed by performance critical code, as more instances can be packed within the same amount of memory.

Combine Booleans

The first technique is to combine bool values into a bitfield, as traditionally each bool value takes a byte. This can be done by having an integer and manually using masks, or by declaring a C++ bitfield. One trick I learned is to create a bitfield using bool, like bool a : 1;, bool b : 1; etc, which has the advantage of being descriptive but also combining adjacent values together.

So instead of something like this:

struct Example
{
	bool a;
	bool b;
	bool c;
};
static_assert(sizeof(Example) == 3);

It would instead be something like this:

struct Example
{
	bool a : 1;
	bool b : 1;
	bool c : 1;
};
static_assert(sizeof(Example) == 1);

Bitpack Values

The next technique is to bitpack smaller integer values (and enumerations) together in a larger field. For example when you have 3 integer values that range only from 0 to 1000 you can store them as three 10 bit values inside a single uint32_t instead of one for each value. The left-over bits can be used for flags.

struct Example
{
	uint32_t r : 10;
	uint32_t g : 10;
	uint32_t b : 10;
	uint32_t a : 2;
};
static_assert(sizeof(Example) == sizeof(uint32_t));

Encode Values

A more advanced version of this would be to use some other encoding scheme to store multiple values in the same integer variable. This can be something simple like encoding in the same way that multidimensional array indexes are calculated. For example storing values a, b, c, as: v = (a + A * (b + B * c)), and then decoding it using / and %. There are other encoding schemes that could be used, but the more complex the encoding is the slower it will be to interact with the values.

An example of the simple multidimensional array index calculation:

uint64_t encode(uint32_t x, uint32_t y, uint32_t z)
{
	return x + MAX_X * (y + MAX_Y * z);
}

void decode(uint64_t value, uint32_t& x, uint32_t& y, uint32_t & z)
{
	z = value % MAX_Y;
	value /= MAX_Y;
	y = value % MAX_X;
	value /= MAX_X;
	x = value;
}

Quantise Floating Point Values

If dealing with floating point values then a good technique is to quantise them and store them in smaller integer types. The simplest version is to store normalised values (ranging from 0.0 to 1.0) in unsigned integer types like uint8_t or uint16_t, where 0.0 maps to 0 and 1.0 maps to 255 or 65535 respectively. Values ranging from -1.0 to 1.0 can be similarly stored in signed integer types, and as long as the range is limited the values can be quantised into a smaller type while retaining reasonable accuracy. The tricky part is that the range has to be defined in code and not in data for there to be a reduction in size.

For example to quantise a value between 0 and 1 into a smaller integer:

uint8_t encode(float value)
{
	return static_cast<uint8_t>(value * 255.0f);
}

float decode(uint8_t value)
{
	return value / 255.0f;
}

Drop Computable Values

Another technique which is common in graphics programming is to drop elements of compound types when they can easily be recomputed. When dealing with normalised vectors and unit quaternions, in addition to quantising their values, one of them can be dropped entirely. One caveat to that is the sign of the dropped element must be stored somewhere or explicitly known externally, as squaring a number will make it positive.

Why and When

A lot of this only matters if the class you’re writing will have many (millions of) instances created of it and therefore you need to organise your data for optimum efficiency. In all other cases you should make it easy to read and easy to understand.

To summarise:

If your class will only be instantiated a handful of times, then readability is far more important than data layout, so no need to optimise.
If your class has a container inside of it, then you will probably be better off optimising the class stored inside the container.
If your class contains another large class inside of it, then optimise that class first.
If you instantiate millions of instances then pay special attention to the class and optimise it.
If you are optimising to get a performance improvement then measure, measure, and measure!

CppCon 2024 Presentation & Review

2025-01-19T11:00:00+00:00

In September last year (2024) I attended CppCon to present my talk about optimising multi-threaded data building for game development. It was quite a hectic and busy experience, talking to people, attending sessions, many of which were not recorded, and most importantly learning about what other people are doing in the C++ community.

I’m posting this now quite a few months after the event, as I’ve been busy with work and other projects. Also the video of my presentation has been officially released, with the other videos for the event also slowly trickling out on the CppCon YouTube Channel,¹.

This was my first ever public presentation and I was a bit nervous, so I’m thankful to my friends and colleagues who helped me do practice runs of the presentation at home and at work before going. This was incredibly helpful since at the even I had some technical difficulties with my laptop refusing to cooperate with the AV equipment², so I had to give the talk from memory with no speaker notes, just the slides on screen.

The event itself was kind of intense, I tried to attend every talk, especially the ones that weren’t being recorded (the open sessions in the morning, during the lunch break, and in the evening). Meaning most days started at 8am and finished at 10pm, though with breaks in between. Luckily for me the hotel was pretty cool, with an impressive view of the rocky mountains in Colorado, and warm pools to soak in.

If people are interested in C++ I would definitely recommend going to CppCon, I had a blast there and would love to go again.

If you want to you can buy access to all the CppCon videos here. ↩
I suspect this might be due to the laptop being old and not having enough power to transmit the HDMI signal over a distance greater than 1.5 metres, or something like that. ↩

Presenting at CppCon 2024

2024-08-25T11:00:00+00:00

Earlier in May of this year I came across a call for submissions for the Game Development track at CppCon. Having had the loose idea for a talk in my head for the longest time I submitted a last minute proposal, and to my delighted surprise they accepted it!

The talk is titled “Techniques to Optimise Multi-threaded Data Building During Game Development”, and I know it’s quite a mouthful but they said to be detailed.

This will be my first big public presentation and I am honoured to have been selected. So now I am deep into preparing and practicing the presentation, writing, coding, editing, and everything else that comes with it. I’ll see what I can release publicly after the conference but at least the slides will be available on the CppCon github, and if I can I’ll post something on my own Github.

If you happen to be at the conference then I would be delighted if you would attend my talk, and if not then please say hi!

Upgrading Assert Macro in C++

2024-06-21T11:00:00+00:00

An article detailing investigations and upgrades to the Flexible Assert Macro to fix some oversights and add some C++20 features which improve the generated code. These updates are now available on Github.

Making Conditions Unlikely

The first update was to add the [[unlikely]] attribute to the assert condition. This will tell the compiler to generate the assembly code under the assumption that the condition will not be true at runtime (but not with the assumption it will never be true).

This actually changes the generated assembly quite a bit in places, moving the handling of the assert to the end of the function and out of the immediate code to execute. While I haven’t measured any performance impact of this change, the assembly code looks tidier and because the regular function code doesn’t need a jump to reach it should be more efficient.

Before Unlikely

; 9    : 	ASSERT(name.length() <= 255);

	cmp	QWORD PTR [rcx+16], 255			; 000000ffH
	mov	rbx, rcx
	jbe	SHORT $LN2@Example
	lea	rax, OFFSET FLAT:??_C@_08MPNMAILL@Test?4cpp@
	mov	DWORD PTR $T1[rsp], 9
	mov	QWORD PTR $T1[rsp+8], rax
	lea	rdx, OFFSET FLAT:??_C@_0BF@CFNDKGCM@name?4length?$CI?$CJ?5?$DM?$DN?5255@
	lea	rax, OFFSET FLAT:??_C@_0HF@HDFDIDJI@int?5__cdecl?5Example?$CIconst?5class@
	mov	DWORD PTR $T1[rsp+4], 2
	lea	rcx, QWORD PTR $T1[rsp]
	mov	QWORD PTR $T1[rsp+16], rax
	call	?handle_assert@error@@YAXUsource_location@std@@PEBD@Z ; error::handle_assert
	int	3
	call	?terminate@std@@YAXXZ			; std::terminate
	int	3
$LN2@Example:

; ... Normal function code here ...

After Unlikely

; 9    : 	ASSERT(name.length() <= 255);

	cmp	QWORD PTR [rcx+16], 255			; 000000ffH
	mov	rbx, rcx
	ja	SHORT $LN20@Example

; ... Normal function code here ...

$LN20@Example:

; 9    : 	ASSERT(name.length() <= 255);

	lea	rax, OFFSET FLAT:??_C@_08MPNMAILL@Test?4cpp@
	mov	DWORD PTR $T1[rsp], 9
	mov	QWORD PTR $T1[rsp+8], rax
	lea	rdx, OFFSET FLAT:??_C@_0BF@CFNDKGCM@name?4length?$CI?$CJ?5?$DM?$DN?5255@
	lea	rax, OFFSET FLAT:??_C@_0HF@HDFDIDJI@int?5__cdecl?5Example?$CIconst?5class@
	mov	DWORD PTR $T1[rsp+4], 2
	lea	rcx, QWORD PTR $T1[rsp]
	mov	QWORD PTR $T1[rsp+16], rax
	call	?handle_assert@error@@YAXUsource_location@std@@PEBD@Z ; error::handle_assert
	int	3
	call	?terminate@std@@YAXXZ			; std::terminate
	int	3

Checking if Debugger is Attached

The next investigation was trying various ways to integrate a check to see if a debugger was attached before triggering the debug break. The main goal behind this was that when running the program with a debugger attached it would trigger the breakpoint and allow the programmer to see the assert that was triggered, and when running outside of a debugger it would just terminate without triggering a breakpoint.

The simplest way of doing this was to wrap the __debugbreak() (or DebugBreak()) with a check like if (IsDebuggerPresent()) { ... }. Doing this added a function call, a test, and a jump to the assert code, which in most cases made the code significantly larger. It also required forward declaring or including debugapi.h, into what otherwise is a fairly low level header.

Another way of doing this was to move the IsDebuggerPresent() call to be inside the handle_assert function, and have that return a boolean indicating if the breakpoint should be triggered or not. This eliminated a function call instruction from the assert macro but it didn’t clean up the assembly all that much.

Overall I wasn’t happy with either of these solutions so I ended up looking for alternatives, but not ones which would require me to implement magical assembly or weird intrinsics. (For reference most alternatives involved manually implementing the IsDebuggerPresent() function by looking up the debugger present flag in the thread information block in Windows. As such I didn’t want the support burden to keep this up to date with newer versions of Windows.)

It was when I was investigating how to handle other program faults that I realised that asserts (and error handling in general) need to be handled differently in developer and retail versions of the program. During development you want to use breakpoints to catch problems early, either by running in a debugger or by being able to attach one as easily as possible. However in retail mode you cannot do that so you want to create a detailed error report with plenty of supporting information, and send that to yourself as a package in order to try and figure out what went wrong.

This means that a separate retail version of the assert macro and assert handler function will need to be created, though that can be done at a later time together with a more thorough error reporting system.

Actually Making it Fatal

The last thing to add was a call to std::terminate() inside the macros to actually make the asserts fatal and exit the program.

One interesting thing discovered by doing this was that in some cases adding the terminate function to the macro caused the compiler to move the implementation of the assert contents to the end of the function, in a similar way as when adding the unlikely attribute. But it did not do this in every situation, therefore using the unlikely attribute is still a good idea.

Classifying Characters with Simple Functions

2023-12-08T11:00:00+00:00

This is the second in a series of articles I’m writing on character classification as used in lexers and compilers. In this I describe the simplest method of character classification which is using plain functions with the logic directly inside.

This is the simplest method to understand and implement, it’s the logic that you would write within an if statement just wrapped up in a convenient and descriptive function. In general you would write a function for each character classification that you need to distinguish.

The Code

There are two main ways of performing the tests, checking a range of characters such as '0' - '9', and testing individual characters such as '$'. The way I’ve written the examples below is designed to make it easy to read the ranges being tested in each function.

constexpr bool IsNumber(char c)
{
	return '0' <= c && c <= '9';
}

constexpr bool IsAlpha(char c)
{
	return ('A' <= c && c <= 'Z') || ('a' <= c && c <= 'z');
}

constexpr bool IsWhitespace(char c)
{
	return c == ' ' || c == '\t' || c == '\r' || c == '\n';
}

These can also be broken down into more specific functions like IsLower and IsUpper, and combined to create character classifiers of a more broad type. By using the C++ constexpr keyword it pretty much guarantees (when compiled with any optimisation level enabled) that the function code will be inlined rather than cause a function call in assembly. So much so in fact that I had to remove the constexpr or make secondary non-constexpr functions when testing in order to see the assembly output for GCC and Clang. In some ways MSVC is nice in that it emits the assembly code for inlined and constexpr functions anyway.

constexpr bool IsNumber(char c)
{
	return '0' <= c && c <= '9';
}

constexpr bool IsLower(char c)
{
	return 'a' <= c && c <= 'z';
}

constexpr bool IsUpper(char c)
{
	return 'A' <= c && c <= 'Z';
}

constexpr bool IsAlpha(char c)
{
	return IsLower(c) || IsUpper(c);
}

constexpr bool IsAlphaNum(char c)
{
	return IsAlpha(c) || IsNumber(c);
}

Note that this is just a short selection of functions and not an exhaustive set that one might need.

The Generated Assembly

In examining the assembly generated by each compiler for the code in this article I made some interesting observations.

In general each compiler will attempt to optimise the code to the best of their ability, and some common optimisation techniques are:

Combining tests for adjacent values into a simpler range test.
Combining tests for disjoint but close values (within 64) into tests against a computed bit mask.

There are also differences between compilers in what assembly instructions they generate, with which you can make some generalisations:

Clang produces assembly code which tries to avoid small branches whenever possible, either by evaluating all conditions and then combining the result, or by using a jump table. This seems more suited towards newer processor architectures and reflects on Clang relatively recent creation.
MSVC produces similar but smaller assembly code although it uses branches to short circuit evaluating all the conditions. This sort of code reminds me of programming for older processor architectures with more limited memory.
GCC produces code that can be seen as a mix of the other two compilers. Sometimes closer to MSVC and sometimes to Clang, and sometimes unfortunately it also produces the most confusing code.

Single Ranges

For functions which only test a single range, such as IsNumber, all compilers effectively generate code similar to:

IsNumber(char):
	sub     cl, 48
	cmp     cl, 9
	setbe   al
	ret     0

Which in C++ is equivalent to:

bool IsNumber(char c)
{
	return (c - 48) <= 9; // '0' == 48
}

This also ends up being the general pattern used for testing ranges, where the minimum of the range is subtracted from the value and then its tested against the length of the range. In this way only a single comparison is needed rather separately testing and branching for each of the minimum and maximum.

Multiple Ranges

For functions which test multiple ranges the compilers generate different code at all optimisation levels. Using the IsAlphaNum function as our subject for comparison and compiling at the O2 optimisation level we can clearly see the differences.

MSVC generates assembly code which most accurately reflects the C++ language semantics in the original code. It tests each condition in an optimised form but then jumps to the end if true, mirroring the short-circuit evaluation of the original C++ source code.

bool IsAlphaNum(char) PROC
        lea     eax, DWORD PTR [rcx-65]
        cmp     al, 25
        jbe     SHORT $LN3@IsAlphaNum
        lea     eax, DWORD PTR [rcx-97]
        cmp     al, 25
        jbe     SHORT $LN3@IsAlphaNum
        sub     cl, 48
        cmp     cl, 9
        jbe     SHORT $LN3@IsAlphaNum
        xor     al, al
        ret     0
$LN3@IsAlphaNum:
        mov     al, 1
        ret     0
bool IsAlphaNum(char) ENDP

The MSVC implementation in the code above uses the lea instruction to compute the initial subtraction of the minimum value before testing. For example the first lea computes eax = ecx - 65.

GCC actually does a similar thing, where it jumps to the end if the first IsAlpha condition is true, but it only has a single branch as the assembly it generates for IsAlpha has no branches.

IsAlphaNum(char):
        mov     eax, edi
        mov     edx, 1
        and     eax, -33
        sub     eax, 65
        cmp     al, 25
        jbe     .L6
        sub     edi, 48
        cmp     dil, 9
        setbe   dl
.L6:
        mov     eax, edx
        ret

In the GCC implementation it uses and to make make the character upper case and then performs the test on that. As -33 is 1101 1111 in binary.

Clang on the other hand follows the intention of the function and generates assembly which produces the right result but does not strictly represent the language semantics of the C++ code as written. Specifically it does not perform any short-circuit evaluation of the logical code and just tests all conditions, combining the result at the end.

IsAlphaNum(char):
        mov     eax, edi
        and     al, -33
        add     al, -65
        cmp     al, 26
        setb    cl
        add     dil, -48
        cmp     dil, 10
        setb    al
        or      al, cl
        ret

My intuition says that the branch-less code that Clang produces should run marginally faster than the other code with branches, as getting a branch misprediction costs many cycles, where as executing a handful more simple instructions would almost be free.

Reordering Range Conditions

It is generally advisable when writing functions to test equality to put the test that partitions the search space the most, first. For example the test that can eliminate 90% of values should go before the test that eliminates only 50%.

To apply it to this case would mean putting the test c <= '9' (which eliminates 184 values) before c >= '0' (which eliminates 60 values), and likewise swapping the order of tests for the alphabet ranges.

However when investigating this with Compiler Explorer these changes generally had no effect on the generated assembly code, but in some cases made the assembly code worse.

For the simplest functions such as IsNumber there was no difference in the generated assembly.
For slightly more complicated functions such as IsAlpha the generated assembly was slightly larger and contained branching on all compilers.
Interestingly enough the reordered versions which called functions rather than do the comparisons directly were just as optimised as the simplest functions.

So in this case the idea to take away from this is to write simple straightforward code that is easy for both people and the compiler to understand.

Multiple Characters

The other type of tests involved directly compare against individual characters rather than ranges of characters. An example of this is the IsWhitespace function from the beginning of the article, though here is a more complete version which tests all of the white-space characters including the lesser known form feed ('\f', 12 dec, 0x0C hex) and vertical tab ('\v', 11 dec, 0x0B hex).

I was not actually aware of these characters myself until I started looking into the Clang and Carbon compiler source code and then cross-referencing with the ASCII table.

bool IsWhitespace(char c)
{
    return c == ' ' || c == '\t' || c == '\v' || c == '\f' || c == '\r' || c == '\n';
}

MSVC compiles this into a couple of tests, one for the space character (32 dec, 0x20 hex), and one for the range of white-space characters (From 0x09 to 0x0D).

bool IsWhitespace(char) PROC
        cmp     cl, 32
        je      SHORT $LN3@IsWhitespa
        sub     cl, 9
        cmp     cl, 4
        jbe     SHORT $LN3@IsWhitespa
        xor     al, al
        ret     0
$LN3@IsWhitespa:
        mov     al, 1
        ret     0

Clang compiles this code into a range check and test against a computed bit mask, combining the result together using logical operations.

IsWhitespace(char):
        cmp     dil, 33
        setb    cl
        movabs  rax, 4294983168
        bt      rax, rdi
        setb    al
        and     al, cl
        ret

GCC however compiles to something more interesting where it creates assembly code which both uses a lookup table to check most of the values and then explicitly checks for carriage-return (13 dec, 0x0D hex) and line-feed (10 dec, 0x0A hex).

IsWhitespace(char):
        cmp     dil, 32
        ja      .L12
        movabs  rax, 4294973952
        bt      rax, rdi
        setc    al
        test    al, al
        je      .L12
        ret
.L12:
        cmp     dil, 13
        sete    al
        cmp     dil, 10
        sete    dl
        or      eax, edx
        ret

My suspicion was that GCC tried to honour the ordering of the comparisons in the written C++ code, and it managed to collapse the first set of 4 comparisons into a bit field lookup, but it did not do so for the last two.

This was confirmed when I sorted the comparisons in the C++ function to match the ASCII values.

bool IsWhitespace(char c)
{
    return c == '\t' || c == '\n' || c == '\v' || c == '\f' || c == '\r' || c == ' ';
}

As both GCC and Clang generated nearly identical and heavily optimised assembly. (This is the GCC version)

IsWhitespace(char):
        lea     eax, [rdi-9]
        cmp     al, 4
        setbe   al
        cmp     dil, 32
        sete    dl
        or      eax, edx
        ret

With MSVC generating the same tests but branching to return the result and therefore adhering to the short-circuit evaluation of the C++ code.

IsWhitespace(char) PROC
        lea     eax, DWORD PTR [rcx-9]
        cmp     al, 4
        jbe     SHORT $LN5@IsWhitespa
        cmp     cl, 32
        je      SHORT $LN5@IsWhitespa
        xor     al, al
        ret     0
$LN5@IsWhitespa:
        mov     al, 1
        ret     0

More Complex Comparisons

A more complete example is the IsSymbol function below, which I’ve written to classify all symbols used in ASCII just as they appear on my keyboard (US layout).

bool IsSymbol(char c)
{
    return c == '~' || c == '`' || c == '!' || c == '@'
        || c == '#' || c == '$' || c == '%' || c == '^'
        || c == '&' || c == '*' || c == '(' || c == ')'
        || c == '_' || c == '-' || c == '+' || c == '='
        || c == '[' || c == ']' || c == '{' || c == '}'
        || c == '|' || c == '\\' || c == ';' || c == ':'
        || c == '\'' || c == '"' || c == ',' || c == '.'
        || c == '<' || c == '>' || c == '/' || c == '?'
    ;
}

In this initial version MSVC ends up with the smallest and possibly cleanest assembly code. It performs a quick test against a short range at the end of the ASCII sequence (for '{', '|', '}', and '~'), then checks the value is within the desired range before doing a bit mask test for the remaining characters.

bool IsSymbol(char) PROC
        lea     eax, DWORD PTR [rcx-123]
        cmp     al, 3
        jbe     SHORT $LN3@IsSymbol
        sub     cl, 33
        cmp     cl, 63
        ja      SHORT $LN5@IsSymbol
        mov     rax, -288230371890266113
        bt      rax, rcx
        jb      SHORT $LN3@IsSymbol
$LN5@IsSymbol:
        xor     al, al
        ret     0
$LN3@IsSymbol:
        mov     al, 1
        ret     0
bool IsSymbol(char) ENDP

Clang also generates simple code, although it uses a 93 element jump table instead of testing against a bit mask. While this may be compact code this table takes up 372 bytes of space, which will take up instruction cache space and could affect performance.

IsSymbol(char):
        add     edi, -33
        cmp     edi, 93
        ja      .LBB1_2
        mov     al, 1
        lea     rcx, [rip + .LJTI1_0]
        movsxd  rdx, dword ptr [rcx + 4*rdi]
        add     rdx, rcx
        jmp     rdx
.LBB1_3:
        ret
.LBB1_2:
        xor     eax, eax
        ret
.LJTI1_0:
        .long   .LBB1_3-.LJTI1_0	# When false
        .long   .LBB1_2-.LJTI1_0	# When true
        # ...

Much like in the previous unordered IsWhitespace function it turns out that GCC doesn’t like the IsSymbol C++ code as written and produces quite a long and branchy sequence of assembly.

The actual assembly code is a bit of a mess, doing a lot of individual tests, some range tests, and a bit mask test. Though the bit mask only has 3 bits set, meaning that it’s only testing for 3 characters even though it could test nearly the entire range of symbols using it (as the MSVC assembly code does).

Expand GCC assembly code

Investigations into Classifying Characters

2023-12-04T11:00:00+00:00

For an application that I’m building I am developing a compiler, and the first part of that is writing a Lexer to convert characters into tokens. A fundamental part of this is to classify characters (ASCII, UTF-8, etc) into character classes in an efficient manner so that they can be lexed appropriately. Doing some research and thinking about this provided me with three methods of classifying characters, but this made me curious and I had questions I wanted the answers to.

This was originally meant to be a single article describing my investigation but as I was writing it I started to have more questions and found more things to investigate. At this point I realised that there was too much to say about each method and so I split it up into a series of articles. This article introduces the project, the next few articles describe the the methods, and a final article discusses the results and hopefully answers the questions.

Introduction (this article)
Classifying Characters with Simple Functions
Classifying Characters with Bit Masks
Classifying Characters with Table Lookup
Analysis and Results

I also want to give a mention to the article by Daniel Lemire which made me want to revisit this topic and write my findings in article form, as I was thinking that this would be too simple and basic a topic.

Methods

During my research I identified three basic methods of classification:

Simple functions using logical operators.
Bitmask lookup table for each classification.
Combined lookup table for all classifications.

All of these have been implemented in places in various forms, from writing the simple functions yourself, to using table based lookup in Clang, and various forms of bit mask lookups.

Questions

Given these options I was curious about how well they performed and had several questions I wanted answers to.

First and most importantly, which method was the fastest and most efficient in classifying characters?
What were the costs and trade-offs in each method?
What kind of optimisations would different compilers apply to each method?

While researching the answers to these questions I also thought of more questions:

What is the optimal trade-off between memory used for lookup tables versus code complexity?
Would it be possible to write simple compile time code to generate lookup tables?
Is it worth the extra code complexity to implement specific micro-optimisations in code?

Testing

The first and simplest way to evaluate these was to use Compiler Explorer to see what assembly instructions were generated for each method by each compiler. While this wouldn’t tell us which method is fastest it would be able to show which methods end up equivalent and which optimisations in C++ actually matter and which don’t. We would also be able to see how each compiler generates code and what the differences (if any) between them are.

The next way to evaluate each method would be with a micro-benchmark. This should tell us something about the relative performance of each method, in isolation from the rest of the system. The trick here is to write the test well enough that you’re testing the actual code and not just training the branch predictor of the processor.

The last way to evaluate each method would be to integrate it with a simple lexer so we can see how well each method performs within a larger system and how it gets integrated at the assembly level. With this we can also feed it a larger data set than a micro-benchmark so any differences can be more easily observed.

Environments

Windows 10 Pro 22H2
Ubuntu 22.04 LTS (WSL2)
Compiler Explorer

Compilers

Microsoft C/C++ Compiler 19.37
Clang 17
GCC 13.2

C++ Anti-Patterns: Calling reserve in a loop

2023-10-31T11:00:00+00:00

This C++ anti-pattern is found in code where the programmer wanted to do the right thing for performance reasons, but for one reason or another it didn’t end up that way. That is when there’s a call to reserve inside of a loop. This may be the result of a refactor where a loop was added surrounding existing code, or just written with the best intentions.

This often ends up looking like:

Container container;
for (const auto& entry : entries)
{
	container.reserve(container.size() + entry.ItemCount());
	entry.AppendItems(container);
}

The main problem with the code is that it becomes a pessimisation, and in general use it will cause an allocation each time through the loop, defeating the whole purpose of reserving memory ahead of time.

The simplest solution would be to just allocate a huge amount of memory for the container so that it should never reallocate. While in certain circumstances - such as in a scratch or very simple program - this is a valid solution, in general use it is not, as it makes it easy to run out of memory especially if the code gets run on multiple threads.

Container container;
container.reserve(1000000000); // Reserve a billion units
for (const auto& entry : entries)
{
	entry.AppendItems(container);
}

The best solution is to pre-allocate the container to the exact size required before the loop so it never has to grow but without over allocating. With this there are two general strategies to figure out the number of total elements:

Loop through the data twice, the first time accumulating the total number of elements, then reserving, and then looping through again to perform the required actions. This is most commonly used with strings or small data elements that are quick to iterate through.

size_t total_items = 0;
for (const auto& entry : entries)
{
	total_items += entry.ItemCount();
}
Container container;
container.reserve(total_items);
for (const auto& entry : entries)
{
	entry.AppendItems(container);
}

Instrument the code to find the maximum number of elements processed, or figure out how many elements will handle most cases and let the rare outliers reallocate.

For example if in most cases the loop has 100 or fewer items to process but on rare occasions it has over 10,000. Then it might make sense to always reserve 100 elements, and then on those rare occasions where you have more than 10,000 elements, to just let the container reallocate as needed.

The next best solution would be to just let the container grown organically, performing reallocations within the loop as required but not forcing them to reallocate. In most cases this should result in fewer allocations than in the pessimistic original case.

Container container;
for (const auto& entry : entries)
{
	entry.AppendItems(container);
}

Of course there will be a degenerate case which will cause a reallocation on each iteration of the loop, but this should be exceedingly rare, and in that case one of the other strategies should be employed.

Enhancements to the Blob Classes

2023-05-22T11:00:00+00:00

This is a short post to describe an update that I made to the Blob classes from the previous article.

Slice!

The first addition is a simple Slice function to the BlobView and BlobSpan classes. This is an alternative way to get a sub-view or sub-span of the data using two indexes (an inclusive begin index and an exclusive end index) rather than an offset and a count. It is most useful when you’re already accessing the data with iterator like indexes.

Typed Array

The next change is adding an ArrayView function to BlobView, a corresponding ArraySpan function to BlobSpan, and both functions to the Blob class. These return a std::span class which represents a typed array over the data. This is a more convenient way to access a contiguous array of items than using pointers and pointer arithmetic explicitly. It also provides iterators to access the data and pass it into other algorithms.

Originally when I implemented these functions there was a dedicated ArrayView class in the codebase, so these functions were intended to return that. However since C++20 the std::span class is available and provides nearly identical functionality, even if it overlaps somewhat with the Blob classes themselves.

Other Changes

The final changes include adding simple unit tests for the new functions and also adding a very simple natvis file.

I know that the unit tests for these classes aren’t exactly the best but they’ve helped me verify the code works as intended.

Source Code

As always these changes are live and available on Github.

Ever Useful Blob Classes

2023-04-28T11:00:00+00:00

In this article I describe (and make available) a set of simple but useful classes for use in file and other IO. I don’t claim that these are completely original or unique, but I have found them useful in various circumstances and therefore want to share them with others.

The core feature of these classes is to be able to take some memory and easily interact with it using types but in a clean C++ kind of way, without needing to do complicated pointer arithmetic or having to write explicit casts. They were also intended to be helpful in nature rather than explicitly preventing you from doing anything unsafe, though they still contain simple safeguards to catch misuse and memory bugs.

The main class of the set is the Blob class, which can be thought of as a unique buffer with some helper functions included. To go along with that there are two companion classes, a BlobView which is a read-only view over a piece of memory, and a BlobSpan which is a writeable span over a piece of memory.

These end up being very useful when writing binary file parsers since you can load the whole file (or part thereof) into a buffer and then use the classes to parse the file format. This also allows for portions of the loaded file to be passed to other functions for processing.

These classes were originally designed and implemented many years ago, after C++11 introduced the string_view class but before C++20 introduced its own generic span class. They have been iterated on over the years as new useful features have been added to the C++ standards, but they largely still adhere to the original simple design with minimal templates used only where necessary.

One more design decision of note is that these classes do not use exceptions, but rather assert if there’s been some sort of error. This was because originally they were developed as part of a home game engine where exceptions are not generally used. They can be adapted to use exceptions but it will require changing ASSERT code to throw and removing noexcept from the affected functions.

BlobView Class

This is the simplest class to describe as it is a read-only view over a piece of memory. It stores a const void* pointer and a size_t size, and because it doesn’t own the data it is pointing to it can be freely and cheaply copied, moved, created, and destroyed.

In addition to a set of simple constructors it has a number of convenience constructors that take existing container classes and construct a view over the top of them.

template <typename T, typename A>
[[nodiscard]] constexpr BlobView(const std::vector<T, A>& vector) noexcept;

template <typename T, size_t N>
[[nodiscard]] constexpr BlobView(const std::array<T, N>& array) noexcept;

template <typename T, size_t N>
[[nodiscard]] constexpr BlobView(const T (&array)[N]) noexcept;

These constructors allow creating a view over containers of arbitrary types, though in general use it is intended that the type T be a trivial or primitive type, most likely some sort of byte like uint8_t or unsigned char.

In addition to these there is an implicit constructor taking a BlobSpan object, which will be used to convert from the writeable view of the memory to a read-only one in this class.

Like any other view class it has simple ways to access the underlying data.

[[nodiscard]] constexpr bool Empty() const noexcept;
[[nodiscard]] constexpr size_t Size() const noexcept;
[[nodiscard]] constexpr const void* Data() const noexcept;

[[nodiscard]] inline const void* Data(const size_t offset) const;

In this case the first Data() function just returns the pointer stored in the object, much like the Size() function returns the size member, where as the second Data(offset) function returns the pointer offset by the specified number of bytes.

The real magic though is in the other data access functions, where they return the underlying memory at a specified byte offset, cast to a specific type.

template <typename T>
[[nodiscard]] inline const T* Pointer(const size_t offset = 0) const;

template <typename T>
[[nodiscard]] inline const T& As(const size_t offset = 0) const;

The As<>() function returns a reference to the data as a specific type making it useful to access the contents of the memory. Either as single values like a uint32_t to quickly get the size or count of something, or as a plain old data struct to interpret things like a file header.

uint32_t count = view.As<uint32_t>();
// or
const auto& file_header = view.As<FileHeader>();

Likewise the Pointer<>() function returns a typed pointer, making it useful to access arrays of things within the memory. It can also be used to expand the interface and add support for functions returning other view like classes like a typed array view.

The final lot of functions included are ones which return a sub-view of the current view. These are most useful when you want to work with a smaller piece of memory, such as when handing it off to a more specific parsing function.

[[nodiscard]] inline BlobView SubView(size_t offset = 0) const;
[[nodiscard]] inline BlobView SubView(size_t offset, size_t bytes) const;

At various points in time and in different implementations of this class I have also included a BlobView Slice(size_t begin, size_t end) function. This extracts out a sub-view by explicitly specifying the beginning (inclusive) and ending (exclusive) of the memory. While this function is not strictly necessary it can be a nice convenience if the things you are parsing specify memory ranges in terms of begin & end rather than start & size.

BlobSpan Class

This class is like the BlobView class above except it returns non-const values of the data so that the memory can be changed.

One difference between the BlobSpan and the BlobView class is that construction of a span from a view is prohibited by marking the constructor as deleted. This prevents accidental creation of writeable spans from read-only views of the memory.

In previous versions I had implemented both the const and non-const member functions to access the data but I found it to be unnecessary as changing the underlying data does not need to change the BlobSpan object at all. Therefore it was sufficient to have the const member functions return non-const references to the data.

Blob Class

This class can be best thought of as a unique buffer with functions from both the BlobView and BlobSpan classes combined. It is designed to own the buffer and provide access functions to interact with the memory, which also include functions to get spans and views of the data.

One big difference between this and the BlobSpan class is that the non-const member functions return non-const pointers and references to the data, where as the const member classes only return const pointers and references. This was done to avoid having to use different names for the const and non-const ways to access the data, and it allows for a const Blob to act like a view of the memory.

Another difference is that the Blob class owns the memory rather than being just a span or a view onto it. It can allocate the memory itself, be given memory to manage from an external source, or move ownership of memory from another Blob. It is also a move-only class because copying potentially huge buffers should not be done in a copy constructor that could be called accidentally.

[[nodiscard]] explicit Blob(size_t size_in_bytes); // Allocates memory
[[nodiscard]] inline Blob(void* data, size_t size); // Takes ownership of memory
template <typename T>
[[nodiscard]] Blob(std::unique_ptr<T[]>&& buffer, size_t elements);

To help with this there are several functions to explicitly deal with the buffer. These are Reset, Release, Copy, and Clear.

Reset is simple as it just frees the memory and clears the internal state.
Release is a function which releases ownership of the memory from the blob, returning a pair of pointer and size to the caller. This is potentially dangerous but as in std::unique_ptr it can serve a purpose.
Clear is just a helper function for zeroing out the memory before writing data to it.
Copy is an explicit function to create a copy of the memory and return it in a new Blob object. It is implemented this way so it will be clear in code when an explicit copy of the memory is desired.

In cases where there’s an actual unique buffer class in the codebase then the blob class can be implemented as a wrapper around it. This can actually be safer in practice as ownership of the memory will remain within the existing developed system.

Limitations

Now because these classes do not prevent you from doing certain things there is one important limitation that you should be aware of, which is not to cast to complex classes or specifically aligned types. This applies to the As(), Pointer(), and any other functions where there’s a template type parameter involved.

The reason not to cast to aligned types is that there’s no guarantee that a particular BlobView, BlobSpan, or an offset thereof, will be aligned to the required amount. Instead the cast should be to an underlying type which can then be loaded into the aligned class.

The simple example of this would be a Matrix class like class alignas(16) Matrix, which is aligned to 16 bytes as part of performance optimisations for SIMD execution. Instead of casting directly into the Matrix class using As, get a pointer to the data as floats using Pointer and use that to fill the Matrix object.

This situation can be detected and fixed by forbidding casting to aligned types, which can be checked at compile time in the code, but I have not implemented that yet.

The reason not to cast to complex classes (ones with a non-trivial constructor) is that they would maintain internal state and have various pre and post conditions which the memory would probably not adhere to. Though if you need to create such a class the solution is similar to the aligned case, cast the memory to compatible types/structs and then use them to create the complex class.

Source Code

The source code for these libraries is available on Github for use in most sorts of projects.

Dominik Grabiec Blog

Presenting at ACCU 2025

How to Layout Data in C++ Classes

Initializer List Order

Group Related Members

Group Members By Usage

Minimise Padding

Packing

Compressing Members

Combine Booleans

Bitpack Values

Encode Values

Quantise Floating Point Values

Drop Computable Values

Why and When

CppCon 2024 Presentation & Review

Presenting at CppCon 2024

Upgrading Assert Macro in C++

Making Conditions Unlikely

Checking if Debugger is Attached

Actually Making it Fatal

Classifying Characters with Simple Functions

The Code

The Generated Assembly

Single Ranges

Multiple Ranges

Reordering Range Conditions

Multiple Characters

More Complex Comparisons

Investigations into Classifying Characters

Methods

Questions

Testing

Environments

Compilers

C++ Anti-Patterns: Calling reserve in a loop

Enhancements to the Blob Classes

Slice!

Typed Array

Other Changes

Source Code

Ever Useful Blob Classes

BlobView Class

BlobSpan Class

Blob Class

Limitations

Source Code