Classifying Characters with Simple Functions

This is the second in a series of articles I’m writing on character classification as used in lexers and compilers. In this I describe the simplest method of character classification which is using plain functions with the logic directly inside.

This is the simplest method to understand and implement, it’s the logic that you would write within an if statement just wrapped up in a convenient and descriptive function. In general you would write a function for each character classification that you need to distinguish.

The Code

There are two main ways of performing the tests, checking a range of characters such as '0' - '9', and testing individual characters such as '$'. The way I’ve written the examples below is designed to make it easy to read the ranges being tested in each function.

constexpr bool IsNumber(char c)
{
	return '0' <= c && c <= '9';
}

constexpr bool IsAlpha(char c)
{
	return ('A' <= c && c <= 'Z') || ('a' <= c && c <= 'z');
}

constexpr bool IsWhitespace(char c)
{
	return c == ' ' || c == '\t' || c == '\r' || c == '\n';
}

These can also be broken down into more specific functions like IsLower and IsUpper, and combined to create character classifiers of a more broad type. By using the C++ constexpr keyword it pretty much guarantees (when compiled with any optimisation level enabled) that the function code will be inlined rather than cause a function call in assembly. So much so in fact that I had to remove the constexpr or make secondary non-constexpr functions when testing in order to see the assembly output for GCC and Clang. In some ways MSVC is nice in that it emits the assembly code for inlined and constexpr functions anyway.

constexpr bool IsNumber(char c)
{
	return '0' <= c && c <= '9';
}

constexpr bool IsLower(char c)
{
	return 'a' <= c && c <= 'z';
}

constexpr bool IsUpper(char c)
{
	return 'A' <= c && c <= 'Z';
}

constexpr bool IsAlpha(char c)
{
	return IsLower(c) || IsUpper(c);
}

constexpr bool IsAlphaNum(char c)
{
	return IsAlpha(c) || IsNumber(c);
}

Note that this is just a short selection of functions and not an exhaustive set that one might need.

The Generated Assembly

In examining the assembly generated by each compiler for the code in this article I made some interesting observations.

In general each compiler will attempt to optimise the code to the best of their ability, and some common optimisation techniques are:

Combining tests for adjacent values into a simpler range test.
Combining tests for disjoint but close values (within 64) into tests against a computed bit mask.

There are also differences between compilers in what assembly instructions they generate, with which you can make some generalisations:

Clang produces assembly code which tries to avoid small branches whenever possible, either by evaluating all conditions and then combining the result, or by using a jump table. This seems more suited towards newer processor architectures and reflects on Clang relatively recent creation.
MSVC produces similar but smaller assembly code although it uses branches to short circuit evaluating all the conditions. This sort of code reminds me of programming for older processor architectures with more limited memory.
GCC produces code that can be seen as a mix of the other two compilers. Sometimes closer to MSVC and sometimes to Clang, and sometimes unfortunately it also produces the most confusing code.

Single Ranges

For functions which only test a single range, such as IsNumber, all compilers effectively generate code similar to:

IsNumber(char):
	sub     cl, 48
	cmp     cl, 9
	setbe   al
	ret     0

Which in C++ is equivalent to:

bool IsNumber(char c)
{
	return (c - 48) <= 9; // '0' == 48
}

This also ends up being the general pattern used for testing ranges, where the minimum of the range is subtracted from the value and then its tested against the length of the range. In this way only a single comparison is needed rather separately testing and branching for each of the minimum and maximum.

Multiple Ranges

For functions which test multiple ranges the compilers generate different code at all optimisation levels. Using the IsAlphaNum function as our subject for comparison and compiling at the O2 optimisation level we can clearly see the differences.

MSVC generates assembly code which most accurately reflects the C++ language semantics in the original code. It tests each condition in an optimised form but then jumps to the end if true, mirroring the short-circuit evaluation of the original C++ source code.

bool IsAlphaNum(char) PROC
        lea     eax, DWORD PTR [rcx-65]
        cmp     al, 25
        jbe     SHORT $LN3@IsAlphaNum
        lea     eax, DWORD PTR [rcx-97]
        cmp     al, 25
        jbe     SHORT $LN3@IsAlphaNum
        sub     cl, 48
        cmp     cl, 9
        jbe     SHORT $LN3@IsAlphaNum
        xor     al, al
        ret     0
$LN3@IsAlphaNum:
        mov     al, 1
        ret     0
bool IsAlphaNum(char) ENDP

The MSVC implementation in the code above uses the lea instruction to compute the initial subtraction of the minimum value before testing. For example the first lea computes eax = ecx - 65.

GCC actually does a similar thing, where it jumps to the end if the first IsAlpha condition is true, but it only has a single branch as the assembly it generates for IsAlpha has no branches.

IsAlphaNum(char):
        mov     eax, edi
        mov     edx, 1
        and     eax, -33
        sub     eax, 65
        cmp     al, 25
        jbe     .L6
        sub     edi, 48
        cmp     dil, 9
        setbe   dl
.L6:
        mov     eax, edx
        ret

In the GCC implementation it uses and to make make the character upper case and then performs the test on that. As -33 is 1101 1111 in binary.

Clang on the other hand follows the intention of the function and generates assembly which produces the right result but does not strictly represent the language semantics of the C++ code as written. Specifically it does not perform any short-circuit evaluation of the logical code and just tests all conditions, combining the result at the end.

IsAlphaNum(char):
        mov     eax, edi
        and     al, -33
        add     al, -65
        cmp     al, 26
        setb    cl
        add     dil, -48
        cmp     dil, 10
        setb    al
        or      al, cl
        ret

My intuition says that the branch-less code that Clang produces should run marginally faster than the other code with branches, as getting a branch misprediction costs many cycles, where as executing a handful more simple instructions would almost be free.

Reordering Range Conditions

It is generally advisable when writing functions to test equality to put the test that partitions the search space the most, first. For example the test that can eliminate 90% of values should go before the test that eliminates only 50%.

To apply it to this case would mean putting the test c <= '9' (which eliminates 184 values) before c >= '0' (which eliminates 60 values), and likewise swapping the order of tests for the alphabet ranges.

However when investigating this with Compiler Explorer these changes generally had no effect on the generated assembly code, but in some cases made the assembly code worse.

For the simplest functions such as IsNumber there was no difference in the generated assembly.
For slightly more complicated functions such as IsAlpha the generated assembly was slightly larger and contained branching on all compilers.
Interestingly enough the reordered versions which called functions rather than do the comparisons directly were just as optimised as the simplest functions.

So in this case the idea to take away from this is to write simple straightforward code that is easy for both people and the compiler to understand.

Multiple Characters

The other type of tests involved directly compare against individual characters rather than ranges of characters. An example of this is the IsWhitespace function from the beginning of the article, though here is a more complete version which tests all of the white-space characters including the lesser known form feed ('\f', 12 dec, 0x0C hex) and vertical tab ('\v', 11 dec, 0x0B hex).

I was not actually aware of these characters myself until I started looking into the Clang and Carbon compiler source code and then cross-referencing with the ASCII table.

bool IsWhitespace(char c)
{
    return c == ' ' || c == '\t' || c == '\v' || c == '\f' || c == '\r' || c == '\n';
}

MSVC compiles this into a couple of tests, one for the space character (32 dec, 0x20 hex), and one for the range of white-space characters (From 0x09 to 0x0D).

bool IsWhitespace(char) PROC
        cmp     cl, 32
        je      SHORT $LN3@IsWhitespa
        sub     cl, 9
        cmp     cl, 4
        jbe     SHORT $LN3@IsWhitespa
        xor     al, al
        ret     0
$LN3@IsWhitespa:
        mov     al, 1
        ret     0

Clang compiles this code into a range check and test against a computed bit mask, combining the result together using logical operations.

IsWhitespace(char):
        cmp     dil, 33
        setb    cl
        movabs  rax, 4294983168
        bt      rax, rdi
        setb    al
        and     al, cl
        ret

GCC however compiles to something more interesting where it creates assembly code which both uses a lookup table to check most of the values and then explicitly checks for carriage-return (13 dec, 0x0D hex) and line-feed (10 dec, 0x0A hex).

IsWhitespace(char):
        cmp     dil, 32
        ja      .L12
        movabs  rax, 4294973952
        bt      rax, rdi
        setc    al
        test    al, al
        je      .L12
        ret
.L12:
        cmp     dil, 13
        sete    al
        cmp     dil, 10
        sete    dl
        or      eax, edx
        ret

My suspicion was that GCC tried to honour the ordering of the comparisons in the written C++ code, and it managed to collapse the first set of 4 comparisons into a bit field lookup, but it did not do so for the last two.

This was confirmed when I sorted the comparisons in the C++ function to match the ASCII values.

bool IsWhitespace(char c)
{
    return c == '\t' || c == '\n' || c == '\v' || c == '\f' || c == '\r' || c == ' ';
}

As both GCC and Clang generated nearly identical and heavily optimised assembly. (This is the GCC version)

IsWhitespace(char):
        lea     eax, [rdi-9]
        cmp     al, 4
        setbe   al
        cmp     dil, 32
        sete    dl
        or      eax, edx
        ret

With MSVC generating the same tests but branching to return the result and therefore adhering to the short-circuit evaluation of the C++ code.

IsWhitespace(char) PROC
        lea     eax, DWORD PTR [rcx-9]
        cmp     al, 4
        jbe     SHORT $LN5@IsWhitespa
        cmp     cl, 32
        je      SHORT $LN5@IsWhitespa
        xor     al, al
        ret     0
$LN5@IsWhitespa:
        mov     al, 1
        ret     0

More Complex Comparisons

A more complete example is the IsSymbol function below, which I’ve written to classify all symbols used in ASCII just as they appear on my keyboard (US layout).

bool IsSymbol(char c)
{
    return c == '~' || c == '`' || c == '!' || c == '@'
        || c == '#' || c == '$' || c == '%' || c == '^'
        || c == '&' || c == '*' || c == '(' || c == ')'
        || c == '_' || c == '-' || c == '+' || c == '='
        || c == '[' || c == ']' || c == '{' || c == '}'
        || c == '|' || c == '\\' || c == ';' || c == ':'
        || c == '\'' || c == '"' || c == ',' || c == '.'
        || c == '<' || c == '>' || c == '/' || c == '?'
    ;
}

In this initial version MSVC ends up with the smallest and possibly cleanest assembly code. It performs a quick test against a short range at the end of the ASCII sequence (for '{', '|', '}', and '~'), then checks the value is within the desired range before doing a bit mask test for the remaining characters.

bool IsSymbol(char) PROC
        lea     eax, DWORD PTR [rcx-123]
        cmp     al, 3
        jbe     SHORT $LN3@IsSymbol
        sub     cl, 33
        cmp     cl, 63
        ja      SHORT $LN5@IsSymbol
        mov     rax, -288230371890266113
        bt      rax, rcx
        jb      SHORT $LN3@IsSymbol
$LN5@IsSymbol:
        xor     al, al
        ret     0
$LN3@IsSymbol:
        mov     al, 1
        ret     0
bool IsSymbol(char) ENDP

Clang also generates simple code, although it uses a 93 element jump table instead of testing against a bit mask. While this may be compact code this table takes up 372 bytes of space, which will take up instruction cache space and could affect performance.

IsSymbol(char):
        add     edi, -33
        cmp     edi, 93
        ja      .LBB1_2
        mov     al, 1
        lea     rcx, [rip + .LJTI1_0]
        movsxd  rdx, dword ptr [rcx + 4*rdi]
        add     rdx, rcx
        jmp     rdx
.LBB1_3:
        ret
.LBB1_2:
        xor     eax, eax
        ret
.LJTI1_0:
        .long   .LBB1_3-.LJTI1_0	# When false
        .long   .LBB1_2-.LJTI1_0	# When true
        # ...

Much like in the previous unordered IsWhitespace function it turns out that GCC doesn’t like the IsSymbol C++ code as written and produces quite a long and branchy sequence of assembly.

The actual assembly code is a bit of a mess, doing a lot of individual tests, some range tests, and a bit mask test. Though the bit mask only has 3 bits set, meaning that it’s only testing for 3 characters even though it could test nearly the entire range of symbols using it (as the MSVC assembly code does).

Expand GCC assembly code

IsSymbol(char):
        cmp     dil, 33
        je      .L12
        lea     eax, [rdi-64]
        cmp     al, 62
        ja      .L18
        movabs  rdx, 4611686022722355201
        bt      rdx, rax
        setc    al
        test    al, al
        je      .L19
.L5:
        ret
.L19:
        cmp     dil, 95
        jg      .L10
        mov     eax, 1
        cmp     dil, 90
        jg      .L5
.L8:
        and     edi, -17
        cmp     dil, 47
        sete    al
        ret
.L18:
        cmp     dil, 95
        jg      .L8
        cmp     dil, 46
        jg      .L11
        mov     eax, 1
        cmp     dil, 33
        jg      .L5
        jmp     .L8
.L10:
        lea     edx, [rdi-123]
        mov     eax, 1
        cmp     dl, 2
        ja      .L8
        ret
.L11:
        lea     eax, [rdi-58]
        cmp     al, 4
        ja      .L8
.L12:
        mov     eax, 1
        ret

Just for fun I reconstructed the C/C++ code from the assembly and it looks something like this:

bool IsSymbol(char c)
{
	if (c == 33) // '!'
	{
		return true;
	}
	if (c - 64 <= 62)
	{
		if (((1 << (c - 64)) & 4611686022722355201) != 0) // '@', '`', '~'
		{
			return true;
		}
		if (c > 95)
		{
			if (c - 123 <= 2) // '{', '|', '}'
			{
				return true;
			}
		}
		else
		{
			if (c > 90) // '[', '\\', ']', '^', '_'
			{
				return true;
			}
		}
	}
	else
	{
		if (c <= 95)
		{
			if (c > 46)
			{
				if (c - 58 <= 4) // ':', ';', '<', '=', '>', '?'
				{
					return true;
				}
			}
			else
			{
				if (c > 33) // '"', '#', '$', '%', '&', '\'', '(', ')', '*', '+', ',', '-', '.'
				{
					return true;
				}
			}
		}
	}
	return (c & 0xEF) == 47; // '/'
}

After doing this the best that I can tell is that GCC is grouping contiguous sequences together where it can find them but otherwise attempts to perform the tests in order. So if the characters tested were more randomly organised then the code would end up looking quite different (in GCC).

Using a Switch

Given that we’re testing so many different characters it may seem more natural to use a switch statement for this instead, so converting the previous implementation gives us this:

bool IsSymbol_switch(char c)
{
    switch (c)
    {
    case '~': case '`': case '!': case '@':
    case '#': case '$': case '%': case '^':
    case '&': case '*': case '(': case ')':
    case '_': case '-': case '+': case '=':
    case '[': case ']': case '{': case '}':
    case '|': case '\\': case ';': case ':':
    case '\'': case '"': case ',': case '.':
    case '<': case '>': case '/': case '?':
        return true;
    default:
        return false;
    }
}

This greatly simplifies the assembly code that GCC emits, turning it into a far simpler set of comparisons and branches. My intuition is that in a switch statement the compiler is more free to re-order case statements giving it more opportunity to optimise.

IsSymbol_switch(char):
        cmp     dil, 96
        jg      .L28
        cmp     dil, 90
        jg      .L31
        cmp     dil, 47
        jg      .L30
        cmp     dil, 32
        setg    al
        ret
.L28:
        sub     edi, 123
        cmp     dil, 3
        setbe   al
        ret
.L30:
        sub     edi, 58
        cmp     dil, 6
        setbe   al
        ret
.L31:
        mov     eax, 1
        ret

While MSVC now generates code that uses an indirect jump table. It stores the jump addresses in a smaller table, and then has the main 93 element table store an index to the smaller table. This means that there are two table lookups per character, but the tables are about a quarter the size of the Clang version.

bool IsSymbol_switch(char) PROC
        movsx   eax, cl
        add     eax, -33
        cmp     eax, 93
        ja      SHORT $LN36@IsSymbol_s
        lea     rdx, OFFSET FLAT:__ImageBase
        cdqe
        movzx   eax, BYTE PTR $LN38@IsSymbol_s[rdx+rax]
        mov     ecx, DWORD PTR $LN39@IsSymbol_s[rdx+rax*4]
        add     rcx, rdx
        jmp     rcx
$LN4@IsSymbol_s:
        mov     al, 1
        ret     0
$LN36@IsSymbol_s:
        xor     al, al
        ret     0
        npad    2
$LN39@IsSymbol_s:
        DD      $LN4@IsSymbol_s
        DD      $LN36@IsSymbol_s
$LN38@IsSymbol_s:
        DB      0	# When true
        DB      1 	# When false
        # ...

Interestingly enough the Clang version remains the same between the two C++ implementations, suggesting that somewhere along the way it converts both versions of the code to a common sequence (I suspect this is likely in the intermediate representation that LLVM uses).

Reordering Compared Values

Now just as reordering comparisons in the shorter IsWhitespace case helped generate better assembly code, I wanted to see what effect ordering the characters would have in the more complete IsSymbol case. So sorting all of the symbols and updating the C++ code gave me this:

bool IsSymbol_ordered(char c)
{
    return c == '!' || c == '"' || c == '#' || c == '$'
        || c == '%' || c == '&' || c == '\'' || c == '('
        || c == ')' || c == '*' || c == '+' || c == ','
        || c == '-' || c == '.' || c == '/'
        || c == ';' || c == ':' || c == '<' || c == '=' 
        || c == '>'  || c == '?' || c == '@'
        || c == '[' || c == '\\' || c == ']' || c == '^'
        || c == '_' || c == '`'
        || c == '{' || c == '|' || c == '}' || c == '~'
    ;
}

For MSVC it ended up with almost identical code as the original, with some of the tests swapped around but otherwise performing the same tests.

bool IsSymbol_ordered(char) PROC
        lea     eax, DWORD PTR [rcx-33]
        cmp     al, 63
        ja      SHORT $LN7@TestSymbol
        mov     rdx, -288230371890266113
        bt      rdx, rax
        jb      SHORT $LN5@TestSymbol
$LN7@TestSymbol:
        sub     cl, 123
        cmp     cl, 3
        jbe     SHORT $LN5@TestSymbol
        xor     al, al
        ret     0
$LN5@TestSymbol:
        mov     al, 1
        ret     0
bool IsSymbol_ordered(char) ENDP

GCC ended up with the same comparison functions but just organised differently, and it even generated an identical bit mask to test against.

IsSymbol_ordered(char):
        lea     eax, [rdi-33]
        cmp     al, 63
        jbe     .L26
.L21:
        sub     edi, 123
        cmp     dil, 3
        setbe   dl
        mov     eax, edx
        ret
.L26:
        movabs  rcx, -288230371890266113
        mov     edx, 1
        bt      rcx, rax
        jnc     .L21
        mov     eax, edx
        ret

With Clang being the odd one out generating two more range comparisons before performing a similar but different bit mask test.

IsSymbol_ordered(char):
        lea     ecx, [rdi - 33]
        mov     al, 1
        cmp     cl, 15
        jb      .LBB2_5
        movsx   ecx, dil
        lea     edx, [rcx - 91]
        cmp     edx, 35
        ja      .LBB2_2
        movabs  rsi, 64424509503
        bt      rsi, rdx
        jb      .LBB2_5
.LBB2_2:
        add     ecx, -58
        cmp     ecx, 7
        jae     .LBB2_3
.LBB2_5:
        ret
.LBB2_3:
        xor     eax, eax
        ret

So we see that there is a vast difference between the assembly code generated when the characters being compared are sorted versus when they are unsorted. The best reason I can think of that this is the case is because of the short circuit behaviour of C++’s logical operators (|| and &&). These effectively imply an ordering to the tests where in this case we don’t really need that.

Reordering the Switch Statement

Now the final thing to check is to see what happens when we reorder the case statements within the switch. The code should look something like:

bool IsSymbol_switch_ordered(char c)
{
    switch (c)
    {
    case '!': case '"': case '#': case '$':
    case '%': case '&': case '\'': case '(':
    case ')': case '*': case '+': case ',':
    case '-': case '.': case '/': case ':':
    case ';': case '<': case '=': case '>':
    case '?': case '@': case '[': case '\\':
    case ']': case '^': case '_': case '`':
    case '{': case '|': case '}': case '~':
        return true;
    default:
        return false;
    }
}

And the answer is absolutely nothing!

The generated assembly code for this function with the sorted case statements is exactly the same as the unordered switch function shown before for every compiler tested. This confirms my intuition from before that the compiler is free to reorder the case statements in this situation.

Analysis

There’s a few things that we can learn from this investigation, even before the other articles are written. Though the performance evaluation will have to wait until the benchmarks are done.

All compilers can create small and efficient bits of code from these simple character classification functions. They can combine multiple tests written in C++ into a single range or bit test down at the assembly level.

So in general terms do not worry about the performance of such functions but instead write simple and clear code, as that also makes it easier for the compiler to optimise.

If you’re in doubt or curious then check with Compiler Explorer.
If the order of the comparisons is unimportant then sorting the values being compared will give better code generation than having them in a random order. This will more easily allow the compiler to implement the comparisons as range or bit mask tests rather than single character tests.

This also applies to testing ranges as keeping the begin and end ranges ordered helps both humans read the code and the compiler optimise the code.

If code size is not an issue, or you want to avoid sorting the comparisons each time, then using a switch statement might work better for you.

Of course if you know the distribution of the values in your data then place the most common tests first, because the compiler should retain that order and the code should run faster in practice. Of course you should test and measure to get a proper understanding.