Classifying Characters with Simple Functions
This is the second in a series of articles I’m writing on character classification as used in lexers and compilers. In this I describe the simplest method of character classification which is using plain functions with the logic directly inside.
This is the simplest method to understand and implement, it’s the logic that you would write within an if statement just wrapped up in a convenient and descriptive function. In general you would write a function for each character classification that you need to distinguish.
The Code
There are two main ways of performing the tests, checking a range of characters such as '0'
- '9'
, and testing individual characters such as '$'
. The way I’ve written the examples below is designed to make it easy to read the ranges being tested in each function.
constexpr bool IsNumber(char c)
{
return '0' <= c && c <= '9';
}
constexpr bool IsAlpha(char c)
{
return ('A' <= c && c <= 'Z') || ('a' <= c && c <= 'z');
}
constexpr bool IsWhitespace(char c)
{
return c == ' ' || c == '\t' || c == '\r' || c == '\n';
}
These can also be broken down into more specific functions like IsLower
and IsUpper
, and combined to create character classifiers of a more broad type. By using the C++ constexpr
keyword it pretty much guarantees (when compiled with any optimisation level enabled) that the function code will be inlined rather than cause a function call in assembly. So much so in fact that I had to remove the constexpr
or make secondary non-constexpr functions when testing in order to see the assembly output for GCC and Clang. In some ways MSVC is nice in that it emits the assembly code for inlined and constexpr functions anyway.
constexpr bool IsNumber(char c)
{
return '0' <= c && c <= '9';
}
constexpr bool IsLower(char c)
{
return 'a' <= c && c <= 'z';
}
constexpr bool IsUpper(char c)
{
return 'A' <= c && c <= 'Z';
}
constexpr bool IsAlpha(char c)
{
return IsLower(c) || IsUpper(c);
}
constexpr bool IsAlphaNum(char c)
{
return IsAlpha(c) || IsNumber(c);
}
Note that this is just a short selection of functions and not an exhaustive set that one might need.
The Generated Assembly
In examining the assembly generated by each compiler for the code in this article I made some interesting observations.
In general each compiler will attempt to optimise the code to the best of their ability, and some common optimisation techniques are:
- Combining tests for adjacent values into a simpler range test.
- Combining tests for disjoint but close values (within 64) into tests against a computed bit mask.
There are also differences between compilers in what assembly instructions they generate, with which you can make some generalisations:
- Clang produces assembly code which tries to avoid small branches whenever possible, either by evaluating all conditions and then combining the result, or by using a jump table. This seems more suited towards newer processor architectures and reflects on Clang relatively recent creation.
- MSVC produces similar but smaller assembly code although it uses branches to short circuit evaluating all the conditions. This sort of code reminds me of programming for older processor architectures with more limited memory.
- GCC produces code that can be seen as a mix of the other two compilers. Sometimes closer to MSVC and sometimes to Clang, and sometimes unfortunately it also produces the most confusing code.
Single Ranges
For functions which only test a single range, such as IsNumber
, all compilers effectively generate code similar to:
IsNumber(char):
sub cl, 48
cmp cl, 9
setbe al
ret 0
Which in C++ is equivalent to:
bool IsNumber(char c)
{
return (c - 48) <= 9; // '0' == 48
}
This also ends up being the general pattern used for testing ranges, where the minimum of the range is subtracted from the value and then its tested against the length of the range. In this way only a single comparison is needed rather separately testing and branching for each of the minimum and maximum.
Multiple Ranges
For functions which test multiple ranges the compilers generate different code at all optimisation levels. Using the IsAlphaNum
function as our subject for comparison and compiling at the O2
optimisation level we can clearly see the differences.
MSVC generates assembly code which most accurately reflects the C++ language semantics in the original code. It tests each condition in an optimised form but then jumps to the end if true, mirroring the short-circuit evaluation of the original C++ source code.
bool IsAlphaNum(char) PROC
lea eax, DWORD PTR [rcx-65]
cmp al, 25
jbe SHORT $LN3@IsAlphaNum
lea eax, DWORD PTR [rcx-97]
cmp al, 25
jbe SHORT $LN3@IsAlphaNum
sub cl, 48
cmp cl, 9
jbe SHORT $LN3@IsAlphaNum
xor al, al
ret 0
$LN3@IsAlphaNum:
mov al, 1
ret 0
bool IsAlphaNum(char) ENDP
The MSVC implementation in the code above uses the
lea
instruction to compute the initial subtraction of the minimum value before testing. For example the firstlea
computeseax = ecx - 65
.
GCC actually does a similar thing, where it jumps to the end if the first IsAlpha
condition is true, but it only has a single branch as the assembly it generates for IsAlpha
has no branches.
IsAlphaNum(char):
mov eax, edi
mov edx, 1
and eax, -33
sub eax, 65
cmp al, 25
jbe .L6
sub edi, 48
cmp dil, 9
setbe dl
.L6:
mov eax, edx
ret
In the GCC implementation it uses
and
to make make the character upper case and then performs the test on that. As-33
is1101 1111
in binary.
Clang on the other hand follows the intention of the function and generates assembly which produces the right result but does not strictly represent the language semantics of the C++ code as written. Specifically it does not perform any short-circuit evaluation of the logical code and just tests all conditions, combining the result at the end.
IsAlphaNum(char):
mov eax, edi
and al, -33
add al, -65
cmp al, 26
setb cl
add dil, -48
cmp dil, 10
setb al
or al, cl
ret
My intuition says that the branch-less code that Clang produces should run marginally faster than the other code with branches, as getting a branch misprediction costs many cycles, where as executing a handful more simple instructions would almost be free.
Reordering Range Conditions
It is generally advisable when writing functions to test equality to put the test that partitions the search space the most, first. For example the test that can eliminate 90% of values should go before the test that eliminates only 50%.
To apply it to this case would mean putting the test c <= '9'
(which eliminates 184 values) before c >= '0'
(which eliminates 60 values), and likewise swapping the order of tests for the alphabet ranges.
However when investigating this with Compiler Explorer these changes generally had no effect on the generated assembly code, but in some cases made the assembly code worse.
- For the simplest functions such as
IsNumber
there was no difference in the generated assembly. - For slightly more complicated functions such as
IsAlpha
the generated assembly was slightly larger and contained branching on all compilers. - Interestingly enough the reordered versions which called functions rather than do the comparisons directly were just as optimised as the simplest functions.
So in this case the idea to take away from this is to write simple straightforward code that is easy for both people and the compiler to understand.
Multiple Characters
The other type of tests involved directly compare against individual characters rather than ranges of characters. An example of this is the IsWhitespace
function from the beginning of the article, though here is a more complete version which tests all of the white-space characters including the lesser known form feed ('\f'
, 12
dec, 0x0C
hex) and vertical tab ('\v'
, 11
dec, 0x0B
hex).
I was not actually aware of these characters myself until I started looking into the Clang and Carbon compiler source code and then cross-referencing with the ASCII table.
bool IsWhitespace(char c)
{
return c == ' ' || c == '\t' || c == '\v' || c == '\f' || c == '\r' || c == '\n';
}
MSVC compiles this into a couple of tests, one for the space character (32
dec, 0x20
hex), and one for the range of white-space characters (From 0x09
to 0x0D
).
bool IsWhitespace(char) PROC
cmp cl, 32
je SHORT $LN3@IsWhitespa
sub cl, 9
cmp cl, 4
jbe SHORT $LN3@IsWhitespa
xor al, al
ret 0
$LN3@IsWhitespa:
mov al, 1
ret 0
Clang compiles this code into a range check and test against a computed bit mask, combining the result together using logical operations.
IsWhitespace(char):
cmp dil, 33
setb cl
movabs rax, 4294983168
bt rax, rdi
setb al
and al, cl
ret
GCC however compiles to something more interesting where it creates assembly code which both uses a lookup table to check most of the values and then explicitly checks for carriage-return (13
dec, 0x0D
hex) and line-feed (10
dec, 0x0A
hex).
IsWhitespace(char):
cmp dil, 32
ja .L12
movabs rax, 4294973952
bt rax, rdi
setc al
test al, al
je .L12
ret
.L12:
cmp dil, 13
sete al
cmp dil, 10
sete dl
or eax, edx
ret
My suspicion was that GCC tried to honour the ordering of the comparisons in the written C++ code, and it managed to collapse the first set of 4 comparisons into a bit field lookup, but it did not do so for the last two.
This was confirmed when I sorted the comparisons in the C++ function to match the ASCII values.
bool IsWhitespace(char c)
{
return c == '\t' || c == '\n' || c == '\v' || c == '\f' || c == '\r' || c == ' ';
}
As both GCC and Clang generated nearly identical and heavily optimised assembly. (This is the GCC version)
IsWhitespace(char):
lea eax, [rdi-9]
cmp al, 4
setbe al
cmp dil, 32
sete dl
or eax, edx
ret
With MSVC generating the same tests but branching to return the result and therefore adhering to the short-circuit evaluation of the C++ code.
IsWhitespace(char) PROC
lea eax, DWORD PTR [rcx-9]
cmp al, 4
jbe SHORT $LN5@IsWhitespa
cmp cl, 32
je SHORT $LN5@IsWhitespa
xor al, al
ret 0
$LN5@IsWhitespa:
mov al, 1
ret 0
More Complex Comparisons
A more complete example is the IsSymbol
function below, which I’ve written to classify all symbols used in ASCII just as they appear on my keyboard (US layout).
bool IsSymbol(char c)
{
return c == '~' || c == '`' || c == '!' || c == '@'
|| c == '#' || c == '$' || c == '%' || c == '^'
|| c == '&' || c == '*' || c == '(' || c == ')'
|| c == '_' || c == '-' || c == '+' || c == '='
|| c == '[' || c == ']' || c == '{' || c == '}'
|| c == '|' || c == '\\' || c == ';' || c == ':'
|| c == '\'' || c == '"' || c == ',' || c == '.'
|| c == '<' || c == '>' || c == '/' || c == '?'
;
}
In this initial version MSVC ends up with the smallest and possibly cleanest assembly code. It performs a quick test against a short range at the end of the ASCII sequence (for '{'
, '|'
, '}'
, and '~'
), then checks the value is within the desired range before doing a bit mask test for the remaining characters.
bool IsSymbol(char) PROC
lea eax, DWORD PTR [rcx-123]
cmp al, 3
jbe SHORT $LN3@IsSymbol
sub cl, 33
cmp cl, 63
ja SHORT $LN5@IsSymbol
mov rax, -288230371890266113
bt rax, rcx
jb SHORT $LN3@IsSymbol
$LN5@IsSymbol:
xor al, al
ret 0
$LN3@IsSymbol:
mov al, 1
ret 0
bool IsSymbol(char) ENDP
Clang also generates simple code, although it uses a 93 element jump table instead of testing against a bit mask. While this may be compact code this table takes up 372 bytes of space, which will take up instruction cache space and could affect performance.
IsSymbol(char):
add edi, -33
cmp edi, 93
ja .LBB1_2
mov al, 1
lea rcx, [rip + .LJTI1_0]
movsxd rdx, dword ptr [rcx + 4*rdi]
add rdx, rcx
jmp rdx
.LBB1_3:
ret
.LBB1_2:
xor eax, eax
ret
.LJTI1_0:
.long .LBB1_3-.LJTI1_0 # When false
.long .LBB1_2-.LJTI1_0 # When true
# ...
Much like in the previous unordered IsWhitespace
function it turns out that GCC doesn’t like the IsSymbol
C++ code as written and produces quite a long and branchy sequence of assembly.
The actual assembly code is a bit of a mess, doing a lot of individual tests, some range tests, and a bit mask test. Though the bit mask only has 3 bits set, meaning that it’s only testing for 3 characters even though it could test nearly the entire range of symbols using it (as the MSVC assembly code does).
Just for fun I reconstructed the C/C++ code from the assembly and it looks something like this: After doing this the best that I can tell is that GCC is grouping contiguous sequences together where it can find them but otherwise attempts to perform the tests in order. So if the characters tested were more randomly organised then the code would end up looking quite different (in GCC).Expand GCC assembly code
IsSymbol(char):
cmp dil, 33
je .L12
lea eax, [rdi-64]
cmp al, 62
ja .L18
movabs rdx, 4611686022722355201
bt rdx, rax
setc al
test al, al
je .L19
.L5:
ret
.L19:
cmp dil, 95
jg .L10
mov eax, 1
cmp dil, 90
jg .L5
.L8:
and edi, -17
cmp dil, 47
sete al
ret
.L18:
cmp dil, 95
jg .L8
cmp dil, 46
jg .L11
mov eax, 1
cmp dil, 33
jg .L5
jmp .L8
.L10:
lea edx, [rdi-123]
mov eax, 1
cmp dl, 2
ja .L8
ret
.L11:
lea eax, [rdi-58]
cmp al, 4
ja .L8
.L12:
mov eax, 1
ret
bool IsSymbol(char c)
{
if (c == 33) // '!'
{
return true;
}
if (c - 64 <= 62)
{
if (((1 << (c - 64)) & 4611686022722355201) != 0) // '@', '`', '~'
{
return true;
}
if (c > 95)
{
if (c - 123 <= 2) // '{', '|', '}'
{
return true;
}
}
else
{
if (c > 90) // '[', '\\', ']', '^', '_'
{
return true;
}
}
}
else
{
if (c <= 95)
{
if (c > 46)
{
if (c - 58 <= 4) // ':', ';', '<', '=', '>', '?'
{
return true;
}
}
else
{
if (c > 33) // '"', '#', '$', '%', '&', '\'', '(', ')', '*', '+', ',', '-', '.'
{
return true;
}
}
}
}
return (c & 0xEF) == 47; // '/'
}
Using a Switch
Given that we’re testing so many different characters it may seem more natural to use a switch
statement for this instead, so converting the previous implementation gives us this:
bool IsSymbol_switch(char c)
{
switch (c)
{
case '~': case '`': case '!': case '@':
case '#': case '$': case '%': case '^':
case '&': case '*': case '(': case ')':
case '_': case '-': case '+': case '=':
case '[': case ']': case '{': case '}':
case '|': case '\\': case ';': case ':':
case '\'': case '"': case ',': case '.':
case '<': case '>': case '/': case '?':
return true;
default:
return false;
}
}
This greatly simplifies the assembly code that GCC emits, turning it into a far simpler set of comparisons and branches. My intuition is that in a switch statement the compiler is more free to re-order case statements giving it more opportunity to optimise.
IsSymbol_switch(char):
cmp dil, 96
jg .L28
cmp dil, 90
jg .L31
cmp dil, 47
jg .L30
cmp dil, 32
setg al
ret
.L28:
sub edi, 123
cmp dil, 3
setbe al
ret
.L30:
sub edi, 58
cmp dil, 6
setbe al
ret
.L31:
mov eax, 1
ret
While MSVC now generates code that uses an indirect jump table. It stores the jump addresses in a smaller table, and then has the main 93 element table store an index to the smaller table. This means that there are two table lookups per character, but the tables are about a quarter the size of the Clang version.
bool IsSymbol_switch(char) PROC
movsx eax, cl
add eax, -33
cmp eax, 93
ja SHORT $LN36@IsSymbol_s
lea rdx, OFFSET FLAT:__ImageBase
cdqe
movzx eax, BYTE PTR $LN38@IsSymbol_s[rdx+rax]
mov ecx, DWORD PTR $LN39@IsSymbol_s[rdx+rax*4]
add rcx, rdx
jmp rcx
$LN4@IsSymbol_s:
mov al, 1
ret 0
$LN36@IsSymbol_s:
xor al, al
ret 0
npad 2
$LN39@IsSymbol_s:
DD $LN4@IsSymbol_s
DD $LN36@IsSymbol_s
$LN38@IsSymbol_s:
DB 0 # When true
DB 1 # When false
# ...
Interestingly enough the Clang version remains the same between the two C++ implementations, suggesting that somewhere along the way it converts both versions of the code to a common sequence (I suspect this is likely in the intermediate representation that LLVM uses).
Reordering Compared Values
Now just as reordering comparisons in the shorter IsWhitespace
case helped generate better assembly code, I wanted to see what effect ordering the characters would have in the more complete IsSymbol
case. So sorting all of the symbols and updating the C++ code gave me this:
bool IsSymbol_ordered(char c)
{
return c == '!' || c == '"' || c == '#' || c == '$'
|| c == '%' || c == '&' || c == '\'' || c == '('
|| c == ')' || c == '*' || c == '+' || c == ','
|| c == '-' || c == '.' || c == '/'
|| c == ';' || c == ':' || c == '<' || c == '='
|| c == '>' || c == '?' || c == '@'
|| c == '[' || c == '\\' || c == ']' || c == '^'
|| c == '_' || c == '`'
|| c == '{' || c == '|' || c == '}' || c == '~'
;
}
For MSVC it ended up with almost identical code as the original, with some of the tests swapped around but otherwise performing the same tests.
bool IsSymbol_ordered(char) PROC
lea eax, DWORD PTR [rcx-33]
cmp al, 63
ja SHORT $LN7@TestSymbol
mov rdx, -288230371890266113
bt rdx, rax
jb SHORT $LN5@TestSymbol
$LN7@TestSymbol:
sub cl, 123
cmp cl, 3
jbe SHORT $LN5@TestSymbol
xor al, al
ret 0
$LN5@TestSymbol:
mov al, 1
ret 0
bool IsSymbol_ordered(char) ENDP
GCC ended up with the same comparison functions but just organised differently, and it even generated an identical bit mask to test against.
IsSymbol_ordered(char):
lea eax, [rdi-33]
cmp al, 63
jbe .L26
.L21:
sub edi, 123
cmp dil, 3
setbe dl
mov eax, edx
ret
.L26:
movabs rcx, -288230371890266113
mov edx, 1
bt rcx, rax
jnc .L21
mov eax, edx
ret
With Clang being the odd one out generating two more range comparisons before performing a similar but different bit mask test.
IsSymbol_ordered(char):
lea ecx, [rdi - 33]
mov al, 1
cmp cl, 15
jb .LBB2_5
movsx ecx, dil
lea edx, [rcx - 91]
cmp edx, 35
ja .LBB2_2
movabs rsi, 64424509503
bt rsi, rdx
jb .LBB2_5
.LBB2_2:
add ecx, -58
cmp ecx, 7
jae .LBB2_3
.LBB2_5:
ret
.LBB2_3:
xor eax, eax
ret
So we see that there is a vast difference between the assembly code generated when the characters being compared are sorted versus when they are unsorted. The best reason I can think of that this is the case is because of the short circuit behaviour of C++’s logical operators (||
and &&
). These effectively imply an ordering to the tests where in this case we don’t really need that.
Reordering the Switch Statement
Now the final thing to check is to see what happens when we reorder the case statements within the switch. The code should look something like:
bool IsSymbol_switch_ordered(char c)
{
switch (c)
{
case '!': case '"': case '#': case '$':
case '%': case '&': case '\'': case '(':
case ')': case '*': case '+': case ',':
case '-': case '.': case '/': case ':':
case ';': case '<': case '=': case '>':
case '?': case '@': case '[': case '\\':
case ']': case '^': case '_': case '`':
case '{': case '|': case '}': case '~':
return true;
default:
return false;
}
}
And the answer is absolutely nothing!
The generated assembly code for this function with the sorted case statements is exactly the same as the unordered switch function shown before for every compiler tested. This confirms my intuition from before that the compiler is free to reorder the case statements in this situation.
Analysis
There’s a few things that we can learn from this investigation, even before the other articles are written. Though the performance evaluation will have to wait until the benchmarks are done.
-
All compilers can create small and efficient bits of code from these simple character classification functions. They can combine multiple tests written in C++ into a single range or bit test down at the assembly level.
So in general terms do not worry about the performance of such functions but instead write simple and clear code, as that also makes it easier for the compiler to optimise.
If you’re in doubt or curious then check with Compiler Explorer.
-
If the order of the comparisons is unimportant then sorting the values being compared will give better code generation than having them in a random order. This will more easily allow the compiler to implement the comparisons as range or bit mask tests rather than single character tests.
This also applies to testing ranges as keeping the begin and end ranges ordered helps both humans read the code and the compiler optimise the code.
If code size is not an issue, or you want to avoid sorting the comparisons each time, then using a switch statement might work better for you.
Of course if you know the distribution of the values in your data then place the most common tests first, because the compiler should retain that order and the code should run faster in practice. Of course you should test and measure to get a proper understanding.