In the previous article (“Searching for a better indexOf”) we studied various methods to search for a byte sequence in another byte sequence in Java. We want to try the same in C++. All of our Java approaches are well applicable in C++, but the common idea is that you can do better in C++. Not because C++ compilers are better (this days this is a hot debatable question), but because C++ allows better access to the computer’s low level operations, such as direct in-line assembly programming.
Specifically, one of the options we have in mind is to find the first character of the pattern using some techniques not available in Java, and then compare the patterns. If this can be performed very fast, we can forget about our fancy search methods, such as shift tables and search trees.
So, today we’ll be studying very narrow technical question: how fast can we find a single character in a long character array? The issue looks very simple; some of the solutions that we’ll be looking at, however, are not simple at all.
Obviously, there are readily available library functions for the job, which we are unlikely to outperform – they are usually well-designed and highly optimised. We’ll measure them, too, and check how well our DIY implementations approach them.
Just for a difference, we’ll write our tests in C, or, more precisely, C99.
The code is available in this repository.
The test
The library function in C to search for a character in a memory block looks like this:
Here count
is the number of bytes to look at. The function returns the position of the character if found, or NULL
if not. We’ll follow the same
signature. Our searchers will satisfy the type
We’ll allocate a big enough block, fill it with some non-matching data, and run several tests where the char to search will be placed at various distances from the start. The measured value will be the time it took to search, in nanoseconds per byte searched.
One important variable in a test like this is data alignment. It affects data reading using all
techniques, except for single byte access. It is especially important for SSE, where
most of instructions require pointer alignment by 16. That’s why we’ll run an alignment-averaging test:
we’ll allocate our array at a 64-byte boundary (using _mm_alloc
), and run our tests with alignments
0 to 63, averaging the results.
Another variable is the exact choice of the text length. Some solutions involve loop unrolling or multi-byte operations and may be sensitive to the alignment of this length. That’s why we’ll run tests with multiple lengths around the published central value, and average the results.
Strictly speaking, another variable is the size of the entire array: if we place the required byte
at the position 4, it may be important if we call memchr
with the total length of 8 or 8192;
however, we’ll ignore this for now and always call the function with some big number.
The null test
First, we want to know the cost of our measurement framework. We’ll use a null searcher for that:
Here are the times (the column headers are the numbers of bytes searched):
Searcher | 4 | 16 | 64 | 256 | 1024 | 4096 | 16384 |
---|---|---|---|---|---|---|---|
Null | 0.63 | 0.14 | 0.035 | 0.0086 | 0.0022 | 0.0005 | 0.0001 |
The simple index-based searcher
This is, probably, the simplest solution one can think of:
Here are the times:
Searcher | 4 | 16 | 64 | 256 | 1024 | 4096 | 16384 |
---|---|---|---|---|---|---|---|
Null | 0.63 | 0.14 | 0.035 | 0.0086 | 0.0022 | 0.0005 | 0.0001 |
Simple | 1.24 | 0.61 | 0.47 | 0.44 | 0.44 | 0.43 | 0.42 |
As expected, the test framework is introducing significant overhead on small arrays, and has virtually no influence on big ones.
Here is the percentage of time the test framework takes:
Fraction | 4 | 16 | 64 | 256 | 1024 | 4096 | 16384 |
---|---|---|---|---|---|---|---|
Test code | 50.8% | 23.0% | 7.4% | 2.0% | 0.5% | 0.1% | 0.02% |
We’ll have to keep this overhead in mind when working with small input sizes.
In absolute terms, the results don’t look great: 0.42 ns is about one CPU cycle. Surely we must be able to do better than this.
The pointer-based simple searcher
This days, using indices is considered normal, due to the modern processor design, which makes indexing cheap, and to sophisticated compilers, which can convert indexing to pointer manipulations and other way around, whichever is the best for the target architecture.
In the old days, however, the pointer arithmetic was praised as the winning optimisation technique and one of the main selling features of a language like C as compared to something like Pascal, Ada, Algol-68 or PL/1 (today, Java would top this list).
Let’s try using pointer arithmetic and see if it makes any difference:
And here are the results:
Searcher | 4 | 16 | 64 | 256 | 1024 | 4096 | 16384 |
---|---|---|---|---|---|---|---|
Simple | 1.24 | 0.61 | 0.47 | 0.44 | 0.44 | 0.43 | 0.42 |
Simple Pointer | 1.20 | 0.61 | 0.47 | 0.43 | 0.43 | 0.42 | 0.42 |
There is no difference outside of the measurement error margin. Let’s try another approach:
The results are a bit better on small sizes:
Searcher | 4 | 16 | 64 | 256 | 1024 | 4096 | 16384 |
---|---|---|---|---|---|---|---|
Simple Pointer | 1.20 | 0.61 | 0.47 | 0.43 | 0.43 | 0.42 | 0.42 |
Simple Pointer v2 | 1.17 | 0.53 | 0.43 | 0.41 | 0.42 | 0.41 | 0.41 |
The improvement, however, isn’t worth bothering. It is really a matter of taste which approach to choose.
The standard searcher
Now, having tested the simplest approaches, let’s see what the standard library has to offer.
We’ll measure memchr
:
Searcher | 4 | 16 | 64 | 256 | 1024 | 4096 | 16384 |
---|---|---|---|---|---|---|---|
Simple | 1.24 | 0.61 | 0.47 | 0.44 | 0.44 | 0.43 | 0.42 |
memchr | 1.25 | 0.32 | 0.10 | 0.063 | 0.035 | 0.031 | 0.027 |
The results are much, much better. The speedup grows with array size, eventually reaching 15 times. The absolute result is also impressive: our processor runs at 2.4 GHz, so 0.027 ns/byte means 0.065 cycles/byte, or 15.4 bytes per cycle. This looks so fast that it’s unlikely to be beaten, no matter what we do. It also suggests that the SSE is involved, so we’ll try it later.
How close can we come to these results with home-made searchers?
The SCAS-based searcher
Since the Day One of Intel x86 family of processors (namely, since 8086), there was a string search
instruction, SCAS
. It compared al
to the byte pointed to by [edi]
, incremented
that pointer and decremented cx
, making it suitable to use with the REP
prefix.
There are rumours that these instructions are not so efficient anymore. However, there are cases
when REP MOVS
is still the most efficient way to copy a block of bytes (although, even there it is
MOVSQ
, rather than MOVSB
). Let’s measure. We’ll write the function straight in assembly:
This can, probably, be improved by replacing the conditional jump with cmove
or setz
,
but this instruction is only executed once after the end of the search. It can only affect
performance on very short repetition counts. Here are the results we get:
Searcher | 4 | 16 | 64 | 256 | 1024 | 4096 | 16384 |
---|---|---|---|---|---|---|---|
Simple | 1.24 | 0.61 | 0.47 | 0.44 | 0.44 | 0.43 | 0.42 |
SCAS | 7.12 | 2.26 | 1.19 | 0.93 | 0.86 | 0.85 | 0.85 |
The solution is really slow, much slower even than the simple loop. It gets a bit faster on longer
arrays, but it’s unlikely to ever catch up. This probably means that the rumours were correct,
and we can forget about the SCAS
-based solutions. This instruction is there as a 40-years old
legacy and is not intended for any practical use today. So enjoy your retirement, good old SCAS
.
The simple DWORD searcher
This is a variation of the Simple Pointer searcher, reading four bytes from memory at a time instead of one:
This version is x86-specific. It doesn’t put any alignment requirements on the input (reading
data using memcpy
takes care of that, see
“A bug story: data alignment on x86”).
However, it makes use of the little-endian nature of the x86. Some additional checks for endianness
are nesessary if we ever want to convert it into an industrial solution, and it’s not going to be
completely portable, for neither C99
, nor C++
has good built-in endianness support.
Let’s check the speed:
Searcher | 4 | 16 | 64 | 256 | 1024 | 4096 | 16384 |
---|---|---|---|---|---|---|---|
Simple | 1.24 | 0.61 | 0.47 | 0.44 | 0.44 | 0.43 | 0.42 |
Simple dword | 0.99 | 0.52 | 0.42 | 0.40 | 0.41 | 0.39 | 0.39 |
The speed is a bit higher (10-20%). This shows that even reading from well-cached memory isn’t as fast as accessing the registers. However, this improvement is clearly not sufficient.
Note that this solution (and most of the solutions that follow) is very strict regarding its memory access. It never reads a single byte past the memory area it was called to search in (or, for that matter, before that area – we’ll see later that this can be useful, too). If we could be certain that there are always a few bytes available past the end, we could avoid the trailing loop. This, however, would only make a visible difference on very short strings.
The DWORD searcher with zeroes detection
In the previous solution we read a double-word (32 bits) from memory and then fetched the bytes one by one from that double-word to find a match. We could save if we could quickly establish if there was any match in this double-word.
Let’s first make a mask containing the byte we are searching for in its every byte, and XOR
with
that mask:
Now we need to find out if any byte in v
is zero. The answer comes from the famous
Bit Twiddling Hacks page.
This page offers two solutions. The first one works like this:
- first
AND
each byte with0x7F
; - this allows adding
0x7F
to resulting bytes without producing carry bits; - the high bit of the result will be set iff any of the lower 7 bits of
v
were non-zero; - we
OR
it with the source (v
). Now the high bit is 1 iff the corresponding byte inv
is non-zero; - finally, leave only the high bit and invert it;
- now bytes that were zero in
v
have value0x80
and all the other bytes are zeroes. The entire doubleword is non-zero whenv
contains a zero byte.
Here is the formula, which uses five operations:
The second solution is less accurate but uses only four operations:
Again, we treat the high bit of every byte separately. The ~(v) & 0x80808080
portion
requires it to be zero to return 0x80
. The (v) - 0x01010101
part subtracts one from every
byte, which sets its high bit to one in three cases:
- it was one before
- the whole byte was zero
- the byte was one and there was a borrow bit from the previous byte.
In the last two cases a borrow bit is set.
The borrow bits can distort the picture, but for them to appear in the first place, a zero byte must have occured somewhere before. This means that this method also allows to establish if the source has zero bytes but does not show where they are. Here are some examples:
v | hasZeroByte | hasZeroByte2 |
---|---|---|
00000000 | 80808080 | 80808080 |
10100020 | 00008000 | 00008000 |
01010100 | 00000080 | 80808080 |
01000001 | 00808000 | 80808000 |
01010001 | 00008000 | 80808000 |
The second method is sufficient for our purpose, so our next solution will be based on it:
The results, however, haven’t improved:
Searcher | 4 | 16 | 64 | 256 | 1024 | 4096 | 16384 |
---|---|---|---|---|---|---|---|
Simple | 1.24 | 0.61 | 0.47 | 0.44 | 0.44 | 0.43 | 0.42 |
Simple dword | 0.99 | 0.52 | 0.42 | 0.40 | 0.41 | 0.39 | 0.39 |
Dword | 1.51 | 0.66 | 0.48 | 0.45 | 0.39 | 0.38 | 0.38 |
The BSF Dword searcher
When we finally find a doubleword that contains our byte, is there any way to locate this byte
quickly, without running a loop? The methods described above produce 0x80 in the bytes where
zeroes were. The first one writes 0x00 in all the other bytes, while the second one may fill some
of the others with 0x80 as well. This, however, requires a borrow bit to be generated, which only
happens to the left (that is, on the higher side) of the zero byte. This means that on
little-endian processors both methods are suitable to locate the first zero byte. All that’s needed
is to find the least significant non-zero byte in the computed value. The Bit Twiddling Hacks has
code for that, too,
but these days it is unnecessary, since we have an efficient CPU instruction BSF
(bit scan forward).
It was not always fast, but now it is. It returns the position of the rightmost (least significant) bit set in a number.
Here bsf()
is our own wrapper that takes care of different intrinsics in different compilers:
Searcher | 4 | 16 | 64 | 256 | 1024 | 4096 | 16384 |
---|---|---|---|---|---|---|---|
Simple dword | 0.99 | 0.52 | 0.42 | 0.40 | 0.41 | 0.39 | 0.39 |
Dword | 1.51 | 0.66 | 0.48 | 0.45 | 0.39 | 0.38 | 0.38 |
Dword BSF | 1.28 | 0.58 | 0.40 | 0.36 | 0.32 | 0.29 | 0.29 |
This trick helped, except for very small sizes.
Quadword searchers
On a 64-bit architecture we can do the same using 64-bit words (qwords). The code looks pretty much the same,
so we’ll only show one piece here – that of qword_bsf
:
This code makes use of the dword_bsf
implementation for leftovers (when there are less than 8 bytes left),
so it will benefit from a good inline engine, if present in the compiler.
Here are the results:
Searcher | 4 | 16 | 64 | 256 | 1024 | 4096 | 16384 |
---|---|---|---|---|---|---|---|
Simple qword | 0.93 | 0.46 | 0.42 | 0.41 | 0.42 | 0.41 | 0.41 |
Qword | 1.59 | 0.49 | 0.24 | 0.19 | 0.16 | 0.15 | 0.15 |
Qword BSF | 1.23 | 0.42 | 0.19 | 0.14 | 0.13 | 0.13 | 0.13 |
The simple qword is running pretty much at the same speed as the simple dword, if not slower. However, the other two show good improvement, especially the BSF version. So wider memory access plus wider operations really help. This makes it attractive to try SSE.
The SSE searcher
The SSE (starting from SSE2) can operate on integer components of various sizes, including individual bytes.
We can thus load 16 bytes into an SSE register and compare them all with the character we search.
The compare instruction (PCMPEQB
) sets corresponding bytes in the result to 0xFF if equality
is detected and to 0 if not. To locate the first equal byte, we apply PMOVMSKB
, which sets bits
in a normal register to one or zero depending on the high bit of each byte, and then apply BSF
to
that register.
The endianness works just right: the byte in an XMM register that is the first in memory, produces
the lowest bit, for which BSF
will return zero.
A comment must be made on data alignment. Generally, SSE prefers 16-byte alignment. All operations
with operands in memory and all normal loads (MOVAPS
, MOVDQA
) require this alignment, and only special
unaligned loads (MOVUPS
, MOVDQU
) can do without it. In older generation of processors,
MOVUPS
was much slower than MOVAPS
, even if the data was aligned. It’s not like that anymore,
with one exception of crossing cache line boundaries: this is still a bit slow. In this version we’ll
go for simplicity and use unaligned loads:
This function reverts to qword_bsf
for leftovers.
Here are the times:
Searcher | 4 | 16 | 64 | 256 | 1024 | 4096 | 16384 |
---|---|---|---|---|---|---|---|
Qword BSF | 1.23 | 0.42 | 0.19 | 0.14 | 0.13 | 0.13 | 0.13 |
memchr | 1.25 | 0.32 | 0.10 | 0.063 | 0.035 | 0.031 | 0.027 |
sse | 1.01 | 0.28 | 0.11 | 0.065 | 0.044 | 0.041 | 0.040 |
The times look very good and approach those of memchr
(beating it on small sizes).
The assembly code of this function is surprisingly long, due to GCC’s loop unrolling. It doesn’t contain calls to other functions, which means that the inline engine performed well.
The unlimited SSE searcher
All functions shown above are written conservatively with regards to the upper boundary of the source array. They don’t access any memory beyond that boundary. If we know that our RAM does not end there, and some bytes are available for reading after the end pointer, we can save on complexity:
Searcher | 4 | 16 | 64 | 256 | 1024 | 4096 | 16384 |
---|---|---|---|---|---|---|---|
sse | 1.01 | 0.28 | 0.11 | 0.065 | 0.044 | 0.041 | 0.040 |
sse_unlimited | 0.88 | 0.26 | 0.11 | 0.068 | 0.050 | 0.049 | 0.048 |
Except for very small arrays, this version didn’t produce any significant improvement, and it’s not quite correct as far as C standard goes, so we won’t use it in practice. It is still unclear, however, why this happened.
Aligned SSE version
As discussed before, aligned access to memory in SSE operations can potentially be faster than unaligned one, due to better code generation (wider choice of instructions), lack of cache boundary crossing, and, sometimes, naturally (as a feature of the processor). Let’s try to align the pointer before running the main loop. We’ll use the unaligned version for the first loop iteration:
There is some improvement, although not very big:
Searcher | 4 | 16 | 64 | 256 | 1024 | 4096 | 16384 |
---|---|---|---|---|---|---|---|
sse | 1.01 | 0.28 | 0.11 | 0.065 | 0.044 | 0.041 | 0.040 |
sse_aligned | 1.00 | 0.27 | 0.13 | 0.062 | 0.043 | 0.039 | 0.036 |
Chasing the standard library: unrolling the loop
The standard library’s function memchr
is still faster, despite all our effort. What do they do
there that makes it so fast?
To see the code of memchr
, we run our test program in gdb
and print the disassembly of this
function (we could instead download the code from GCC code repository, but it wouldn’t make much
difference, since the code is written in assembly). The code is rather long
(see here),
and generally follows the pattern of our aligned SSE solution, but it contains a portion that looks
particularily interesting:
The loop is unrolled four times: we load four SSE registers and perform comparison in parallel.
Then we combine four comparison results (as we remember, these are SSE values where 255 indicates
equal bytes and 0 means unequal), by performing a byte MAX function (PMAXUB
). Frankly, bitwise
OR
(POR
instruction) seems more natural for this purpose, but PMAXUB
also works, and there
is no performance difference.
The aligned access is also used (MOVDQA
instruction).
Let’s build a similar solution on top of our “Aligned SSE”:
On long inputs this version is faster than our previous SSE-based solutions, but still doesn’t
reach the memchr
(except for length 256):
Searcher | 4 | 16 | 64 | 256 | 1024 | 4096 | 16384 |
---|---|---|---|---|---|---|---|
memchr | 1.25 | 0.32 | 0.10 | 0.063 | 0.035 | 0.031 | 0.027 |
sse | 1.01 | 0.28 | 0.11 | 0.065 | 0.044 | 0.041 | 0.040 |
sse_aligned | 1.00 | 0.27 | 0.13 | 0.062 | 0.043 | 0.039 | 0.036 |
sse_aligned64 | 1.14 | 0.33 | 0.12 | 0.053 | 0.037 | 0.034 | 0.034 |
The AVX solutions
All our SSE solutions can be easily converted to AVX. Let’s try this. We’ll only show the simplest of the AVX solutions, the others can be seen in the repository:
The results are quite a bit better than those of SSE, and generally better than the memchr
.
This, however, mustn’t make us very proud, since we were not using the very latest version of GNU C.
Surely, later versions contain a version of that library optimised for AVX.
Here are the times:
Searcher | 4 | 16 | 64 | 256 | 1024 | 4096 | 16384 |
---|---|---|---|---|---|---|---|
sse_aligned64 | 1.14 | 0.33 | 0.12 | 0.053 | 0.037 | 0.034 | 0.034 |
memchr | 1.25 | 0.32 | 0.10 | 0.063 | 0.035 | 0.031 | 0.027 |
avx | 1.18 | 0.27 | 0.10 | 0.045 | 0.028 | 0.024 | 0.023 |
avx_aligned | 1.14 | 0.26 | 0.12 | 0.049 | 0.028 | 0.020 | 0.019 |
avx_aligned64 | 1.15 | 0.26 | 0.12 | 0.041 | 0.024 | 0.018 | 0.017 |
The standard library revisited
We managed to outperform the standard memchr
using an unfair advantage (AVX), but haven’t beaten it
using normal SSE code, which rises a question: what exactly is the magic in that code?
The only feature of that code that we paid attention to so far was the unrolled loop where it tested four 16-byte blocks at a time. A deeper inspection of this code reveals two more features:
-
the code does not contain any single-byte, four-byte or eight-byte operations; everything happens in SSE. No special treatment of leftovers is visible in this code;
-
the code starts with a strange sequence:
In other words, we check the alignment of the source pointer with respect to a 64-byte boundary. Why do we do it?
The code happened to be more interesting and more complex than it looked at first. As it is written in assembly, it can’t be directly converted into C. The closest we could get to is shown here:
So what is actually going on here?
1) We check how the pointer is aligned against a 64-byte boundary. The idea here is that, since it is
a built-in function, and it is written in assembly anyway, it does not have to follow the C
standard strictly. Specifically, it makes use of the fact that the available memory does not start or end at
an address not aligned by 64 (actually, the memory boundary is always aligned by 4096, so perhaps
we could save there). So, if the alignment is 48 or less, there are definitely 16 bytes available,
so we can read them using MOVDQU
. We must handle correctly the case when the string is shorter
than 16: that’s why we return NULL
if the result of bsf
is bigger than count
.
2) If the alignment is above 48, we read the 16-byte vector at alignment 48 (actually reading data
before our pointer). Here we must deal with possible matches before the start (for that we
shift the eq_mask
right by align
), and after the end (for that we compare the result of bsf()
with count
).
3) In both cases, unless the end of the input is detected, we end up at the next 16-byte boundary (in the first case some of the bytes immediately after that boundary have already been tested, so we are going to perform some redundant computations).
4) Then we test the next 64 bytes, one 16-byte block at a time, using aligned access.
5) If the match hasn’t been found, we read enough 16-byte blocks to get to the 64-byte alignment.
6) Then we run the loop reading 64 bytes at a time (as implemented in sse_aligned64
version).
7) Finally, we deal with the leftovers (possible 32 bytes followed by possible up to 32 bytes).
Here are the results, which leave mixed feelings. The code is faster than memchr
on small sizes,
but loses to it on big ones:
Searcher | 4 | 16 | 64 | 256 | 1024 | 4096 | 16384 |
---|---|---|---|---|---|---|---|
sse_aligned64 | 1.14 | 0.33 | 0.12 | 0.053 | 0.037 | 0.034 | 0.034 |
memchr | 1.25 | 0.32 | 0.10 | 0.063 | 0.035 | 0.031 | 0.027 |
sse_memchr | 1.19 | 0.30 | 0.092 | 0.058 | 0.038 | 0.033 | 0.033 |
It looks like re-writing the carefully written assembly code in C makes it worse. What if we simplify it?
Here we don’t insist that the 64-byte loop runs over the 64-byte-aligned memory blocks. We also relaxed our alignment requirements from 64 to 4096 for the original test.
Searcher | 4 | 16 | 64 | 256 | 1024 | 4096 | 16384 |
---|---|---|---|---|---|---|---|
memchr | 1.25 | 0.32 | 0.10 | 0.063 | 0.035 | 0.031 | 0.027 |
sse_memchr | 1.19 | 0.30 | 0.092 | 0.058 | 0.038 | 0.033 | 0.033 |
sse_memchr2 | 1.11 | 0.32 | 0.11 | 0.052 | 0.036 | 0.033 | 0.033 |
The results got slightly worse on sizes 16 and 64, but either improved, or stayed the same on all
the other sizes, which makes the simplified version viable. It still, however, loses to memchr
on
long inputs. Why?
Loop unrolling
The code generated for sse_memchr
and sse_memchr2
differs quite a bit from that of the memchr
.
The original code is much neater. It is also shorter (although not by far). Let’s show the code size
for all the versions presented so far:
Searcher | Code size, bytes |
---|---|
simple_index | 385 |
simple_ptr | 385 |
simple_ptr2 | 241 |
memchr | 836 |
index_scas | 14 |
simple_dword | 977 |
dword | 956 |
dword_bsf | 900 |
simple_qword | 997 |
qword | 1200 |
qword_bsf | 848 |
sse | 1157 |
sse_unlimited | 599 |
sse_aligned | 1716 |
sse_aligned64 | 2326 |
sse_memchr | 1221 |
sse_memchr2 | 1220 |
Obviously, we don’t expect the shortest code to be nesessarily faster. The memchr_scas
is the
most obvious example of the opposite. Our SSE matchers, compared to the simple matchers,
also demonstrate that the longer code is often faster. However, the sse_memchr
is implementing
the same algorithm as memchr
, and yet is 50% longer. This shows that the hand-written code can
still outperform the code produced even by the best compiler (and the GCC is definitely very good).
Why is the sse_memchr
so long? A brief code inspection shows that the main loop (the one that iterates
over 64-byte blocks) has been unrolled. Generally, loop unrolling is a very powerful tool, but here
we have already unrolled the main loop 4 times in a custom way – is there really a point to unroll it
more, dealing with all possible leftovers?
Let’s try to disable loop unrolling forcibly, using GCC pragmas:
We’ll do it for both sse_memchr
and sse_memchr2
:
Searcher | 4 | 16 | 64 | 256 | 1024 | 4096 | 16384 |
---|---|---|---|---|---|---|---|
memchr | 1.25 | 0.32 | 0.10 | 0.063 | 0.035 | 0.031 | 0.027 |
sse_memchr | 1.19 | 0.30 | 0.092 | 0.058 | 0.038 | 0.033 | 0.033 |
sse_memchr_nounroll | 1.18 | 0.29 | 0.091 | 0.056 | 0.033 | 0.029 | 0.024 |
sse_memchr2 | 1.11 | 0.32 | 0.11 | 0.052 | 0.036 | 0.033 | 0.033 |
sse_memchr2_nounroll | 1.11 | 0.32 | 0.10 | 0.049 | 0.033 | 0.027 | 0.024 |
Finally, we’ve done it. The SSE-based version written in C is faster than the one carefully
written by hand in assembly. Moreover, the simplified version (sse_memchr2
) isn’t, generally,
performing worse that the original sse_memchr
. So this will be our version of choice unless we
go for AVX.
The sizes also look better: 775 bytes for sse_memchr_nounroll
and 472 for sse_memchr2_nounroll
.
The C compiler made code shorter than assembly!
The AVX version of the standard library
The same approach can be used using AVX. Here is the AVX version of the sse_memchr2
code
as the shorter one (the macro definitions are omitted here):
The original avx_memchr
can be seen in the repository.
And here are the times, compared to the previous results:
Searcher | 4 | 16 | 64 | 256 | 1024 | 4096 | 16384 |
---|---|---|---|---|---|---|---|
memchr | 1.25 | 0.32 | 0.10 | 0.063 | 0.035 | 0.031 | 0.027 |
avx_aligned64 | 1.15 | 0.26 | 0.12 | 0.041 | 0.024 | 0.018 | 0.017 |
avx_memchr | 1.24 | 0.29 | 0.085 | 0.045 | 0.026 | 0.018 | 0.017 |
avx_memchr_nounroll | 1.24 | 0.29 | 0.088 | 0.042 | 0.027 | 0.024 | 0.019 |
avx_memchr2 | 1.23 | 0.28 | 0.098 | 0.041 | 0.025 | 0.018 | 0.017 |
avx_memchr2_nounroll | 1.23 | 0.28 | 0.092 | 0.038 | 0.026 | 0.023 | 0.019 |
Surprisinly, the ununrolled versions aren’t faster than unrolled ones, and, in general, the versions
aren’t faster than avx_aligned64
. I won’t investigate it further, though; we already achieved a lot.
Summary
Here is the total summary of all the methods considered so far, excluding AVX:
Searcher | 4 | 16 | 64 | 256 | 1024 | 4096 | 16384 |
---|---|---|---|---|---|---|---|
Null | 0.63 | 0.14 | 0.035 | 0.009 | 0.002 | 0.000 | 0.000 |
Simple | 1.24 | 0.61 | 0.47 | 0.44 | 0.44 | 0.43 | 0.42 |
Simple Ptr | 1.20 | 0.61 | 0.47 | 0.43 | 0.43 | 0.42 | 0.42 |
Simple Ptr2 | 1.17 | 0.53 | 0.43 | 0.41 | 0.42 | 0.41 | 0.41 |
SCAS | 7.12 | 2.26 | 1.19 | 0.93 | 0.86 | 0.85 | 0.85 |
Simple dword | 0.99 | 0.52 | 0.42 | 0.40 | 0.41 | 0.39 | 0.39 |
Dword | 1.51 | 0.66 | 0.48 | 0.45 | 0.39 | 0.38 | 0.38 |
Dword BSF | 1.28 | 0.58 | 0.40 | 0.36 | 0.32 | 0.29 | 0.29 |
Simple qword | 0.93 | 0.46 | 0.42 | 0.41 | 0.42 | 0.41 | 0.41 |
Qword | 1.59 | 0.49 | 0.24 | 0.19 | 0.16 | 0.15 | 0.15 |
Qword BSF | 1.23 | 0.42 | 0.19 | 0.14 | 0.13 | 0.13 | 0.13 |
memchr | 1.25 | 0.32 | 0.10 | 0.063 | 0.035 | 0.031 | 0.027 |
sse | 1.01 | 0.28 | 0.11 | 0.065 | 0.044 | 0.041 | 0.040 |
sse_unlimited | 0.88 | 0.26 | 0.11 | 0.068 | 0.050 | 0.049 | 0.048 |
sse_aligned | 1.00 | 0.27 | 0.13 | 0.062 | 0.043 | 0.039 | 0.036 |
sse_aligned64 | 1.14 | 0.33 | 0.12 | 0.053 | 0.037 | 0.034 | 0.034 |
sse_memchr | 1.19 | 0.30 | 0.092 | 0.058 | 0.038 | 0.033 | 0.033 |
sse_memchr2 | 1.11 | 0.32 | 0.11 | 0.052 | 0.036 | 0.033 | 0.033 |
sse_memchr_nounroll | 1.18 | 0.29 | 0.091 | 0.056 | 0.033 | 0.029 | 0.024 |
sse_memchr2_nounroll | 1.11 | 0.32 | 0.10 | 0.049 | 0.033 | 0.027 | 0.024 |
As usual, the first, the second and the third place in each category are marked green, yellow and red.
And here are the AVX ones:
Searcher | 4 | 16 | 64 | 256 | 1024 | 4096 | 16384 |
---|---|---|---|---|---|---|---|
avx | 1.18 | 0.27 | 0.10 | 0.045 | 0.028 | 0.024 | 0.023 |
avx_aligned | 1.14 | 0.26 | 0.12 | 0.049 | 0.028 | 0.020 | 0.019 |
avx_aligned64 | 1.15 | 0.26 | 0.12 | 0.041 | 0.024 | 0.018 | 0.017 |
avx_memchr | 1.25 | 0.29 | 0.085 | 0.045 | 0.026 | 0.018 | 0.017 |
avx_memchr | 1.24 | 0.29 | 0.085 | 0.045 | 0.026 | 0.018 | 0.017 |
avx_memchr_nounroll | 1.24 | 0.29 | 0.088 | 0.042 | 0.027 | 0.024 | 0.019 |
avx_memchr2 | 1.23 | 0.28 | 0.098 | 0.041 | 0.025 | 0.018 | 0.017 |
avx_memchr2_nounroll | 1.23 | 0.28 | 0.092 | 0.038 | 0.026 | 0.023 | 0.019 |
With the exception of size 4, the AVX versions are always faster than the SSE ones, although they are never twice as fast as one might expect.
In absolute terms, the results don’t look bad: the achieved speed was about 41 bytes per nanosecond (17 bytes per cycle) for SSE, and a bit higher for AVX (59 and 24 bytes).
When we were searching for byte strings in Java (in “Searching for a better indexOf”), the typical times, per byte, varied depending on nature of data and length of pattern, and for small patterns (100 bytes) were about 0.04 – 0.10 ns per byte, which makes the search based on the first character a viable option. Obviously, searching for a string also requires string comparison, but it is worth trying anyway.
Results on Windows
For the sake of completeness, let’s run the same test on Windows. Unfortunately, there are two variables that change. First, it’s a compiler (Microsoft Visual Studio 2017, MSVC 19.10.25019). And then, it’s hardware: a notebook with i7-6820HQ @ 2.7 GHz (Windows 10).
Here are the abridged results:
Searcher | 4 | 16 | 64 | 256 | 1024 | 4096 | 16384 |
---|---|---|---|---|---|---|---|
null_index | 0.49 | 0.11 | 0.027 | 0.007 | 0.002 | 0.001 | 0.000 |
simple_index | 1.27 | 0.60 | 0.43 | 0.46 | 0.40 | 0.38 | 0.38 |
simple_ptr | 1.24 | 0.84 | 0.72 | 0.72 | 0.70 | 0.69 | 0.69 |
dword_bsf | 0.75 | 0.36 | 0.28 | 0.26 | 0.23 | 0.22 | 0.22 |
qword_bsf | 0.78 | 0.30 | 0.13 | 0.13 | 0.11 | 0.10 | 0.10 |
memchr | 0.71 | 0.16 | 0.072 | 0.051 | 0.045 | 0.045 | 0.043 |
sse | 0.68 | 0.18 | 0.083 | 0.045 | 0.042 | 0.041 | 0.039 |
sse_aligned64 | 0.81 | 0.22 | 0.072 | 0.032 | 0.022 | 0.022 | 0.022 |
sse_memchr2 | 0.68 | 0.21 | 0.075 | 0.032 | 0.024 | 0.022 | 0.022 |
avx | 0.78 | 0.15 | 0.051 | 0.028 | 0.030 | 0.022 | 0.020 |
avx_aligned64 | 0.71 | 0.16 | 0.069 | 0.025 | 0.013 | 0.011 | 0.012 |
avx_memchr2 | 0.68 | 0.16 | 0.063 | 0.030 | 0.016 | 0.012 | 0.011 |
The main observations are:
-
in MSVC, there is a difference between indexing and pointer arithmetic – indexing works so much faster. What a shame;
-
DWORD-based solution runs at more than twice the speed of a simple byte-based one; and QWORD-based solution runs exactly two times faster than that;
-
the version of
memchr
that comes with the standard library is not really that fast at all; it loses even to the very basic SSE solution (still beating the simple solution by the factor of 9), and is twice as slow as the 64-aligned SSE version. -
our fancy SSE and AVX versions are by far the best; however, the
aligned64
versions are good enough, -
specifically,
aligned64
AVX version runs twice as fast as any SSE version; one kind of expects this, but it only happens here, with MSVC on Windows.
Conclusions
-
C provides more options than Java to perform low-level tasks such as search for a character. Some of them may improve performance quite a bit;
-
indexing does not differ from pointer arithmetic in performance when scanning byte arrays in C;
-
simple solutions, which inspect every byte, are quite slow;
-
old-style hardware acceleration (
SCAS
) is even slower; -
reading and processing more than one byte at a time helps, especially in combination with the
BSF
instruction; -
however, using SSE and AVX provides a real breakthrough. The speed can be as much as 24 times higher than that of simple solutions;
-
the
memchr
function from the GCC standard library is well-written in assembly and performs very well, so in most practical cases is the best and the easiest option to use; -
we still managed to outperform it – by 10% in SSE and by 35% in AVX. However, the GCC standard library is evolving with time and is likely to catch up;
-
I’m not so sure about MSVC library, which we managed to outperform by 100% while staying with SSE; the AVX offered another 100%;
-
in the process we learned some tricks of working with SSE, e.g. reading extra bytes after of before the data of interest helps avoid using any non-SSE operations;
-
another trick is performing several comparisons at a time, combining the results using
OR
orMAX
operation; -
generic loop unrolling as presented by GCC may harm the performance, especially if manual unrolling has already been done.
Things to do
Now it’s time to look at string searching in C/C++.
Comments are welcome below or at reddit.