Experiments in program optimisation

Making a char searcher in C

Tags: C optimisation search

26 Sep 2019

In the previous article (“Searching for a better indexOf”) we studied various methods to search for a byte sequence in another byte sequence in Java. We want to try the same in C++. All of our Java approaches are well applicable in C++, but the common idea is that you can do better in C++. Not because C++ compilers are better (this days this is a hot debatable question), but because C++ allows better access to the computer’s low level operations, such as direct in-line assembly programming.

Specifically, one of the options we have in mind is to find the first character of the pattern using some techniques not available in Java, and then compare the patterns. If this can be performed very fast, we can forget about our fancy search methods, such as shift tables and search trees.

So, today we’ll be studying very narrow technical question: how fast can we find a single character in a long character array? The issue looks very simple; some of the solutions that we’ll be looking at, however, are not simple at all.

Obviously, there are readily available library functions for the job, which we are unlikely to outperform – they are usually well-designed and highly optimised. We’ll measure them, too, and check how well our DIY implementations approach them.

Just for a difference, we’ll write our tests in C, or, more precisely, C99.

The code is available in this repository.

The test

The library function in C to search for a character in a memory block looks like this:

void* memchr (const void* ptr, int ch, size_t count);

Here count is the number of bytes to look at. The function returns the position of the character if found, or NULL if not. We’ll follow the same signature. Our searchers will satisfy the type

typedef void* Searcher (const void* ptr, int ch, size_t count);

We’ll allocate a big enough block, fill it with some non-matching data, and run several tests where the char to search will be placed at various distances from the start. The measured value will be the time it took to search, in nanoseconds per byte searched.

One important variable in a test like this is data alignment. It affects data reading using all techniques, except for single byte access. It is especially important for SSE, where most of instructions require pointer alignment by 16. That’s why we’ll run an alignment-averaging test: we’ll allocate our array at a 64-byte boundary (using _mm_alloc), and run our tests with alignments 0 to 63, averaging the results.

Another variable is the exact choice of the text length. Some solutions involve loop unrolling or multi-byte operations and may be sensitive to the alignment of this length. That’s why we’ll run tests with multiple lengths around the published central value, and average the results.

Strictly speaking, another variable is the size of the entire array: if we place the required byte at the position 4, it may be important if we call memchr with the total length of 8 or 8192; however, we’ll ignore this for now and always call the function with some big number.

The null test

First, we want to know the cost of our measurement framework. We’ll use a null searcher for that:

void* null_index(const void* ptr, int ch, size_t count)
{
    return NULL;
}

Here are the times (the column headers are the numbers of bytes searched):

Searcher 416642561024409616384
Null 0.630.140.0350.00860.00220.00050.0001

The simple index-based searcher

This is, probably, the simplest solution one can think of:

void* simple_index (const void* ptr, int ch, size_t count)
{
    const char * p = (const char *) ptr;
    for (size_t i = 0; i < count; i ++) {
        if (p [i] == (char) ch) return (void*) (p + i);
    }
    return NULL;
}

Here are the times:

Searcher 416642561024409616384
Null 0.630.140.035 0.0086 0.0022 0.0005 0.0001
Simple 1.240.610.47 0.44  0.44  0.43  0.42  

As expected, the test framework is introducing significant overhead on small arrays, and has virtually no influence on big ones.

Here is the percentage of time the test framework takes:

Fraction 416642561024409616384
Test code50.8% 23.0%7.4%2.0%0.5%0.1%0.02%

We’ll have to keep this overhead in mind when working with small input sizes.

In absolute terms, the results don’t look great: 0.42 ns is about one CPU cycle. Surely we must be able to do better than this.

The pointer-based simple searcher

This days, using indices is considered normal, due to the modern processor design, which makes indexing cheap, and to sophisticated compilers, which can convert indexing to pointer manipulations and other way around, whichever is the best for the target architecture.

In the old days, however, the pointer arithmetic was praised as the winning optimisation technique and one of the main selling features of a language like C as compared to something like Pascal, Ada, Algol-68 or PL/1 (today, Java would top this list).

Let’s try using pointer arithmetic and see if it makes any difference:

void* simple_ptr(const void* ptr, int ch, size_t count)
{
    const char * p = (const char *)ptr;
    while (count) {
        if (*p == (char) ch) return (void*) p;
        ++p;
        --count;
    }
    return NULL;
}

And here are the results:

Searcher 416642561024409616384
Simple 1.240.610.470.440.440.430.42
Simple Pointer 1.200.610.470.430.430.420.42

There is no difference outside of the measurement error margin. Let’s try another approach:

void* simple_ptr2(const void* ptr, int ch, size_t count)
{
    const char * p = (const char *)ptr;
    const char * q = p + count;
    for (; p != q; p++) {
        if (*p == (char)ch) return (void*)p;
    }
    return NULL;
}

The results are a bit better on small sizes:

Searcher 416642561024409616384
Simple Pointer 1.200.610.470.430.430.420.42
Simple Pointer v2 1.170.530.430.410.420.410.41

The improvement, however, isn’t worth bothering. It is really a matter of taste which approach to choose.

The standard searcher

Now, having tested the simplest approaches, let’s see what the standard library has to offer. We’ll measure memchr:

Searcher 416642561024409616384
Simple 1.240.610.470.44 0.44 0.43 0.42 
memchr 1.250.320.100.063 0.035 0.031 0.027

The results are much, much better. The speedup grows with array size, eventually reaching 15 times. The absolute result is also impressive: our processor runs at 2.4 GHz, so 0.027 ns/byte means 0.065 cycles/byte, or 15.4 bytes per cycle. This looks so fast that it’s unlikely to be beaten, no matter what we do. It also suggests that the SSE is involved, so we’ll try it later.

How close can we come to these results with home-made searchers?

The SCAS-based searcher

Since the Day One of Intel x86 family of processors (namely, since 8086), there was a string search instruction, SCAS. It compared al to the byte pointed to by [edi], incremented that pointer and decremented cx, making it suitable to use with the REP prefix.

There are rumours that these instructions are not so efficient anymore. However, there are cases when REP MOVS is still the most efficient way to copy a block of bytes (although, even there it is MOVSQ, rather than MOVSB). Let’s measure. We’ll write the function straight in assembly:

        .p2align 5,,31
        .globl	index_scas
        .type	index_scas, @function

# const void* index_scas(const void* ptr, int ch, size_t count)
# RDI = ptr
# RSI = ch
# RDX = count

index_scas:
        .cfi_startproc
        mov     %esi, %eax
        mov     %rdx, %rcx
        repne scasb
        jnz     not_found
        lea     -1(%rdi), %rax
        ret
not_found:
        xorq    %rax, %rax
        ret
	.cfi_endproc

This can, probably, be improved by replacing the conditional jump with cmove or setz, but this instruction is only executed once after the end of the search. It can only affect performance on very short repetition counts. Here are the results we get:

Searcher 416642561024409616384
Simple 1.240.610.470.440.440.430.42
SCAS 7.122.261.190.930.860.850.85

The solution is really slow, much slower even than the simple loop. It gets a bit faster on longer arrays, but it’s unlikely to ever catch up. This probably means that the rumours were correct, and we can forget about the SCAS-based solutions. This instruction is there as a 40-years old legacy and is not intended for any practical use today. So enjoy your retirement, good old SCAS.

The simple DWORD searcher

This is a variation of the Simple Pointer searcher, reading four bytes from memory at a time instead of one:

void* simple_dword(const void* ptr, int ch, size_t count)
{
    const char * p = (const char *)ptr;
    while (count >= 4) {
        uint32_t v;
        memcpy(&v, p, 4);
        for (size_t i = 0; i < 4; i++) {
            if ((char)v == (char)ch) return (void *)(p + i);
            v >>= 8;
        }
        p += 4;
        count -= 4;
    }
    while (count) {
        if (*p == ch) return (void*)p;
        ++p;
        --count;
    }
    return NULL;
}

This version is x86-specific. It doesn’t put any alignment requirements on the input (reading data using memcpy takes care of that, see “A bug story: data alignment on x86”). However, it makes use of the little-endian nature of the x86. Some additional checks for endianness are nesessary if we ever want to convert it into an industrial solution, and it’s not going to be completely portable, for neither C99, nor C++ has good built-in endianness support.

Let’s check the speed:

Searcher 416642561024409616384
Simple 1.240.610.470.440.440.430.42
Simple dword 0.990.520.420.400.410.390.39

The speed is a bit higher (10-20%). This shows that even reading from well-cached memory isn’t as fast as accessing the registers. However, this improvement is clearly not sufficient.

Note that this solution (and most of the solutions that follow) is very strict regarding its memory access. It never reads a single byte past the memory area it was called to search in (or, for that matter, before that area – we’ll see later that this can be useful, too). If we could be certain that there are always a few bytes available past the end, we could avoid the trailing loop. This, however, would only make a visible difference on very short strings.

The DWORD searcher with zeroes detection

In the previous solution we read a double-word (32 bits) from memory and then fetched the bytes one by one from that double-word to find a match. We could save if we could quickly establish if there was any match in this double-word.

Let’s first make a mask containing the byte we are searching for in its every byte, and XOR with that mask:

    uint32_t value = ch * 0x01010101;
    uint32_t v;
    memcpy(&v, p, 8);
    v ^= value;

Now we need to find out if any byte in v is zero. The answer comes from the famous Bit Twiddling Hacks page.

This page offers two solutions. The first one works like this:

Here is the formula, which uses five operations:

#define hasZeroByte(v) ~(((((v) & 0x7F7F7F7F) + 0x7F7F7F7F) | (v)) | 0x7F7F7F7F);

The second solution is less accurate but uses only four operations:

#define hasZeroByte2(v) (((v) - 0x01010101) & ~(v) & 0x80808080)

Again, we treat the high bit of every byte separately. The ~(v) & 0x80808080 portion requires it to be zero to return 0x80. The (v) - 0x01010101 part subtracts one from every byte, which sets its high bit to one in three cases:

In the last two cases a borrow bit is set.

The borrow bits can distort the picture, but for them to appear in the first place, a zero byte must have occured somewhere before. This means that this method also allows to establish if the source has zero bytes but does not show where they are. Here are some examples:

v hasZeroBytehasZeroByte2
0000000080808080 80808080
1010002000008000 00008000
0101010000000080 80808080
0100000100808000 80808000
0101000100008000 80808000

The second method is sufficient for our purpose, so our next solution will be based on it:

void* dword(const void* ptr, int ch, size_t count)
{
    const char * p = (const char *)ptr;
    uint32_t value = ch * 0x01010101L;
    while (count >= 8) {
        uint32_t v;
        memcpy(&v, p, 8);
        v ^= value;
        if ((v - 0x01010101L) & ~v & 0x80808080L) {
            for (size_t i = 0; i < 4; i++) {
                if (!(char)v) return (void *)(p + i);
                v >>= 8;
            }
        }
        p += 4;
        count -= 4;
    }
    while (count) {
        if (*p == ch) return (void*)p;
        ++p;
        --count;
    }
    return NULL;
}

The results, however, haven’t improved:

Searcher 416642561024409616384
Simple 1.240.610.470.440.440.430.42
Simple dword 0.990.520.420.400.410.390.39
Dword 1.510.660.480.450.390.380.38

The BSF Dword searcher

When we finally find a doubleword that contains our byte, is there any way to locate this byte quickly, without running a loop? The methods described above produce 0x80 in the bytes where zeroes were. The first one writes 0x00 in all the other bytes, while the second one may fill some of the others with 0x80 as well. This, however, requires a borrow bit to be generated, which only happens to the left (that is, on the higher side) of the zero byte. This means that on little-endian processors both methods are suitable to locate the first zero byte. All that’s needed is to find the least significant non-zero byte in the computed value. The Bit Twiddling Hacks has code for that, too, but these days it is unnecessary, since we have an efficient CPU instruction BSF (bit scan forward). It was not always fast, but now it is. It returns the position of the rightmost (least significant) bit set in a number.

void* dword_bsf(const void* ptr, int ch, size_t count)
{
    const char * p = (const char *)ptr;
    uint32_t value = ch * 0x01010101L;
    while (count >= 4) {
        uint32_t v;
        memcpy(&v, p, 8);
        v ^= value;
        uint32_t bits = (v - 0x01010101L) & ~v & 0x80808080L;
        if (bits) {
            return (void *)(p + (bsf(bits) / 8));
        }
        p += 4;
        count -= 4;
    }
    while (count) {
        if (*p == ch) return (void*)p;
        ++p;
        --count;
    }
    return NULL;
}

Here bsf() is our own wrapper that takes care of different intrinsics in different compilers:

size_t bsf(uint32_t bits)
{
#ifdef __GNUC__
    return __builtin_ctz(bits);
#else
#ifdef _MSC_VER
    unsigned long index;
    _BitScanForward(&index, bits);
    return index;
#else
#error Undefined BSF
#endif
#endif
}
Searcher 416642561024409616384
Simple dword 0.990.520.420.400.410.390.39
Dword 1.510.660.480.450.390.380.38
Dword BSF 1.280.580.400.360.320.290.29

This trick helped, except for very small sizes.

Quadword searchers

On a 64-bit architecture we can do the same using 64-bit words (qwords). The code looks pretty much the same, so we’ll only show one piece here – that of qword_bsf:

void* qword_bsf(const void* ptr, int ch, size_t count)
{
    const char * p = (const char *)ptr;
    uint64_t value = ch * 0x0101010101010101LL;
    while (count >= 8) {
        uint64_t v;
        memcpy(&v, p, 8);
        v ^= value;
        uint64_t bits = (v - 0x01010101010101010ULL)
                        & ~v
                        & 0x8080808080808080LL;
        if (bits) {
            return (void *)(p + (bsf_64(bits) / 8));
        }
        p += 8;
        count -= 8;
    }
    return dword_bsf(p, ch, count);
}

This code makes use of the dword_bsf implementation for leftovers (when there are less than 8 bytes left), so it will benefit from a good inline engine, if present in the compiler.

Here are the results:

Searcher 416642561024409616384
Simple qword 0.930.460.420.410.420.410.41
Qword 1.590.490.240.190.160.150.15
Qword BSF 1.230.420.190.140.130.130.13

The simple qword is running pretty much at the same speed as the simple dword, if not slower. However, the other two show good improvement, especially the BSF version. So wider memory access plus wider operations really help. This makes it attractive to try SSE.

The SSE searcher

The SSE (starting from SSE2) can operate on integer components of various sizes, including individual bytes. We can thus load 16 bytes into an SSE register and compare them all with the character we search. The compare instruction (PCMPEQB) sets corresponding bytes in the result to 0xFF if equality is detected and to 0 if not. To locate the first equal byte, we apply PMOVMSKB, which sets bits in a normal register to one or zero depending on the high bit of each byte, and then apply BSF to that register.

The endianness works just right: the byte in an XMM register that is the first in memory, produces the lowest bit, for which BSF will return zero.

A comment must be made on data alignment. Generally, SSE prefers 16-byte alignment. All operations with operands in memory and all normal loads (MOVAPS, MOVDQA) require this alignment, and only special unaligned loads (MOVUPS, MOVDQU) can do without it. In older generation of processors, MOVUPS was much slower than MOVAPS, even if the data was aligned. It’s not like that anymore, with one exception of crossing cache line boundaries: this is still a bit slow. In this version we’ll go for simplicity and use unaligned loads:

void* sse(const void* ptr, int ch, size_t count)
{
    const char * p = (const char *)ptr;
    __m128i value = _mm_set1_epi8((char)ch);
    while (count >= 16) {
        __m128i v = _mm_loadu_si128((__m128i *) p);
        int eq_mask = _mm_movemask_epi8(_mm_cmpeq_epi8(v, value));
        if (eq_mask) {
            return (void *)(p + bsf(eq_mask));
        }
        p += 16;
        count -= 16;
    }
    return qword_bsf(p, ch, count);
}

This function reverts to qword_bsf for leftovers.

Here are the times:

Searcher 416642561024409616384
Qword BSF 1.230.420.190.14 0.13 0.13 0.13 
memchr 1.250.320.100.063 0.035 0.031 0.027
sse 1.010.280.110.065 0.044 0.041 0.040

The times look very good and approach those of memchr (beating it on small sizes).

The assembly code of this function is surprisingly long, due to GCC’s loop unrolling. It doesn’t contain calls to other functions, which means that the inline engine performed well.

The unlimited SSE searcher

All functions shown above are written conservatively with regards to the upper boundary of the source array. They don’t access any memory beyond that boundary. If we know that our RAM does not end there, and some bytes are available for reading after the end pointer, we can save on complexity:

void* sse_unlimited(const void* ptr, int ch, size_t count)
{
    const char * p = (const char *)ptr;
    __m128i value = _mm_set1_epi8((char)ch);
    for (ptrdiff_t cnt = (ptrdiff_t)count; cnt > 0; cnt -= 16, p += 16) {
        __m128i v = _mm_loadu_si128((__m128i *) p);
        int eq_mask = _mm_movemask_epi8(_mm_cmpeq_epi8(v, value));
        if (eq_mask) {
            size_t offset = bsf(eq_mask);
            return offset >= (size_t) cnt ? NULL : (void *)(p + offset);
        }
    }
    return NULL;
}
Searcher 416642561024409616384
sse 1.010.280.110.0650.0440.0410.040
sse_unlimited 0.880.260.110.0680.0500.0490.048

Except for very small arrays, this version didn’t produce any significant improvement, and it’s not quite correct as far as C standard goes, so we won’t use it in practice. It is still unclear, however, why this happened.

Aligned SSE version

As discussed before, aligned access to memory in SSE operations can potentially be faster than unaligned one, due to better code generation (wider choice of instructions), lack of cache boundary crossing, and, sometimes, naturally (as a feature of the processor). Let’s try to align the pointer before running the main loop. We’ll use the unaligned version for the first loop iteration:

void* sse_aligned(const void* ptr, int ch, size_t count)
{
    if (count < 16) return qword_bsf(ptr, ch, count);
        
    const char * p = (const char *)ptr;
    __m128i value = _mm_set1_epi8((char)ch);

    size_t align = (size_t)p & 0x0F;
    if (align > 0) {
        align = 16 - align;
        __m128i v = _mm_loadu_si128((__m128i *) p);
        int eq_mask = _mm_movemask_epi8(_mm_cmpeq_epi8(v, value));
        if (eq_mask) {
            return (void *)(p + bsf(eq_mask));
        }
        p += align;
        count -= align;
    }
    while (count >= 16) {
        __m128i v = _mm_load_si128((__m128i *) p);
        int eq_mask = _mm_movemask_epi8(_mm_cmpeq_epi8(v, value));
        if (eq_mask) {
            return (void *)(p + bsf(eq_mask));
        }
        p += 16;
        count -= 16;
    }
    return qword_bsf(p, ch, count);
}

There is some improvement, although not very big:

Searcher 416642561024409616384
sse 1.010.280.110.0650.0440.0410.040
sse_aligned 1.000.270.130.0620.0430.0390.036

Chasing the standard library: unrolling the loop

The standard library’s function memchr is still faster, despite all our effort. What do they do there that makes it so fast?

To see the code of memchr, we run our test program in gdb and print the disassembly of this function (we could instead download the code from GCC code repository, but it wouldn’t make much difference, since the code is written in assembly). The code is rather long (see here), and generally follows the pattern of our aligned SSE solution, but it contains a portion that looks particularily interesting:

   <+410>:   movdqa xmm0,XMMWORD PTR [rdi]
   <+414>:   movdqa xmm2,XMMWORD PTR [rdi+0x10]
   <+419>:   movdqa xmm3,XMMWORD PTR [rdi+0x20]
   <+424>:   movdqa xmm4,XMMWORD PTR [rdi+0x30]
   <+429>:   pcmpeqb xmm0,xmm1
   <+433>:   pcmpeqb xmm2,xmm1
   <+437>:   pcmpeqb xmm3,xmm1
   <+441>:   pcmpeqb xmm4,xmm1
   <+445>:   pmaxub xmm3,xmm0
   <+449>:   pmaxub xmm4,xmm2
   <+453>:   pmaxub xmm4,xmm3
   <+457>:   pmovmskb eax,xmm4
   <+461>:   add    rdi,0x40
   <+465>:   test   eax,eax
   <+467>:   je     0x7ffff7aa7920 <memchr+400>

The loop is unrolled four times: we load four SSE registers and perform comparison in parallel. Then we combine four comparison results (as we remember, these are SSE values where 255 indicates equal bytes and 0 means unequal), by performing a byte MAX function (PMAXUB). Frankly, bitwise OR (POR instruction) seems more natural for this purpose, but PMAXUB also works, and there is no performance difference.

The aligned access is also used (MOVDQA instruction).

Let’s build a similar solution on top of our “Aligned SSE”:

void* sse_aligned64(const void* ptr, int ch, size_t count)
{
    if (count < 16) return qword_bsf(ptr, ch, count);

    const char * p = (const char *)ptr;
    __m128i value = _mm_set1_epi8((char)ch);

    size_t align = (size_t)p & 0x0F;
    if (align > 0) {
        align = 16 - align;
        __m128i v = _mm_loadu_si128((__m128i *) p);
        int eq_mask = _mm_movemask_epi8(_mm_cmpeq_epi8(v, value));
        if (eq_mask) {
            return (void *)(p + bsf(eq_mask));
        }
        p += align;
        count -= align;
    }
    while (count >= 64) {
        __m128i v0 = _mm_load_si128((__m128i *) p);
        __m128i v1 = _mm_load_si128((__m128i *) (p + 16));
        __m128i v2 = _mm_load_si128((__m128i *) (p + 32));
        __m128i v3 = _mm_load_si128((__m128i *) (p + 48));

        __m128i m0 = _mm_cmpeq_epi8(v0, value);
        __m128i m1 = _mm_cmpeq_epi8(v1, value);
        __m128i m2 = _mm_cmpeq_epi8(v2, value);
        __m128i m3 = _mm_cmpeq_epi8(v3, value);

        __m128i max_val = _mm_max_epu8(_mm_max_epu8(m0, m1),
                                       _mm_max_epu8(m2, m3));

        int eq_mask = _mm_movemask_epi8(max_val);
        if (eq_mask) {
            eq_mask = _mm_movemask_epi8(m0);
            if (eq_mask) {
                return (void *)(p + bsf(eq_mask));
            }
            eq_mask = _mm_movemask_epi8(m1);
            if (eq_mask) {
                return (void *)(p + bsf(eq_mask) + 16);
            }
            eq_mask = _mm_movemask_epi8(m2);
            if (eq_mask) {
                return (void *)(p + bsf(eq_mask) + 32);
            }
            eq_mask = _mm_movemask_epi8(m3);
            return (void *)(p + bsf(eq_mask) + 48);
        }
        p += 64;
        count -= 64;
    }
    while (count >= 16) {
        __m128i v = _mm_load_si128((__m128i *) p);
        int eq_mask = _mm_movemask_epi8(_mm_cmpeq_epi8(v, value));
        if (eq_mask) {
            return (void *)(p + bsf(eq_mask));
        }
        p += 16;
        count -= 16;
    }
    return qword_bsf(p, ch, count);
}

On long inputs this version is faster than our previous SSE-based solutions, but still doesn’t reach the memchr (except for length 256):

Searcher 416642561024409616384
memchr 1.250.320.100.0630.0350.0310.027
sse 1.010.280.110.0650.0440.0410.040
sse_aligned 1.000.270.130.0620.0430.0390.036
sse_aligned64 1.140.330.120.0530.0370.0340.034

The AVX solutions

All our SSE solutions can be easily converted to AVX. Let’s try this. We’ll only show the simplest of the AVX solutions, the others can be seen in the repository:

void* avx(const void* ptr, int ch, size_t count)
{
    const char * p = (const char *)ptr;
    __m256i value = _mm256_set1_epi8((char)ch);
    while (count >= 32) {
        __m256i v = _mm256_loadu_si256((__m256i *) p);
        int eq_mask = _mm256_movemask_epi8(_mm256_cmpeq_epi8(v, value));
        if (eq_mask) {
            return (void *)(p + bsf_64(eq_mask));
        }
        p += 32;
        count -= 32;
    }
    return sse(p, ch, count);
}

The results are quite a bit better than those of SSE, and generally better than the memchr. This, however, mustn’t make us very proud, since we were not using the very latest version of GNU C. Surely, later versions contain a version of that library optimised for AVX.

Here are the times:

Searcher 416642561024409616384
sse_aligned64 1.140.330.120.0530.0370.0340.034
memchr 1.250.320.100.0630.0350.0310.027
avx 1.180.270.100.0450.0280.0240.023
avx_aligned 1.140.260.120.0490.0280.0200.019
avx_aligned64 1.150.260.120.0410.0240.0180.017

The standard library revisited

We managed to outperform the standard memchr using an unfair advantage (AVX), but haven’t beaten it using normal SSE code, which rises a question: what exactly is the magic in that code?

The only feature of that code that we paid attention to so far was the unrolled loop where it tested four 16-byte blocks at a time. A deeper inspection of this code reveals two more features:

   <+5>:     mov    rcx,rdi
             ...
   <+25>:    and    rcx,0x3f
   <+34>:    cmp    rcx,0x30
   <+38>:    ja     0x7ffff7aa7800 <memchr+112>

In other words, we check the alignment of the source pointer with respect to a 64-byte boundary. Why do we do it?

The code happened to be more interesting and more complex than it looked at first. As it is written in assembly, it can’t be directly converted into C. The closest we could get to is shown here:

#define TEST16_cmp_known_nz(p, cmp) \
    do {\
        int eq_mask = _mm_movemask_epi8(cmp);\
        return (void *)((p)+bsf(eq_mask));\
    } while (0)

#define TEST16_cmp(p, cmp) \
    do {\
        int eq_mask = _mm_movemask_epi8(cmp);\
        if (eq_mask) {\
            return (void *)((p)+bsf(eq_mask));\
        }\
    } while (0)

#define TEST16_v(p, v) TEST16_cmp (p, _mm_cmpeq_epi8(v, value))
#define TEST16(p) TEST16_v (p, _mm_load_si128((__m128i *) (p)))

#define TEST16_v_count(p, v) \
    do {\
        int eq_mask = _mm_movemask_epi8(_mm_cmpeq_epi8(v, value));\
        if (eq_mask) {\
            size_t pos = bsf(eq_mask);\
            if (pos >= count) return NULL;\
            return (void *)(p + pos);\
        }\
    } while (0)

#define TEST16_count(p) TEST16_v_count (p, _mm_load_si128((__m128i *) (p)))

void* sse_memchr(const void* ptr, int ch, size_t count)
{
    if (count == 0) return NULL;
    const char * p = (const char*)ptr;
    __m128i value = _mm_set1_epi8((char)ch);

    size_t align = (size_t) p & 0x3F;
    if (align <= 48) {
        TEST16_v_count(p, _mm_loadu_si128((__m128i *) p));
        if (count <= 16) return NULL;
        count -= 16;
        p += 16;
        align &= 0x0F;
        p -= align;
        count += align;
    } else {
        align &= 0x0F;
        p -= align;
        __m128i v = _mm_load_si128((__m128i *) p);
        unsigned eq_mask =
            (unsigned) _mm_movemask_epi8(_mm_cmpeq_epi8(v, value));
        eq_mask >>= align;
        if (eq_mask) {
            size_t pos = bsf(eq_mask);
            if (pos >= count) return NULL;
            return (void *)(p + pos + align);
        }
        count += align;
        if (count <= 16) return NULL;
        p += 16;
        count -= 16;
    }
    if (count > 64) {
        TEST16(p);
        TEST16(p + 16);
        TEST16(p + 32);
        TEST16(p + 48);
        p += 64;
        count -= 64;

        if (count > 64) {
            align = (size_t) p & 0x3F;
            if (align) {
                TEST16(p);
                TEST16(p + 16);
                TEST16(p + 32);
                TEST16(p + 48);
                p += 64;
                count -= 64;
                p -= align;
                count += align;
            }

            for (; count > 64; count -= 64, p += 64) {
                __m128i v0 = _mm_load_si128((__m128i *) p);
                __m128i v1 = _mm_load_si128((__m128i *) (p + 16));
                __m128i v2 = _mm_load_si128((__m128i *) (p + 32));
                __m128i v3 = _mm_load_si128((__m128i *) (p + 48));
                __m128i cmp0 = _mm_cmpeq_epi8(v0, value);
                __m128i cmp1 = _mm_cmpeq_epi8(v1, value);
                __m128i cmp2 = _mm_cmpeq_epi8(v2, value);
                __m128i cmp3 = _mm_cmpeq_epi8(v3, value);
                __m128i max_val = _mm_max_epu8(_mm_max_epu8(cmp0, cmp1),
                                               _mm_max_epu8(cmp2, cmp3));
                int eq_mask = _mm_movemask_epi8(max_val);
                if (eq_mask == 0) continue;

                TEST16_cmp(p, cmp0);
                TEST16_cmp(p + 16, cmp1);
                TEST16_cmp(p + 32, cmp2);
                TEST16_cmp_known_nz(p + 48, cmp3); // returns from function
            }
        }
    }
    if (count > 32) {
        TEST16(p);
        TEST16(p + 16);
        count -= 32;
        p += 32;
    }

    TEST16_count(p);
    if (count > 16); {
        count -= 16;
        TEST16_count(p + 16);
    }
    return NULL;
}

So what is actually going on here?

1) We check how the pointer is aligned against a 64-byte boundary. The idea here is that, since it is a built-in function, and it is written in assembly anyway, it does not have to follow the C standard strictly. Specifically, it makes use of the fact that the available memory does not start or end at an address not aligned by 64 (actually, the memory boundary is always aligned by 4096, so perhaps we could save there). So, if the alignment is 48 or less, there are definitely 16 bytes available, so we can read them using MOVDQU. We must handle correctly the case when the string is shorter than 16: that’s why we return NULL if the result of bsf is bigger than count.

2) If the alignment is above 48, we read the 16-byte vector at alignment 48 (actually reading data before our pointer). Here we must deal with possible matches before the start (for that we shift the eq_mask right by align), and after the end (for that we compare the result of bsf() with count).

3) In both cases, unless the end of the input is detected, we end up at the next 16-byte boundary (in the first case some of the bytes immediately after that boundary have already been tested, so we are going to perform some redundant computations).

4) Then we test the next 64 bytes, one 16-byte block at a time, using aligned access.

5) If the match hasn’t been found, we read enough 16-byte blocks to get to the 64-byte alignment.

6) Then we run the loop reading 64 bytes at a time (as implemented in sse_aligned64 version).

7) Finally, we deal with the leftovers (possible 32 bytes followed by possible up to 32 bytes).

Here are the results, which leave mixed feelings. The code is faster than memchr on small sizes, but loses to it on big ones:

Searcher 416642561024409616384
sse_aligned64 1.140.330.12 0.0530.0370.0340.034
memchr 1.250.320.10 0.0630.0350.0310.027
sse_memchr 1.190.300.092 0.0580.0380.0330.033

It looks like re-writing the carefully written assembly code in C makes it worse. What if we simplify it?

void* sse_memchr2(const void* ptr, int ch, size_t count)
{
    if (count == 0) return NULL;

    const char * p = (const char*)ptr;

    __m128i value = _mm_set1_epi8((char)ch);
    size_t align = (size_t)p & 0xFFF;

    if (align <= 4096 - 16) {
        TEST16_v_count(p, _mm_loadu_si128((__m128i *) p));
        if (count <= 16) return NULL;
        count -= 16;
        p += 16;

        align &= 0x0F;
        p -= align;
        count += align;
    }
    else {
        align &= 0x0F;
        p -= align;
        __m128i v = _mm_load_si128((__m128i *) p);
        unsigned eq_mask = _mm_movemask_epi8(_mm_cmpeq_epi8(v, value));
        eq_mask >>= align;
        if (eq_mask) {
            size_t pos = bsf(eq_mask);
            if (pos >= count) return NULL;
            return (void *)(p + pos + align);
        }
        count += align;
        if (count <= 16) return NULL;
        p += 16;
        count -= 16;
    }
    for (; count >= 64; count -= 64, p += 64) {

        __m128i v0 = _mm_load_si128((__m128i *) p);
        __m128i v1 = _mm_load_si128((__m128i *) (p + 16));
        __m128i v2 = _mm_load_si128((__m128i *) (p + 32));
        __m128i v3 = _mm_load_si128((__m128i *) (p + 48));

        __m128i cmp0 = _mm_cmpeq_epi8(v0, value);
        __m128i cmp1 = _mm_cmpeq_epi8(v1, value);
        __m128i cmp2 = _mm_cmpeq_epi8(v2, value);
        __m128i cmp3 = _mm_cmpeq_epi8(v3, value);

        __m128i max_val = _mm_max_epu8(_mm_max_epu8(cmp0, cmp1),
                          _mm_max_epu8(cmp2, cmp3));
        int eq_mask = _mm_movemask_epi8(max_val);
        if (eq_mask == 0) continue;

        TEST16_cmp(p, cmp0);
        TEST16_cmp(p + 16, cmp1);
        TEST16_cmp(p + 32, cmp2);
        TEST16_cmp_known_nz(p + 48, cmp3); // returns from function
    }
    if (count >= 32) {
        TEST16(p);
        TEST16(p + 16);
        count -= 32;
        p += 32;
    }
    if (count >= 16) {
        TEST16(p);
        count -= 16;
        p += 16;
    }
    if (count > 0) {
        TEST16_count(p);
    }
    return NULL;
}

Here we don’t insist that the 64-byte loop runs over the 64-byte-aligned memory blocks. We also relaxed our alignment requirements from 64 to 4096 for the original test.

Searcher 416642561024409616384
memchr 1.250.320.10 0.0630.0350.0310.027
sse_memchr 1.190.300.092 0.0580.0380.0330.033
sse_memchr2 1.110.320.11 0.0520.0360.0330.033

The results got slightly worse on sizes 16 and 64, but either improved, or stayed the same on all the other sizes, which makes the simplified version viable. It still, however, loses to memchr on long inputs. Why?

Loop unrolling

The code generated for sse_memchr and sse_memchr2 differs quite a bit from that of the memchr. The original code is much neater. It is also shorter (although not by far). Let’s show the code size for all the versions presented so far:

Searcher Code size, bytes
simple_index 385
simple_ptr 385
simple_ptr2 241
memchr 836
index_scas 14
simple_dword 977
dword 956
dword_bsf 900
simple_qword 997
qword 1200
qword_bsf 848
sse 1157
sse_unlimited 599
sse_aligned 1716
sse_aligned64 2326
sse_memchr 1221
sse_memchr2 1220

Obviously, we don’t expect the shortest code to be nesessarily faster. The memchr_scas is the most obvious example of the opposite. Our SSE matchers, compared to the simple matchers, also demonstrate that the longer code is often faster. However, the sse_memchr is implementing the same algorithm as memchr, and yet is 50% longer. This shows that the hand-written code can still outperform the code produced even by the best compiler (and the GCC is definitely very good).

Why is the sse_memchr so long? A brief code inspection shows that the main loop (the one that iterates over 64-byte blocks) has been unrolled. Generally, loop unrolling is a very powerful tool, but here we have already unrolled the main loop 4 times in a custom way – is there really a point to unroll it more, dealing with all possible leftovers?

Let’s try to disable loop unrolling forcibly, using GCC pragmas:

#pragma GCC push_options
#pragma GCC optimize ("no-unroll-loops")

// original code

#pragma GCC pop_options

We’ll do it for both sse_memchr and sse_memchr2:

Searcher 416642561024409616384
memchr 1.250.320.10 0.063 0.035 0.031 0.027
sse_memchr 1.190.300.092 0.058 0.038 0.033 0.033
sse_memchr_nounroll 1.180.290.0910.0560.0330.0290.024
sse_memchr2 1.110.320.110.0520.0360.0330.033
sse_memchr2_nounroll 1.110.320.100.0490.0330.0270.024

Finally, we’ve done it. The SSE-based version written in C is faster than the one carefully written by hand in assembly. Moreover, the simplified version (sse_memchr2) isn’t, generally, performing worse that the original sse_memchr. So this will be our version of choice unless we go for AVX.

The sizes also look better: 775 bytes for sse_memchr_nounroll and 472 for sse_memchr2_nounroll. The C compiler made code shorter than assembly!

The AVX version of the standard library

The same approach can be used using AVX. Here is the AVX version of the sse_memchr2 code as the shorter one (the macro definitions are omitted here):

void* avx_memchr2(const void* ptr, int ch, size_t count)
{
    if (count == 0) return NULL;

    const char * p = (const char*)ptr;

    __m256i value = _mm256_set1_epi8((char)ch);
    size_t align = (size_t)p & 0xFFF;

    if (align <= 4096-32) {
        TEST32_v_count(p, _mm256_loadu_si256((__m256i *) p));
        if (count <= 32) return NULL;
        count -= 32;
        p += 32;

        align &= 0x1F;
        p -= align;
        count += align;
    }
    else {
        align &= 0x1F;
        p -= align;
        __m256i v = _mm256_load_si256((__m256i *) p);
        unsigned eq_mask =
            (unsigned)_mm256_movemask_epi8(_mm256_cmpeq_epi8(v, value));
        eq_mask >>= align;
        if (eq_mask) {
            size_t pos = bsf(eq_mask);
            if (pos >= count) return NULL;
            return (void *)(p + pos + align);
        }
        count += align;
        if (count <= 32) return NULL;
        p += 32;
        count -= 32;
    }
    for (; count >= 64; count -= 64, p += 64) {

        __m256i v0 = _mm256_load_si256((__m256i *) p);
        __m256i v1 = _mm256_load_si256((__m256i *) (p + 32));

        __m256i cmp0 = _mm256_cmpeq_epi8(v0, value);
        __m256i cmp1 = _mm256_cmpeq_epi8(v1, value);

        __m256i max_val = _mm256_max_epu8(cmp0, cmp1);
        unsigned eq_mask = (unsigned)_mm256_movemask_epi8(max_val);
        if (eq_mask == 0) continue;

        TEST32_cmp(p, cmp0);
        TEST32_cmp_known_nz(p + 32, cmp1); // returns from function
    }
    if (count > 32) {
        TEST32(p);
        count -= 32;
        p += 32;
    }
    TEST32_count(p);
    return NULL;
}

The original avx_memchr can be seen in the repository.

And here are the times, compared to the previous results:

Searcher 416642561024409616384
memchr 1.250.320.10 0.063 0.035 0.031 0.027
avx_aligned64 1.150.260.12 0.0410.0240.0180.017
avx_memchr 1.240.290.085 0.045 0.026 0.018 0.017
avx_memchr_nounroll 1.240.290.0880.0420.0270.0240.019
avx_memchr2 1.230.280.0980.0410.0250.0180.017
avx_memchr2_nounroll1.230.280.0920.0380.0260.0230.019

Surprisinly, the ununrolled versions aren’t faster than unrolled ones, and, in general, the versions aren’t faster than avx_aligned64. I won’t investigate it further, though; we already achieved a lot.

Summary

Here is the total summary of all the methods considered so far, excluding AVX:

Searcher 416642561024409616384
Null 0.63 0.14 0.035 0.009 0.002 0.000 0.000
Simple 1.24 0.61 0.47  0.44  0.44  0.43  0.42 
Simple Ptr 1.20 0.61 0.47  0.43  0.43  0.42  0.42 
Simple Ptr2 1.17 0.53 0.43  0.41  0.42  0.41  0.41 
SCAS 7.12 2.26 1.19  0.93  0.86  0.85  0.85 
Simple dword 0.99 0.52 0.42  0.40  0.41  0.39  0.39 
Dword 1.51 0.66 0.48  0.45  0.39  0.38  0.38 
Dword BSF 1.28 0.58 0.40  0.36  0.32  0.29  0.29 
Simple qword 0.93 0.46 0.42  0.41  0.42  0.41  0.41 
Qword 1.59 0.49 0.24  0.19  0.16  0.15  0.15 
Qword BSF 1.23 0.42 0.19  0.14  0.13  0.13  0.13 
memchr 1.25 0.32 0.10  0.063 0.035 0.031 0.027
sse 1.010.28 0.11  0.065 0.044 0.041 0.040
sse_unlimited 0.880.26 0.11  0.068 0.050 0.049 0.048
sse_aligned 1.000.27 0.13  0.062 0.043 0.039 0.036
sse_aligned64 1.14 0.33 0.12 0.053 0.037 0.034 0.034
sse_memchr 1.19 0.300.092 0.058 0.038 0.033 0.033
sse_memchr2 1.11 0.32 0.11 0.052 0.036 0.033 0.033
sse_memchr_nounroll 1.18 0.290.091 0.056 0.033 0.029 0.024
sse_memchr2_nounroll 1.11 0.320.10 0.049 0.033 0.027 0.024

As usual, the first, the second and the third place in each category are marked green, yellow and red.

And here are the AVX ones:

Searcher 416642561024409616384
avx 1.180.270.10 0.0450.0280.0240.023
avx_aligned 1.140.260.12 0.0490.0280.0200.019
avx_aligned64 1.150.260.12 0.0410.0240.0180.017
avx_memchr 1.250.290.0850.0450.0260.0180.017
avx_memchr 1.240.290.0850.0450.0260.0180.017
avx_memchr_nounroll 1.240.290.0880.0420.0270.0240.019
avx_memchr2 1.230.280.0980.0410.0250.0180.017
avx_memchr2_nounroll1.230.280.0920.0380.0260.0230.019

With the exception of size 4, the AVX versions are always faster than the SSE ones, although they are never twice as fast as one might expect.

In absolute terms, the results don’t look bad: the achieved speed was about 41 bytes per nanosecond (17 bytes per cycle) for SSE, and a bit higher for AVX (59 and 24 bytes).

When we were searching for byte strings in Java (in “Searching for a better indexOf”), the typical times, per byte, varied depending on nature of data and length of pattern, and for small patterns (100 bytes) were about 0.04 – 0.10 ns per byte, which makes the search based on the first character a viable option. Obviously, searching for a string also requires string comparison, but it is worth trying anyway.

Results on Windows

For the sake of completeness, let’s run the same test on Windows. Unfortunately, there are two variables that change. First, it’s a compiler (Microsoft Visual Studio 2017, MSVC 19.10.25019). And then, it’s hardware: a notebook with i7-6820HQ @ 2.7 GHz (Windows 10).

Here are the abridged results:

Searcher 416642561024409616384
null_index 0.490.110.0270.0070.0020.0010.000
simple_index 1.270.600.43 0.46 0.40 0.38 0.38 
simple_ptr 1.240.840.72 0.72 0.70 0.69 0.69 
dword_bsf 0.750.360.28 0.26 0.23 0.22 0.22 
qword_bsf 0.780.300.13 0.13 0.11 0.10 0.10 
memchr 0.710.160.0720.0510.0450.0450.043
sse 0.680.180.0830.0450.0420.0410.039
sse_aligned64 0.810.220.0720.0320.0220.0220.022
sse_memchr2 0.680.210.0750.0320.0240.0220.022
avx 0.780.150.0510.0280.0300.0220.020
avx_aligned64 0.710.160.0690.0250.0130.0110.012
avx_memchr2 0.680.160.0630.0300.0160.0120.011

The main observations are:

Conclusions

Things to do

Now it’s time to look at string searching in C/C++.

Comments are welcome below or at reddit.

comments powered by Disqus