Today we’ll be searching for strings inside other strings. We all know that this topic has been studied comprehensively, and virtually all possible algorithms have already been developed and published. We’ll deliberately not look there. Our objective is not to google the best algorithm but to see what ideas naturally come to a programmer, implement them in Java, and measure performance.
The journey in front of us is rather long. We’re going to look at twenty-two different solutions.
For the impatient ones and the scared ones I suggest jumping straight
The summary of all results is here. The other results of
interest are Big random patterns and
Big random patterns, huge data.
One very good matcher suggested by a reader is here.
In my Java coding practice I once had to search for certain byte sequences inside bigger byte arrays. If these were character strings, I would have used
String.indexOf straight away, and there would have been no topic for this study. I had to implement
indexOf for byte arrays, and, as far as I know, there is no library function for that.
There are three factors that affect search algorithms.
The first one is the length of a pattern. There are fast ways to search for a single byte, and there are (as we’ll see later) fast ways to search for a megabyte-long pattern. Different algorithms prefer different pattern lengths. The patterns I had to search were 10 – 20 bytes long.
The second one is the nature of data. On one end of the spectrum lies searching for random byte
sequences inside other random byte sequences. On the other end we have regular repeating patterns, such as
aaaaaacaaaaaabaaaa. Somewhere in between lies the case where there are some patterns
of much less horrible structure, such as in natural language text. In my case, the pattern was such a text, while
the array contained both natural text and binary data. Natural text search is a good enough approximation of this problem.
The third factor is possibility of pre-processing the data. This makes three types of search problems:
1) searching for the same pattern in multiple strings (pattern pre-compilation)
2) searching for multiple patterns in the same string (indexing)
3) searching for a single ad-hoc pattern in a single ad-hoc string (both indexing and pre-compilation can work if they are fast, but pre-compilation seems to have better chance).
We’ll concentrate on the first class, although the simplest solutions cover all the classes.
Here are some naming and formatting conventions used in this article:
- The string to search in we’ll call the text;
- The string to search for we’ll call the pattern;
- Our algorithms will compare the pattern with various portions of the text; the substring we’re currently busy comparing with we’ll call the current portion;
- The terms prefix and suffix will have their usual meaning: a possibly empty substring at the beginning or at the end of a given string; the string itself is a prefix and a suffix for itself;
- Prefixes and suffixes that are shorter than the string are called proper prefixes and proper suffixes;
- The longest suffix of the current portion than matches the text will be called the good suffix;
- The first byte that follows the good suffix to the left, if exists, will be called the bad byte (the bad character if describing the character search).
The snapshots of search algorithms running will be shown in the following format:
Here the top line contains the current portion and some of its neighbourhood, while the second one shows our pattern, placed right under the current portion. We’ll avoid patterns starting or ending with spaces for these demonstrations.
As a text we need something with a non-trivial distribution
of characters, repeating patterns and good chances to find short strings in more than one place.
Some piece of poetry will do, so we’ll use The Tragedy of Hamlet, Prince of Denmark
text in the root directory).
We’ll convert it into lowercase, remove punctuation
and replace line breaks with spaces. Our text is 150128 characters long.
We’ll use one arbitrarily chosen verse as the base of our patterns:
We’ll use substrings of this pattern base of various lengths as our patterns.
Since there are very good ways to search for short patterns (such as one or two bytes), and complex algorithms are unlikely to perform well there, we’ll limit the study to substrings of length of at least four.
Our pattern base is 106 bytes long. To save space, we’ll only publish the exponential results (for patterns of lengths of 4, 8, 16, 32, 64 and 106). For each of those we’ll use all possible substrings of this length and average the results. This makes the full 106-long pattern special, so the results may be a bit odd for that one. To get averaged results for long patterns, we’ll add length 96 as well.
The test will count how many times each pattern is encountered in the entire text (which must be at least one). We’ll measure the time spent, in nanoseconds, per byte of the text. Since the text is short and is iterated more than once, the caching effects are ignored for now. We will, however, try very long data sets later.
Our matchers will extend the abstract class
with the following method:
Each class will be accompanied by a corresponding factory class – some implementations will use different classes for different patterns. The classes will contain debug code, which collects appropriate statistics; I’ll omit this code in included code snippets.
The main class of the test is
We’ll run our tests on Java 1.8.142 on Xeon® CPU E5-2620 v3 @ 2.40GHz, using Linux. The full source code is available in the repository. Now we’re ready to go.
A Simple Matcher
Repository reference: SimpleMatcher.java.
This is the simplest of all implementations. We’ll use it as the reference point: everything else will be tested against this.
compare() is defined in the base
Here are the times for our exponential set (nanoseconds per byte of the text):
We promised some statistics. For this matcher, we’ll calculate the average inequality index, that is, the average
i at the exit of
The reason this index is so much smaller in the case of length 106 is that in this case the first
character of the pattern is fixed (
d), and this letter occurs rather infrequently (3.22%),
while smaller pattern lengths involve variety of first characters.
This index is bigger than the value of 1/26 = 0.038, which is the mean predicted value for the alphabet of 27 characters. The reason is non-uniform distribution of characters, both in the text and in the pattern.
First Byte Matcher
Repository reference: FirstByteMatcher.java.
Since 90% of compare failures happen at the first byte, it looks attractive to treat this byte specially:
This is indeed faster than the simple matcher:
We even broke the one-nanosecond barrier, but we already know why this happened: the first character
of the pattern is a rare letter
The statistic we collect this time is the rate
compare() is invoked, or, in other words,
the frequency of the first byte not matching:
First Bytes Matcher
Repository reference: FirstBytesMatcher.java.
The success of special treatment of the first byte makes us think of extending this approach to multiple first bytes. This would definitely help if we wrote in C, where reading (at least, on Intel architecture) a multi-byte word is as cheap as reading one byte. In Java, we aren’t so sure, so it needs testing. We’ll use the first four bytes:
Here are the results:
There is some improvement, although it isn’t very big. We didn’t expect much anyway, because the change only addresses 8% of all cases.
We collect the
compare() rate again, and it is much lower than before:
First Bytes Matcher improved
Repository reference: FirstBytesMatcher2.java.
We can save a bit on reading
buf [i+2] and
buf [i+3] in the following way:
Generally, I don’t like such manual improvements, because I believe that a good compiler should be clever enough to perform them for us. However, it did help a bit:
Repository reference: CompileMatcher.java.
Let’s try a different approach and make use of the fact that we’re writing in Java. What if we create a new class for every pattern we search for, compile it and load? All of that must happen in the factory class. It takes time, but this time isn’t affecting our test results. There are three ways to generate a class:
- create the byte code directly;
- generate Java text and invoke the compiler from
tools.jar; that will require knowing the path to that jar;
- generate Java text and invoke javac, assuming it’s in the
We’ll go the last route as it’s the simplest one; the generator is in the
is an example of the code for some small pattern length (pattern
Results are, however, not better than before, so we can discard this method:
This is actually quite strange. Perhaps, it’s a topic of a new study.
Repository reference: HashMatcher.java.
For the first time we’ll try not to improve little things here and there, but rather use some sort of non-trivial algorithm. We’ll start with a hash matcher.
The idea is to apply some function to both the pattern and the current portion to obtain their hash value. This function must be additive, i.e. allow for quick addition and removal of bytes, so that we could slide the current portion along the text. We’ll only perform full comparison when the hash values match. To make things simple, we’ll use addition as a hash function:
Here are the times:
Rather poor performance of this algorithm is surprising. Let’s look at the
compare() rate statistics:
The compare rate isn’t very low. It is much higher than for
FirstBytesMatcher, and some operations (two
hash updates and one comparison) happen on each iteration. Some better algorithm is required, one
that helps reduce the number of iterations.
Last Byte Matcher
Repository reference: LastByteMatcher.java.
The idea of the last byte matcher is that when our pattern does not match the current portion, in most cases we can do better than shift the pattern by one. We can look at the last byte of the current portion, find it in the pattern and shift accordingly. This is how it works:
Let’s assume we are searching for the first line of our pattern:
doubt thou the stars are fire,
and let’s try matching it against an arbitrary portion of our text:
The strings clearly don’t match, and the last character of the pattern (
e) falls against an
Instead of shifting the pattern by 1, which would put this
h against an
r, we can search
our pattern right to left and find the rightmost
h in it, which is the one in
the (16 positions
from the end). We can then shift our pattern by 16 to put
Completely coincidentally, the last
e fell on
h once more, so we can shift by 16 again:
Here we got very lucky:
e falls on
n, which does not occur in our pattern. We can shift by
the entire pattern length (29):
Note that shift by 1 can still occur, such as in this case:
What if the last byte matches, but not the rest?
In this case we must find the second rightmost occurrence of the last character (in this case it’s
are) and shift there (by 5):
So here is the plan: we search for all possible bytes in our pattern and record how far from the end of the pattern they occur, for the last byte in the pattern ignoring that last occurrence. For the bytes that never occur, we record the pattern length. Then we’ll take the last byte of the current portion, look up the recorded value for that byte and shift our pattern by this number. This is how we calculate the shifts:
And this is how we search:
Except for very short patterns, it really runs faster than everything we tried so far:
The one-nanosecond barrier has now been broken convincingly; it doesn’t depend on the specific character of the pattern.
The statistics of interest for this matcher are the average shift and the compare rate:
The compare rate is pretty much the same as for the first byte matcher (except for the last pattern;
probably, because it ends with a very frequent letter
The average shift is quite a bit bigger than one (that’s why this matcher is so fast), but it still disappoints: sometimes it is as much as four times smaller than the pattern length. This is to be expected, though, since our alphabet is so small. The chances for a given character to be found inside a long pattern are quite high. If the text and the pattern were random byte sequences, we could be much better off with this matcher (or, rather, the same problem would require longer patterns to occur), but for our string example we need to increase the shift.
For medium-sized patterns (such as of length 16), however, the shift value is satisfactory, the speed is good, and the simplicity of this matcher makes it very attractive. I ended up using this one in my production code.
Repository reference: MultiByteMatcher.java.
One way to increase the average shift value is to work with more than one byte.
Let’s look at the example we used when introducing the
We looked at the last character (
h), and found it 16 positions from the end of the pattern,
so we could shift the pattern by this number. The next character in the current portion (
doesn’t work so well, since it occurs in
stars and only provides the shift value of 11. The next one
(space) is even worse, as it occurs just two positions to the left. The next one (
s) is no better: it
gives the shift value of 6. Finally,
i becomes the champion: it never occurs in the pattern
before this point, so the shift is 24. There is no point looking further – the shift can only
The last byte has the potential to produce the biggest shift (the shift can never be bigger than the character’s position from the left), that’s why if we can only use one byte, the last one is the best choice. What if we use more than one? We’ll start from the right-hand side and go left while there is a chance to improve the shift.
One downside of this approach is that instead of one array of size 256 we’ll have to keep a whole collection of those – one for each position in the pattern:
This is what matching looks like:
We collect two statistics for this matcher: the average shift and the average number of iterations this shift was obtained at:
The shift looks very good (very close to the pattern length), but the iteration count is quite high, which can undermine the entire effort.
This seems indeed to be happening: the results aren’t very good.
Limited Multi-byte matcher
Repository reference: LimitedMultiByteMatcher.java.
We can limit the number of iterations in
MultiByteMatcher to some small value (two or three). This
will reduce both the average shift and the operation cost, and the overall performance can improve.
We just copy the code for
MultiByteMatcher and introduce the constructor parameter
the factory and the matcher. The main loop will run to
pattern_len - n instead of
First, some statistics. Let’s start with the average shift:
As expected, the shift gradually increases with the maximal iteration count.
Here are the actual iteration counts used:
Now let’s see if smaller iteration counts compensate for smaller shifts:
It doesn’t. The limited multi-byte matcher is still slower than the ordinary last-byte matcher.
Unrolled Limited Multi-byte matcher
Repository reference: UnrolledLimitedMultiByteMatcher.java.
We don’t give up so easily. When the number of iterations of the limited multi-byte version is known and small, we can unroll the loop and, possibly, increase performance.
This is what the unrolled routine looks like for the number of iterations of three:
Here are the times:
Unrolling improves times, but not enough to outperform the
LastByteMatcher, except for
our pattern of 106, which is non-representative by its nature.
Search String Suffixes
Until now we looked for occurrences of single bytes in our pattern. When more than one byte was used, they were searched independently of each other. It may be more productive to search for groups of bytes, e.g. for entire suffixes of the current portion.
Let’s look at some example: we’ll search for the second line of our pattern base (which is 28 characters long) close to its presence in the text:
If we record positions of single characters only (as in
LastByteMatcher), we get the
shift value of 2, as this is how far the last
o is found from the end of the pattern.
If we work with suffixes of length 2, we’ll search for
do and find it 7 characters from the
end, which is better but still far from perfect.
Three characters (
" do") take us to the same place (the shift value of 7).
Finally, four characters (
e do) are not found in the pattern at all. We can shift by 25,
so that the next current portion starts at
We can do better than this if for all suffix lengths we record whether these suffixes occur in the
beginning of the pattern. We know the pattern does not start with
" do", so we can increase
the shift to 26. Since the pattern starts with
do, we can’t increase it more, and we can see
that 26 is indeed the right value – it takes us straight to the match:
What if no suffix is found in the beginning of the pattern? Then we can shift by the entire
pattern length, as in the example from the
" th" give the same shift (16), and neither of them is found in the beginning.
The suffix of length 4 (
s th) isn’t found at all, which allows us to shift the pattern
by its entire length of 29.
What if more than one suffix is found in the beginning of the pattern? We must be conservative and use the longest one, which will result in the smallest shift. This case seems impossible at first, but it is in fact quite real:
Three suffixes are found in the beginning:
t that t, the longest one
being eight characters long. This means that if we get as far as the suffix
ut that t,
which does not occur in the pattern, we must shift by the pattern length minus 8, which is 8:
In short, here are the simple rules we can follow (assuming P is the pattern length):
- if some current portion suffix is found in the pattern at the smallest distance of N from the right end, use N as the shift value;
- if some current portion suffix of size L is not found in the pattern at all, and we have no information on the occurrence in the beginning, use P − L + 1 as the shift value;
- if some current portion suffix is not found in the pattern at all, and L is the length of the longest suffix of this suffix that occurs in the beginning (zero if none), use P − L as the shift.
Last Byte Suffix Matcher
Repository reference: LastByteSuffixMatcher.java.
Let’s start with a simplistic implementation of this idea, which only deals with the suffixes of the pattern.
Let’s go back to the
LastByteMatcher. There we used the last byte of the current portion
to calculate the shift, no matter if it matched the last byte of the pattern or not. Let’s modify
this a bit. If the last byte doesn’t match, we do exactly as before.
If it does match, we look how many more bytes match (find the good suffix), and apply the
entire procedure described above to that suffix. The good part of it is that the suffix is also a
suffix of the pattern and is therefore entirely defined by its length. We can pre-calculate whatever
we want for these suffixes. Here is what we do to initialise the matcher:
And here is the matcher:
Here is an example of operation of this matcher:
Here the pattern length is 16 and the good suffix is of size 3 (
ove), and it never occurs in the pattern.
However, it doesn’t mean we can shift by 16: a shorter suffix (
ve) matches the beginning of
the pattern, so the correct shift is 14:
The times, however, aren’t great:
It is easy to see why the speed hasn’t improved. Remember, in the
LastByteMatcher the compare rate
was 9 – 10%. This is the fraction of invocations when the last byte didn’t match. And this is the
only case when this new matcher tries to make a difference. Even if we made very high shift values
in these 10% of cases, we wouldn’t have made a big difference, but we didn’t do even that: when the
good suffix length is exactly one, we’re back at the
LastByteMatcher territory, where shift
values aren’t very big, as any single character is quite likely to be found in a long enough pattern.
To check this, let’s look at statistics: the average shift counts in general, the average shift counts
for the suffixes of length one, and the average shift counts for the suffixes of length more than one
(together with the same statistic from the
|Avg shift, LastByte||3.5||6.0||9.6||13.9||19.2||22.2||24.8|
|Avg shift, size = 1||3.8||5.7||7.9||9.7||10.3||10.7||14.0|
|Avg shift, size > 1||3.9||7.5||13.7||22.9||37.2||43.4||29.9|
We see that increasing the suffix size does increase the shift, but not enough; besides, the situation when it happens is rather rare.
Next Byte Suffix Matcher
Repository reference: NextByteSuffixMatcher.java.
There are ways to improve these solutions. Let’s look at one of them.
We can combine the
LastByteSuffixMatcher with the
MultiByteMatcher. For that, we’ll look
at both the good suffix (just like in the previous case), and several of
the bytes that follow, choosing the bigger one of two shifts. The simplest is to look at exactly
one such byte – the bad byte. For that, we need a full collection of 256-element shift arrays, one for each
position where that bad byte occurs.
I’ll skip the initialisation code. The matcher code looks like this:
The shift count improved a bit:
|Avg shift, LastByteSuffix||3.5||6.0||9.6||13.9||19.2||22.2||24.8|
|Avg shift, NextByteSuffix||3.5||6.1||10.0||14.8||20.9||24.3||27.2|
This wasn’t, however, enough to achieve better performance. It stayed the same or even dropped a bit:
Possible improvement: Next Suffix Matcher
Another possible way to improve this solution is to extend the set of the indexed suffixes.
Currently, we only work with the direct suffixes of the pattern.
We can extend it to work with the strings like “a suffix of the pattern plus one arbitrary character”.
This will require allocating an array of size 256 for every position in the pattern (just like what
we did in the
MultiByteMatcher). The shifts are quite likely to improve a lot. The overall big
performance improvement is, however, highly unlikely, because these big shifts will only be produced in 10% of
all the cases – when the last bytes match. The shifts in 90% of the cases will be exactly the
same as for
LastByteMatcher. That’s why we’ll skip this solution.
If we want to get good shifts, we must really go wild and index more suffixes. What if we index everything?
Repository reference: SuffixMatcher.java.
The procedure of searching suffixes described above was rather complex; this was due to limitations on the kind of strings we indexed, both in their length and their nature.
If we can afford to index suffixes of any length, things suddenly become much simpler. There is no such thing anymore as “a suffix matches some string in the middle of the pattern”. If it does, expand it by one byte. Eventually, we’ll either find a mismatch, or reach the beginning of the pattern.
The description of our algorithm becomes very easy and straightforward:
- let P be the length of the pattern
- let N be the length of the longest suffix of the current portion that is also a prefix of the pattern (zero if none)
- if P = N, we’ve found a match
- otherwise, shift the current portion right by P − N.
If we look at the previously considered example:
There are three suffixes of the current portion that are also prefixes of the pattern:
t that t of lengths 1, 3 and 8. Since the pattern length is
16, we must shift by 8:
We don’t need to index all the substrings of our pattern; we only need the prefixes, and there are only P of them. We could use a hash table for that, but I feel that a search tree is more appropriate. This is what we do:
This is much simpler than initialisation of the
LastByteSuffixMatcher: we don’t need to
search anything, we just index what we have. We make use of a relatively small branching factor (256).
Here are the conventions:
noderepresents a string (some substring of the pattern); the
rootnode represents an empty string;
node.boundaryis true, this string is a prefix of the pattern;
node.nodes == null, this node represents a maximal string (a string that can’t be expanded left to another substring of the pattern).
bis a byte and
null, this node’s string, expanded left by this byte, does not occur in the pattern.
node.nodes == null can only happen if
node.boundary flag is set. The opposite is not true:
a substring that is a prefix of the pattern can be expanded left to another prefix of the pattern.
We’ve seen the examples of this before.
This is what the search procedure now looks like:
Both initialisation and search procedures are very simple and easy to read – much simpler,
for instance, than the
Here are some comments on this code:
iterations traverse the search tree instead of comparing the current portion to the pattern; the iteration count is not affected by equality or inequality of encountered bytes to those of the pattern;
there is no direct comparison of the pattern to the current portion at all; the match is found if the tree is fully traversed. This means that the
patternvariable is unnecessary. It is kept for debug purposes only;
we start with an unsafe value of the shift (
len), and decrease it as we proceed, when we encounter nodes marked
boundary. This means that we have to go as deep as possible into the tree, and are not allowed to abort iterations prematurely;
all previous versions could be modified by using smaller number of bits than eight in each byte to access the arrays; the arrays could thus be allocated smaller. This would decrease the average shift but could be handy when units of text were not bytes but, say, Unicode characters. This version is different: since the tree traversal replaces string comparison, the tree must be exact. An explicit comparison of the pattern to the current portion will be required if smaller arrays are used.
Here are the stats:
Both statistics look very good. We manage to shift the pattern by very close to its full length, and to establish this shift in under three iterations.
How is the speed?
The suffix matcher is slower on short patterns and faster on long ones. The speed achieved for
patterns of length 106 is very good; it is 24 times higher than that of the Simple matcher and
9 times higher than the
FirstByteMatcher. The nice thing here is that the speed was achieved
by better algorithm only – no intricate optimisations or dirty tricks.
The last result was quite good. Can we possibly improve on that by employing the technique of generating Java code again? After all, the entire search tree is known when we create a matcher, so why now hard-code the entire tree traversal in Java?
I’ll be brief here. We tried three approaches:
we generate a method for every tree node; it calls other methods as it goes deeper. We rely
on JVM to optimise these methods and perform all necessary inlining. Here is an example of one
of the methods (all the methods return calculated shift):
put the entire search into one function, with corresponding depth of nesting
It slows down at length 64 and fails completely at 106: the code of the method becomes too big for the Java compiler.
we’ve seen that the average iteration count is 2.7 for the longest pattern (length 106). It seems
attractive to perform the first iterations in generated Java code, and the rest on a tree, as before.
Here is an example of generated code (pattern
Here are the results:
The results are not so good. The original version is the best (and, as we remember, on small patterns
LastByteMatcher is even better).
Saving memory: SmallSuffixMatcher
Repository reference: SmallSuffixMatcher.java.
SuffixMatcher, being fast, has one big disadvantage: it allocates a 256-element array of objects
(one kilobyte on JVM with compressed pointers) for each internal tree node, and the upper limit
for the number of these nodes is
|N =||P (P − 1)||+ 1|
where P is the pattern length. The actual values are a bit lower, but not by much: the value for our 106-byte pattern is 5353, while the limit is 5365.
This can become a problem for longer patterns, so it is attractive to reduce this number, even at the cost of some speed degradation. It may, however, even help improve the speed for very long patterns due to better caching.
Let’s implement tree nodes as different classes depending on their children number. There is a big variety of options we can take: use vectors with linear search, vectors with binary search, small hash tables and so on. For now, let’s implement a very simple solution where we define three classes of nodes:
- nodes with zero children (correspond to the nodes where
- nodes with exactly one child (with the obvious search procedure)
- nodes with more than one child (with a 256-long array, just like before).
We can build these nodes directly, but it felt easier to build the old-style nodes (just using maps instead of arrays) first and then convert them into proper nodes. This is the matter of taste; practically applied solutions can do it differently.
So we rename our previous class to
ByteMap is a home-made map from
(in repository). A normal Java map could be used
here; I just don’t like excessive boxing.
Node class and its subclasses look like this:
There is no information inside a terminal node. That’s why we can allocate exactly one of those and re-use it (previously we couldn’t do it, because a terminal node could become internal as we built the tree).
The search procedure changes, too:
Obviously, such an amount of virtual calls should affect performance, and it did:
Results are still better than for
LastByteMatcher for long patterns, but the boundary value is
now higher (96 rather than 32).
The number of allocated arrays improved dramatically: the pattern of length 106 produced 4335 instances
SingleNode and 49 of
ArrayNode. Of these, 33 had two children, 7 had three and 4 had four.
It is unclear if these numbers justify special treatment for nodes with very small branch factors.
Possible other options
We know that for patterns of length 106 the average iteration count is 2.7. We could keep the first
three levels of our tree as
ArrayNodes, and implement the rest using some space-saving technique.
We could even hard-code this in our search routine, avoiding virtual calls at these levels. This is
very speculative, as it might not work with data of other nature. And, in any case, it won’t run faster
SuffixMatcher. After all, the purpose of this exercise is to make a matcher close to
SuffixMatcher in performance, but less memory-hungry. However, there is one option that has
a potential to be faster as well.
Repository reference: ChainSuffixMatcher.java.
If we dump a typical search tree now, we’ll see that there are lots of nodes of type
whose children are again
SingleNodes, and so on. These
SingleNodes form a chain, and the search
engine traverses this chain one by one. How about modelling this chain explicitly as a special type
of node? This will require change in the interface of
Node: it must match itself against a byte
sequence instead of a single byte:
Each node matches several incoming bytes to a portion of the tree and then calls the matcher of the subsequent node. The method returns obtained shift. This way iteration is replaced by tail recursion, so the main loop body becomes very small:
We’ll follow the same approach as for
SmallSuffixMatcher: build a generic tree first, converting it
later to the specialised tree. Here are the classes of that tree:
Most single nodes were converted to chains: our 106-byte pattern now produces 49 arrays, 9 single nodes and 107 chains of average length 49.5.
Unfortunately, the times didn’t improve:
There is still some value in this solution: it allocates fewer memory. Here is the amount of memory, in kilobytes, allocated for several types of matchers:
Light chain suffix matcher
Repository reference: LightChainSuffixMatcher.java.
There are multiple ways the
ChainMatcher can be improved. We tried two, which are not presented here
for the reason that they didn’t improve speed. They are, however, published in the repository.
tree nodes were specialised by presence of the child (
In another one,
they were also specialised by the
boundary flag. This
complicated the code but made no performance difference.
This doesn’t mean there are no ways to improve speed. And there are definitely ways to improve
memory footprint. The table above shows very good memory use for
ChainSuffix, but only compared
to other methods. If the pattern length grows, so does the memory use, eventually making this method
inapplicable. We want to move this point as far right as possible on the pattern length scale.
One way to save memory looks next to trivial: all our chains are in fact substrings of our pattern,
so they can be represented as offsets and lengths, with no need to copy them into specially allocated
arrays. This is a second-degree improvement compared to what we’ve just done (remember, previously
each byte of a chain was represented by a node (a Java object), and before that - as an array
of 256 elements. Still, getting rid of this array looks attractive. All that’s necessary is to
collect the positions of the original nodes in the pattern, and, obviously, provide the
with the reference of the pattern.
I’ll skip the modifications to the tree building procedure (see in the repository); here is the
Here are the times:
And the memory use:
The memory use isn’t much lower than before, but we’ll look at this again when considering longer patterns.
There are other ways to reduce memory, e.g. to re-use some subtrees, essentially converting the tree into a DAG. This, however, falls outside of the scope of this article, which already has grown too long.
Let’s however, do the last effort.
Lightest chain suffix matcher
Repository reference: LightestChainSuffixMatcher.java.
LightChainSuffixMatcher is good in speed and in memory, but lacks a good procedure for
its instantiation. The way the tree is created is obviously non-optimal. The final tree is well
optimised for space, but the initial tree is not, and, as we’ll see later, it may require quite
an amount of memory. It is also not very quick to build. Even though our primary interest is in
application of matchers and not in their instantiation, we still prefer our solution to be
practically usable for wide range of inputs.
Let’s save some time and space by creating the tree in one step. Another change is that we’ll also insert full strings into the tree rather than bytes one by one.
Here is the base class for a tree node:
A new method
add will insert a given substring of
pattern into this node, creating, if necessary,
more nodes and, possibly, replacing this node with a new one.
We’ll sacrifice a
SingleNode (it will be replaced with a
ChainNode of length 1; it can be re-introduced,
but the code will become longer and less clear with uncertain benefits). There will be three subclasses
ArrayNode: a branch node, which has more than one child. Currently, always implemented as a 256-long array, but, for saving on memory, can be replaced by something smaller. Null value in the array means that corresponding byte is illegal in this state, otherwise, we can descend one byte deeper into the tree;
ChainNode: a node representing a string of bytes; it has a single child indicating the next node to match;
nullmeans end of matching process.
EmptyNode: a terminal node that indicates end of matching process for an
nullhas other meaning there). This node is immutable, so exactly one instance is needed.
boundary bit indicates that the path in the tree from the root to (but not including) this node
represents a valid prefix of the pattern. The
EmptyNode always has this bit on. This bit always
applies to the whole node; if a prefix of the pattern ends in the middle of a chain, this chain is
split in two.
Here are our nodes:
ChainNode.add looks complicated but actually is not. It must cater for four major cases:
- the added string is a proper suffix of the chain: split the chain at that point to put a
boundaryflag on the second chain;
- the chain is a proper suffix of the added string: add the remaining string to the chain’s child, if it exists,
otherwise make the remaining string such a child with the
- the added string is equal to the chain: if there is a next node, set the
- the string and the chain differ somewhere: create an
ArrayNodeand insert it into an appropriate place. There are three sub-cases: the array node may replace the first byte of the chain, or the last one, or some middle one, splitting the chain in two.
The tree-building procedure now looks very neat:
The matching procedure looks nice, too:
The times are very close to those of the
LightChainMatcher – removal of the
SingleNode hasn’t affected them:
This was the last of the algorithms we planned to look at today. Now it’s time to look at other aspects of string search.
The primary reason why we went on this journey was absence of
indexOf method for byte arrays,
which was needed for my real-world problem. However, there are such methods for Java Strings,
and we can compare them with our solutions converted to work with strings.
We’ll convert some of our solutions (too much work to do it for everything; besides, it feels unnecessary),
Suffix. To simplify things, we’ll ignore Unicode
and limit our character set to 256 characters.
Besides mentioned matchers, we’ll add three string-specific ones:
StandardMatcher: the same as
SimpleMatcher, except it calls
String.regionMatch()method instead of our
compare(). The methods are nearly identical, except the former one is implemented right inside the
Stringclass and has direct access to its internals. This is the way to check if it gives any advantage.
IndexofMatcher: makes direct use of
String.indexof(), which is (as of Java 8) essentially the same as our
RegexMatcher: we compile the pattern as a regular expression and then use this pre-compiled pattern. Note that an arbitrary string isn’t yet a valid regex pattern: it may contain special characters that have their own semantics and must be properly escaped. We’ll ignore this for now, as our artificial example doesn’t contain such characters.
Here are the times:
The first place in each column is marked green, the second is yellow and the third is red.
Standardis roughly 10% faster than
IndexOf, which is surprising;
Simplematcher does the same as the
Standardone, just calls
String.charAt()instead of char array access. This makes hell of a difference: it works at half the speed;
FirstBytesis still much faster than
Simple, but slower than the matchers based on the standard library;
the compiled regular expression matcher is really fast, quite a bit faster than all naïve matchers;
we still managed to outperform it using both
of these two matchers, the
LastByteis faster for shorter patterns, while
Suffixis faster for longer ones (beginning at 32).
The Regex pattern matcher
Repository reference: RegexByteMatcher.java.
The pattern matching engine of Java is designed to handle much more complex cases than ours: it works with potentially very sophisticated regular expressions. Taking this into account, even though we managed to overtake it, this engine performed very well in our test. How was this achieved?
The pattern search routine, starting at
java.util.regex.Matcher.find (), quickly ends up at
java.util.regex.Matcher.BnM.match () (
BnM standing for
Boyer-Moore search algorithm, 1977):
There is more code there, related to matching more complex patterns than just a simple string, which we can ignore for now. This is, however, the main execution path in our case. Here are some immediate observations on this code:
This code looks very neat.
Labels may look odd for someone grown up with the “Goto statement considered harmful” statement, but can be used quite successfully by Java developers; after all, Java labels are not quite goto.
The Java library developers also prefer to pre-load fields into local variables, just like we did. It’s not clear if it makes any difference to the code generated, but let’s not break the tradition.
This code doesn’t make use of the string internals: the pattern is converted into an
intarray, while the text is addressed as a
CharSequenceusing a virtual call to
charAt(). So it’s possible to make a fast matcher even under these limitations.
You can have your name given even to such a simple algorithm if you are the first to invent it.
We didn’t use this exact algorithm anywhere above; the closest to it
NextByteSuffixMatcher. This is what the Regex code does:
1) compare the current portion to the pattern right to left and identify the good suffix (we remind that this is the longest suffix that matches, and we know the most common case is the suffix of length zero);
2) if this suffix spans the entire pattern, we have a match;
2) otherwise, look up the shift for this suffix in a prepared table, just like we did;
3) then look at the bad character (the first character that didn’t match), and get a shift for this character. The greater of the two shifts is used.
The last step differs from our implementation: we had a
separate shift table for each position; they use just one table, calculated for the last position
(exactly the same as our table for
LastByteMatcher), adjusting the shift by the bad character’s position.
The shift obtained this way, while safe, must be, generally, worse than the shift we got.
Let’s see by how much:
|Avg shift, LastByteSuffix||3.5||6.0||9.6||13.9||19.2||22.2||24.8|
|Avg shift, Regex||3.5||6.1||10.0||14.8||20.9||24.3||27.2|
|Avg shift, NextByteSuffix||3.5||6.1||10.0||14.8||20.9||24.3||27.2|
Here comes a surprise: the average shift counts for the
Regex are exactly the same as for
NextByte. Is it simply a rounding problem, the results being similar but not equal? We can print
the raw data used to calculate averages. They are exactly the same. For instance, for the pattern length
of 16 both algorithms report the shift sum of 1514837185 for 151637812 operations.
This can’t be a coincidence, and, indeed, it’s not. The shift generated by multiple tables (
can differ from the shift from a single table (
Regex) only when the latter is negative, that is,
if the bad character occurs in the good suffix.
Anything that can only be found to the left of the bad character, gets the same shifts from both algorithms:
Here the bad character is
o, which gets the shift of one from the
and the shift of 7 found in
Regex table, adjusted by its position 6, also produces 1.
Regex shift of the bad character is negative, the final shift is determined by the good
suffix shift. It is easy to see that it is never smaller than the shift of
the bad character.
Let the pattern length be l and the good suffix length be n. We have two cases:
- the good suffix is never found in the pattern left from itself. Let the longest suffix of the good suffix that is also a prefix of the pattern have length m (possibly zero). In this case this suffix produces shift l − m, which is greater than l − n, which is the maximal possible shift for the bad character (obtained when it doesn’t occur in the pattern left of itself).
Here is the example demonstrating this case, where m = 3:
The bad character is
o, and its shift is −3, so the shift of 29 is produced by the good suffix
" dou", of which the suffix
dou if a prefix of the pattern (l = 32, n = 4, m = 3).
- the good suffix is found in the pattern k ≥ n positions left from itself, which means that it produces the shift of k. In this case the bad character, being part of this suffix, is found in the pattern between positions k and k + n − 1 from the right. This makes its shift k − n to k − 1, which is less than k.
Here is the example demonstrating this case:
Here the suffix is
" do" of length 3, and its shift k = 10. The bad character is
d (occurs in this suffix),
and it is found 8 positions left from itself. The
NextByte implementation will give 8, the
Regex will produce −2,
and the total shift will be determined by the suffix shift (10).
So it is proven that the algorithm with one table produces exactly the same results as the algorithm with a table at every character position. This is remarkable, and it wasn’t obvious immediately. This shows that there is a reason algorithms get inventors’ names.
It’s worth noticing, however, that the
Regex algorithm, even being very smart, is still a variation of
LastByteMatcher. In 90% cases (when the last character does not match) it produces the same shift.
The reason it performs so well as a string matcher (although it lost to
LastByte, and lost big time to
long patterns) is its simplicity.
Repository reference: RegexByteMatcher.java.
A detour into string matchers was quite productive: we’ve learned another good algorithm,
which is close, but not identical to those considered before. Let’s return to the byte array matching
territory and try implementing this algorithm. The new matcher will get its name based on its origin
The shift statistics is exactly the same as expected, which means we’ve implemented it right. The times, however, are not so good:
What could have caused such a difference? The
NextByteSuffix matcher, apart from using multiple tables, also employs
a special treatment of the last byte. This is a classic case of a partial loop unroll: if there are big chances that
the loop is executed exactly once (which is our case, the probability is 90%), and the loop body in this case is simpler
than the general one, move it out of the loop and adjust the iteration count. Sometimes, a compiler can perform this
transformation automatically, but often it does not have enough information to do so. Let’s help it. We saw that
previously it helped in
FirstByteMatcher; let’s try again, we’ll call it
(repository reference: RegexByteMatcher2.java).
Two factors may contribute to performance of this code: the
last variable loaded once (hopefully, stored in a
register), and the
suffix_shifts array not being accessed when the suffix length is zero (the shift array contains zero
in this case, but it would be too much to expect the compiler to make use of that). So, this is the speed:
Performance has improved quite a bit and is now pretty much the same as that of
NextByteSuffixMatcher, but the code
is simpler. It is, however, still more complex than the simple
LastByteMatcher, and is not faster than that.
We tried many options:
Simplematcher that uses plain string comparison: runs rather slowly;
- several of
FirstBytematchers, optimised for special treatment of first bytes; definite speed improvement;
Compileversion that creates Java class on the fly: marginally faster;
Hashmatcher: the first attempt to use non-trivial algorithm; failed miserably;
LastBytematcher: another non-trivial algorithm and the first to produce shift values bigger than one; a big success;
MultiByteversions that improve shift values at the cost of more complex code and fail to improve speed;
LastByteSuffix: the first attempt to operate on entire suffixes rather than single characters; failed due to limited subset of cases it addressed;
NextByteSuffix: a combination of the suffixes approach and the single character approach: worked well but not faster than the
Regex: a classic algorithm that has a name and a Wiki page; is a variation of the
NextByteSuffix, and, being optimised to
Regex2, performs as fast as that one;
Suffix: a generalised algorithm operating on suffixes and addressing all the cases; a good success on longer patterns but suffers there from high memory demand;
- several compiled versions of suffix matchers, failed to improve speed (only one is included in the table to keep it shorter);
SmallSuffix: a way to reduce memory requirements of
Suffixat the cost of some performance loss;
ChainSuffix: further reduction of memory requirements; the same or slightly worse performance as that of the
LightChainSuffix: an improvement of memory use of
ChainSuffix; pretty much the same performance on the tested pattern lengths;
LightestChainSuffix: identical matching algorithm to the previous one, but better procedure for constructing the matcher.
Here are all the results:
The colour scheme is the same as before (green – yellow – red for the first, second and third place).
The compiled versions are excluded from the competition as very exotic.
LastByteMatcher and its
variants are clearly winning on shorter patterns, while the family of the
SuffixMatcher is taking over
on the longer ones.
It is remarkable how the biggest improvements were obtained by algorithmic changes rather than
low-level optimisations. It is also worth noticing that the generic
Suffix matcher was both neater
and faster than the specialised
Note: Since publishing of this article, a new, better matcher has been suggested, see Update 2.
Until now we’ve been working with natural texts with significant simplifications (lowercase, no punctuation, no line breaks, the total alphabet size of 27). How will the results change if our text and our pattern becomes totally random sequences of bytes?
We’ll generate a text a bit bigger than before (4M) and use some position inside as a source for our patterns. Since the data are random, we don’t need to use very many patterns to get an average value: we’ll just use four. We won’t run all our algorithms now, just the most significant ones:
Surprisingly, the top performing solutions haven’t changed.
Suffix is the best, although its
LastByte is small.
All simple solutions fall way behind, so better algorithms are beneficial for random data as well.
Let’s try longer patterns:
SuffixMatcher failed on long patterns due to shortage of RAM, so it was good that we bothered
to implement more compact
ChainSuffix. They started slowing down due to shortage
of L3 cache, and will eventually also fail at even longer patterns. However, on current patterns they
perform quite well. Here are the amounts of allocated memory, in megabytes:
Big random patterns
How about even longer patterns? The
SmallSuffix is already about to fail.
The chain suffix matchers look much better, but two of them suffer from the same problem.
LightChainBuffer first build a classic search tree, which takes a lot of
space. Here is the amount of memory this tree takes, in megabytes:
This is where our effort to make
LightestChainSuffixMatcher pays out:
of the tree-based matchers, it is the only one that can survive doubling
of the pattern size. Obviously, the less efficient but less memory-hungry ones also can. Let’s try to grow
our pattern to one megabyte. First of all, let’s check if use of
LightestChain is feasible at all.
We know it is building a big tree: how long does it take?
The time of 310 ms isn’t insignificant, but still appropriate for practical use. Now, let’s check the matching times:
The results are somewhat disappointing: starting impressively at nearly 10 times the speed of
LightestChain matcher eventually (at 256K bytes-long patterns) loses its advantage,
and it’s easy to see why. Let’s look at memory consumption of the matchers:
LastByte matchers use exactly 1064 bytes).
We ran out of L3 cache, which on our machine is only 15 MB per chip. Our text (4 MB) fits in cache, even together with the data for simple matchers. That’s why, even with higher number of iterations they outperform the chain matcher.
Big random patterns, huge data
Finally, let’s run the ultimate test: we’ll search for long patterns (16K to 1M) in a sixteen gigabyte text. Technically this text is arranged as 16 arrays of one gigabyte each, one of which contains the pattern. The search happens exactly once, to prevent caching.
It looks strange at first that the lightest chain suffix matcher is so much faster now than when the text was cached. Most probably, it is due to much less success rate (previously the pattern was found once in every four megabytes of scanned text), and it is a match that is slow.
The performance advantage of this matcher on the uncached data is impressive. When the pattern does not match the current portion, we shift the pattern by the value that is very close to the length of the pattern, avoiding a lot of uncached reads. Instrumenting the matcher code with a counter shows that matching each gigabyte of the text involves reading about 3K bytes there. The total number of bytes read from all 16 gigabytes is 2150497, of which 2100280 were read when processing the part that contained a match.
This makes it a good solution to find in files, too. So if one has a multi-terabyte file and needs to find a one-megabyte fragment inside, this is the way to go. This, however, doesn’t seem like a realistic practical problem.
LightestChainSuffixMatcher has weak points, too. The tree for a one-megabyte pattern
occupies 137 Mbytes, so searching for much longer patterns requires radical space reduction.
The time to build a tree is small, but not completely insignificant: for the same tree it is 310 ms.
Some improvement may be needed there, too.
On reddit, user pgris
asked how these algorithms would perform with UTF-8 data, e.g. long Chinese text. This is very
valid question. First of all, will the algorithms work at all? Characters in UTF-8 take more than one
byte and there is a risk of misaliginng the pattern (trying to match beginning of one character to the
middle of another one). Fortunately, the design of UTF-8 makes this situation impossible. The first
byte of each character has high bits
11, while subsequent bytes start with
This makes it possible to search for UTF-8 strings just like ordinary byte blobs.
Performance, however, is a different issue. Multi-byte characters may affect some of the algorithms negatively. For instance, if Chinese characters use small set of first bytes, it will slow down the first byte matcher. If some last bytes are very frequent, the last byte matcher will suffer, etc. Let’s try it.
On the Loyal Books site I found a long enough (1.8M) Chinese text: “The Three Kingdoms Romance” by Guanzhong Luo. The text contains Chinese characters, some punctuation and line breaks. It is in the repository under the name Book-23950.txt. We’ll use some arbitrary chosen line as our pattern. Its length is 114 bytes (38 characters). When searching for substrings, their lengths must be adjusted to search for full characters.
Here are the times:
LastByte and its family (such as
Regex) performed very well, while the
with friends fell a bit behind (although, not looking too bad, either). Perhaps, they need longer
patterns to excel in the UTF-8 case.
Another important feature of UTF-8 searching is Unicode normalisation, but that falls beyond the limits of our study. The same applies to case-insensitive search. After all, initially we only wanted to search for bytes…
Update 2: TwoByteHashShiftMatcher
The idea is similar to
LastByteMatcher, except instead of a single last byte we look at the
suffix of length 2. The shifts for those suffixes are pre-calculated and stored in an array.
Storing shifts for all possible suffixes can be expensive (we need an array of 64K elements, or
256K bytes, which is still smaller than many of our solutions), so we save space by using hash values.
Of all possible hash functions, the simplest one is used:
Varying the shift, we can change the table size and the algorithm performance.
The matcher looks like this (full text in repository):
Another difference from
LastByteMatcher is that it doesn’t compare last bytes to those of the pattern;
instead, it uses the shift value of zero to detect if they match. This reduces the shift achieved
in this case (one instead of the shift to the second rightmost occurrence of these two bytes),
but this happens very rarely.
This matcher works very fast:
Results are improving with growth of the shift_value (called
p2 in the program), and after value 1
are better than both
Suffix. The improvement rate slows down after two bits and
stops after 5. To see why, let’s look at the average shift values:
The shifts are worse that those of
Suffix but better than those of
LastByte. They gradually improve
shift_value but saturate at 5. With the 5 bit shift (meaning the total number of bits 13 and
the array size 8192 elements) the solution works just as with full 8 bits – that’s probably due to
our small, 27-character, alphabet.
compare() rate varies between 0.9% for long patterns at 5 bits to 3% for short patterns and 0 bits,
with a typical value of 1.5% in the middle.
Now let’s look at the binary data.
The results (except for very small patterns) are better than
LastByte but slightly worse than
Still, it makes this matcher very attractive because of its simplicity. The shift factor of 2 is
enough. Here are the shift factors for it, which are very good:
Now, the big patterns:
Again, it performs better than our previous solutions.
Finally, the huge test:
Only now the
Lightest matcher managed to catch up with the Two-byte hashing one. However, while
Lightest is approaching its limit of applicability at the pattern length 1M, the Two-byte one
does not have such a limit. Besides, there is a chance that for the huge array case the two-byte one can be upgraded to
three-byte or four-byte, using some good enough hash function.
We’ve implemented a version of
String.indexoffrom scratch for byte arrays;
Various optimisation techniques can make the implementation quite a bit faster (two to three times);
However, the algorithmic improvements often outperform code optimisations;
When working with strings, Java has a competitive advantage (access to the String internals);
However, it is still possible to perform better, at least in some cases (long patterns);
Java regular expression library is very good and performs very well; however, we managed to win over that library;
I used the
LastBytematcher in the production code; it offered the best “value for money” (performance to complexity) ratio in my particular case. Other cases may put other matchers in front; there is never a definite winner;
The suffix-based family of matchers works very well on a wide range of pattern lengths, but hits the memory limit on very long ones, where some additional work is required to save memory;
We tested two major cases: data with small alphabet size and some repeating patterns (simplified natural language), and completely random data. We didn’t cover very exotic regular patterns. Some other algorithms may perform better for these cases;
Although generation of class files on the fly seems like the way to go in Java, it never worked: the performance was consistently worse than that achieved by a normal Java code;
Our starting point was searching for small to medium-sized patterns in not-so-big byte arrays. In the process we accidentally came up with an algorithm suitable for searching for big patterns in huge arrays or even in files, which works 1000+ times faster than any alternatives. There doesn’t seem to be a demand for that, though;
A solution was suggested that combines simplicity of the
LastByteand performance of the
Suffuxby indexing suffixes of length 2 (
TwoByteHashShiftMatcher). This one performs like magic.
Things to do
I want to run similar tests in C++. It presents multiple challenges. On one hand, it also has
a pattern-matching library. On another one, it allows easy access to the CPU instructions, which
provide both fast byte search (
SCAS) and wide comparisons (64 bit in normal mode and up to 512 in
Comments are welcome below or at reddit.