Reading time: about 7 minutes
Pattern matching with strings is a classic problem in computing. It has applications ranging from humble text editors to cutting-edge genome research in bioinformatics. The pattern matching algorithms also exhibit a curious property: approaches that work well for “typical” texts tend to perform badly on worst-case texts, and vice-versa. FJS attempts to solve this problem by combining the features of two existing algorithms.
What do I mean? Let’s look at a “typical” example problem: find all matches of at (the pattern) in Cat bath mat (the text):
You’d do OK even with a naïve algorithm that walked along the text and checked, at each position (called an alignment), if an occurrence of the pattern at started there:
But now consider this problem: find all matches of the pattern aaab in the text aaaaaaaaaaaaaa.
There are no matches, but the naïve algorithm can’t see that right away. It will check, at each position in the text, if the pattern can be found starting from that position. In each case, it won’t figure out that there is no match until it checks the last pattern letter. Let’s call the pattern p and its length m=|p|, and let’s call the text t and its length n=|t|. Since at each position in t we test all m letters in p, we see that we are doing on the order of m × n comparisons. (A few less, since the pattern is too long for a match to start at any of the last m - 1 positions in t.)
The KMP algorithm handles cases like this very nicely. It never needs to do more than 2n - m letter comparisons. To achieve this, it needs an auxilliary array for p with entries based on the following idea: suppose we are looking for the pattern onions at a particular position in the text and everything matched up until the s (which didn’t match). Consider only the part that did match, onion. Notice that it both starts and ends with on. We say that on is a border of the string onion. In fact, on is the longest possible border of onion. How can we use this information? Well instead of moving to the next letter in t and starting over like the naïve algorithm would, we can instead move down to the on at the end of the onion sequence that we know matched. Now since we know that the pattern starts with on (because it is the longest border of the part that matched), we don’t test for the o or the n and instead start by checking if the i matches. You can prove that this is safe and won’t miss any possible matches.[2, 5]
OK, so KMP lets us search any string in O(n) time (that is, time proportional to n). But it also takes time to make this auxilliary array. It turns out that if you are clever you can do this in O(m) time, so the whole thing now runs in O(n + m) time. That’s much better than the O(m × n) time that the naïve algorithm offers… or is it?
Notice that KMP fixes “weird” worst cases with periodic strings like aaaaaaaaaa. Yet typical text doesn’t look much like that, and we have that extra O(m) preprocessing time, too. The net effect is that KMP is slower than the naïve algorithm on typical texts! There are, however, a number of algorithms that address this average case. Of these, the fastest is generally considered to be the BMS[3, 4] algorithm.
Unlike KMP, BMS matches the pattern from right to left starting from the last letter. When it hits a mismatch, it jumps to the right. If the text letter that didn’t match can be found in the pattern somewhere left of the mismatch point, then that point in the pattern is aligned with the mismatch point. The real magic, though, happens when the mismatched letter does not appear in the pattern. Then the entire pattern is shifted to the point just after the mismatch, leaping m letters down the text in a single step! On a typical text, BMS will finish in a fraction of the time of KMP or the naïve algorithm. The down side is that BMS always starts over at the right end of the pattern each time the pattern moves down the text: in the worst case, it is as bad as the naïve algorithm. And like KMP, BMS needs an auxilliary array that tells it how far it can jump, but its array needs an entry for every letter in the alphabet used by the text and pattern (such as ASCII or Unicode). So instead of running in O(n + m) time it runs in O(n + α) time, where α is the number of letters in the alphabet. Large alphabets, like Unicode, threaten to overshadow the gains made during matching, but in practice there are workarounds for this.
Finally we come to FJS. The goal is to create a hybrid of KMP and BMS that combines their best properties: KMP’s worst-case performance and BMS’s average-case performance. This is not as simple as it may sound: keep in mind that they don’t even run in the same direction when checking for pattern matches! And you can’t just switch one of them around: their scan directions are critical to how they work.
To do it we need to recognize two key requirements. First, the BMS part can’t be allowed to restart from the right end of the pattern after any arbitrary mismatch, or you will lose your worst case guarantee. (That means we need to be able to use KMP’s rules “nearly always.”) Second, the majority of the benefit of BMS in typical texts comes from the pattern not matching on the first test at a given position and then making a jump. So, when we can we will start any new alignment position of the pattern to the text by testing the right end of the pattern against the corresponding text position. If it doesn’t match, we use a BMS-style jump. If it does match, then we continue matching the rest of the pattern from the left end using KMP. The when we can condition means that once we start doing KMP matching, we need to keep using it as long as there is a border for KMP to use at the next pattern position. But whenever KMP has to start over from the left end of the pattern, we get to sneak in another BMS-style test at the other end first.
Since the BMS test is only performed on the last pattern letter, never in a loop over the pattern, it can’t add more than n - m + 1 letter comparisons than KMP would have done, so FJS is still O(n) in the worst case. But remember, it uses the preprocessing arrays of both algorithms, making it run in O(n + m + α) time overall.
That said, it turns out that FJS is actually about 5-10% faster than BMS on average. This seems to come down to two factors: First, the loop for doing BMS skipping is tighter in FJS than in BMS because it only ever scans the final pattern letter. Second, the fact that the first pattern letter to be checked and the second pattern letter to be checked are further apart may make early detection of a mismatch more likely. For example, -ing is a common suffix in English. If a pattern ends in -ing then BMS needs at least 3 letter comparisons every time it is aligned with an -ing word in the text. In contrast, FJS would often detect a mismatch after only 2 letter comparisons.
Using the controls below, you can get a good sense of how the algorithm works by watching it run in slow motion on any strings you like. Two things to watch for: big jumps thanks to BMS when the last pattern letter doesn’t match, and pale green letters at the start of the pattern that indicate the parts that KMP knows will match without any testing.
This article intentionally glosses over fine details. For those with a computing science background who want more, this section summarizes some of the key properties of FJS. Proofs and further details can be found in the primary literature.
FJS uses auxilliary arrays from the KMP and BMS algorithms. KMP’s modified border array (called β′ in the journal article and code) requires m + 1 words while BMS’s skip array (called Δ) requires α words.
In the best case, FJS performs ⌊n / (m+1)⌋ letter comparisons. This occurs when, at each possible alignment, p[m] does not match the letter it is aligned with in t and the following letter (in t) does not occur in p. In the worst case, FJS performs 3n - 2m letter comparisons. The maximum number is 3n - 6m, triggered by strings of the form t = an and p = aba. In the average case, empirical testing suggests that FJS is currently the fastest exact pattern matching algorithm under most practical circumstances.
As with all algorithms derived from Boyer-Moore, FJS requires an indexed alphabet. For practical purposes, this means that the alphabet must be of finite size, it must have letters encodable as a series of integers, and it must be possible to determine the code for any letter in O(1) time.
Empirical evidence suggests that FJS is usually not the fastest algorithm for small alphabets (α < 20), although it remains competitive. For very large alphabets, preprocessing time may dominate unless n is large. An example where this may occur is 16-bit Unicode characters. One strategy for dealing with this in a language like C is to cast the strings to 8-bit arrays and ignore any matches at odd offsets. Another strategy is to create a Δ table with 256 entries and mask away the high byte of each character, effectively creating a kind of hash table. (Table entries should use the smallest jump of any character colliding with the hash value.) This may result in extra letter comparisons in practice, but does not affect correctness or worst case performance.
Code is available on GitHub under the liberal MIT license.