June 3, 2017
About 10 minutes
Visualization Algorithms Strings STEM

The FJS string matching algorithm

Find all occurrences of a pattern in a longer text, quickly

Contents

About

FJS is an algorithm which I developed as part of my Master’s degree. It is perhaps the fastest general algorithm for finding all occurrences of a pattern string in a text string.¹ This page gives an overview of how it works, including a visualization so you can see it in action.

Introduction

Finding patterns in strings is a classic problem in computing. It has applications ranging from a humble text editor’s search-and-replace feature to cutting-edge genome research in bioinformatics. Most pattern matching algorithms exhibit a curious property: algorithms that work well for typical texts tend to perform badly on worst-case texts, and vice-versa. The original impetus for FJS was to fix this by combining aspects from two existing algorithms: one that works well in worst-case scenarios and one that works well on typical texts. To my surprise, not only did FJS succeed in this regard, it turned out to be extremely fast in practice.

Before getting into the details, let’s look at examples of how a naïve algorithm might work in order to understand the main performance issues faced by these kinds of algorithms. Consider this example: find all matches of at (the pattern) in Cat bath mat (the text):

A naïve algorithm would walk along the text and check if the pattern at occurs at each possible alignment of the pattern with potential starting positions for matches in the text:

Seems OK. But now consider this worst-case example: find all matches of the pattern aaab in the text aaaaaaaaaaaaaa.

There are no matches, but the naïve algorithm can’t see that right away. It checks each possible alignment with the text to see if the pattern can be found starting from that position. In each case, it only determines that there is no match after checking every pattern letter, up to and including the b.

How bad is it? Let’s call the pattern $p$ and its length $m = | p |$ , and let’s call the text $t$ and its length $n = | t |$ . Since at each position in $t$ we test all $m$ of the letters in $p$ , we see that we are doing on the order of $m ⨯ n$ comparisons. (A few less, since the pattern is too long for a match to start at any of the last $m - 1$ positions in $t$ .)

KMP handles worst cases

The KMP² algorithm handles cases like this nicely: it never does more than $2 n - m$ letter comparisons.³ To achieve this, it uses an auxilliary array to store extra information about $p$ that exploits the following idea: Suppose we are looking for the pattern onions at a particular position in the text and everything matched up until the s, which didn’t match. Consider only the part that did match, onion. Notice that it both starts and ends with on. We say that on is a border of the string onion. In fact, on is the longest border of onion. How does KMP use this information? Instead of moving the pattern to the next letter in $t$ and then starting over from the start of $p$ like the naïve algorithm would, it can instead skip testing the starting on. Since it matched the on at the end of the previously matched onion substring, and since on is a border of that substring, we already know that the on at the start of the pattern has going to match, so those letters first two letters don’t need to be tested. The pattern is still aligned to the next possible starting position, but testing starts at the i in the pattern instead of starting over from o. By using this trick, KMP never moves “backward” in the text. That is, once it reaches a given letter in $t$ , it may re-test that letter against some pattern letter multiple times but it never re-tests any earlier letters.

One way to think about how KMP works is to imagine a finite state machine in which each pattern letter is a state. Leading from each letter-state are two transitions: one going to the next letter-state (if the pattern matches the text), and one going to a previous letter-state (if the pattern doesn’t match). The next letter-state after a mismatch can be determined as follows:
Look at the state immediately prior to the state where the mismatch happened. (For a mismatch at s, the second n state.)
Find the longest border of the pattern up to and including that state. (In this case, on is the longest border of onion.)
Take the length of this border (it can be, and often is, 0). Call this value $b$ . (The length of on is 2.)
Starting at the first letter-state, skip to the next letter state $b$ times. (Starting at o, move to the next state twice, first to n, and finally to i.) This is the state to resume matching from at the site of the mismatch.

Because KMP never moves backward in the text, it can guarantee that it can always search the entire text in time proportional to $n$ , even in those worst cases where the naïve algorithm requires time proportional to $m ⨯ n$ . Don’t forget, though, that KMP also requires that auxilliary array with information about the longest borders of the substrings. But, using some cleverness²^,³ I won’t get into here, it turns out that you can create that array in time proportional to $m$ . But $n + m$ is still a lot better than the $m ⨯ n$ that the naïve algorithm offers… or is it?

Notice that KMP fixes “weird” worst cases with periodic strings like aaaaaaaaaa. Yet typical text doesn’t look much like that, and we have that extra preprocessing time, too. All together, KMP ends up being slower than the naïve algorithm on typical texts!

BMS fixes typical cases

While KMP struggles with the average case, there are a number of algorithms that significantly boost performance on the average case (while sacrificing worst case performance on “weird” strings). Of these, one of the fastest is generally considered to be the BMS algorithm.⁴^,⁵

Unlike KMP, BMS matches the pattern from right to left, starting from the last letter. When it hits a mismatch, it jumps to the right. If the text letter after the mismatch can be found in the pattern, that point in the pattern is aligned with the text letter after the mismatch. The real magic, though, happens when the letter after the mismatch does not appear in the pattern. Then the entire pattern jumps past that point, leaping $m + 1$ letters down the text in a single step! On a typical text, BMS will finish in a fraction of the time of KMP or the naïve algorithms. The down side is that BMS always starts over at the right end of the pattern each time the pattern moves down the text: in the worst case, it is no better than the naïve algorithm.

Like KMP, BMS needs an auxilliary array, but this array stores how far it can jump on a mismatch. That means the array needs an entry for every letter in the alphabet used by the text and pattern (such as ASCII or Unicode. Instead of needing set-up time proportional to $m$ like KMP, BMS needs time proportional to $α$ , the size of the alphabet. Large alphabets, like Unicode, threaten to overshadow the gains made during matching, but in practice there are effective workarounds for this.¹

FJS, the beautiful baby

Finally we come to FJS. As stated earlier, the goal when I developed FJS was to create a hybrid of KMP and BMS that combines their best properties: KMP’s worst-case performance and BMS’s average-case performance. This is not as simple as it may sound: keep in mind that they don’t even run in the same direction when checking for pattern matches! (And you can’t just switch one of them around: their scan directions are critical to how they work.)

To do it we need to recognize two key requirements. First, the BMS part can’t be allowed to restart from the right end of the pattern after any arbitrary mismatch, or the worst case guarantee is lost. (It needs to perform KMP-style matching “nearly always”.) Second, the majority of the benefit of BMS in typical texts comes from the pattern not matching on the first test at a given position and then making a big jump. So, when it can FJS starts any new alignment position of the pattern to the text by testing the right end of the pattern against the corresponding text position. If it doesn’t match, it performs a BMS-style jump. If it does match, then it continues matching the rest of the pattern using a version of KMP. The when it can condition means that once it starts doing KMP matching, it needs to keep using it as long as there is a (non-empty) border that KMP can use at the next pattern position. But whenever KMP would start over from the left end of the pattern, FJS can sneak in another BMS-style test at the other end first.

Since the BMS test is only performed on the last pattern letter, never in a loop over the pattern, it can’t add more than $n - m + 1$ letter comparisons than KMP would have done, so FJS still runs in linear time in the worst case. But remember, it uses the preprocessing arrays of both algorithms, making it run in time proportional to $n + m + α$ overall.

That said, it turns out that FJS is actually about 5-10% faster than BMS on average.¹ This seems to come down to two factors: First, the loop for doing BMS skipping is tighter in FJS than in BMS because it only ever scans the final pattern letter. Second, the fact that the first pattern letter to be checked and the second pattern letter to be checked are further apart may make early detection of a mismatch more likely. For example, -ing is a common suffix in English. If a pattern ends in -ing then BMS needs at least 3 letter comparisons every time it is aligned with an -ing word in the text. In contrast, FJS would often detect a mismatch after only 2 letter comparisons.

See it in action

Using the controls below, you can get a good sense of how the algorithm works by watching it run in slow motion on any strings you like. Watch for pale (translucent) letters. A pale green (square) letter is a pattern letter that is known to match without comparing it to the text thanks to a KMP-style border alignment. After the algorithm completes, any text letters that remain pale were never compared at all, thanks to a BMS-style skip.

Text Pattern Speed

Letter comparisons:
n)
Matches found:

Legend

letter match

letter mismatch

not compared

not compared, known to match

pattern matched

Key properties

This article intentionally glosses over fine details. For those with a computing science background who want more, this section summarizes some of the key properties of FJS. Proofs and further details can be found in the primary literature.¹

Asymptotic complexity

Preprocessing time and space: $Θ (m + α)$
FJS uses auxilliary arrays from the KMP and BMS algorithms. KMP’s modified border array (called $β^{'}$ in the journal article and code) requires $m + 1$ words while BMS’s skip array (called $Δ$ ) requires $α$ words.

Search time: $O (n + m)$ , $Ω (\frac{n}{m})$
In the best case, FJS performs $⌊ \frac{n}{(m + 1)} ⌋$ letter comparisons. This occurs when, at each possible alignment, $p [m]$ does not match the letter it is aligned with in $t$ and the following letter (in $t$ ) does not occur in $p$ . In the worst case, FJS performs $3 n - 2 m$ letter comparisons. The maximum number is $3 n - 6$ , triggered by strings of the form $t = a^{n}$ , $p = aba$ . In the average case, empirical testing suggests that FJS is currently the fastest exact pattern matching algorithm under most practical circumstances.

Note that this section describes the complexities for finding all occurrences of the pattern, whereas some texts describe only the time needed to find the first occurrence of a pattern. The distinction is vital, since some algorithms described as running in linear time do so only for the first occurrence, but are actually $O (m n)$ overall.

Limitations

As with all algorithms derived from Boyer-Moore, FJS requires an indexed alphabet.³ For practical purposes, this means that the alphabet must be of finite size, it must have letters encodable as a series of integers, and it must be possible to determine the code for any letter in $O (1)$ time.

Empirical evidence suggests that FJS is usually not the fastest algorithm for small alphabets ( $α < 20$ ), although it remains competitive. For very large alphabets, preprocessing time may dominate unless $n$ is large. An example where this may occur is 16-bit Unicode characters. One strategy for dealing with this in a language like C is to cast the strings to 8-bit arrays and ignore any matches at odd offsets. Another strategy is to create a $Δ$ table with 256 entries and mask away the high byte of each character, effectively creating a kind of hash table. Table entries should use the smallest jump of any character colliding with the hash value. This may result in extra letter comparisons in practice, but does not affect correctness or worst case performance.

Sample code on GitHub

C and Java code are available under the liberal MIT license.

Bibliography

Frantisek Franek, Christopher G. Jennings and W. F. Smyth, A simple fast hybrid pattern-matching algorithm, J. Discrete Algorithms 5: 682–695 (December 2007). DOI: 10.1016/j.jda.2006.11.004 (Preprint PDF) ↩︎ ↩︎² ↩︎³ ↩︎⁴
Donald E. Knuth, James H. Morris and Vaughan R. Pratt, Fast pattern matching in strings, SIAM Journal on Computing 6(2):323–350 (1977). ↩︎ ↩︎²
Bill Smyth, Computing Patterns in Strings, Addison Wesley Pearson, 2003. ↩︎ ↩︎² ↩︎³
Robert S. Boyer and J. Strother Moore, A fast string searching algorithm, Communications of the ACM 20(10):762–772 (1977). ↩︎
Daniel M. Sunday, A very fast substring search algorithm, Communications of the ACM 33(8):132–142 (1990). ↩︎

Have a comment or correction? Let me know! I don’t use an automated comment system on this site, but I do appreciate feedback and I update pages accordingly.