**Reading time:** about 7 minutes

FJS^{[1]}
is a leading algorithm to quickly find all occurrences of a pattern string in a larger text.
This page gives an overview of how it works and includes a
visualization so you can watch it work.

Pattern matching with strings is a classic problem in computing. It has applications ranging from humble text editors to cutting-edge genome research in bioinformatics. The pattern matching algorithms also exhibit a curious property: approaches that work well for “typical” texts tend to perform badly on worst-case texts, and vice-versa. FJS attempts to solve this problem by combining the features of two existing algorithms.

What do I mean? Let’s look at a “typical” example problem: find all matches of
**at** (the pattern) in **Cat bath mat** (the text):

You’d do OK even with a naïve algorithm that walked along
the text and checked, at each position (called an *alignment*),
if an occurrence of the pattern **at**
started there:

But now consider this problem: find all matches of the pattern
**aaab** in the text **aaaaaaaaaaaaaa**.

There are no matches, but the naïve algorithm can’t see that right away.
It will check, at each position in the text, if the pattern can be found
starting from that position. In each case, it won’t figure out that there is no
match until it checks the last pattern letter. Let’s call the pattern *p*
and its length *m*=|*p*|, and let’s call the text *t* and
its length *n*=|*t*|. Since at each position in *t* we
test all *m* letters in *p*, we see that we are doing on the order
of *m* × *n* comparisons. (A few less, since the pattern is
too long for a match to start at any of the last *m* - 1 positions in
*t*.)

The KMP^{[2]}
algorithm handles cases like
this very nicely. It never needs to do more than 2*n* - *m*
letter comparisons.^{[5]}
To achieve this, it needs an auxilliary array for *p* with entries
based on the following idea:
suppose we are looking for the pattern **onions** at a particular
position in the text and everything matched up until the **s**
(which didn’t match).
Consider only the part that did match, **onion**. Notice that
it both starts and ends with **on**. We say that **on**
is a *border* of the string **onion**. In fact, **on**
is the longest possible border of **onion**. How can we use this
information? Well instead of moving to the next letter in *t* and
starting over like the naïve algorithm would, we can instead move down to the
**on** at the end of the **onion** sequence that we
know matched. Now since we know that the pattern starts with **on**
(because it is the longest border of the part that matched), we don’t test
for the **o** or the **n** and instead start by
checking if the **i** matches. You can prove that this is safe
and won’t miss any possible matches.^{[2, 5]}

OK, so KMP lets us search any string in O(*n*) time
(that is, time proportional to *n*).
But it also takes time to make this auxilliary array. It turns out that if
you are clever you can do this in O(*m*) time, so the whole thing
now runs in O(*n* + *m*) time. That’s much better than the
O(*m* × *n*) time that the naïve algorithm offers…
or is it?

Notice that KMP fixes “weird” worst cases with periodic strings like
**aaaaaaaaaa**. Yet typical text doesn’t look much like that,
and we have that extra O(*m*) preprocessing time, too.
The net effect is that KMP is slower than the naïve algorithm on typical texts!
There are, however, a number of algorithms that address this average case.
Of these, the fastest is generally considered to be the
BMS^{[3, 4]}
algorithm.

Unlike KMP, BMS matches the pattern from right to left starting from the
last letter. When it hits a mismatch, it jumps to the right. If the text letter
that didn’t match can be found in the pattern somewhere left of the mismatch point, then
that point in the pattern is aligned with the mismatch point. The real magic,
though, happens when the mismatched letter does *not* appear in the pattern.
Then the entire pattern is shifted to the point just after the mismatch, leaping
*m* letters down the text in a single step! On a typical text, BMS
will finish in a fraction of the time of KMP or the naïve algorithm.
The down side is that BMS always starts over at the right end of the pattern
each time the pattern moves down the text: in the worst case, it is as bad as
the naïve algorithm.
And like KMP, BMS needs an auxilliary array that tells it how far it can jump,
but its array needs an entry for every letter in the *alphabet*
used by the text and pattern (such as ASCII or Unicode).
So instead of running in O(*n* + *m*) time it runs in
O(*n* + *α*) time, where *α* is the
number of letters in the alphabet. Large alphabets, like Unicode,
threaten to overshadow the gains made during matching, but in practice there are
workarounds for this.^{[1]}

Finally we come to FJS. The goal is to create a hybrid of KMP and BMS that combines their best properties: KMP’s worst-case performance and BMS’s average-case performance. This is not as simple as it may sound: keep in mind that they don’t even run in the same direction when checking for pattern matches! And you can’t just switch one of them around: their scan directions are critical to how they work.

To do it we need to recognize two key requirements. First, the BMS part can’t
be allowed to restart from the right end of the pattern after any arbitrary
mismatch, or you will lose your worst case guarantee. (That means we need to
be able to use KMP’s rules “nearly always.”) Second, the majority of the
benefit of BMS in typical texts comes from the pattern *not* matching on the
first test at a given position and then making a jump.
So, *when we can* we will start any new alignment position of the pattern
to the text by testing the right end of the pattern against the corresponding
text position. If it *doesn’t* match, we use a BMS-style jump. If it *does*
match, then we continue matching the rest of the pattern from the left end
using KMP. The *when we can* condition means that once we start
doing KMP matching, we need to keep using it as long as there is a border for
KMP to use at the next pattern position. But whenever KMP has to start over from
the left end of the pattern, we get to sneak in another BMS-style test at the
other end first.

Since the BMS test is only performed on the last pattern letter,
never in a loop over the pattern, it can’t add more than
*n* - *m* + 1 letter comparisons than KMP
would have done, so FJS is still O(*n*) in the worst case.
But remember, it uses the preprocessing arrays of both algorithms, making it
run in O(*n* + *m* + *α*)
time overall.

That said, it turns out that FJS is actually about 5-10% *faster*
than BMS on average.^{[1]}
This seems to come down to two factors: First, the loop for doing BMS skipping
is tighter in FJS than in BMS because it only ever scans the final pattern letter.
Second, the fact that the first pattern letter to be checked
and the second pattern letter to be checked are further apart may make
early detection of a mismatch more likely. For example, **-ing**
is a common suffix in English. If a pattern ends in **-ing** then
BMS needs at least 3 letter comparisons every time it is aligned with
an **-ing** word in the text. In contrast, FJS would often
detect a mismatch after only 2 letter comparisons.

Using the controls below, you can get a good sense of how the algorithm works by watching it run in slow motion on any strings you like. Two things to watch for: big jumps thanks to BMS when the last pattern letter doesn’t match, and pale green letters at the start of the pattern that indicate the parts that KMP knows will match without any testing.

**Character comparisons:**

*n*)

**Matches found:**

This article intentionally glosses over fine details.
For those with a computing science background who want more, this section
summarizes some of the key properties of FJS. Proofs and further details can be
found in the primary literature.^{[1]}

FJS uses auxilliary arrays from the KMP and BMS algorithms.
KMP’s modified border array (called β′ in the journal article and code)
requires *m* + 1 words while BMS’s skip array
(called Δ) requires *α* words.

In the best case, FJS performs ⌊*n* / (*m*+1)⌋ letter comparisons.
This occurs when, at each possible alignment, *p*[*m*] does not match
the letter it is aligned with in *t* and the following letter (in *t*)
does not occur in *p*. In the worst case, FJS performs 3*n* - 2*m*
letter comparisons. The maximum number is 3*n* - 6*m*,
triggered by strings of the form
*t* = a^{n} and *p* = aba.
In the average case, empirical testing suggests that FJS is currently the
fastest exact pattern matching algorithm under most practical circumstances.

As with all algorithms derived from Boyer-Moore, FJS requires an indexed
alphabet.^{[5]} For practical purposes, this means
that the alphabet must be of finite size, it must have letters encodable
as a series of integers, and it must be possible to determine the
code for any letter in O(1) time.

Empirical evidence suggests that FJS is usually not the fastest algorithm for
small alphabets (*α* < 20), although it remains competitive.
For very large alphabets, preprocessing time may dominate unless *n* is large. An example where this may
occur is 16-bit Unicode characters. One strategy for dealing with this in
a language like C is to cast the strings to 8-bit arrays and ignore any matches
at odd offsets. Another strategy is to create a Δ table with 256 entries and mask away
the high byte of each character, effectively creating a kind of hash table.
(Table entries should use the smallest jump of
any character colliding with the hash value.) This may result in extra
letter comparisons in practice, but does not affect correctness or worst case
performance.

Code is available on GitHub under the liberal MIT license.

- Frantisek Franek, Christopher G. Jennings and W. F. Smyth,
**A simple fast hybrid pattern-matching algorithm**, J. Discrete Algorithms 5: 682–695 (December 2007). DOI: 10.1016/j.jda.2006.11.004 (Preprint PDF) - Donald E. Knuth, James H. Morris and Vaughan R. Pratt,
**Fast pattern matching in strings**, SIAM Journal on Computing 6(2):323–350 (1977). - Robert S. Boyer and J. Strother Moore,
**A fast string searching algorithm**, Communications of the ACM 20(10):762–772 (1977). - Daniel M. Sunday,
**A very fast substring search algorithm**, Communications of the ACM 33(8):132–142 (1990). - Bill Smyth,
**Computing Patterns in Strings**, Addison Wesley Pearson, 2003.