Algorithm FJS

A Fast Linear Time Exact Pattern Matching Algorithm

Send Feedback      Home Page > Algorithm FJS

Introduction

This page features a brief description and animated visualization applet of an exact pattern matching algorithm called FJS [3], which I co-discovered with Franya Franek and W. F. Smyth. Exact pattern matching algorithms find all instances of one string, called the pattern, in another string, called the text. Algorithm FJS, which represents an improvement over an earlier version first described in my MSc thesis [5], combines the best properties of the Knuth-Morris-Pratt (KMP) [6], Sunday (BMS) [8] and (indirectly) Boyer-Moore (BM) [1] algorithms. In emprical experiments, FJS consistently outperformed several leading algorithms for alphabet sizes of about 20 letters and up, and remained competitive for smaller alphabets [3]. The following table summarizes the key properties of Algorithm FJS when searching for all occurrences of a pattern p[1..m] in a text x[1..n], both drawn from an alphabet of α symbols:

Summary of Key Properties

Note that the asymptotic complexities listed below are for finding all instances of a pattern string in a text; many presentations of pattern matching algorithms instead describe the complexity of finding only the first instance. 

Preprocessing Time
Θ(m+α) The algorithm uses preprocessing arrays from the KMP and BMS algorithms.  In the applet we call these β′ and Δ, respectively, following the notation of our journal article [3]. (β′ is the modified border array used by KMP, while Δ is the shift array used by BMS.)
Preprocessing Space
Θ(m+α) The extra space required for preprocessing is that needed for the two arrays β′ (m+1 words) and Δ (α words).
Searching Time
Ω(n/m) In the best case, FJS performs n ÷ (m+1) letter comparisons.  This occurs when, at the start of each possible match (alignment):
  •  p[m] does not match its aligned letter in x, and
  • the following letter (in x) does not occur in p.
O(n+m) In the worst case, FJS performs at most 3n-2m letter comparisons.  Strings of the form x = an, p = aba require the greatest possible number of comparisons (3n-6).
Average Empirical testing described in [3] suggests that FJS is currently the fastest exact pattern matching algorithm under nearly all practical circumstances.  For example, it is typically 510% faster than BMS even while providing a linear worst case time guarantee.
Behaviour
Outline

The basic procedure for exact pattern matching is as follows: The pattern is initially aligned against the text so that p[1] is in line with x[1].  For each alignment, the pattern is tested against the text to determine if a match occurs at that point.  The pattern is then shifted one or more positions towards the end of the text to a new alignment.  This test-and-realign process is repeated until p[m] "slides off the end" of the text, at which point no further matches are possible.

For Algorithm FJS, a position  j in 1..m-1 is maintained; initially j=1. The algorithm proceeds as in KMP, except that at the start of each alignment p[m] is tested first if j=1. If this test fails, a BMS-style shift is performed using Δ to obtain a new potential alignment. This may be repeated any number of times until the end of the string is reached or an alignment is found where p[m] matches. Subsequently, KMP matching tests either p[j..m-1] or p[j..m], depending on whether BMS-style shifting was performed. After this attempt is made, and a match signalled if appropriate, β′ is used to compute a new j as per the KMP algorithm. If β′ returns the special value 0, the pattern is realigned to the next position down in x and j will be 1 (inducing the BMS-shift skip loop in the next iteration).  This process repeats until the end of the string is reached.

Limitations
Alphabet
Restriction
As with all BM-derived algorithms, FJS requires an indexed alphabet [7].  For practical purposes:
  • the alphabet must be of finite size
  • it must be possible to encode the letters of the alphabet as a sequence of integers 1, 2, 3, ..., α.
  • it must be possible to determine the code for any letter in constant time

ASCII and Unicode are indexed alphabets; a string consisting of pointers to data structures generally is not, since we probably don't know the set of all possible pointers that will occur ahead of time.  Although it is possible to convert any string into an equivalent string that uses an indexed alphabet, this is not usually a practical solution.

Alphabet
Size
Empirical evidence suggests that the algorithm is not the fastest for small alphabets, namely those with fewer than about 20 letters, although it remains competitive with other leading algorithms.

Demonstration Applet

Clicking on the button below will open a Java applet window with an animated visualization of Algorithm FJS.  You may enter texts and patterns of up to 60 letters each and see how FJS operates on them. Black letters in the text and pattern alignments indicate letters which have been compared for equality, while gray letters have not been tested. Blue letters in the text (top line of letters) indicate complete matches.  In the pattern alignments, green letters indicate tested letters that match the text, while crossed-out red letters indicate tested letters that don't match the text. You don't have to wait for matching to finish before changing the text, pattern, or animation speed parameters.

The applet requires that your browser support at least the Java 2 runtime.  If the button above doesn't work properly, this is almost certainly the problem. You can download the latest Java runtime from Sun to correct this.  If you are unable to view the applet, you can instead view a screen shot of a run on its default input.

Implementations

fjs-code-13.zipversion 1.3 (May 12, 2007), 18 kB

This archive contains simple implementations of FJS in C and Java as well as the source code for the demonstration applet.

More Information

Smyth's Computing Patterns in Strings [7] provides an extensive background on the subject of pattern matching in strings. Charras and Lecroq's Web site [2] provides summaries of the theoretical and practical aspects of a large number of exact pattern matching algorithms, and includes sample C implementations and applets.  For information on FJS in particular, see the paper by Franek, Jennings, and Smyth [3].

Note that an earlier presentation of FJS [4] contains an error which affects theoretical worst-case performance but not correctness.  This version fails to check whether KMP matching has made j>1 before attempting the BMS skip loop.  In certain pathological cases, this results in O(mn) character comparisons being performed.

Return to Home Page    Send Feedback

Bibliography

[1] Robert S. Boyer and J. Strother Moore, A fast string searching algorithm, Communications of the ACM 20(10):762–772 (1977).

[2] Christian Charras and Thierry Lecroq, Exact String Matching Algorithms, Laboratoire d'Informatique, Université de Rouen (1997).

[3] Frantisek Franek, Christopher G. Jennings and W. F. Smyth, A simple fast hybrid pattern-matching algorithm, J. Discrete Algorithms 5: 682–695 (2007). DOI: 10.1016/j.jda.2006.11.004 (Preprint PDF, 498 kB)

[4] Frantisek Franek, Christopher G. Jennings, and W. F. Smyth, A Simple Fast Hybrid Pattern-Matching Algorithm (preliminary version), Proc. 16th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science 3537, Springer-Verlag (2005).

[5] Christopher G. Jennings, A Linear-Time Algorithm for Fast Exact Pattern Matching in Strings, MSc thesis, McMaster University (2002) 97 pp.

[6] Donald E. Knuth, James H. Morris and Vaughan R. Pratt, Fast pattern matching in strings, SIAM Journal on Computing 6(2):323–350 (1977).

[7] Bill Smyth, Computing Patterns in Strings, Pearson Addison-Wesley (2003) 423 pp.

[8] Daniel M. Sunday, A very fast substring search algorithm, Communications of the ACM 33(8):132–142 (1990).

August 23, 2004 — Updated January 01, 2011