My research on fast multi-pattern search algorithms


I was lucky to have a chance to research a way to improve my company text parser. The key to efficient parsing is lazy parsing that parses only what we need, ignore the rest, which essentially a string matching problem. So I did research on multi-pattern search algorithms. Here are summary of the research:
Single pattern matching algorithms
I include single pattern matching algorithms as a basis for understanding multi-pattern matching algorithms.

  • Boyer-Moore (BM) uses bad character heuristic. It is the foundation of the search algorithms. It is simple, easy to implement, yet effective and fast. To learn about search algorithm, you must understand BM. Since Boyer-Moore-Horspool (BMH) is simpler than BM, I recommend to learn about BMH first.
    It is used in Snort.
     Learn more from wikipedia
  • Boyer-Moore-Horspool (BMH): It is a simplification of the Boyer-Moore algorithm.  It is amazingly simple.
    Learn more from wikipedia
  • Markatos-Anagnostakis ExB:  exclusion based string matching algorithm. It has 2 assumptions:
    1/ Most traffic will not trigger the patterns.
    2/ The text is not too big b/c the chance the pattern appears will increase as the text gets bigger. 
    This algorithm can also be used in multi-pattern search by repeating single pattern search.
  • Markatos-Anagnostakis E2xB: an enhancement to previous ExB. It uses interger for variable exists[256].
Multi-pattern matching algorithm
  • Aho-Corasick algorithm used automata approach. The automaton consumes one character from the input text each time. The character will drive the automaton to a new state which represents all the partially matched patterns.
    It is used in UNIX fgrep & egrep (with -f option) and compilers.
  • Commentz-Walter  combined Boyer-Moore technique with the Aho-Corasick algorithm.
    It is implemented in gre and GNU fgrep 2.0.
  • Baeza-Yates combined Boyer-Moore-Horspool algorithm with the Aho-Corasick algorithm.
  • Manber-Wu uses the bad character heuristic of Boyer-Moore algorithm. By noticing that the large number of patterns will decrease the chance to get bigger movement, this algorithm use a block of characters to find a movement offset.
  • Kim-Kim: compact encoding and hash to decrease the number of potential patterns.
  • Fisk-Varghese proposed Setwise Boyer-Moore-Horspool. The idea is using bad character heuristic and suffix trie.
  • Coit-Staniford-McAlerney proposed AC-BM, almost the same as Setwise Boyer-Moore-Horspool. using bad character and good prefix heuristic  plus prefix trie.
Advertisements
My research on fast multi-pattern search algorithms

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s