Skip to main content

An Optimized Pattern Matching Algorithm based on Suffix Arrays with LCP-Guided Binary Search: A Time-Efficient Approach for Large-Scale Text Processing

Research Abstract

Pattern matching is a fundamental operation in
computational biology, information retrieval, and text processing.
Despite advances in algorithms, efficiently searching large-scale
texts, particularly genomic sequences spanning billions of base
pairs, remains a significant challenge. Existing approaches face
substantial limitations including Boyer-Moore algorithm’s poor
worst-case performance of O(mn) and limited scalability on large
datasets, KMP algorithm’s high memory overhead and poor
cache locality despite O(m+n) complexity, traditional suffix array
methods’ expensive O(mlogn) search time with costly preprocessing,
and hash-based approaches like Rabin-Karp’s vulnerability
to hash collisions and poor performance on repetitive sequences
common in genomic data.
This paper presents LCP-Optimized Suffix Array (LOSA),
a high-performance pattern matching algorithm that combines
suffix arrays with Longest Common Prefix (LCP) array optimizations
and parallel processing to achieve faster search times
with scalable memory usage. LOSA reduces the worst-case search
complexity from O(m log n) to nearly O(m + log n), where
m length pattern and n length the text. This improvement
is achieved through LCP-aware skips to minimize redundant
comparisons, as well as parallelized construction and query
phases, for modern multicore systems.

Research Authors
Mohmoud K Diab, Taysir Hassan A Soliman, Amr M Mohamed, Ibrahim E Elsemman
Research Date
Research Department
Research Journal
IEEE Xplore
Research Publisher
IEEE
Research Year
2025