Finding all occurrences of a substring using suffix arrays

3 minute read

Python has no built-in function to find all occurrences of a substring. Furthermore, custom functions that use re.finditer or find are quite slow. Hence, when speed is essential, search indices are the way to go.

In this blog post, I will compare some naive approaches I found on Stack Overflow with a suffix array.

Preparing the data

First, a text dataset is needed to which we can apply the algorithms:

wget http://archives.textfiles.com/stories.zip
unzip stories.zip

Next, let us parse the data.

import re
import os

with open(os.path.join("stories", "index.html"), "r") as f:
    text = f.read()

dataset = []
for filename, title in zip(re.findall(r"HREF=\"(.*?)[\"']", text), re.findall(r"<BR><TD>(.*)\n", text)):
    filename = os.path.join("stories", filename)
    if os.path.isfile(filename):
        with open(filename, "r", encoding="ISO-8859-1") as f:
            dataset.append((filename, title.strip(), f.read().lower()))

The variable dataset contains tuples of the form (filename, title, text).

Algorithms

I will be testing the following three algorithms:

  1. re.finditer
  2. find in a while loop
  3. binary search with a suffix array

The first two algorithms are from Stack Overflow. The third one is based on the paper Suffix arrays: A new method for on-line string searches and an implementation from Algorithmic-Alley.

The following function will measure the performance of the algorithms:

from timeit import default_timer as timer
import string
import numpy as np

def eval_algo(f, dataset, queries=list(string.ascii_lowercase) + list(string.digits) + ["test", "heeeeeellllooo"], iterations=5):
    runs = []
    for run in range(iterations):
        start = timer()
        for query in queries:
            for _, _, text in dataset:
                matches = f(query, text)
        runs.append(timer() - start)
    print(np.mean(runs))

re.finditer

The function re.finditer is not that slow. The following code takes on average 1.61 seconds.

def algo1(query, text):
    return [m.start() for m in re.finditer(query, text)]

eval_algo(algo1, dataset)

find

The function find inside a while loop needs about 2.77 seconds.

def algo2(query, text):
    start = 0
    output = []
    while True:
        start = text.find(query, start)
        if start == -1:
            break
        output.append(start)
        start += len(query)

    return output

eval_algo(algo2, dataset)

Binary search with a suffix array

This algorithm is a bit more complicated, we have to first construct a suffix array. Once created, the suffix array can be reused for other searches. If you don’t know how suffix arrays work, refer to Wikipedia and the paper from above.

The following code will create the suffix array. It is not the fastest construction algorithm, but it is good enough for this blog post.

from collections import defaultdict

def suffix_array(text):
    def sort_bucket(bucket, order=1):
        d = defaultdict(list) 
        for i in bucket:
            key = text[i:i+order]
            d[key].append(i)

        result = []
        for k, v in sorted(d.items()):
            if len(v) > 1:
                result += sort_bucket(v, order*2)
            else:
                result.append(v[0])
        return result

    return sort_bucket(range(len(text)))

start = timer()
dataset_ = []
for _, _, text in dataset:
    dataset_.append((text, suffix_array(text)))
print(timer() - start)

The creation of the suffix array took about 24.56 seconds.

Now, we can perform searches on suffix_array(text). The following code makes use of two binary searches to find matches inside a text:

import operator

def find_all_matches(suffix_arr, text, query):
    def binary_search(lo, hi, op):
        m = len(query)
        while lo < hi:
            mid = (lo + hi) // 2
            suffix = suffix_arr[mid]
            if op(text[suffix:suffix+m], query):
                lo = mid + 1
            else:
                hi = mid

        return lo

    n = len(suffix_arr)
    start = binary_search(0, n, operator.lt)
    end = binary_search(start, n, operator.eq)

    return suffix_arr[start:end]

runs = []
for run in range(5):
    start = timer()
    for query in list(string.ascii_lowercase) + list(string.digits) + ["test", "heeeeeellllooo"]:
        for text, suffix_arr in dataset_:
            matches = find_all_matches(suffix_arr, text, query)
    runs.append(timer() - start)
print(np.mean(runs))

The algorithm found matches in 0.41 seconds.

Summary

As expected, suffix arrays with binary search are faster than find and re.finditer. The search algorithm has a time complexity of $\mathcal{O}(m\log n)$ where $m$ is the length of the search query and $n$ is the length of the text. LCP arrays and enhanced suffix arrays can increase the speed even further.

Algorithm Speed
re.finditer 1.61
find 2.77
suffix array 0.41

Categories:

Updated:

Comments