Almost Equal Strings // My problem in Code.cpp contest

I created the most difficult problem for May edition of Code.cpp on HackerRank. The full problem is available here. The problem got great attention, because many contestants were surprised how it is possible to "compare" two strings that fast.

The statement is quite straightforward:

We define two strings $S$ and $W$ as almost equal if and only if they have the same length and they differ on at most one position.

You are given a string $S$ of length $N$ and an integer $Q$. You have to perform $Q$ queries on $S$. Each query consists of four integers, $i$, $j$, $k$, and $l$. For each such query, your task is to decide if the substrings $S[i, j]$ and $S[k, l]$ are almost equal.

While the problem is formulated very simple, the solution is not obvious, especially because there are many queries to handle - up to $10^5$ and the input string can have at most $10^6$ letters. I will sketch two completely different approaches here.

LCP Method

First of all, we construct suffix array for string $S$. While $O(n \cdot \log n)$ algorithm is fine here, we can use a simple and famous linear one by Kärkkäinen and Sanders known as Skew algorithm

After that, we compute longest common prefix array abbreviated as $LCP$ array, where $LCP[i]$ is the longest common prefix of suffixes $i$ and $i-1$ in the suffix array.

Using our $LCP$ array and some range queries, we are able to compute the length of the longest common prefix for arbitrary suffixes of $S$, the sketch of this method is available here.

Being able to do that, let's consider any query $i, j, k, l$ and two substrings $S[i, j]$ and $S[k, l]$. Let assume that they have equal length $K$, because the other case is trivial. In order to answer, if these substrings are almost equal, first we compute the length of $LCP$ of suffixes of $S$ starting in $i$ and $k$. Let this length to be $L$. If $L \geq K$, then we know that our two substrings are exactly equal. In the other case, we know that they differ on position $L + 1$, so we compute the length of the $LCP$ of suffixes starting just after this different character and the answer to the query is positive if these suffixes have equal at least $K - L - 1$ characters, otherwise it is negative.

Time Complexity

If we use Skew algorithm and the fastest method of computing $LCP$ of arbitrary strings, we can achieve $O(n + q)$, but any algorithm running in $O(n \cdot \log n + q \cdot \log q)$ is also fine.

Hashing method

The idea is to provide a method for deciding if $S[i, j] = S[k, l]$ in constant time for any $i \leq j$ and $k \leq l$

For now, let's assume that we can do that.

Then in order to decide if $S[i, j]$ is almost equal to $S[k, l]$, first we check if they have equal length. If they have, then we check if $S[i, j] = S[k, l]$ if it is, they are exact the same strings. If not, we find the first letter on which they are different. We can do this using binary search and our method for equality of substrings on prefixes of both strings. After we find this first letter, let say it has index $x$ inside these strings, we just need to compare $S[i + x + 1, j]$ with $S[k + x + 1, l]$ and we assumed that we can do this in $O(1)$ time.

In order to do our magic comparison in constant time, first we define

$H[i] = s[0] \cdot P^{i - 1} + s[1] \cdot P^{i - 2} + ... + s[i - 1] \cdot P^1 + s[i]$

which is called polynomial hash of a string and we can compute $H[i]$ for all $0 \leq i \leq n - 1$ in $O(n)$ using Horner's method i.e:

$H[0] = s[0]$

$H[i] = P \cdot H[i - 1] + s[i]$ for $i > 0$

All these computations can be done modulo some big prime, but it is faster to use unsigned long long type here, especially for c++ which is nice in this contest, and we can avoid modulo computation.

The next step is to be able to check if $S[i, j] = S[k, l]$ and we can do this by multiplications and subtractions of some values from $H$ table, which can be done in constant time.

Of course this method is a probabilistic one and it can return false results, but we can either assume that it works for one $H$ function, or compute more than one hashing function to increase its probability. It is worth to mention that using two functions rather that one squares the probability of an error. An example of using two functions can be computing both as $H$ function given above, but for each one computing it modulo a different big prime number.

Time Complexity

Computing $H$ table takes $O(n)$ and binary search for one query takes $O(\log n)$, so the overall complexity is $O(n + q \cdot \log n)$.