Almost Equal Strings // My problem in Code.cpp contest

I created the most difficult problem for May edition of Code.cpp on HackerRank. The full problem is available here. The problem got great attention, because many contestants were surprised how it is possible to "compare" two strings that fast.

The statement is quite straightforward:

We define two strings S and W as almost equal if and only if they have the same length and they differ on at most one position.

You are given a string S of length N and an integer Q. You have to perform Q queries on S. Each query consists of four integers, i, j, k, and l. For each such query, your task is to decide if the substrings S[i, j] and S[k, l] are almost equal.

While the problem is formulated very simple, the solution is not obvious, especially because there are many queries to handle - up to 10^5 and the input string can have at most 10^6 letters. I will sketch two completely different approaches here.

LCP Method

First of all, we construct suffix array for string S. While O(n \cdot \log n) algorithm is fine here, we can use a simple and famous linear one by Kärkkäinen and Sanders known as Skew algorithm

After that, we compute longest common prefix array abbreviated as LCP array, where LCP[i] is the longest common prefix of suffixes i and i-1 in the suffix array.

Using our LCP array and some range queries, we are able to compute the length of the longest common prefix for arbitrary suffixes of S, the sketch of this method is available here.

Being able to do that, let's consider any query i, j, k, l and two substrings S[i, j] and S[k, l]. Let assume that they have equal length K, because the other case is trivial. In order to answer, if these substrings are almost equal, first we compute the length of LCP of suffixes of S starting in i and k. Let this length to be L. If L \geq K, then we know that our two substrings are exactly equal. In the other case, we know that they differ on position L + 1, so we compute the length of the LCP of suffixes starting just after this different character and the answer to the query is positive if these suffixes have equal at least K - L - 1 characters, otherwise it is negative.

Time Complexity

If we use Skew algorithm and the fastest method of computing LCP of arbitrary strings, we can achieve O(n + q), but any algorithm running in O(n \cdot \log n + q \cdot \log q) is also fine.

Hashing method

The idea is to provide a method for deciding if S[i, j] = S[k, l] in constant time for any i \leq j and k \leq l

For now, let's assume that we can do that.

Then in order to decide if S[i, j] is almost equal to S[k, l], first we check if they have equal length. If they have, then we check if S[i, j] = S[k, l] if it is, they are exact the same strings. If not, we find the first letter on which they are different. We can do this using binary search and our method for equality of substrings on prefixes of both strings. After we find this first letter, let say it has index x inside these strings, we just need to compare S[i + x + 1, j] with S[k + x + 1, l] and we assumed that we can do this in O(1) time.

In order to do our magic comparison in constant time, first we define

H[i] = s[0] \cdot P^{i - 1} + s[1] \cdot P^{i - 2} + ... + s[i - 1] \cdot P^1 + s[i]

which is called polynomial hash of a string and we can compute H[i] for all 0 \leq i \leq n - 1 in O(n) using Horner's method i.e:

H[0] = s[0]

H[i] = P \cdot H[i - 1] + s[i] for i > 0

All these computations can be done modulo some big prime, but it is faster to use unsigned long long type here, especially for c++ which is nice in this contest, and we can avoid modulo computation.

The next step is to be able to check if S[i, j] = S[k, l] and we can do this by multiplications and subtractions of some values from H table, which can be done in constant time.

Of course this method is a probabilistic one and it can return false results, but we can either assume that it works for one H function, or compute more than one hashing function to increase its probability. It is worth to mention that using two functions rather that one squares the probability of an error. An example of using two functions can be computing both as H function given above, but for each one computing it modulo a different big prime number.

Time Complexity

Computing H table takes O(n) and binary search for one query takes O(\log n), so the overall complexity is O(n + q \cdot \log n).

  • Siva Krishna

    can you please elaborate how we will find the first letter on which they are different?.

    • http://chasethered.com pkacprzak

      Sure, in which method? The hashing one?

      • Siva Krishna

        yes. Hashing one.

        • http://chasethered.com pkacprzak

          I wrote that you can use binary search for that. You pick the middle letter of the current search range and check if the prefixes in compared strings till this middle letter are the same and proceed further computation based on the result of this comparison.

  • Chris Henry

    I put a comment in the problem discussion, also.

    I'm curious as to why all the large test cases contained many duplicate queries? I was upset that I lucked out on a working solution only because I stored and matched the outputs to all already computed queries. Letting me brute force the string comparisons where I shouldn't be able to and still passing all cases with 0.85 sec left to spare.

    I wish at least one or two had all unique queries.

    • http://chasethered.com pkacprzak

      It is great that you pointed it out.

      I checked the test generator and I found out that it has a bug while generating long queries i.e. rather than generate new queries, it puts the same long query every time and since in big tests every even indexed query is a long query and odd indexed is a random one, it basically has Q / 2 random queries for which the longest common prefix is short in over 90% and just one long query with a great possibility to have a match. Please let me know if you want to see fixed testcases, then I will ask HackerRank admins if I can update them.

      Thanks a lot again and cheers!

      • Chris Henry

        If that is something that won't cause you much trouble then, sure.

        Otherwise, it isn't a big deal, I just thought you should be aware for the future. It's not a test I should have passed with the program I submitted.

        So, don't reply back and say "yes" just because you feel obligated, now.

        Thanks for responding and thanks for the problem. These were my first few problems on Hackerrank and yours made me want to continue with it.

        • http://chasethered.com pkacprzak

          Did you solve the problem during the contest for 100 score or after it finished?

          It is great to hear that, keep up your great work. I am preparing now the hardest problem for upcoming Weekly of Code on HackerRank which starts in about a month. It will be a graph problem, definitely the best which I created so far I think. Stay tuned!

          • Chris Henry

            Not till after it was over. I just learned of the site and had plans already for most of the contest duration. I've never done anything like contest challenges before, so I was a little slow on the uptake, also.

            I just went through the remaining problems without "peaking." Points don't matter in the least if that is what your asking.

            I'll be sure to check out your new problem in a month.