Tag Archives: string

Almost Equal Strings // My problem in Code.cpp contest

I created the most difficult problem for May edition of Code.cpp on HackerRank. The full problem is available here. The problem got great attention, because many contestants were surprised how it is possible to "compare" two strings that fast.

The statement is quite straightforward:

We define two strings S and W as almost equal if and only if they have the same length and they differ on at most one position.

You are given a string S of length N and an integer Q. You have to perform Q queries on S. Each query consists of four integers, i, j, k, and l. For each such query, your task is to decide if the substrings S[i, j] and S[k, l] are almost equal.

While the problem is formulated very simple, the solution is not obvious, especially because there are many queries to handle - up to 10^5 and the input string can have at most 10^6 letters. I will sketch two completely different approaches here.

LCP Method

First of all, we construct suffix array for string S. While O(n \cdot \log n) algorithm is fine here, we can use a simple and famous linear one by Kärkkäinen and Sanders known as Skew algorithm

After that, we compute longest common prefix array abbreviated as LCP array, where LCP[i] is the longest common prefix of suffixes i and i-1 in the suffix array.

Using our LCP array and some range queries, we are able to compute the length of the longest common prefix for arbitrary suffixes of S, the sketch of this method is available here.

Being able to do that, let's consider any query i, j, k, l and two substrings S[i, j] and S[k, l]. Let assume that they have equal length K, because the other case is trivial. In order to answer, if these substrings are almost equal, first we compute the length of LCP of suffixes of S starting in i and k. Let this length to be L. If L \geq K, then we know that our two substrings are exactly equal. In the other case, we know that they differ on position L + 1, so we compute the length of the LCP of suffixes starting just after this different character and the answer to the query is positive if these suffixes have equal at least K - L - 1 characters, otherwise it is negative.

Time Complexity

If we use Skew algorithm and the fastest method of computing LCP of arbitrary strings, we can achieve O(n + q), but any algorithm running in O(n \cdot \log n + q \cdot \log q) is also fine.

Hashing method

The idea is to provide a method for deciding if S[i, j] = S[k, l] in constant time for any i \leq j and k \leq l

For now, let's assume that we can do that.

Then in order to decide if S[i, j] is almost equal to S[k, l], first we check if they have equal length. If they have, then we check if S[i, j] = S[k, l] if it is, they are exact the same strings. If not, we find the first letter on which they are different. We can do this using binary search and our method for equality of substrings on prefixes of both strings. After we find this first letter, let say it has index x inside these strings, we just need to compare S[i + x + 1, j] with S[k + x + 1, l] and we assumed that we can do this in O(1) time.

In order to do our magic comparison in constant time, first we define

H[i] = s[0] \cdot P^{i - 1} + s[1] \cdot P^{i - 2} + ... + s[i - 1] \cdot P^1 + s[i]

which is called polynomial hash of a string and we can compute H[i] for all 0 \leq i \leq n - 1 in O(n) using Horner's method i.e:

H[0] = s[0]

H[i] = P \cdot H[i - 1] + s[i] for i > 0

All these computations can be done modulo some big prime, but it is faster to use unsigned long long type here, especially for c++ which is nice in this contest, and we can avoid modulo computation.

The next step is to be able to check if S[i, j] = S[k, l] and we can do this by multiplications and subtractions of some values from H table, which can be done in constant time.

Of course this method is a probabilistic one and it can return false results, but we can either assume that it works for one H function, or compute more than one hashing function to increase its probability. It is worth to mention that using two functions rather that one squares the probability of an error. An example of using two functions can be computing both as H function given above, but for each one computing it modulo a different big prime number.

Time Complexity

Computing H table takes O(n) and binary search for one query takes O(\log n), so the overall complexity is O(n + q \cdot \log n).

Codechef Long // July Challenge - GERALD09 the tiebreaker problem

The June challenge is over. I am very happy, because I finished 19th overall, which is my personal best:

July_challenge_final

While editorials for all problems are available on Codechef forum, almost every solution to the tie breaker problem is unique, so I want to share mine, because it is completely different from the one posted in the official editorial.

Problem statement

The full problem statement is here. You are given a matrix T with n \leq 15 rows and m \leq 15 colums. In addition, you are given an integer k, where 1 \leq k \leq n^2 \cdot m^2. Your task is to fill the matrix with letters from 4-letters alphabet \{A, C, G, T\} in such a way, that the number of different submatrices of matrix T is as close to k as possible - the less the difference is the greater score you have.

Solution

I used the concept of 2 dimensional variant of polynomial hashing in order to count the number of different submatrices in time O(n^2 \cdot m^2). The concept allows to compare any two submatrices in O(1) time.

Lets make some observations:

  1. It is impossible to create a matrix with less than n \cdot m different submatrices, because every 2 submatrices of different size are different. It follows that,if k \leq n \cdot m, the best you can do is to fill the matrix with only one letter, lets cay A.
  2. It is impossible to create a matrix with more than M = n \cdot (n + 1) / 2 \cdot m \cdot (m + 1) / 2 different submatrices, because this is the number of all possible submatrices. It follows that if k \geq M, the best you can do is to return a matrix with n rows and m colums with the maximal number of different submatrices.
    My approach is to precompute for every of 225 pairs (n, m) the matrix with the maximum number of different submatrices. I did that by repeatedly exchanging some random letter in a matrix and if the resulting matrix has a greater number of different submatrices than it had before, then it becomes the resulting matrix. Of course it might not return the best matrix, but if you repeat the exchange many times it should be at least quite close to the optimal matrix. It is worth to mention, that such a a process can stuck in a local maximum, so it is better to choose the result from many different runs.
  3. Since the maximum allowed source file limit on Codechef is 50kb, you can store a lot precomputed information in it. Let R := \{s : n \cdot m < s < M\}. Based on two above cases, we know that k \in R. For every pair of (n, m), I assigned to mem[n][m][s] a matrix of size n \times m with exactly s different submatrices, where s is equally distributed over R. For example, let n, m = 10, then 100 < s < 3025 and we chose to memorize matrices which differ in number of different submatrices by 500, then we compute mem[10][10][s] for s \in \{600, 1100, 1600, 2100, 2600\}. It helps us to chose the initial matrix for the beginning of the on-line computation. For example, if n, m = 10, k = 1220 and we computed mem[10][10] as above, we select mem[10][10][1100] as the initial matrix. I computed mem using the same random method as in Step 2.
  4. Lets fill the matrix with the letter A only and consider, how many different submatrices we can generate by placing letter C in the i-th row and j-th column. If we do that, then every submatrix with the upper left corner in (i_1, j_1) where i_1 \leq i and j_2 \leq j and the lower right corner in (i_2, j_2), where i_2 \geq i and j_2 \geq j is unique, because any other matrix of the same size has either no letter C or has letter C at different relative position. It follows that using that method, we can create a matrix with up to about n^2 / 4 \cdot m^2 / 4 + n \cdot m different submatrices. The natural extension is to insert letters G and T at some positions different than (i, j) in our matrix, which helps us to create a matrix with up to about 3/16 \cdot n^2 \cdot m^2 + n \cdot m different submatrices.

We are ready now for the on-line computation. We choose as the initial matrix the one, which can be generated in any of the above steps with the closest number of different submatrices to k.

My online computation consists of only one technique. For any pair of (n, m), I set the number of computation steps to c \cdot 225 / (n^2 \cdot m^2) for the largest possible c which doesn't exceed the time limit, in which I try to exchange some random elements in the matrix and hope to generate a better one.

At the beginning, I count the number of different submatrices in my initial matrix. Then, for every computations step, I exchange exactly \sqrt{n \cdot m} random elements, again count the number of different submatrices and if it generated a better matrix, I store it as a current matrix.

Conclusion

While the obvious method is to try to exchange on-line some elements in order to achieve a better solution, the crucial step is to set up the initial matrix in the best possible way. I am pretty satisfied with my solution. It results in 20th best score in that problem and gave me 19th place in the contest.