Eureka!!: Suffix Array

Suffix Array

I usually don't like to give definitions ( coz i dont remember it :P )
Today i am going to write about a very important data structure called suffix array.

you must have come across problems like finding the
-longest common substring,
-minimum lexicographic rotation of a string,
-finding the longest palindrome, (this can be done in better time by another algorithm)
-searching for a smaller string in a bigger string. ( this can be done in O(m+n) using kmp. however we can do it in O(m log n ) using suffix array. it is useful if we have to search many small strings in the large string. Then it will outspeed KMP)
-

These types of problems can be efficiently solved by a suffix tree.
http://en.wikipedia.org/wiki/Suffix_tree

However to construct a suffix tree it is very tedious and takes a long time. It is impractical to implement a suffix tree in any contest of 5-6 hours. Also at times suffix tree is slower than suffix array because of the overhead of the constants involved. Below i will give you some theory regarding the Suffix Array.

Given a string A = (a0 a1 a2 a3 ....an-1) of length n.
A suffix of string S starting at position i is Suffix(i) = (ai ai+1 ai+2 ....an-1).
There are n suffixes of string A ( starting at each index to the end ).

Let U be a set of suffixes of string A in sorted order. ie obtain all the suffixes of A and sort them to form U.

A suffix array S is an array such that S[i] gives the index of Suffix(i) in U.

for example if A = banana
suffixes of A are:
banana = Suffix(0)
anana = Suffix(1)
nana = Suffix(2)
ana = Suffix(3)
na = Suffix(4)
a = Suffix(5)

after sorting U = {
0 a
1 ana
2 anana
3 banana
4 na
5 nana
}

thus suffix array S would be,
S = { 3, 2, 5, 1, 4, 0 }

S[1] = 2 because index of Suffix(1) = anana in U is 2

Algorithm:

n ← length(A)
for i ← 0, n – 1
P(0, i) ← position of Ai in the ordered array of A‘s characters
cnt ← 1
for k ← 1, [log2n] (ceil)
for i ← 0, n – 1
L(i) ← (P(k – 1, i) , P(k – 1, i + cnt) , i )
sort L
compute P(k, i) , i = 0, n - 1
cnt ← 2 * cnt

Effectively the algorithm is O(n * log n) if we sort using bucket sort.
But sometimes we can compromise in the sorting step and the complexity O(n*log n * log n) is also acceptable.

Explanation:

Consider A is of infinite length such that a character γ is appended to A infinite times (just consider, dont to it).

let Σ be the set of all alphabets of A. Then γ < Σ i , for all i=1,2....sizeof (Σ).
ie. our new appended character's value is smaller than all the alphabets of A.

This is done because if we want to take a substring A.substr(i,j) such that i+j > len(A), then it would be possible to consider such string now. [γ is not counted else the length would be infinite].

What is P(k, i) in the above algorithm?

Take all the substring (not suffix) of A of length 2^k. If we dont have enough character for some substrings at the tail of the string then append γ in the remaining spots. Let U be a set of those substrings taken in sorted order.

As an example:
A = banana
then, substring A(3,5) = anaγγ
We dont need to write γ every time.
thus, A(3,5) = ana

P(k, i) is the index of substring A.substr(i, 2^k) in U.

let r = [log2 (len(A)) ] (ceil)
Hence 2^r >= len(A)
Also, len( P(r, 0) ) >= len(A)

When the above algorithm completes, The array P(r) is the suffix array.
You may argue that suffixes are of variable length but P(r) has substrings of equal length.
That is when γ comes handy. Since γ is smaller in value than all the alphabets, γ can be effectively ignored in all cases because in comparing 2 strings γ will play no role in changing the string comparison.

Also note that if there are two substrings starting at say m and n, such that:
A.substr(m,2^k) = A.substr(n,2^k)
then P(k,m) = P(k,n). (This is an algorithmic necessity.)

Here you can see an extremely short implementation for suffix arrays in O(n * logn * log n).

#include < cstdio >
#include < iostream >
#include < cstdlib >
using namespace std;
#define MAXN 65536
#define MAXLG 17

char A[MAXN];

struct entry {
int nr[2], p;
} L[MAXN];

int P[MAXLG][MAXN], N, i, stp, cnt;

int cmp(struct entry a, struct entry b)
{
return a.nr[0] == b.nr[0] ? (a.nr[1] < b.nr[1] ? 1 : 0) : (a.nr[0] < b.nr[0] ? 1 : 0);
}
int main(void)
{
gets(A);
for (N = strlen(A), i = 0; i < N; i ++)
P[0][i] = A[i] - 'a';
for (stp = 1, cnt = 1; cnt >> 1 < N; stp ++, cnt <<= 1)
{
for (i = 0; i < N; i ++)
{
L[i].nr[0] = P[stp - 1][i];
L[i].nr[1] = i + cnt < N ? P[stp - 1][i + cnt] : -1;
L[i].p = i;
}
sort(L, L + N, cmp);
for (i = 0; i < N; i ++)
P[stp][L[i].p] = i > 0 && L[i].nr[0] == L[i - 1].nr[0] && L[i].nr[1] == L[i - 1].nr[1] ? P[stp][L[i - 1].p] : i;
}
return 0;
}

You may optimize the space of P by defining P as
int P[2][MAXN];
since P only depends in its previous value of the first dimension. (see the above code).

Longest common prefix (LCP)

Given 2 strings, the LCP of the 2 strings is defined as the longest string that occurs at the beginning of both the strings.

To compute the LCP of suffixes of A starting at i and j, ie s1 = A(i..n) and s2 = A(j..n) where n = len(A).
We have matrix P.
- find the largest k such that P[k][i] = P[k][j],
- then we have found the prefix of length 2^k common to both s1 and s2.
- increment i and j by 2^k and repeat all over.

Algorithm:

int lcp(int x, int y)
{
int k, ret = 0;
if (x == y) return N - x;
for (k = stp - 1; k >= 0 && x < N && y < N; k --)
if (P[k][x] == P[k][y])
x += 1 << k, y += 1 << k, ret += 1 << k;
return ret;
}

We can reduce the complexity of the above algorithm to O(1) based on the following observation.
Given x < y

LCP(x,y) = minimum of { LCP(x,x+1) , LCP(x+1,x+2) ... LCP(y-1,y) }

So if we have an array that stores LCP of adjacent numbers, then LCP(x,y) = RMQ(x,y) where RMQ = range minimum query.

With some preprocessing of O(n*log n) RMQ can be done in O(1).

For more details http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=lowestCommonAncestor

Example Problems that can be done by suffix array

Problem 1.
Minimum lexicographic rotation of a string

Given string A, add the string to itself.
Compute the suffix array of only first half. (use the characters of second half also)
the smallest suffix gives the minimum lexicographic rotation of a string.

Problem 2.
You are given a text consisting of N characters (big and small letters of the English alphabet and
digits). A substring of this text is a subsequence of characters that appear on consecutive positions in the text. Given an integer K, find the length of the longest substring that appears in the text at least K times (1 ≤ N ≤ 16384).

Having the text’s suffixes sorted, iterate with a variable i from 0 to N – K and compute the longest
common prefix of suffix i and suffix i + K – 1. The biggest prefix found during this operation represents
the problem’s solution.

Problem 3.
Longest common substring LCS

There are given three strings S1 , S2 of lengths m and n ( 1 <= m, n <= 10000). Determine their longest common substring.
For example, if S1 = abababca and S2 = aababc their longest common substring would be ababc.

- Concatenate S1 and S2.

- Compute the suffix array.

- Compute the LCP of all adjacent suffixes.

the maximum LCP(i,i+1) is the required answer.

An important note:

We implemented suffix array in O(n*log n) but there is an important algorithm called DC3 which computes suffix arrays in O(n). Thus suffix arrays becomes as effective as suffix tree.

http://www.cs.helsinki.fi/u/tpkarkka/publications/jacm05-revised.pdf

But my principle is that we can use the data structure without knowing the mathematics behind it. So I will paste a very efficient code for DC3 below. This is not my code but I sometimes use it. It calculates suffix array in O(n).

#include

using namespace std;

const int MAX = 250052;

char str[MAX];

int c[MAX];

#define GetI() (SA12[t] < n0 ? SA12[t] * 3 + 1 : (SA12[t] - n0) * 3 + 2)

inline bool leq(int a1, int a2, int b1, int b2) {

return(a1 < b1 || (a1 == b1 && a2 <= b2));

}

inline bool leq(int a1, int a2, int a3, int b1, int b2, int b3) {

return(a1 < b1 || (a1 == b1 && leq(a2,a3, b2,b3)));

}

static void radixPass(int* a, int* b, int* r, int n, int K) {

int i, sum, t;

for(i = 0; i <= K; i++) c[i] = 0;

for(i = 0; i < n; i++) c[r[a[i]]]++;

for(i = 0, sum = 0; i <= K; i++) {

t = c[i];

c[i] = sum;

sum += t;

}

for(i = 0; i < n; i++) b[c[r[a[i]]]++] = a[i];

}

void suffixArray(int* s, int* SA, int n, int K) {

int n0 = (n+2)/3, n1 = (n+1)/3, n2 = n/3, n02 = n0+n2;

int* s12 = new int[n02 + 3];

int* SA12 = new int[n02 + 3];

int* s0 = new int[n0];

int* SA0 = new int[n0];

int i, j, name, c0, c1, c2, p, t, k;

s12[n02] = s12[n02+1] = s12[n02+2] = 0;

SA12[n02] = SA12[n02+1] = SA12[n02+2] = 0;

for(i=0, j=0; i < n+(n0-n1); i++) if(i%3 != 0) s12[j++] = i;

radixPass(s12, SA12, s+2, n02, K);

radixPass(SA12, s12, s+1, n02, K);

radixPass(s12, SA12, s, n02, K);

name = 0, c0 = -1, c1 = -1, c2 = -1;

for(i = 0; i < n02; i++) {

if(s[SA12[i]] != c0 || s[SA12[i]+1] != c1 || s[SA12[i]+2] != c2) {

name++;

c0 = s[SA12[i]];

c1 = s[SA12[i]+1];

c2 = s[SA12[i]+2];

}

if(SA12[i] % 3 == 1) s12[SA12[i]/3] = name;

else s12[SA12[i]/3 + n0] = name;

}

if(name < n02) {

suffixArray(s12, SA12, n02, name);

for(i = 0; i < n02; i++) s12[SA12[i]] = i + 1;

}

else for(i = 0; i < n02; i++) SA12[s12[i] - 1] = i;

for(i=0, j=0; i < n02; i++) if(SA12[i] < n0) s0[j++] = 3*SA12[i];

radixPass(s0, SA0, s, n0, K);

for(p=0, t=n0-n1, k=0; k < n; k++) {

i = GetI();

j = SA0[p];

if(SA12[t] < n0 ? leq(s[i], s12[SA12[t] + n0], s[j], s12[j/3]) : leq(s[i], s[i+1], s12[SA12[t]-n0+1], s[j], s[j+1], s12[j/3+n0])) {

SA[k] = i; t++;

if(t == n02) for(k++; p < n0; p++, k++) SA[k] = SA0[p];

}

else {

SA[k] = j; p++;

if(p == n0) for(k++; t < n02; t++, k++) SA[k] = GetI();

}

delete[] s12; delete[] SA12; delete[] SA0; delete[] s0;

}

int s[MAX], SA[MAX];

int main() {

int n, m, i;

gets(str);

m = -1;

for(i = 0; str[i]; i++) {

s[i] = str[i];

m = m > str[i]? m : str[i];

}

n = i;

for(i = n; i < n+3; i++) SA[i] = s[i] = 0;

suffixArray(s, SA, n, m);

for(i = 0; i < n; i++) printf("%d\n", SA[i]);

return 0;

}

*********END**********

16 comments:

PushkarFebruary 9, 2012 at 10:20 PM
Nice work man !
UnknownJune 17, 2012 at 9:33 PM
I feel a bit difficult to understand the approach of problem 1.
Can you please elaborate it a bit(a small example will be helpful),especially not able to understand the second point.
(1)Given string A, add the string to itself.
(2)Compute the suffix array of only first half. (use the characters of second half also)
(3)the smallest suffix gives the minimum lexicographic rotation of a string.

Thanks!
UnknownNovember 11, 2012 at 10:40 AM
sorry to say.. but you suffix array definition is incorrect.
see this --> http://en.wikipedia.org/wiki/Suffix_array
Αριστοφάνης ΡοντογιἀννηςFebruary 7, 2013 at 6:00 AM
Really nice work! The only think is that I doubt the correctness of the lcs algorithm you propose; could you elaborate more please?
Rahul Kumar SinghMay 2, 2013 at 8:21 PM
very easy to understand from your blog :)
MayankJune 2, 2013 at 5:04 AM
the suffix array according to the wikipedia definition of "banana" should have been {5, 3, 1, 0, 4, 2} (taking 0 indexed instead of 1 indexed given there). Can you elaborate how your suffix array comes out to be different?
UnknownJuly 21, 2013 at 1:41 AM
I think P[stp-1] gives the rank of the suffix starting at 'i'.

Suffix array should give the starting position of suffixes in lexicographical order.

P[stp-1] array gives the inverted suffix array.

Correct me if am wrong .....

A very nice tutorial :)
UnknownOctober 23, 2013 at 10:43 AM
very nice blog indeed....i have a question....how can i find the longest common string of n strings?????plz explain...better if u share a c++ implementation...thanks
godfatherMarch 11, 2014 at 9:14 AM
Eureka!! roxx..thanku u Sir...
UnknownMay 31, 2014 at 10:34 AM
This comment has been removed by the author.
UnknownJanuary 2, 2016 at 5:46 AM
you are using in built sort..so your suffx array construction takes O(nlogn*logn)..can you tell O(nlogn)using radix sort? or atleast post the link..

Tuesday, December 6, 2011

Suffix Array

16 comments: