Tuesday, December 6, 2011

Suffix Array

Suffix Array


I usually don't like to give definitions ( coz i dont remember it :P )
Today i am going to write about a very important data structure called suffix array.

you must have come across problems like finding the
-longest common substring,
-minimum lexicographic rotation of a string,
-finding the longest palindrome, (this can be done in better time by another algorithm)
-searching for a smaller string in a bigger string. ( this can be done in O(m+n) using kmp. however we can do it in O(m log n ) using suffix array. it is useful if we have to search many small strings in the large string. Then it will outspeed KMP)
-

These types of problems can be efficiently solved by a suffix tree.
http://en.wikipedia.org/wiki/Suffix_tree

However to construct a suffix tree it is very tedious and takes a long time. It is impractical to implement a suffix tree in any contest of 5-6 hours. Also at times suffix tree is slower than suffix array because of the overhead of the constants involved. Below i will give you some theory regarding the Suffix Array.


Given a string A = (a0 a1 a2 a3 ....an-1) of length n.
A suffix of string S starting at position i is Suffix(i) = (ai ai+1 ai+2 ....an-1).
There are n suffixes of string A ( starting at each index to the end ).

Let U be a set of suffixes of string A in sorted order. ie obtain all the suffixes of A and sort them to form U.

A suffix array S is an array such that S[i] gives the index of Suffix(i) in U.

for example if A = banana
suffixes of A are:
banana = Suffix(0)
anana = Suffix(1)
nana = Suffix(2)
ana = Suffix(3)
na = Suffix(4)
a = Suffix(5)


after sorting U = {
0 a
1 ana
2 anana
3 banana
4 na
5 nana
}

thus suffix array S would be,
S = { 3, 2, 5, 1, 4, 0 }

S[1] = 2 because index of  Suffix(1) = anana in U is 2





Algorithm:


n ← length(A) 
for i ← 0, n – 1
 P(0, i) ← position of Ai in the ordered array of A‘s characters
cnt ← 1 
for k ← 1, [log2n] (ceil) 
 for i ← 0, n – 1
    L(i) ← (P(k – 1, i) , P(k – 1, i + cnt)  ,  i ) 
 sort L
 compute P(k, i) , i = 0, n - 1 
cnt ← 2 * cnt



Effectively the algorithm is O(n * log n) if we sort using bucket sort.
But sometimes we can compromise in the sorting step and the complexity O(n*log n * log n) is also acceptable.




Explanation:

Consider A is of infinite length such that a character γ is appended to A infinite times (just consider, dont to it).

let Σ be the set of all alphabets of A. Then γ < Σ i  , for all i=1,2....sizeof (Σ).
ie. our new appended character's value is smaller than all the alphabets of A.


This is done because if we want to take a substring A.substr(i,j) such that i+j > len(A), then it would be possible to consider such string now. [γ is not counted else the length would be infinite].



What is P(k, i) in the above algorithm?

Take all the substring (not suffix) of A of length 2^k. If we dont have enough character for some substrings at the tail of the string then append γ in the remaining spots. Let U be a set of those substrings taken in sorted order.

As an example:
A = banana
then, substring A(3,5) = anaγγ
We dont need to write γ every time.
thus, A(3,5) = ana

P(k, i) is the index of substring A.substr(i, 2^k) in U.



let r = [log2 (len(A)) ] (ceil)
Hence 2^r >= len(A)
Also, len( P(r, 0) ) >= len(A)

When the above algorithm completes, The array P(r) is the suffix array.
You may argue that suffixes are of variable length but P(r) has substrings of equal length.
That is when γ comes handy. Since γ is smaller in value than all the alphabets, γ can be effectively ignored in all cases because in comparing 2 strings γ will play no role in changing the string comparison.



Also note that if there are two substrings starting at say m and n, such that:
A.substr(m,2^k) = A.substr(n,2^k)
then P(k,m) = P(k,n). (This is an algorithmic necessity.)



Here you can see an extremely short implementation for suffix arrays in O(n * logn * log n).


#include < cstdio >
#include < iostream >
#include < cstdlib >
using namespace std; 
#define MAXN  65536 
#define MAXLG 17

char A[MAXN];

struct entry { 
    int nr[2], p;
} L[MAXN]; 

int P[MAXLG][MAXN], N, i, stp, cnt; 

int cmp(struct entry a, struct entry b) 

    return a.nr[0] == b.nr[0] ? (a.nr[1] < b.nr[1] ? 1 : 0) : (a.nr[0] < b.nr[0] ? 1 : 0); 

int main(void) 

    gets(A);
    for (N = strlen(A), i = 0; i < N; i ++) 
        P[0][i] = A[i] - 'a'; 
    for (stp = 1, cnt = 1; cnt >> 1 < N; stp ++, cnt <<= 1)
    { 
        for (i = 0; i < N; i ++) 
        { 
            L[i].nr[0] = P[stp - 1][i]; 
            L[i].nr[1] = i + cnt < N ? P[stp - 1][i + cnt] : -1; 
            L[i].p = i;
        } 
        sort(L, L + N, cmp); 
        for (i = 0; i < N; i ++) 
            P[stp][L[i].p] = i > 0 && L[i].nr[0] == L[i - 1].nr[0] && L[i].nr[1] == L[i - 1].nr[1] ? P[stp][L[i - 1].p] : i;
    } 
    return 0;
}





You may optimize the space of P by defining P as
int P[2][MAXN];
since P only depends in its previous value of the first dimension. (see the above code).



Longest common prefix (LCP)

Given 2 strings, the LCP of the 2 strings is defined as the longest string that occurs at the beginning of both the strings.

To compute the LCP of suffixes of A starting at i and j, ie s1 = A(i..n) and s2 = A(j..n) where n = len(A).
We have matrix P.
- find the largest k such that P[k][i] = P[k][j],
- then we have found the prefix of length 2^k common to both s1 and s2.
- increment i and j by 2^k and repeat all over.


Algorithm:


int lcp(int x, int y) 

    int k, ret = 0; 
    if (x == y) return N - x;
    for (k = stp - 1; k >= 0 && x < N && y < N; k --) 
        if (P[k][x] == P[k][y]) 
            x += 1 << k, y += 1 << k, ret += 1 << k; 
    return ret; 
}




We can reduce the complexity of the above algorithm to O(1) based on the following observation.
Given x < y





LCP(x,y) = minimum of { LCP(x,x+1) , LCP(x+1,x+2) ... LCP(y-1,y) }

So if we have an array that stores LCP of adjacent numbers, then LCP(x,y) = RMQ(x,y) where RMQ = range minimum query.

With some preprocessing of O(n*log n) RMQ can be done in O(1).

For more details http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=lowestCommonAncestor








Example Problems that can be done by suffix array

Problem 1.
Minimum lexicographic rotation of a string

Given string A, add the string to itself.
Compute the suffix array of only first half. (use the characters of second half also)
the smallest suffix gives the minimum lexicographic rotation of a string.



Problem 2.
You are given a text consisting of N characters (big and small letters of the English alphabet and
digits). A substring of this text is a subsequence of characters that appear on consecutive positions in the text. Given an integer K, find the length of the longest substring that appears in the text at least K times (1 ≤ N ≤ 16384).



Having the text’s suffixes sorted, iterate with a variable i from 0 to N – K and compute the longest
common prefix of suffix i and suffix i + K – 1. The biggest prefix found during this operation represents
the problem’s solution.






Problem 3.
Longest common substring LCS


There are given three strings S1 , S2 of lengths m and n ( 1 <= m, n <= 10000). Determine their longest common substring. 
For example, if S1 = abababca  and S2 = aababc  their longest common substring would be ababc.


- Concatenate S1 and S2.
- Compute the suffix array.
- Compute the LCP of all adjacent suffixes.
the maximum LCP(i,i+1) is the required answer.





An important note:
We implemented suffix array in O(n*log n) but there is an important algorithm called DC3 which computes suffix arrays in O(n). Thus suffix arrays becomes as effective as suffix tree.




But my principle is that we can use the data structure without knowing the mathematics behind it. So I will paste a very efficient code for DC3 below. This is not my code but I sometimes use it. It calculates suffix array in O(n).


#include
#include
using namespace std;
const int MAX = 250052;
char str[MAX];
int c[MAX];
#define GetI() (SA12[t] < n0 ? SA12[t] * 3 + 1 : (SA12[t] - n0) * 3 + 2)
inline bool leq(int a1, int a2, int b1, int b2) {
        return(a1 < b1 || (a1 == b1 && a2 <= b2)); 
}
inline bool leq(int a1, int a2, int a3, int b1, int b2, int b3) {
        return(a1 < b1 || (a1 == b1 && leq(a2,a3, b2,b3)));
}
static void radixPass(int* a, int* b, int* r, int n, int K) {
        int i, sum, t;
        for(i = 0;  i <= K;  i++) c[i] = 0;
        for(i = 0;  i < n;  i++) c[r[a[i]]]++;
        for(i = 0, sum = 0;  i <= K;  i++) {
                t = c[i];
                c[i] = sum;
                sum += t;
        }
        for(i = 0;  i < n;  i++) b[c[r[a[i]]]++] = a[i];
}
void suffixArray(int* s, int* SA, int n, int K) {
        int n0 = (n+2)/3, n1 = (n+1)/3, n2 = n/3, n02 = n0+n2;
        int* s12 = new int[n02 + 3];
        int* SA12 = new int[n02 + 3];
        int* s0 = new int[n0];
        int* SA0 = new int[n0];
        int i, j, name, c0, c1, c2, p, t, k;
        s12[n02] = s12[n02+1] = s12[n02+2] = 0;
        SA12[n02] = SA12[n02+1] = SA12[n02+2] = 0;
        for(i=0, j=0; i < n+(n0-n1); i++) if(i%3 != 0) s12[j++] = i;
        radixPass(s12, SA12, s+2, n02, K);
        radixPass(SA12, s12, s+1, n02, K);
        radixPass(s12, SA12, s, n02, K);
        name = 0, c0 = -1, c1 = -1, c2 = -1;
        for(i = 0; i < n02; i++) {
                if(s[SA12[i]] != c0 || s[SA12[i]+1] != c1 || s[SA12[i]+2] != c2) {
                        name++;
                        c0 = s[SA12[i]];
                        c1 = s[SA12[i]+1];
                        c2 = s[SA12[i]+2];
                }
                if(SA12[i] % 3 == 1) s12[SA12[i]/3] = name;
                else s12[SA12[i]/3 + n0] = name;
        }
        if(name < n02) {
                suffixArray(s12, SA12, n02, name);
                for(i = 0; i < n02; i++) s12[SA12[i]] = i + 1;
        }
        else for(i = 0; i < n02; i++) SA12[s12[i] - 1] = i;
        for(i=0, j=0; i < n02; i++) if(SA12[i] < n0) s0[j++] = 3*SA12[i];
        radixPass(s0, SA0, s, n0, K);
        for(p=0, t=n0-n1, k=0; k < n; k++) {
                i = GetI();
                j = SA0[p];
                if(SA12[t] < n0 ? leq(s[i], s12[SA12[t] + n0], s[j], s12[j/3]) : leq(s[i], s[i+1], s12[SA12[t]-n0+1], s[j], s[j+1], s12[j/3+n0])) {
                        SA[k] = i; t++;
                        if(t == n02) for(k++; p < n0; p++, k++) SA[k] = SA0[p];
                }
                else {
                        SA[k] = j; p++;
                        if(p == n0) for(k++; t < n02; t++, k++) SA[k] = GetI();
                }
        }
        delete[] s12; delete[] SA12; delete[] SA0; delete[] s0;
}
int s[MAX], SA[MAX];
int main() {
        int n, m, i;
        gets(str);
        m = -1;
        for(i = 0; str[i]; i++) {
                s[i] = str[i];
                m = m > str[i]? m : str[i];
        }
        n = i;
        for(i = n; i < n+3; i++) SA[i] = s[i] = 0;
        suffixArray(s, SA, n, m);
        for(i = 0; i < n; i++) printf("%d\n", SA[i]);
        return 0;
}





*********END**********






16 comments:

  1. I feel a bit difficult to understand the approach of problem 1.
    Can you please elaborate it a bit(a small example will be helpful),especially not able to understand the second point.
    (1)Given string A, add the string to itself.
    (2)Compute the suffix array of only first half. (use the characters of second half also)
    (3)the smallest suffix gives the minimum lexicographic rotation of a string.

    Thanks!

    ReplyDelete
    Replies
    1. 1. By add i mean concatenate.
      2. do the for loop " for k ← 1, [log2n] (ceil) - 1 ". We want to find Minimum lexicographic rotation of a string. Hence we should sort only the suffixes of first half of the string. Suffixes of strings on the other half have fewer than n characters and hence doesn't qualify for a lexicographic rotated string.
      In other words we could compute suffix array of entire string and then take the minimum value of first half only of the array.

      Delete
  2. sorry to say.. but you suffix array definition is incorrect.
    see this --> http://en.wikipedia.org/wiki/Suffix_array

    ReplyDelete
    Replies
    1. i dont know if it just me, but i dont see the difference in the definition. It is just another way of saying the same thing.

      Delete
  3. Really nice work! The only think is that I doubt the correctness of the lcs algorithm you propose; could you elaborate more please?

    ReplyDelete
    Replies
    1. Oh that is quite simple.
      Let string s be common between string A and string B.
      A = a1 a2 ... ai s aj aj+1...
      B = b1 b2 ...bk s bl bl+1...

      after concatenation it becomes
      AB
      = a1 a2 ... ai s aj aj+1...b1 b2 ...bk s bl bl+1...

      if we look at 2 suffixes
      s aj aj+1...b1 b2 ...bk s bl bl+1...
      s bl bl+1...

      these 2 suffixes will always occur adjacent to each other after all the suffixes are sorted.
      the longest common prefix is 's' which is the LCS of original problem.

      Delete
    2. the proof is not entirely sufficient. We need to prove that s is the largest possible string that can be common but you get the idea.

      Delete
  4. very easy to understand from your blog :)

    ReplyDelete
  5. the suffix array according to the wikipedia definition of "banana" should have been {5, 3, 1, 0, 4, 2} (taking 0 indexed instead of 1 indexed given there). Can you elaborate how your suffix array comes out to be different?

    ReplyDelete
    Replies
    1. Our definition of Suffix Array are different.
      I defined the Suffix array S[i] as index of suffix[i] in U.
      Wikipedia defines S[i] as index of ith smallest suffix (ie. ith suffix of U) in array A.
      Essentially both definitions are right and can be interchangeably used.
      It is possible to convert from one definition to another in O(n).
      I hope that answers your question.

      Delete
  6. I think P[stp-1] gives the rank of the suffix starting at 'i'.

    Suffix array should give the starting position of suffixes in lexicographical order.

    P[stp-1] array gives the inverted suffix array.

    Correct me if am wrong .....

    A very nice tutorial :)

    ReplyDelete
  7. very nice blog indeed....i have a question....how can i find the longest common string of n strings?????plz explain...better if u share a c++ implementation...thanks

    ReplyDelete
  8. This comment has been removed by the author.

    ReplyDelete
  9. you are using in built sort..so your suffx array construction takes O(nlogn*logn)..can you tell O(nlogn)using radix sort? or atleast post the link..

    ReplyDelete