Thursday, May 31, 2012

Finding palindromes efficiently

Using the palindromes in palindromes property, I now develop an algorithm for finding palindromes that requires a number of steps approximately equal to the length of its input. This linear-time algorithm can be used to find palindromes in documents of any size, even in DNA strings. Finding palindromes in a string of length $20,000,000$ using this algorithm requires a number of seconds on a modern laptop. It is impossible to find palindromes significantly faster, unless you have a machine with multiple cores, and use a parallel algorithm.

The program for efficiently finding palindromes is only about 25 lines long, but for formatting purposes it extends to over 80 lines in this blog post. Although the program is short, it is rather intricate. I guess that you need to experiment a bit with to find out how and why it works.

The naive algorithm for finding palindromes discussed in a previous blog post uses the following top-level definition for finding maximal palindromes:


< maximalPalindromes :: Array Int Char -> [Int]
< maximalPalindromes a  =   
<   let (first,last)  =  bounds a
<       centers       =  [0 .. 2*(last-first+1)]
<   in  map (lengthPalindromeAround a) centers

The reason why this algorithm is naive is that lengthPalindromeAround calculates the maximal palindrome around a center independently of the palindromes calculated previously. I now change this by calculating the maximal palindromes from left to right around the centers of a string. In this calculation I either extend a palindrome around a center, or I move the center around which I determine the maximal palindrome rightwards because I have found a maximal palindrome around a center. So I replace the above definition by

> maximalPalindromes :: Array Int Char -> [Int]
> maximalPalindromes a  =   
>   let (first,last)  =  bounds a
>   in  reverse (extendPalindrome a first 0 [])

Before I introduce and explain function extendPalindrome, I give an example of how the algorithm works. Suppose I want to find the maximal palindromes in the string "yabadabadoo". The algorithm starts by finding the maximal palindrome around the position in front of the string, which cannot be anything else than the empty string. It moves the position around which to find the maximal palindrome one step to point to the `y'. The maximal palindrome around this position is "y", since there is no character in front of it. It again moves the position around which to find palindromes one step to point to the position in between `y' and `a'. Since `y' and `a' are different, the maximal palindrome around this position is the empty string. Moving the center to `a', it finds that "a" is the maximal palindrome around this center, since `y' and `b' are different. The maximal palindrome around the next center in between `a' and `b' is again the empty string. Moving the center to `b', it can extend the current longest palindrome "b" around this center, since both before and after `b' it finds an `a'. It cannot further extend the palindrome "aba", since `y' and `d' are different. To determine the maximal palindrome around the center in between `b' and `a', the next center position, it uses the fact that "aba" is a palindrome, and that it already knows that the maximal palindrome around the center in between `a' and `b' is the empty string. Using the palindromes in palindromes property, it finds that the maximal palindrome around the position in between `b' and `a' is also the empty string, without having to look at the `b' and the `a'. To determine the maximal palindrome around the next center position on the last `a' of "aba", it has to determine if `d' equals `b', which it doesn't of course. Also here it uses the palindromes in palindromes property. Since "a" is the maximal palindrome around the center of the first `a' in "aba", and it reaches until the start of the palindrome "aba", I have to determine if the palindrome "a" around the second `a' can be extended. I won't describe all steps extendPalindrome takes in detail, but only give one more detail I already described above: the second occurrence of the palindrome "aba" in "yabadabadoo" is not found by extending the palindrome around its center, but by using the palindromes in palindromes property to find "aba" a second time in "abadaba".

Function extendPalindrome takes four arguments. The first argument is the array a in which we are determining maximal palindromes. The second argument is the position in the array directly after the longest palindrome around the current center (the longest palindrome around the center before the first symbol has length $0$, so the position directly after the empty palindrome around the first center is the first position in the array). I will call this the current rightmost position. The third argument is the length of the current longest palindrome around that center (starting with 0), and the fourth and final argument is a list of lengths of longest palindromes around positions before the center of the current longest tail palindrome, in reverse order (starting with the empty list []). It returns the list of lengths of maximal palindromes around the centers of the array, in reverse order. Applying function reverse to the result gives the maximal palindromes in the right order. The function extendPalindrome maintains the invariant that the current palindrome is the longest palindrome that reaches until the current rightmost position.

There are three cases to be considered in function extendPalindrome.

  • If the current position is after the end of the array, so rightmost is greater than last, I cannot extend the current palindrome anymore, and it follows that it is maximal. This corresponds to the null after case in the definition of maximal palindromes. It only remains to find the maximal palindromes around the centers between the current center and the end of the array, for which I use the function finalPalindromes.
  • If the current palindrome extends to the start of the array, or it cannot be extended, it is also maximal, and I add it to the list of maximal palindromes found. This case corresponds to the null after || last before /= head after case in the definition of maximal palindromes. I then determine the maximal palindrome around the following center by means of the function moveCenter.
  • If the element at the current rightmost position in the array equals the element before the current palindrome I extend the current palindrome.
In the following Haskell code, I list the arguments to a function below the function, since they don't fit on a single line anymore. I have also included some comments in the code, which are the pieces of text appearing after --.

> extendPalindrome 
>   a 
>   rightmost 
>   currentPalindrome 
>   currentMaximalPalindromes
>   | rightmost > last                                   =  
>       -- reached the end of the array
>       finalPalindromes 
>         currentPalindrome 
>         currentMaximalPalindromes 
>         (currentPalindrome:currentMaximalPalindromes)
>   | rightmost-currentPalindrome == first ||
>     a!rightmost /= a!(rightmost-currentPalindrome-1)   =  
>       -- the current palindrome extends to the start 
>       -- of the array, or it cannot be extended
>       moveCenter 
>         a 
>         rightmost 
>         (currentPalindrome:currentMaximalPalindromes) 
>         currentMaximalPalindromes 
>         currentPalindrome 
>   | otherwise                                          =  
>       -- the current palindrome can be extended
>       extendPalindrome 
>         a 
>         (rightmost+1) 
>         (currentPalindrome+2) 
>         currentMaximalPalindromes      
>   where  (first,last)  =  bounds a

In two of the three cases, function extendPalindrome finds a maximal palindrome, and goes on to the next center by means of function finalPalindromes or moveCenter. In the other case it extends the current palindrome, and moves the rightmost position one further to the right.

Function moveCenter moves the center around which the algorithm determines the maximal palindrome. In this function I make essential use of the palindromes in palindromes property. It takes the array as argument, the current rightmost position in the array, the list of maximal palindromes to be extended, the list of palindromes around centers before the center of the current palindrome, and the number of centers in between the center of the current palindrome and the rightmost position. It uses the palindromes in palindromes property to calculate the longest palindrome around the center after the center of the current palindrome.

  • If the last center is on the last element, there is no center in between the rightmost position and the center of the current palindrome. I call extendPalindrome with rightmost position one more than the previous position, and a current palindrome of length 1.
  • If the previous element in the list of maximal palindromes reaches exactly to the left end of the current palindrome, I use the palindromes in palindromes property of palindromes to find the next current palindrome using extendPalindrome.
  • In the other case, I have found the longest palindrome around a center, add that to the list of maximal palindromes, and proceed by moving the center one position, and calling moveCenter again. I only know that the previous element in the list of maximal palindromes does not reach exactly to the left end of the current palindrome, so it might be either shorter or longer. If it is longer, I need to cut off the new maximal palindrome found, so that it reaches exactly to the current rightmost position.

> moveCenter 
>   a 
>   rightmost 
>   currentMaximalPalindromes 
>   previousMaximalPalindromes 
>   nrOfCenters
>   | nrOfCenters == 0                                  =  
>       -- the last center is on the last element: 
>       -- try to extend the palindrome of length 1
>       extendPalindrome 
>         a 
>         (rightmost+1) 
>         1 
>         currentMaximalPalindromes
>   | nrOfCenters-1 == head previousMaximalPalindromes  =  
>       -- the previous maximal palindrome reaches 
>       -- exactly to the end of the last current 
>       -- palindrome. Use the palindromes in palindromes 
>       -- property to extend the current palindrome
>       extendPalindrome 
>         a 
>         rightmost 
>         (head previousMaximalPalindromes) 
>         currentMaximalPalindromes
>   | otherwise                                         =  
>       -- move the center one step. Add the length of 
>       -- the longest palindrome to the maximal
>       -- palindromes
>       moveCenter 
>         a 
>         rightmost 
>         (min (head previousMaximalPalindromes) 
>              (nrOfCenters-1)
>         :currentMaximalPalindromes) 
>         (tail previousMaximalPalindromes) 
>         (nrOfCenters-1)

In the first case, function moveCenter moves the rightmost position one to the right. In the second case it calls extendPalindrome to find the maximal palindrome around the next center, and in the third case it adds a maximal palindrome to the list of maximal palindromes, and moves the center of the current palindromes one position to the right.

Function finalPalindromes calculates the lengths of the longest palindromes around the centers that come after the center of the current palindrome of the array. These palindromes are again obtained by using the palindromes in palindromes property. Function finalPalindromes is called when we have reached the end of the array, so it is impossible to extend a palindrome. We iterate over the list of maximal palindromes, and use the palindromes in palindromes property to find the maximal palindrome at the final centers. As in the function moveCenter, if the previous element in the list of maximal palindromes reaches before the left end of the current palindrome, I need to cut off the new maximal palindrome found, so that it reaches exactly to the end of the array.


> finalPalindromes 
>   nrOfCenters     
>   previousMaximalPalindromes 
>   currentMaximalPalindromes   
>   | nrOfCenters == 0  =  currentMaximalPalindromes
>   | otherwise         =
>       finalPalindromes 
>         (nrOfCenters-1) 
>         (tail previousMaximalPalindromes) 
>         (min (head previousMaximalPalindromes) 
>              (nrOfCenters-1)
>         :currentMaximalPalindromes)

In each step, function finalPalindromes adds a maximal palindrome to the list of maximal palindromes, and moves on to the next center.

I have discussed the number of steps this algorithm takes for each function. At a global level, this algorithm either extends the current palindrome, and moves the rightmost position in the array, or it extends the list of lengths of maximal palindromes, and moves the center around which we determine the maximal palindrome. If the length of the input array is $n$, the number of steps the algorithm is $n$ for the number of moves of the rightmost position, plus $2n+1$ for the number of center positions. This is a linear-time algorithm.

4 comments:

  1. I've finally determined your algorithm after playing with a small example:) Yours implicitly maintains the current center and it moves rightward upon calling `moveCenter'. This approach is Manacher's algorithm in essence but sorry I think a comprehension of the latter is essential to understand your article...

    ReplyDelete
  2. Indeed, the main idea already appears in Manacher's paper. My work extends it to non-initial palindromes, and to palindromes of both even and odd length. I was hoping this blog post, together with the one on palindromes in palindromes, would be sufficient to understand the main idea. What do you think is missing from the explanation?

    ReplyDelete
  3. I've added an example in which I show how the algorithm calculates palindromes in the string "yabadabadoo" to the blog post.

    ReplyDelete
  4. The example helps a lot. Thanks!

    Also, thanks for the great article. I couldn't access your original paper due to a paywall. This helped me a lot.

    ReplyDelete