Sunday, October 28, 2012

The history of finding palindromes

I have given several programs to find palindromes in the blog posts on this blog. Who was the first to construct those? The history of finding palindromes is not as long as the history of palindromes themselves, since, obviously, computers did not exist in the time of the ancient Greeks. Using a computer to find palindromes has probably been done since the 1970s. Before that, palindromes have been used to illustrate the power of computing models. In this blogpost I will go through the history of finding palindromes using a computer.

The oldest references to papers in computer science I could find mentioning palindromes (or, actually, `strings satisfying the symmetry predicate') were published in 1965 in Russian by Freivald and Bārzdiņš. Since I cannot read Russian, I depend on the description of these papers in a review from Albert Mullin in the Journal of Symbolic Logic in 1970. Both Freivald and Bārzdiņš show how to determine whether or not a string is a palindrome using different variants of a Turing machine. Turing machines are named after Alan Turing, who was born exactly 100 years ago.

In the 1930s, Turing devised a model of a machine to prove a fundamental mathematical theorem about the Entscheidungsproblem. The gist of this theorem is that you cannot solve all problems by means of a computer. In particular, you cannot use a computer to determine if an arbitrary program returns it result in a reasonable amount of time. The Turing machine model has had a profound influence on real computing machines built in the 1940s and 1950s. It has also been used a lot in Computer Science, the scientific field that arose since the construction of the first computers. For a long time, the fields of algorithms and complexity theory, which study algorithms and their efficiency, have used Turing machines to show the efficiency of particular algorithms. Modern computers and Turing machines are very dissimilar, and the complexity of software is often determined by factors not present in Turing machines. Probably for that reason, Turing machines are not used much anymore in Computer Science, although they still appear as examples of computing models.

The most basic form of a Turing machine has a tape on which it can write symbols, a table that tells which actions to take, using a head that can move along the tape, and a state register, which contains the different states the machine can be in, such as a start state, an end state, or a state in between. The machine can read a symbol from the type, write a symbol on the tape, and move the head along the tape. Freivald and Bārzdiņš show that to test whether an input word is a palindrome requires a number of steps that is quadratic in the length of the input string on such a Turing machine. Eight years later in 1973, Slisenko showed that if you are allowed to use more than one head on a tape, or multiple tapes, than a palindrome can be determined in a number of steps linear in the length of the input string. The English translation of the paper `Recognizing a symmetry predicate by multihead Turing machines with input' is a 183 page, dense mathematical text.

Slisenko announced the result already in 1969, and given the form and the length of his paper (with 183 pages this is more a book than a paper), I find it highly likely that it took him a couple of years to write it. Slisenko's result was a surprisingly strong result, and people started to study and trying to improve upon it. One of these people was Zvi Galil, then a postdoc IBM Yorktown Heights.
He had a hard time understanding Slisenko's paper, which I fully understand, but in 1978 he obtained the same result, only he needed just 18 pages to describe it. The problem of finding palindromes efficiently started off an area within Computer Science now known as stringology, which studies strings and their properties, and algorithms on strings.

Freivald, Bārzdiņš, Slisenko, and Galil describe algorithms for recognizing palindromes on Turing machines with a varying number of tapes. On other machine models, palindromes have been a popular example too. Stephen Cole showed how to recognize palindromes on iterative arrays of finite-state machines in 1964. Alvy Ray Smith III used cellular automata to recognize palindromes in 1972. Also in the 1970s, Freivald and Yao, independently, looked at recognizing palindromes by means of probabilistic Turing machines. Looking at these results from a distance of around 40 years, it seems like it was a sport to study programming with various restrictions, like not using your left hand, or tying your legs together.

Some of the machine models on which algorithms for palindromes were developed have rather artificial restrictions, which gives rather artificial algorithms for finding palindromes. But the algorithms on some of the more realistic machine models, such as the Random Access Machine (RAM) model, contain the essential components of the algorithm I give in the blog post on finding palindromes efficiently. Manacher gives a linear-time algorithm on the RAM computing model finding the smallest initial palindrome of even length. He also describes how to adjust his algorithm in order to find the smallest initial palindrome of odd length longer than 3. He did not realize that his approach could be used to find all maximal palindromes in a string in linear time. Zvi Galil and Joel Seiferas did, in 1976. They wrote a paper, titled `Recognizing certain repetitions and reversals within strings', in which they develop an algorithm that finds all maximal palindromes in a string in linear time. As far as I know, this is the first description of the algorithm I give in the blog post on finding palindromes efficiently.

Twelve years later I rediscovered this algorithm. I published a paper on `The derivation of on-line algorithms, with an application to finding palindromes', in which I show how to obtain the efficient algorithm for finding palindromes from the naive algorithm for finding palindromes using algebraic reasoning. The method at which I arrive at the algorithm is completely different from the way Galil and Seiferas present their version. I presented the algorithm and the method I use to construct it to several audiences. I was only 22 years old at the time, and I am afraid my presentation skills were limited. Several people that saw me presenting the efficient algorithm for finding palindromes thought they could do better. An example can be found in the book `Beauty is our business' (which isn't about models, or escort services, but about Computer Science, and dedicated to the Dutch Turing award winning computer scientist Edsger Dijkstra on his sixtieth birthday), in which Jan Tijmen Udding derives the same algorithm using Dijkstra's calculus for calculating programs.

Zvi Galil developed algorithms for almost all problems related to palindromes in his series of papers on palindromes. In 1985, he even developed an algorithm for finding palindromes using a parallel machine. Turing machines, and the other machine models mentioned above, are sequential machines: computations are performed in sequence, and not at the same time. Parallel machines allow computations to be performed at the same time, in parallel. In the previous century few parallel machines were available, but nowadays, even the laptop on which I am typing this text has four cores, and can do many things in parallel. The importance of parallel machines has increased considerably, also because it is expected that increasing the speed of computers by increasing the clock speed is not going to work anymore in the next couple of years. Extra power of computers has to come from the possibilities offered by using multiple cores for computations. To put these cores to good use, we need efficient parallel algorithms for the problems we want to solve. Apostolico, Breslauer, and Galil improved upon Galil's first parallel algorithm for finding palindromes in their paper on `Parallel detection of all palindromes in a string' in 1995. When finding palindromes in DNA, such parallel algorithms might be very useful. I am not aware of any experience report on using parallel algorithms to find palindromes in DNA, but I hope to try this out myself soon, and report on it on this blog.

Since the discovery that palindromes appear in DNA, people working on bioinformatics have developed algorithms for finding palindromes in DNA. The first description of these algorithms I could find are in the book Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, by Dan Gusfield.

This is one of the standard works in computational biology. Gusfield shows how to compute all maximal palindromes in a string using a suffix tree. He then moves on to discuss gapped palindromes (called separate palindromes in Gusfield's book), and approximate palindromes, for which he leaves it to the reader to construct an algorithm that returns approximate palindrome in time linear in the product of the maximum number of errors and the length of the input. After Gusfield, several other scientist have worked on finding gapped and approximate palindromes, sometimes improving on or generalizing Gusfield's results. For example, as I already mentioned in the blog post on palindromes with errors or gaps, Porto and Barbosa have developed an efficient algorithm that finds approximate palindromes in which also symbols may be inserted or deleted.

Are all problems related to finding palindromes solved? I think many are, and for most practical purposes, efficient algorithms for finding palindromes are available. From an algorithm-design perspective I think there are still some open questions. The central concept I use in the design of the efficient algorithm for finding palindromes is the palindromes in palindromes property. I think this property should also play a central role in efficient algorithms for finding gapped and approximate palindromes. I have tried for quite a while to design algorithms for these problems using the palindromes in palindromes concept, but failed. Anyone?

Tuesday, September 25, 2012

Finding palindromes in the Y chromosome

To test the program I obtain by combining palindrome finding with errors and gaps, I downloaded a copy of the Y chromosome. The first paper that described the occurrences of huge palindromes in the male chromosome referred to in my blog post on palindromes in DNA is `The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes'. This paper appeared in Nature in 2003, and the Nature web pages provide a link to the Y chromosome studied in the paper. The paper caused quite a stir. It has been cited more than a thousand times in other scientific papers. A thousand citations is a lot: none of my direct colleagues in computer science ever wrote an article with so many citations.

Since it is computationally infeasible to test the program for finding palindromes with errors and gaps with large numbers of allowed errors, I looked up the information about palindromes in the paper. The paper contains a table that gives information about the eight palindromes the authors found in chromosome Y.

Palindrome Arm length (kb) Arm-to-arm identity (%) Spacer length (kb) Palindrome span (kb)
P1 1,450 99.97 2.1 2,902
P1.1 9,9 99.95 3.9 24
P1.2 9,9 99.95 3.9 24
P2 122 99.97 2.1 246
P3 283 99.94 170 736
P4 190 99.98 40 419
P5 496 99.98 3.5 996
P6 110 99.97 46 266
P7 8.7 99.97 12.6 30
P8 36 99.997 3.4 75

Palindrome P1 consists of almost 3 million symbols, and the arm-to-arm identity is 99.97 percent. An arm has length almost one and a half million, so the number of errors is around 450. I assume that the errors are more or less evenly distributed in the palindrome, which implies that in the central 10,000 symbols of an arm of a palindrome, after the gap, about 3 errors would occur. So instead of trying to find the long palindrome with possibly 450 errors and a gap of size around 2100 symbols, I instead try to find palindromes with a gap of size 2200 (to be on the safe side) of length at least 5000 (to be on the safe side) with at most 5 errors (to be on the safe side). To my surprise, I do not find a single palindromes that satisfies these requirements. This long palindrome with a gap of size 2100 does not seem to appear in the Y chromosome of the authors. Since P2 has the same gap size, it does not appear either. Hmm, what's wrong here? Have I made a mistake?

Lets have a look at the other palindromes reported in the table. The next smallest gap size reported in the above table is 3.4 kb. If I try to find palindromes with a gap of size 3400, I get two hits. Around position 18905096 I find a palindrome of length 32056 (including the gap) if I allow for 8 errors. If I try to reduce the gap length, I find that I can take a gap length of 2320, and still find the same palindrome. So maybe this is one of the palindromes with a gap of around 2.1 kb? The arm length is (32056-2320)/2 = 14868, which doesn't correspond at all with the reported arm lengths. I also find a palindrome around position 9806955. This palindrome has length 17304, and 5 errors. This palindrome indeed has a gap of size 3400: reducing the gap size leads to much shorter palindromes. But the palindrome around this position is much shorter than the reported length in Nature. I experimented with different gap sizes and numbers of allowed errors, and found the following palindromes in the Y chromosome:

Center Length Errors Gap length
9806955 13904 5 3400
12078263 43356 1 151498
13738581 34452 0 39262
14445793 64738 5 132950
17989037 14934 5 297114
18899672 3364 7 39540
18905096 29736 8 2320
20530881 24374 2 157760

Since my findings are very different from the results reported in Nature, I contacted the group responsible for the paper, the Whitehead Institute at the Department of Biology at the MIT. The research scientist that answered my questions was kind enough to provide me with the positions at which the palindromes occur in the Y chromosome. Given the positions, I tried to compare the arms by hand (or eye, actually). I got nowhere: the start and end of the arms did not resemble each other at all.

The final clue I needed was that it is not enough to discard errors constituted by two symbols that do not match in a palindrome. Sometimes I also have to delete symbols in one arm of a palindrome, or, equivalently, add symbols to the other arm. With this last clue, I selected the reported palindromic arms in the Y chromosome, and compared them. It turns out that in each case I only have to delete or insert a couple of DNA symbols to obtain a palindrome. The table of palindromes I find are the palindromes P8 to P1 reported in Nature, with P7 missing. My seventh palindrome does not appear in the list of positions I received, but this is probably one of P1.1 or P1.2, of which I didn't receive the positions.

After months of playing with the Y chromosome in some of my spare time, I now finally find the palindromes I was after. If I would have to do the same thing again for another DNA string, I would follow the same approach:

  • use the program that finds palindromes with gaps and errors with various gap sizes and numbers of allowed errors to find centers of palindrome candidates
  • hand align the sequences around the centers of palindrome candidates to determine the length of the palindromes around these centers, given some number of allowed deletions or insertions
I don't think another approach is feasible given the current speed of computers.

Wednesday, September 12, 2012

Palindromes with errors or gaps

Sometimes I not only want to find perfect palindromes, but also palindromes that contain a limited number of errors. A palindrome with a limited number of errors is often called an approximate palindrome. For example, in the book Judges in the King James Bible, verse 19:9 reads:

  And when the man rose up to depart, he, and his 
  concubine, and his servant, his father in law, the 
  damsel's father, said unto him, Behold, now the day 
  draweth toward evening, I pray you tarry all night: 
  behold, the day groweth to an end, lodge here, that 
  thine heart may be merry; and to morrow get you early 
  on your way, that thou mayest go home.
The substring "draweth toward" is a text palindrome with one error: the `e' and the `o' don't match. To be precise, a string s is an approximate palindrome with k errors, if it satisfies the predicate approximatePalindrome k s. I define approximatePalindrome as a program by taking the first half of both the input string and its reverse, and comparing each of the elements of these strings pairwise. I don't have to compare the second halves of these strings, since that would give the same result as the comparison of the first halves, and mismatches would be counted twice.

> approximatePalindrome k s = 
>   let half = div (length s) 2
>       approximatePalindrome' k []     []     = k>=0
>       approximatePalindrome' k (x:xs) (y:ys)  
>         | x == y    = approximatePalindrome' k     xs ys
>         | k >= 0    = approximatePalindrome' (k-1) xs ys
>         | otherwise = False
>   in  approximatePalindrome' k 
>         (take half s) 
>         (take half (reverse s)) 

A palindrome with a gap is a palindrome in which a gap of a particular size in the middle is ignored. An example of a palindrome with a gap is found in Revelations, where verses 20:7-8 read:

  And when the thousand years are expired, Satan shall 
  be loosed out of his prison, And shall go out to 
  deceive the nations which are in the four quarters of 
  the earth, Gog, and Magog, to gather them together to 
  battle: the number of whom is as the sand of the sea.
Here "Gog, and Magog" is a text palindrome with a gap of length three in the middle: the `n' and the `M' around the central `d' don't match. Since the gap appears in the middle of the string, the length of the gap is odd if the length of the palindrome is odd, and even if the length of the palindrome is even. To be precise, a string s is a palindrome with a gap of length g in the middle, if it satisfies the predicate gappedPalindrome g s:

> gappedPalindrome g s = 
>   let ls            = length s
>       removeGap g s = 
>           let armLength     = div (ls - g) 2
>               (before,rest) = splitAt armLength s
>               (gap,after)   = splitAt g rest
>           in  before ++ after 
>   in  if    g < ls && odd g == odd ls 
>       then  palindrome (removeGap g s)
>       else  error "gappedPalindrome: incorrect gap length"
 
If the gap is shorter than the length of the input string, and both the gap and input string have even or odd length, I remove the gap from the middle of the string, and check that the remaining string is a palindrome. I remove the gap by taking half of the elements of the input string, minus half of the gap elements, then dropping the gap, and then taking the rest of the string.

To find palindromes with errors or gaps, I adapt my software for finding palindromes.

It is quite easy to adapt the quadratic-time algorithm for finding palindromes to also find approximate palindromes, and palindromes with gaps. In this blog post I show how to adapt the quadratic-time algorithm that finds palindromes in DNA. Adapting the quadratic-time algorithms for finding exact or text palindromes is done in exactly the same way. To find approximate palindromes in DNA, I pass an extra argument k, representing the maximum number of errors allowed, to the function qMaximalPalindromesDNA, which I now call approximatePalindromesDNA. The only function I have to adapt is the function lengthPalindromeAroundDNA:


> lengthApproximatePalindromeAroundDNA 
>   input lastPos k start end  
>   |  start < 0 || end > lastPos                =  end-start-1
>   |  B.index input start =:= B.index input end =  
>        lengthApproximatePalindromeAroundDNA 
>          input 
>          lastPos 
>          k 
>          (start-1) 
>          (end+1) 
>   |  k > 0                                     =  
>        lengthApproximatePalindromeAroundDNA 
>          input 
>          lastPos 
>          (k-1) 
>          (start-1) 
>          (end+1) 
>   |  otherwise                                 =  end-start-1

The time complexity of approximatePalindromesDNA is related to the sum of the lengths of the approximate palindromes found in the input. As I described in the blog post on finding palindromes in DNA, this is not a problem for perfect palindromes in DNA. In the case of approximate palindromes, the number of allowed errors has a big influence on the time complexity. If I increase the number of allowed errors by one, the length of all palindromes increases with at least two, since a single error represents two symbols, and since after the error there might be matching symbols. For example, if I allow for 500 errors to find a palindrome consisting of almost $3$ million symbols, each approximate palindrome in the input has at least length $1000$, except for the palindromes at the start and the end of the string. The average approximate palindrome is quite a bit longer, since the chance for an accidental match is one in four. Since each palindrome has length on average more than $1250$, applying the function approximatePalindromesDNA to a DNA string of length $25$ million requires more than $30$ billion steps only to compute the lengths of the approximate palindromes. This number is very challenging for my laptop, and until now I haven't been able to convince my machine to calculate these values for me, despite letting it run for several nights.

To find palindromes with gaps in DNA, I use an extra argument g, which represents the length of the gap in the middle of the palindrome which I can ignore. A gap may be odd or even, and in the former case I get a palindrome around a center on a symbol in a DNA string, instead of in between two symbols. So where an exact palindromes in a DNA string has its center in between two symbols, a gapped palindrome may have a center in between two symbols or on a symbol, depending on whether or not the gap is even. The function qMaximalPalindromesDNA, which is now called gappedMaximalPalindromes, is the only function I have to adapt. For each center, I have to subtract half of the gap of the center for the start position, and add half of the gap to the end position. When doing so, I have to check that these positions do not point before the beginning of the string or after the end of the string. If they do, I have to replace half of the gap by the appropriate smaller number.


> gappedMaximalPalindromesDNA input g  =
>   let halfg       =  div g 2
>       lengthInput =  B.length input
>       centers     =  if   even g 
>                      then [0 .. lengthInput]
>                      else [0 .. lengthInput-1]
>       halfg' c    |  c < halfg             = c
>                   |  c+halfg > lengthInput = lengthInput-c
>                   |  otherwise             = halfg
>       left c      =  c-1-halfg' c
>       right c     =  if   even g
>                      then c+halfg' c
>                      else c+1+halfg' c
>   in  map 
>         (\c -> lengthPalindromeAroundDNA 
>                  input 
>                  (lengthInput-1)
>                  (left c)
>                  (right c)
>         ) 
>         centers

By combining the functions lengthApproximatePalindromeAroundDNA and gappedMaximalPalindromesDNA I obtain a function for determining palindromes with both gaps in the middle, and errors in the palindromic arms.

Adapting the quadratic-time algorithm for finding palindromes to also allow for errors and gaps in palindromes is not very hard. How about adapting the linear-time algorithm for finding palindromes?

Adapting the linear-time algorithm to account for errors and gaps is very hard. I spent a long time studying the problem of finding palindromes with a given maximum number of errors in time linear in some combination of the number of errors and the length of the input, but failed to find a satisfying solution. During a couple of long trips I drew many ideas in my notebook to adapt the main components of the linear-time algorithm, the longest tail palindrome and the list of maximal palindromes with centers before the longest tail palindrome, to allow for errors and gaps. In vain.

I tried to adapt the linear-time algorithm for finding palindromes to also find approximate palindromes before I had realized that the quadratic-time algorithm returns its results much faster on DNA strings than the linear-time algorithm. My frustrations about not being able to find a solution have almost disappeared now.

In the scientific literature I found two solutions to the problem of finding approximate palindromes in time linear in some combination of the input and the number of allowed errors. The first solution uses so-called suffix trees. A suffix tree is an advanced data structure that can be used to find all kinds of structures in strings, among which palindromes. I compared the efficiency of my linear-time algorithm for finding exact palindromes with an algorithm using suffix trees, and my linear-time algorithm was far superior. The solutions for finding approximate palindromes using suffix trees use even more arguments, and they will run even slower than the algorithm for finding exact palindromes in strings using suffix trees. The quadratic-time solution for finding approximate palindromes with gaps is no more complicated than the above programs, so although I did not perform any tests, I am pretty confident that the above quadratic-time algorithm for finding approximate palindromes runs much faster than the solutions using suffix trees when applied to DNA strings. The second solution uses advanced algorithms on graphs to determine edit actions that turn one string into another string. I'm afraid I don't understand the details sufficiently well to discuss the algorithm here. The time-complexity of this algorithm is related to $k^2 \times n$, where $k$ is the number of allowed errors, and $n$ the length of the input string. For the DNA string of length $25$ million and the $500$ errors used as examples above, this would amount to more than $6$ trillion steps. My machine would just laugh at it.

Thursday, August 16, 2012

Finding palindromes in DNA

Palindromes play an important role in DNA, and many geneticists want to find palindromes in the DNA they have. To find palindromes in DNA, I cannot directly apply the algorithm for finding palindromes described in a previous blog post, since the equality used for palindromes in DNA differs from the normal equality. A string is a palindrome in DNA if each symbol in the reversed string is the negation of the symbol in the original string, where `A' is the negation of `T' and vice versa, and `C' is the negation of `G', and vice versa. The operator =:= introduced in my blog post on What is a palindrome II? A specification in Haskell implements this equality. By negating the result of =:= I get the inequality operator =/= on DNA symbols.

A first attempt to find palindromes in DNA is to adapt the various components to deal with DNA in the functions extendPalindrome, moveCenter, and finalPalindromes introduced in the blog post on finding palindromes efficiently. In the function extendPalindrome the expression a!... /= a!..., where the dots should be replaced by the appropriate indices in the input, is replaced by the expression a!... =/= a!... using the DNA inequality operator. This is sufficient to turn the function extendPalindrome into a function that finds palindromes in DNA. The functions moveCenter and finalPalindromes do not use the (in)equality operator, so you might think that this is all that is needed to obtain a program for finding palindromes in DNA.

There is one more location in the algorithm for finding palindromes where I use a property of palindromes that does not hold for DNA. In the function moveCenter I call function extendPalindrome with a new "current palindrome" of length one. But a DNA string of length one is not a palindrome, since if I reverse it, I get the same string, and to be a palindrome, the string has to contain the complement symbol. Every DNA string of odd length is not a palindrome, since when such a string is reversed, the symbol in the middle is compared against itself, and DNA equality fails. To find palindromes in a DNA string, I only have to look at the even positions in the string, which correspond to the positions in between symbols. The palindromes around odd positions on the symbols always have length zero.

I adapt the algorithm for efficiently finding palindromes to only find palindromes around even positions. Besides adapting the function extendPalindrome with a different inequality operator, I adapt the function moveCenter by calling extendPalindrome with a new longest current palindrome of length zero, and by counting the number of centers down with steps of size two instead of one, since I am not interested in the odd center positions anymore. These are all changes needed, and I will not give the resulting algorithm. The interested reader can find them in my software package for finding palindromes.

In spring 2012 I was contacted by an Indian scientist who was developing an algorithm for finding palindromes in DNA. I soon found out that her algorithm was similar to my naive, quadratic-time, algorithm for finding maximal palindromes. What worried me was that her algorithm seemed to run faster than my efficient algorithm for finding palindromes, adapted to finding palindromes in DNA. By thinking that my efficient algorithm would run faster, I had made the classical mistake of confusing worst-case time complexity with practice.

The worst-case time complexity of my efficient algorithm for finding palindromes is linear. The number of steps the algorithm takes is linear in the length of the input string. The worst case time-complexity of the naive algorithm is quadratic. A string in which all substrings are palindromes leads to "worst case" time behaviour. An example of a DNA string satisfying this property is a string looking like "ATATATAT...". For a string consisting of three million copies of "AT", the efficient algorithm returns the length of all maximal palindromes about equally fast as for the the Amycolatopsis marina, which is about equally long, but the quadratic time algorithm never finishes on the repetitions of ATs. Strings like "ATATATAT..." appear in DNA, but they are short. For example, the longest maximal palindrome in chromosome 18 consisting of just "ATAT..." has length 66. The longest overall maximal palindrome has length 76. The majority of the maximal palindromes is much shorter. The time-complexity of the naive algorithm for finding palindromes is the sum of the length of the palindromes found in the argument string. If the maximum length of a palindrome is 76, then the number of steps performed by the naive algorithm is at most 76 times the length of the input string. Since chromosome 18 consists of almost 75 million symbols, this is more than five and a half billion, which is quite a lot, but doable for a modern laptop. Indeed, if I measure the running time of the quadratic and linear-time solutions on chromosome 18, I find that the quadratic solution is about four to five times faster than the linear solution. To be precise, on my MacBook Pro with 2.4 GHz Intel Core i5 with 4GB memory, with very few other processes running, the linear-time solution requires slightly more than 30 minutes to analyse the almost 75 megabytes of chromosome 18, whereas the quadratic solution uses around 6 minutes. Why is it faster?

It is hard to exactly determine what one program makes faster than another. The linear time solution uses quite a few arguments, in particular the list of maximal palindromes found until now. This is the central idea to obtain linear-time behaviour, but it also requires constructing, passing on, inspecting, and deconstructing this list to determine maximal palindromes with centers to the right of the current longest tail palindrome. I have experimented with several kinds of representations for the list of maximal palindromes used as argument to the functions extendPalindrome and moveCenter, but none of these solutions can compare to the quadratic solution when looking at running times on large DNA strings. An important motivation for developing a linear-time algorithm for finding palindromes turns out not to be valid at all...

The quadratic-time solution for finding palindromes in DNA is very similar to the naive algorithm for finding maximal palindromes. There are three differences. First, the input to the function maximalPalindromes, which I now call qMaximalPalindromesDNA, is of type ByteString instead of Array Int Char. The type ByteString is a data structure specifically designed to efficiently deal with strings of which the length is known. It is very similar to the type String, but functions like read for reading strings, map for mapping a function over a string, length for determining the length of a stringdefined in the module ByteString, all are usually much faster than their equivalents on String, or even on Array Int Char. I write B.ByteString, B.length, and so on, to make explicit that these components come from the module ByteString:


> import qualified Data.ByteString as B

Second, in the function qMaximalPalindromesDNA I only consider positions in between symbols around which to find maximal palindromes, since palindromes in DNA have even length. The length of the list of centers has length one plus the length of the input, to account for the center at the end of the input.

> qMaximalPalindromesDNA :: B.ByteString -> [Int]
> qMaximalPalindromesDNA input = 
>   let lengthInput =  B.length input
>       centers     =  [0 .. lengthInput]
>       lastPos     =  lengthInput - 1
>   in map 
>        (\c -> lengthPalindromeAroundDNA 
>                 input 
>                 lastPos 
>                 (c-1) 
>                 c
>        ) 
>        centers

The third difference is that function lengthPalindromeAroundDNA doesn't distinguish between even-length and odd-length palindromes anymore, since all palindromes have even length, and that it needs to use the equality operator for DNA.

> lengthPalindromeAroundDNA input lastPos start end  
>   | start < 0 || end > lastPos                 =  end-start-1
>   | B.index input start =:= B.index input end  =  
>       lengthPalindromeAroundDNA 
>         input 
>         lastPos 
>         (start-1) 
>         (end+1) 
>   | otherwise                                  =  end-start-1

This solves the problem of finding palindromes in DNA. But I cannot yet use this algorithm to find the huge palindromes in the Y chromosome, since those palindromes have some noise in the middle, and the palindromic arms contain a few errors. I will look into this in a following blog post.

Sunday, July 22, 2012

Palindromes in DNA

DNA strings are often millions of letters long. For example, the copy I have of the DNA of the Amycolatopsis marina, a bacteria discovered in 2009, consists of $6,503,724$ characters.

Bacteria are of course some of the smallest living beings. The DNA in the copy I have of the human chromosome 18 is an astounding $74,657,229$ characters long. Chromosome 18 is one of the 23 chromosomes of human beings, and is estimated to contain in between $300$ and $500$ genes. A gene is responsible for part of our functioning as human beings. An example of a gene contained within chromosome 18 is the NPC1 gene, named after the Niemann-Pick disease, type C1. This gene is named after the disease you get when you have an error on the gene. Experiments with mice show that this gene probably controls the appetite, and people with a mutation on the NPC1 gene often suffer from obesitas. Interestingly, the same mutation on NPC1 might make you immune for the ebola virus, one of the deadliest known viruses. So the mutation is partly a blessing in disguise.

If I search for the keyword palindrome in the electronic publications available at the library of my university, I get more than $500$ hits. The first ten of these hits are all about palindromes in DNA. My guess is that at least $90$% of the list of publications are about palindromes in DNA. So what is the interest in palindromes in DNA? Surely, if your strings only contain `A', `T', `C', and `G' you are likely to get many palindromes?

Let us look at how many palindromes we expect to find in the DNA of the Amycolatopsis marina. Suppose I want to find palindromes of length twelve. I calculate the chance that an arbitrary DNA string of length twelve is a palindrome as follows. The first six characters of the string don't matter. The seventh character needs to match with the sixth character, for which we have a chance of one in four. Remember that in DNA, `A' matches with `T' and vice versa, but both `A' and `T' do not match with themselves, or with `C' and `G'. The eighth character needs to match with the fifth character, for which we also have a chance of one in four. This goes on until the twelfth character, so I get a chance of \[ \frac{1}{4} \times \frac{1}{4} \times \frac{1}{4} \times \frac{1}{4} \times \frac{1}{4} \times \frac{1}{4} = (\frac{1}{4})^6 = \frac{1}{4^6} = \frac{1}{4096} \] Since the Amycolatopsis marina is $6,503,724$ characters long, there are $6,503,713$ substrings of length twelve. Multiplying this number with the chance that it is a palindrome, I expect to get $1,589$ palindromes. Using my algorithm for finding palindromes I get $1,784$ palindromes. This is slightly above $10$% more than expected, but that might be an accident. If I look at the palindromes of length fourteen in the $18$th human chromosome, using a similar calculation I expect to find $4,556$ palindromes. My algorithm finds $25,323$. More than five times as many palindromes as expected! This is not an accident anymore. Palindromes play a role in DNA. But what role?

Palindromes perform several tasks in human DNA. I will discuss one particularly intriguing task in this blog post.

Why do we humans have sex? I can answer this question from several perspectives. The most common answer will probably mention love, pleasure, or children. Looking at the question from a biological perspective, the answer discusses the biological advantages of having sex. Our children get their genes from two parents. They get two complete copies of human DNA, and merge these copies in some way to make them the way they are. In this merging process, damaged DNA from one parent can be replaced by functioning DNA from the other parent. There is a lot of DNA to repair: every day, each cell may be damaged at one million different places. Most of this damage is harmless, since a lot of DNA does not seem to have any function at all, but some kinds of damage may cause a lot of problems. Being able to repair non-functioning DNA when combining DNA from parents is essential for keeping the human race in a good state. The American geneticist Hermann Joseph Muller used this argument to explain why sexual reproduction is favored over asexual reproduction in organisms. When an organism reproduces asexually it passes all its DNA errors on to its offspring, without a possibility to repair them, eventually leading to the extinction of the population. The process has been dubbed Muller's ratchet, after the ratchet device, which can only turn in one direction. This is the theory, practice is slightly more complicated.

Muller's ratchet should already be at work in humans, since there is one human chromosome in which no combination of parental genes takes place: chromosome 23. Chromosome 23 determines whether we become a man or a woman. A man gets a copy of the chromosome called X from his mother and a copy called Y from his father. A woman gets an X from her mother and an X from her father. A woman can merge both X's and repair possible errors, but a man has two different copies of the chromosome, and has no possibility to combine, let alone repair, them. The Y is passed on from father to son, with no involvement of women. Muller's ratchet theory says the genes on the chromosome that make a man should deteriorate quickly, and men should soon become extinct.

There is no sign of men becoming extinct. Apparently there are some other mechanisms at work to repair DNA on the Y chromosome. If I cannot obtain a copy of some piece of DNA from my mother, maybe I can store copies of the DNA in the Y chromosome itself? If I maintain two copies, I can always check if a piece of DNA is correct by comparing it against the copy of this piece of DNA. This is the mechanism used by the Y chromosome, where the copies are stored as palindromes, with some noise in the middle of these palindromes. Such palindromes with gaps in the middle are often called inverted repeats in biology. The Y chromosome contains eight huge palindromes, the longest of which consists of almost three million characters. Around 25% of the Y chromosome consists of palindromes. The DNA in the palindromes carries genes for describing the male testes. So the mechanism by means of which men survive is called palindrome...

Sunday, July 1, 2012

Other implementations and solutions

Next year it is 25 years ago since I constructed an algorithm for finding palindromes efficiently. I was 22 years old then, and had just started as a PhD student. I was quite excited about having solved the problem of finding palindromes efficiently, and wrote a scientific paper about it. My excitement wasn't shared by the scientific community though. In the last twenty-five years this paper has been cited less than ten times, and appears at the bottom end of my most cited papers list.

In 2007 we developed the ICFP programming contest. The ICFP programing contest is a very challenging contest, in which thousands of programmers try to show off their programming talents in their programming language of choice. Our contest asked participants to transform the left picture below to the right picture using as few commands as possible. We included a problem related to palindromes in our contest, since this was my pet-problem ever since 1988. After the contest I explained how to solve the palindrome problem efficiently in a blog post.












When some years later I looked at the number of hits the contest pages received, I found that each month, thousands of people are reading the blog message about finding palindromes. Why does this page attract so many visitors?

The palindrome problem is a common question in interviews for jobs for software developers, so the blog post attracts software developers looking for a new job, and preparing themselves for an interview. Another reason the blog post attracts visitors is that I think quite a few people are genuinely interested in the problem and its solution. The concept of palindromes appears in almost all languages, which means that the question of finding palindromes is asked all over the world. The blog post indeed attracts visitors from all over the world. The last 100 (July 1, 2012) visitors come from all continents except Australia.

Many people ask the question of how to find palindromes, but also many people try to answer the question. You can find hundreds of solutions for finding palindromes on the internet. Some of these are variants of my linear-time solution, others are more naive quadratic or sometimes even cubic-time solutions. Below I give the list I found, ordered on the programming language used for the implementation. If there exists a linear-time implementation, I don't list less efficient solutions.

C The same linear-time solution as my blog post.
C++ 1 The same linear-time solution as my blog post. This post has an extensive description of the palindromes in palindromes property, including some pictures which try to explain it.
C++ 2 A C++ implementation of Manacher's algorithm.
C++ 3 The same linear-time solution as my blog post in C++ by Fernando Pelliccioni.
C# As far as I can see, this is a quadratic-time solution.
Factor A cubic-time solution.
F# A quadratic-time solution.
Go A quadratic-time solution, by Lars Björnfot.
Haskell Just as the program for finding palindromes described in my blog post, this blog post describes a linear-time program for finding palindromes in Haskell. The interesting aspect of this solution is that it returns results lazily: as soon as it finds a maximal palindrome, it writes it to the output.
Java 1 A quadratic-time solution.
Java 2 Another quadratic-time solution in Java, by Marcello de Sales.
Java 3 Yet another quadratic-time solution in Java, by Mohit Bhandari.
Matlab A quadratic-time solution, by Lalit Mohan.
PHP 1 As far as I can see this is a quadratic-time solution, by Marc Donaldson.
PHP 2 This is a quadratic- or cubic-time solution, by Joseba Bikandi. You can try it on-line.
Python The same linear-time solution as my blog post, developed by Fred Akalin. This post contains an extensive description of the palindromes in palindromes property too.
R This page describes the interface of functionality for finding palindromes in DNA strings. It doesn't say how efficient the software is.
Ruby 1 A quadratic-time solution, by Matthew Kerry.
Ruby 2 Another quadratic-time solution in Ruby, by Rick DeNatale. The post and the comments mention some more Ruby implementations.
Ruby 3 Yet another quadratic-time solution in Ruby, by Mitchell Fang.
Scheme A quadratic-time solution.
I could only find linear-time implementations for finding palindromes in C, C++, Haskell, and Python. Please let me know if you find better or alternative implementations for finding palindromes. This list easily gets outdated, so please keep me informed.

Friday, June 29, 2012

The Sator square

The Sator square is a square that reads the same forwards, backwards, upwards and downwards:

Sator
Arepo
Tenet
Opera
Rotas

The square has been found at many places around the world: on amulets, carved into walls or stones, written on disks used for extinguishing fire, and so on. Its first known appearance is in the ruins of Pompeii. The square was found in a building that was decorated in a style that became popular since the year 50. Pompeii was covered in ash from the Vesuvius in the year 79, and it follows that this example is almost 2000 years old. This makes the square one of the oldest known palindromes.

What does the Sator square mean? Except for `Arepo', scholars mostly agree on the meaning of the individual words. `Sator' is the seeder, sower, or begetter. It is also used as a metaphor for God, not necessarily Christian, but also Roman. `Tenet' is a verb, and means it, he or she holds. `Opera' is often considered to mean with effort. It is related to opus or opera, which means work. Finally, `Rotas' is generally considered a noun, meaning the wheels. But it might also be a verb meaning turn. Nobody has ever found a word in Roman, Greek, Etruscan, or any Indo-European language that explains `Arepo'. Some people think that it is a name, others that it stands for plough, misspelled to fit the palindrome, and yet others that it is just nonsense. For a long period, people thought the square had a Christian meaning by rearranging the letters into two Pater Nosters in the form of a cross. The Pre-Christian Pompeii find crushed this theory. If Arepo is a name, the Sator square might read: `The sower Arepo holds the wheels with effort', and if it stands for plough, you might get: `God holds the plough, but you turn the furrows'. According to John Cullen, this could have been a motto for farmers in Roman times. Most explanations I have read call the Sator square an ancient meme, with as much meaning as the sentence `all your base are belong to us'. In 1937, the Italian Antonio Ferrua probably gives the best explanation for what the square means: `Esattamente quello che si vuole'(!) E basta di questo argumento (it means exactly what you want it to mean. And so much for that argument!). The Sator square is an example of a meme that went viral long before the internet.

Here is an instance of the square I found in the library in Skara, Sweden, written by an unknown author probably in Stockholm, Sweden, in the year 1722. It starts with Rotas instead of Sator, just as the first appearances of the square. For some reason the version starting with Sator has become more popular.

Because it is a palindrome, the Sator square was thought to have many healing effects, curing snake-bites, headaches, jaundice, and many other illnesses. Medieval books mention the square as a cure for fever and insanity. Interestingly, although we now know that using words to cure an illness is of little help, we humans do use palindromes to repair our body. I will say more about this in a later blog post on palindromes in DNA.

The amount of human effort gone into explaining the meaning of the Sator square is unbelievable. Since 1889, when Francis John Haverfield described a find of the square in a Romano-Bristish building in Cirencester, there has been a steady stream of articles on the Sator square, and the total number of articles easily surpasses a hundred. Rose Mary Sheldon recently published a 54-page annotated biography of the literature on the Sator square: The Sator rebus: An unsolved cryptogram? Charles Douglas Gunn wrote his PhD thesis on the square at Yale in 1969. He suggests that the square was written by a Roman who wanted to take palindromic squares one step further from the misformed four-letter word square Roma tibi subito montibus ibit amor, meaning `For by my efforts you are about to reach Rome, the object of your travel'. He wrote software to generate all possible five-letter Latin word squares. These squares take up more than a hundred pages in his thesis. He concludes that the Sator square is the best.

There are not many other examples of squares like the Sator square, consisting of five words that together constitute a meaningful palindrome. Piet Burger discovered the following Dutch square in 1990.

Sneed
nette
etste
etten
deens

Its meaning is so far-fetched that I won't try to give an English translation.

Thursday, May 31, 2012

Finding palindromes efficiently

Using the palindromes in palindromes property, I now develop an algorithm for finding palindromes that requires a number of steps approximately equal to the length of its input. This linear-time algorithm can be used to find palindromes in documents of any size, even in DNA strings. Finding palindromes in a string of length $20,000,000$ using this algorithm requires a number of seconds on a modern laptop. It is impossible to find palindromes significantly faster, unless you have a machine with multiple cores, and use a parallel algorithm.

The program for efficiently finding palindromes is only about 25 lines long, but for formatting purposes it extends to over 80 lines in this blog post. Although the program is short, it is rather intricate. I guess that you need to experiment a bit with to find out how and why it works.

The naive algorithm for finding palindromes discussed in a previous blog post uses the following top-level definition for finding maximal palindromes:


< maximalPalindromes :: Array Int Char -> [Int]
< maximalPalindromes a  =   
<   let (first,last)  =  bounds a
<       centers       =  [0 .. 2*(last-first+1)]
<   in  map (lengthPalindromeAround a) centers

The reason why this algorithm is naive is that lengthPalindromeAround calculates the maximal palindrome around a center independently of the palindromes calculated previously. I now change this by calculating the maximal palindromes from left to right around the centers of a string. In this calculation I either extend a palindrome around a center, or I move the center around which I determine the maximal palindrome rightwards because I have found a maximal palindrome around a center. So I replace the above definition by

> maximalPalindromes :: Array Int Char -> [Int]
> maximalPalindromes a  =   
>   let (first,last)  =  bounds a
>   in  reverse (extendPalindrome a first 0 [])

Before I introduce and explain function extendPalindrome, I give an example of how the algorithm works. Suppose I want to find the maximal palindromes in the string "yabadabadoo". The algorithm starts by finding the maximal palindrome around the position in front of the string, which cannot be anything else than the empty string. It moves the position around which to find the maximal palindrome one step to point to the `y'. The maximal palindrome around this position is "y", since there is no character in front of it. It again moves the position around which to find palindromes one step to point to the position in between `y' and `a'. Since `y' and `a' are different, the maximal palindrome around this position is the empty string. Moving the center to `a', it finds that "a" is the maximal palindrome around this center, since `y' and `b' are different. The maximal palindrome around the next center in between `a' and `b' is again the empty string. Moving the center to `b', it can extend the current longest palindrome "b" around this center, since both before and after `b' it finds an `a'. It cannot further extend the palindrome "aba", since `y' and `d' are different. To determine the maximal palindrome around the center in between `b' and `a', the next center position, it uses the fact that "aba" is a palindrome, and that it already knows that the maximal palindrome around the center in between `a' and `b' is the empty string. Using the palindromes in palindromes property, it finds that the maximal palindrome around the position in between `b' and `a' is also the empty string, without having to look at the `b' and the `a'. To determine the maximal palindrome around the next center position on the last `a' of "aba", it has to determine if `d' equals `b', which it doesn't of course. Also here it uses the palindromes in palindromes property. Since "a" is the maximal palindrome around the center of the first `a' in "aba", and it reaches until the start of the palindrome "aba", I have to determine if the palindrome "a" around the second `a' can be extended. I won't describe all steps extendPalindrome takes in detail, but only give one more detail I already described above: the second occurrence of the palindrome "aba" in "yabadabadoo" is not found by extending the palindrome around its center, but by using the palindromes in palindromes property to find "aba" a second time in "abadaba".

Function extendPalindrome takes four arguments. The first argument is the array a in which we are determining maximal palindromes. The second argument is the position in the array directly after the longest palindrome around the current center (the longest palindrome around the center before the first symbol has length $0$, so the position directly after the empty palindrome around the first center is the first position in the array). I will call this the current rightmost position. The third argument is the length of the current longest palindrome around that center (starting with 0), and the fourth and final argument is a list of lengths of longest palindromes around positions before the center of the current longest tail palindrome, in reverse order (starting with the empty list []). It returns the list of lengths of maximal palindromes around the centers of the array, in reverse order. Applying function reverse to the result gives the maximal palindromes in the right order. The function extendPalindrome maintains the invariant that the current palindrome is the longest palindrome that reaches until the current rightmost position.

There are three cases to be considered in function extendPalindrome.

  • If the current position is after the end of the array, so rightmost is greater than last, I cannot extend the current palindrome anymore, and it follows that it is maximal. This corresponds to the null after case in the definition of maximal palindromes. It only remains to find the maximal palindromes around the centers between the current center and the end of the array, for which I use the function finalPalindromes.
  • If the current palindrome extends to the start of the array, or it cannot be extended, it is also maximal, and I add it to the list of maximal palindromes found. This case corresponds to the null after || last before /= head after case in the definition of maximal palindromes. I then determine the maximal palindrome around the following center by means of the function moveCenter.
  • If the element at the current rightmost position in the array equals the element before the current palindrome I extend the current palindrome.
In the following Haskell code, I list the arguments to a function below the function, since they don't fit on a single line anymore. I have also included some comments in the code, which are the pieces of text appearing after --.

> extendPalindrome 
>   a 
>   rightmost 
>   currentPalindrome 
>   currentMaximalPalindromes
>   | rightmost > last                                   =  
>       -- reached the end of the array
>       finalPalindromes 
>         currentPalindrome 
>         currentMaximalPalindromes 
>         (currentPalindrome:currentMaximalPalindromes)
>   | rightmost-currentPalindrome == first ||
>     a!rightmost /= a!(rightmost-currentPalindrome-1)   =  
>       -- the current palindrome extends to the start 
>       -- of the array, or it cannot be extended
>       moveCenter 
>         a 
>         rightmost 
>         (currentPalindrome:currentMaximalPalindromes) 
>         currentMaximalPalindromes 
>         currentPalindrome 
>   | otherwise                                          =  
>       -- the current palindrome can be extended
>       extendPalindrome 
>         a 
>         (rightmost+1) 
>         (currentPalindrome+2) 
>         currentMaximalPalindromes      
>   where  (first,last)  =  bounds a

In two of the three cases, function extendPalindrome finds a maximal palindrome, and goes on to the next center by means of function finalPalindromes or moveCenter. In the other case it extends the current palindrome, and moves the rightmost position one further to the right.

Function moveCenter moves the center around which the algorithm determines the maximal palindrome. In this function I make essential use of the palindromes in palindromes property. It takes the array as argument, the current rightmost position in the array, the list of maximal palindromes to be extended, the list of palindromes around centers before the center of the current palindrome, and the number of centers in between the center of the current palindrome and the rightmost position. It uses the palindromes in palindromes property to calculate the longest palindrome around the center after the center of the current palindrome.

  • If the last center is on the last element, there is no center in between the rightmost position and the center of the current palindrome. I call extendPalindrome with rightmost position one more than the previous position, and a current palindrome of length 1.
  • If the previous element in the list of maximal palindromes reaches exactly to the left end of the current palindrome, I use the palindromes in palindromes property of palindromes to find the next current palindrome using extendPalindrome.
  • In the other case, I have found the longest palindrome around a center, add that to the list of maximal palindromes, and proceed by moving the center one position, and calling moveCenter again. I only know that the previous element in the list of maximal palindromes does not reach exactly to the left end of the current palindrome, so it might be either shorter or longer. If it is longer, I need to cut off the new maximal palindrome found, so that it reaches exactly to the current rightmost position.

> moveCenter 
>   a 
>   rightmost 
>   currentMaximalPalindromes 
>   previousMaximalPalindromes 
>   nrOfCenters
>   | nrOfCenters == 0                                  =  
>       -- the last center is on the last element: 
>       -- try to extend the palindrome of length 1
>       extendPalindrome 
>         a 
>         (rightmost+1) 
>         1 
>         currentMaximalPalindromes
>   | nrOfCenters-1 == head previousMaximalPalindromes  =  
>       -- the previous maximal palindrome reaches 
>       -- exactly to the end of the last current 
>       -- palindrome. Use the palindromes in palindromes 
>       -- property to extend the current palindrome
>       extendPalindrome 
>         a 
>         rightmost 
>         (head previousMaximalPalindromes) 
>         currentMaximalPalindromes
>   | otherwise                                         =  
>       -- move the center one step. Add the length of 
>       -- the longest palindrome to the maximal
>       -- palindromes
>       moveCenter 
>         a 
>         rightmost 
>         (min (head previousMaximalPalindromes) 
>              (nrOfCenters-1)
>         :currentMaximalPalindromes) 
>         (tail previousMaximalPalindromes) 
>         (nrOfCenters-1)

In the first case, function moveCenter moves the rightmost position one to the right. In the second case it calls extendPalindrome to find the maximal palindrome around the next center, and in the third case it adds a maximal palindrome to the list of maximal palindromes, and moves the center of the current palindromes one position to the right.

Function finalPalindromes calculates the lengths of the longest palindromes around the centers that come after the center of the current palindrome of the array. These palindromes are again obtained by using the palindromes in palindromes property. Function finalPalindromes is called when we have reached the end of the array, so it is impossible to extend a palindrome. We iterate over the list of maximal palindromes, and use the palindromes in palindromes property to find the maximal palindrome at the final centers. As in the function moveCenter, if the previous element in the list of maximal palindromes reaches before the left end of the current palindrome, I need to cut off the new maximal palindrome found, so that it reaches exactly to the end of the array.


> finalPalindromes 
>   nrOfCenters     
>   previousMaximalPalindromes 
>   currentMaximalPalindromes   
>   | nrOfCenters == 0  =  currentMaximalPalindromes
>   | otherwise         =
>       finalPalindromes 
>         (nrOfCenters-1) 
>         (tail previousMaximalPalindromes) 
>         (min (head previousMaximalPalindromes) 
>              (nrOfCenters-1)
>         :currentMaximalPalindromes)

In each step, function finalPalindromes adds a maximal palindrome to the list of maximal palindromes, and moves on to the next center.

I have discussed the number of steps this algorithm takes for each function. At a global level, this algorithm either extends the current palindrome, and moves the rightmost position in the array, or it extends the list of lengths of maximal palindromes, and moves the center around which we determine the maximal palindrome. If the length of the input array is $n$, the number of steps the algorithm is $n$ for the number of moves of the rightmost position, plus $2n+1$ for the number of center positions. This is a linear-time algorithm.

Tuesday, May 22, 2012

Sotades

His name has been mentioned already a couple of times: Sotades reportedly wrote some of the first palindromes.

Sotades was an Ancient Greek poet, living in the third century before Christ. He lived in Alexandria during the reign of Ptolemy II Philadelphus (285 BC-246 BC). Ptolemy II was born in Kos in about 309 BC. As a youth, he enjoyed the best tutors. The practice of getting the best scholars or poets available to educate the crown prince was something that Ptolemy II's father (also called Ptolemy, which explains the II in his son's name), had the occasion to observe in Macedonia, where the young Alexander was taught by no less a figure than Aristotle himself. Ptolemy II and his father set up the Alexandrian University and Library, and attracted the most learned men of the times to Alexandria. Ptolemy II provided these scholars with a life free from want and taxes, allowing them to study, write, perform research, and lecture in their respective disciplines.

Ptomely II was married to Arsinoe (I), and got three children with her. He had also a sister called Arsinoe (II), who was married to a Greek king. Sister Arsinoe was a scheming woman. She had the oldest son of her husband, by a different woman, murdered to position her own children for the throne. When her husband was killed in the battle that ensued after the widow of the murdered son fled to a neighboring country, she married her half-brother (also called Ptolemy), who claimed the same throne as her husband had had. When her new husband became too powerful, she and her sons conspired against him. They were found out. Two sons got killed, and she and her eldest son (also called Ptolemy) fled to her full brother Ptolemy II in Egypt. In Eqypt she managed to get rid of Ptolemy's first wife Arsinoe I by accusing her of conspiring against her husband. Arsinoe I was sent in exile, and Arsinoe II married her brother.

Alexandria not only attracted scholars, but also poets and satirists, such as Sotades. One of his poems attacked Ptolemy II's marriage to his sister Arsinoe, from which came the infamous line: "You're sticking your prick in an unholy hole." For this, Sotades was imprisoned, but he escaped to the island of Kaunos, where he was captured by Ptolemy's admiral, shut up in a leaden chest, and thrown into the sea.

Sotades wrote obscene verses, in particular about the love between an older man and an adolescent boy. He invented his own metre for such verses, the sotadic metre. This metre spread around the world, and was used by other satirists to hint at an obscene or satirical interpretation. For example, Ennius wrote his `Sota' already in the same century in southern Italy. Sotades' verses could be read both forwards and backwards, where the backwards reading was the same, or contained an obscene interpretation. Unfortunately, none of the palindromic verses of Sotades survived.

Tuesday, May 15, 2012

Palindromes in palindromes

In a previous blog post, I develop a quadratic time algorithm for finding the length of the maximal palindromes around all centers of a string. Although it is a lot faster than the naive algorithm that finds all palindromes in a string, it is still not very useful when trying to find palindromes in DNA, or in a text as long as the Bible.

To design an efficient algorithm for finding palindromes, I reuse as much information as possible when calculating the length of maximal palindromes around centers. In this blog post, I will explain the central idea that allows me to reuse previously calculated maximal palindromes.

In the naive algorithm for finding the maximal palindromes around all centers of a string, I start with calculating the maximal palindrome around the first, leftmost, center, then the palindrome around the next center on the first character, then the palindrome around the center in between the first two characters, and so on. I calculate the palindromes from left to right. In the illustration below I depict a string with the letters blackened out. But you may assume that it represents the string "yabadabadoo". In this string, p is the maximal palindrome around center b. For example, p might be "abadaba", with b equal to $9$. I assume p is the longest palindrome that reaches to its rightmost position, so that no palindrome around a center before b reaches there.

I calculate the maximal palindromes from left to right, so I have already calculated the maximal palindrome around center a, which appears within palindrome p. I call the maximal palindrome around center a q. In the "yabadabadoo" example a might be center $5$, with "aba" as maximal palindrome q around it. In the illustration q is completely within p, but it might also extend to before p. It cannot extend after p, since then there would be a longer palindrome that extends to the rightmost position of p. Center a' is the center I get when mirroring center a with respect to center b: b plus the difference between b and a. This would be $13$ ($=9+(9-5)$) in our example. What is the maximal palindrome around a'?

Since p is a palindrome, the string with the same length as q around a' is reverse q, provided q is completely within p. If q extends to before p, we cut it off at the start of p, and remove an equally long part at the end of q. Since q is a palindrome, reverse q equals q, and is hence also a palindrome. I will call this palindrome q' to distinguish it from q. Is it the maximal palindrome around a'?

To find out whether or not q' is the maximal palindrome around a', I determine whether or not the characters before and after q' are the same. If the maximal palindrome around center a does not extend to the start of p, I know that the characters around q are different. Since the same characters appear around the palindrome q', it is the maximal palindrome around center a'. In this case I don't even have to look at the characters before and after q': the fact that q does not extend to the start of p gives me enough information to determine that q' is maximal. If the maximal palindrome around center q does extend to the start of p, as in our example, where "abadaba" starts with "aba", or even before, I have to compare the character before q' with the character after p. If these are different, q' is the maximal palindrome around a', otherwise, I have to investigate further to find out the maximal palindrome around a'. I will do this in the blog post that discusses an efficient algorithm for finding the length of all maximal palindromes in a string. In our example, "aba" is not the maximal palindrome around center a', since the character 'd' appears both before and after q'

Wednesday, May 9, 2012

The word `palindrome'

The word `palindrome' comes from the Greek παλινδρόμους. The word παλιν translates to `again', and the word δρόμους (or δρόμος) to `way' or `direction', and the combination is often translated to `running back again'. In the 17th century, the English poet Ben Johnson connected the word palindrome to the concept of a word or a sequence of words that reads, letter for letter, the same backwards as forwards. The palindrome concept existed long before the 17th century. Reportedly, the first recorded palindromes were written by Sotades in (again) Greece in the third century B.C.

`Palindrome' is an old word, which already appears in some early Greek texts. For example, it appears in the work Timon of Lucian of Samosata.

Lucian was a rhetorician and satirist living in the 2nd century A.D. He wrote more than eighty works, among which Timon, sometimes also translated as The Misanthrope. Shakespeare's play Timon of Athens has been influenced by the work of Lucian.

Timon of Athens was a citizen of Athens during the era of the Peloponnesian War between Athens and Sparta (431 BC–404 BC). He was wealthy and lavished his money on flattering friends. When funds ran out, friends deserted and Timon was reduced to working in the fields. One day, he found a pot of gold and soon his unreliable friends were back. This time, he drove them away with dirt clods. Timon became an angry despiser of mankind, and his reputation for misanthropy grew to legendary status.

He can't have been nice company: a typical sentence of the misanthropic Timon in Lucian's work is `Go your ways, then, Hermes, and take Plutus back to Zeus. I am quite content to let every man of them go hang.' Hermes is the messenger god, en Plutus the god of wealth. In Greek, this sentence reads: ὥστε παλίνδρομος ἄπιθι, ὦ Ἑρμῆ, τὸν Πλοῦτον ἐπανάγων τῷ Διί: ἐμοὶ δὲ τοῦτο ἱκανὸν ἦν, πάντας ἀνθρώπους ἡβηδὸν οἰμώζειν ποιῆσαι. The second word is the occurrence of `palindrome'. It is not translated by `running back again', but clearly something goes back, or returns, here.

Another occurrence of `palindrome' is found in Diogenes Laërtius' book `Lives of Eminent Philosophers', which was probably published somewhere in the 3rd century A.D. Laërtius introduces Aristippus in Chapter 8.

Aristippus was drawn to Athens by the fame of Socrates. He became a lecturer himself, and was one of the first to both pay and charge fees for lecturing. `And on one occasion the sum of twenty minae which he had sent was returned to him, Socrates declaring that the supernatural sign would not let him take it; the very offer, in fact, annoyed him.' In Greek, this reads: καί ποτε πέμψας αὐτῷ μνᾶς εἴκοσι παλινδρόμους ἀπέλαβεν, εἰπόντος Σωκράτους τὸ δαιμόνιον αὐτῷ μὴ ἐπιτρέπειν: ἐδυσχέραινε γὰρ ἐπὶ τούτῳ. The 7th word of this sentence is `palindrome' in Greek. Again, it is not translated exactly as `running back again', but something is returned.

Although the roots of the word `palindrome' are in Greek, the Greek themselves describe the palindrome concept by karkinikê epigrafê (καρκινικὴ επιγραφή; "crab inscription"), or simply karkinoi (καρκίνοι; "crabs"), alluding to the movement of crabs.

Thursday, May 3, 2012

A naive algorithm for finding maximal palindromes

The previous blog post discusses why it is sufficient to calculate the length of the maximal palindrome around each center position in a string if I want to find palindromes in a string. In this blog post I will describe a naive algorithm for finding the length of all maximal palindromes in a string. This algorithm is slightly less naive than the naive algorithm for finding palindromes given in an earlier blog post. The latter algorithm requires a supercomputer for a couple of days to find palindromes in a long string, the algorithm I will develop in this blogpost requires several days of computing power of the laptop on which I am typing this post. In numbers, this algorithm is about a milliard times faster on such a long string.

Given a string as input, I want to find the list of lengths of maximal palindromes around all centers of the string. I will use the function maximalPalindromes for this purpose. It has the following type


< maximalPalindromes :: String -> [Int]

I want to find the length of the maximal palindrome around each center in a string. I will do this by trying to extend the trivial palindromes consisting of either a single letter (for odd centers) or of the empty string (for even centers) around each center. To extend a palindrome, I have to compare the characters before and after the current palindrome. It would be helpful if I had random access into the string, so that looking up the character at a particular position in a string can be done in constant time. Since an array allows for constant time lookup, I change the input type of the maximalPalindromes function to an array. I have to import the module Data.Array to use arrays.

> import Data.Array
>
> maximalPalindromes :: Array Int Char -> [Int]

If I change my input type from strings to arrays, I have to convert an input string into an array. The Data.Array module contains a function just for this purpose: the function listArray creates an array from a string together with a pair of indices for the first and last position in the string. Function maximalPalindromes calculates the following lengths of maximal palindromes in the string "abb":

?maximalPalindromes (listArray (0,2) "abb")
[0,1,0,1,2,1,0]

Function maximalPalindromes calculates the length of maximal palindromes by first calculating all center positions of an input array, and then the length of the maximal palindrome around each of these centers. The center positions of an array with first and last as bounds are the elements in the list [0 .. 2*(last-first+1)]. I will always use arrays that start at index 0, which implies that the length of an array is last+1.


> maximalPalindromes a  =   
>   let (first,last)  =  bounds a
>       centers       =  [0 .. 2*(last-first+1)]
>   in  map (lengthPalindromeAround a) centers

Function lengthPalindromeAround takes an array and a center position, and calculates the length of the longest palindrome around that position. Center positions do not exactly correspond to indices in the array, since they not only describe locations of characters, but also locations in between characters. To obtain an array index from a center position I divide it by two. If the center position is even, it is in between two characters, and if I divide it by two, I get an array index that points at the character after the center position. I compare the character pointed to by the index with the character before it to find out if the (empty) palindrome can be extended. If it can, I subtract one from the starting position, and add one to the end position, and try to extend the palindrome with the new indices. If the palindrome cannot be extended, because the elements around it are different, or the current palindrome starts at the start or ends at the end of the array, I return the length of the maximal palindrome found, which is the difference between the end and the start position minus one. For odd center positions I can start with the indices before and after the character pointed to by the index obtained by dividing the center position by two, rounding down.

> lengthPalindromeAround  ::  Array Int Char -> Int -> Int
> lengthPalindromeAround a center 
>   | even center = 
>       lengthPalindrome (first+c-1) (first+c) 
>   | odd  center = 
>       lengthPalindrome (first+c-1) (first+c+1) 
>   where  c             =  div center 2
>          (first,last)  =  bounds a
>          lengthPalindrome start end  = 
>             if   start < 0 
>               || end > last-first 
>               || a!start /= a!end
>             then end-start-1
>             else lengthPalindrome (start-1) (end+1) 

For each position, this function may take an amount of steps linear in the length of the array, so this is a quadratic-time algorithm.

Friday, April 27, 2012

Maximal palindromes

The algorithm for finding palindromic substrings described in my previous blog post has two major problems. First, if I want to find the palindromes in a string of substantial length, it will take a normal computer many years to calculate the result. I didn't mention the second problem explicitly in my previous blog post, but it is pretty obvious. The number of palindromic substrings returned by the algorithm is at least as large as the length of the string, since all single letters are palindromes, and there might be many more. If the string contains a palindrome of length $1,000,000$, then the algorithm will also return the palindrome of length $999,998$ obtained by removing the first and last element of the million letter palindrome, the palindromes of length $999,996$, $999,994$, and so on. This implies that the total number of palindromes occurring in a long string is huge, and I can easily drown in the palindromes returned.

Of the series of palindromes of length $1,000,000$, $999,998$, $999,996$ etc, all palindromes have the same center, and all shorter palindromes can be derived from the longest palindrome by removing equally many characters from the front and from the end. In the string "yabadabadoo", the palindromes around the center at the second occurrence of a 'b', are "b", "aba", and "dabad". "dabad" is the maximal palindrome around this center, since its extension "adabado" is not a palindrome. Maximality does not imply it is the longest palindromic substring, it only is the maximal palindrome around a particular center in the string. For example, in "yabadabadoo", "abadaba" is a longer palindrome, but with a different center. A center position is either on a letter, as in "dabad" on 'b', or in between two letters, as in "oo", where the center is in between the two o's. The string "yabadabadoo" has $23$ centers: one on each letter (of which there are eleven), one before each letter (another eleven), and one after the last letter of the string. If I assign $0$ to the center before the first letter, $1$ to the center on the first letter, $2$ to the center after the first letter, and so on, the maximal palindrome "dabad" has center $13$, and the maximal palindrome "abadaba" $9$. For a string of length $n$, there are $2n+1$ center positions in the string. Since we can derive all shorter palindromes around a center from the maximal palindrome around the center, it is sufficient to calculate the maximal palindromes around the centers in a string. So instead of calculating all palindromic substrings of a string, I calculate all maximal palindromic substrings of a string.

There are just as many maximal palindromes in a string as there are center positions, so there still are quite a few maximal palindromes. But the number of maximal palindromes might be substantially lower than the number of (not necessarily maximal) palindromic substrings in a string. Take for example the string of length $n$ just containing the character 'a'. The number of maximal palindromes is equal to the number of center positions in the string, $2n+1$. The total number of palindromes in the string is the number of substrings of the string, since all substrings consist of only 'a's, and hence all are palindromes. Since substrings is defined as concat . map tails . inits, I can calculate the total number of substrings if I know how many initial and final substrings appear in a string of length $n$. Function inits returns $n+1$ substrings, of length $0...n$, respectively. Similarly, for a string of length $n$, function tails returns $n+1$ substrings, also of length $0...n$. So the total number of substrings is: \[ \sum\limits_{i=0}^{n} (i+1) = \frac{n(n+1)}{2} + n+1 = \frac{(n+2)(n+1)}{2} = \frac{n^2}{2}+\frac{3}{2}n+1\] So in the string "aaaaaaaaaa" of length $10$, there are $21$ maximal palindromes, and $66$ palindromic substrings.

If I know the center and the length of a palindrome, I can recover the palindrome if I have the original string. For example, the maximal palindrome of length $5$ around center $13$ in the string "yabadabadoo" is the string "dabad", and the maximal palindrome of length $7$ around center $9$ is the string "abadaba". So to find maximal palindromes, it suffices to find their center and length.

Given a string, a center within the string, and a lengthp denoting the length of the maximal palindrome in the string around the center, I can use a Haskell program to check if the length is indeed the length of the maximal palindrome around the center in the string. To determine the substring denoted by a center and a length, I first split the string into the initial part of length div (center-lengthp) 2, the part before the maximal palindrome, and the following lengthp letters. Function div divides its first argument by its second argument, throwing away the remainder. Function splitAt takes a positive integer $n$ and a list, and splits the list into two lists. The first list contains the first $n$ letters, and the second list contains the rest of the letters. The lengthp letters after before should form a palindrome. The letters around it, the last letter of the part of the string before it and the head of the string after it, should be different to make it maximal. Functions last and head are predefined functions which return the last or head element of a list, respectively. The palindrome is maximal too if at least one of the strings before or after it is empty.


> maximalPalindrome string center lengthp  =
>   let (before,rest)  =  splitAt 
>                           (div (center-lengthp) 2) 
>                           string
>       (p,after)      =  splitAt lengthp rest
>   in    length p    ==  lengthp 
>      && odd center  ==  odd lengthp
>      && palindrome p 
>      && (  null before 
>         || null after 
>         || last before /= head after
>         )

The function maximalPalindrome uses two sanity checks. It first checks that there indeed exists a substring of length lengthp around center by means of length p == lengthp, where function length is a predefined function that returns the length of a list. The other sanity check checks that the maximal palindrome around an even center has even length, and a maximal palindrome around an odd center has odd length, by means of odd center == odd lengthp, which equals True only if both center and lengthp are odd or both are even. Function odd is a predefined function that determines whether or not a number is odd. The logic or operator || returns True if one of its arguments is True. If its left-hand side argument is True it doesn't evaluate its right-hand argument, which is just as well for the above definition, since evaluating last before if before is empty (null before) would lead to a undefined value.

In the next blog post I will show how to calculate the length of all maximal palindromes in a string.