Thursday, August 16, 2012

Finding palindromes in DNA

Palindromes play an important role in DNA, and many geneticists want to find palindromes in the DNA they have. To find palindromes in DNA, I cannot directly apply the algorithm for finding palindromes described in a previous blog post, since the equality used for palindromes in DNA differs from the normal equality. A string is a palindrome in DNA if each symbol in the reversed string is the negation of the symbol in the original string, where `A' is the negation of `T' and vice versa, and `C' is the negation of `G', and vice versa. The operator =:= introduced in my blog post on What is a palindrome II? A specification in Haskell implements this equality. By negating the result of =:= I get the inequality operator =/= on DNA symbols.

A first attempt to find palindromes in DNA is to adapt the various components to deal with DNA in the functions extendPalindrome, moveCenter, and finalPalindromes introduced in the blog post on finding palindromes efficiently. In the function extendPalindrome the expression a!... /= a!..., where the dots should be replaced by the appropriate indices in the input, is replaced by the expression a!... =/= a!... using the DNA inequality operator. This is sufficient to turn the function extendPalindrome into a function that finds palindromes in DNA. The functions moveCenter and finalPalindromes do not use the (in)equality operator, so you might think that this is all that is needed to obtain a program for finding palindromes in DNA.

There is one more location in the algorithm for finding palindromes where I use a property of palindromes that does not hold for DNA. In the function moveCenter I call function extendPalindrome with a new "current palindrome" of length one. But a DNA string of length one is not a palindrome, since if I reverse it, I get the same string, and to be a palindrome, the string has to contain the complement symbol. Every DNA string of odd length is not a palindrome, since when such a string is reversed, the symbol in the middle is compared against itself, and DNA equality fails. To find palindromes in a DNA string, I only have to look at the even positions in the string, which correspond to the positions in between symbols. The palindromes around odd positions on the symbols always have length zero.

I adapt the algorithm for efficiently finding palindromes to only find palindromes around even positions. Besides adapting the function extendPalindrome with a different inequality operator, I adapt the function moveCenter by calling extendPalindrome with a new longest current palindrome of length zero, and by counting the number of centers down with steps of size two instead of one, since I am not interested in the odd center positions anymore. These are all changes needed, and I will not give the resulting algorithm. The interested reader can find them in my software package for finding palindromes.

In spring 2012 I was contacted by an Indian scientist who was developing an algorithm for finding palindromes in DNA. I soon found out that her algorithm was similar to my naive, quadratic-time, algorithm for finding maximal palindromes. What worried me was that her algorithm seemed to run faster than my efficient algorithm for finding palindromes, adapted to finding palindromes in DNA. By thinking that my efficient algorithm would run faster, I had made the classical mistake of confusing worst-case time complexity with practice.

The worst-case time complexity of my efficient algorithm for finding palindromes is linear. The number of steps the algorithm takes is linear in the length of the input string. The worst case time-complexity of the naive algorithm is quadratic. A string in which all substrings are palindromes leads to "worst case" time behaviour. An example of a DNA string satisfying this property is a string looking like "ATATATAT...". For a string consisting of three million copies of "AT", the efficient algorithm returns the length of all maximal palindromes about equally fast as for the the Amycolatopsis marina, which is about equally long, but the quadratic time algorithm never finishes on the repetitions of ATs. Strings like "ATATATAT..." appear in DNA, but they are short. For example, the longest maximal palindrome in chromosome 18 consisting of just "ATAT..." has length 66. The longest overall maximal palindrome has length 76. The majority of the maximal palindromes is much shorter. The time-complexity of the naive algorithm for finding palindromes is the sum of the length of the palindromes found in the argument string. If the maximum length of a palindrome is 76, then the number of steps performed by the naive algorithm is at most 76 times the length of the input string. Since chromosome 18 consists of almost 75 million symbols, this is more than five and a half billion, which is quite a lot, but doable for a modern laptop. Indeed, if I measure the running time of the quadratic and linear-time solutions on chromosome 18, I find that the quadratic solution is about four to five times faster than the linear solution. To be precise, on my MacBook Pro with 2.4 GHz Intel Core i5 with 4GB memory, with very few other processes running, the linear-time solution requires slightly more than 30 minutes to analyse the almost 75 megabytes of chromosome 18, whereas the quadratic solution uses around 6 minutes. Why is it faster?

It is hard to exactly determine what one program makes faster than another. The linear time solution uses quite a few arguments, in particular the list of maximal palindromes found until now. This is the central idea to obtain linear-time behaviour, but it also requires constructing, passing on, inspecting, and deconstructing this list to determine maximal palindromes with centers to the right of the current longest tail palindrome. I have experimented with several kinds of representations for the list of maximal palindromes used as argument to the functions extendPalindrome and moveCenter, but none of these solutions can compare to the quadratic solution when looking at running times on large DNA strings. An important motivation for developing a linear-time algorithm for finding palindromes turns out not to be valid at all...

The quadratic-time solution for finding palindromes in DNA is very similar to the naive algorithm for finding maximal palindromes. There are three differences. First, the input to the function maximalPalindromes, which I now call qMaximalPalindromesDNA, is of type ByteString instead of Array Int Char. The type ByteString is a data structure specifically designed to efficiently deal with strings of which the length is known. It is very similar to the type String, but functions like read for reading strings, map for mapping a function over a string, length for determining the length of a stringdefined in the module ByteString, all are usually much faster than their equivalents on String, or even on Array Int Char. I write B.ByteString, B.length, and so on, to make explicit that these components come from the module ByteString:


> import qualified Data.ByteString as B

Second, in the function qMaximalPalindromesDNA I only consider positions in between symbols around which to find maximal palindromes, since palindromes in DNA have even length. The length of the list of centers has length one plus the length of the input, to account for the center at the end of the input.

> qMaximalPalindromesDNA :: B.ByteString -> [Int]
> qMaximalPalindromesDNA input = 
>   let lengthInput =  B.length input
>       centers     =  [0 .. lengthInput]
>       lastPos     =  lengthInput - 1
>   in map 
>        (\c -> lengthPalindromeAroundDNA 
>                 input 
>                 lastPos 
>                 (c-1) 
>                 c
>        ) 
>        centers

The third difference is that function lengthPalindromeAroundDNA doesn't distinguish between even-length and odd-length palindromes anymore, since all palindromes have even length, and that it needs to use the equality operator for DNA.

> lengthPalindromeAroundDNA input lastPos start end  
>   | start < 0 || end > lastPos                 =  end-start-1
>   | B.index input start =:= B.index input end  =  
>       lengthPalindromeAroundDNA 
>         input 
>         lastPos 
>         (start-1) 
>         (end+1) 
>   | otherwise                                  =  end-start-1

This solves the problem of finding palindromes in DNA. But I cannot yet use this algorithm to find the huge palindromes in the Y chromosome, since those palindromes have some noise in the middle, and the palindromic arms contain a few errors. I will look into this in a following blog post.

Sunday, July 22, 2012

Palindromes in DNA

DNA strings are often millions of letters long. For example, the copy I have of the DNA of the Amycolatopsis marina, a bacteria discovered in 2009, consists of $6,503,724$ characters.

Bacteria are of course some of the smallest living beings. The DNA in the copy I have of the human chromosome 18 is an astounding $74,657,229$ characters long. Chromosome 18 is one of the 23 chromosomes of human beings, and is estimated to contain in between $300$ and $500$ genes. A gene is responsible for part of our functioning as human beings. An example of a gene contained within chromosome 18 is the NPC1 gene, named after the Niemann-Pick disease, type C1. This gene is named after the disease you get when you have an error on the gene. Experiments with mice show that this gene probably controls the appetite, and people with a mutation on the NPC1 gene often suffer from obesitas. Interestingly, the same mutation on NPC1 might make you immune for the ebola virus, one of the deadliest known viruses. So the mutation is partly a blessing in disguise.

If I search for the keyword palindrome in the electronic publications available at the library of my university, I get more than $500$ hits. The first ten of these hits are all about palindromes in DNA. My guess is that at least $90$% of the list of publications are about palindromes in DNA. So what is the interest in palindromes in DNA? Surely, if your strings only contain `A', `T', `C', and `G' you are likely to get many palindromes?

Let us look at how many palindromes we expect to find in the DNA of the Amycolatopsis marina. Suppose I want to find palindromes of length twelve. I calculate the chance that an arbitrary DNA string of length twelve is a palindrome as follows. The first six characters of the string don't matter. The seventh character needs to match with the sixth character, for which we have a chance of one in four. Remember that in DNA, `A' matches with `T' and vice versa, but both `A' and `T' do not match with themselves, or with `C' and `G'. The eighth character needs to match with the fifth character, for which we also have a chance of one in four. This goes on until the twelfth character, so I get a chance of \[ \frac{1}{4} \times \frac{1}{4} \times \frac{1}{4} \times \frac{1}{4} \times \frac{1}{4} \times \frac{1}{4} = (\frac{1}{4})^6 = \frac{1}{4^6} = \frac{1}{4096} \] Since the Amycolatopsis marina is $6,503,724$ characters long, there are $6,503,713$ substrings of length twelve. Multiplying this number with the chance that it is a palindrome, I expect to get $1,589$ palindromes. Using my algorithm for finding palindromes I get $1,784$ palindromes. This is slightly above $10$% more than expected, but that might be an accident. If I look at the palindromes of length fourteen in the $18$th human chromosome, using a similar calculation I expect to find $4,556$ palindromes. My algorithm finds $25,323$. More than five times as many palindromes as expected! This is not an accident anymore. Palindromes play a role in DNA. But what role?

Palindromes perform several tasks in human DNA. I will discuss one particularly intriguing task in this blog post.

Why do we humans have sex? I can answer this question from several perspectives. The most common answer will probably mention love, pleasure, or children. Looking at the question from a biological perspective, the answer discusses the biological advantages of having sex. Our children get their genes from two parents. They get two complete copies of human DNA, and merge these copies in some way to make them the way they are. In this merging process, damaged DNA from one parent can be replaced by functioning DNA from the other parent. There is a lot of DNA to repair: every day, each cell may be damaged at one million different places. Most of this damage is harmless, since a lot of DNA does not seem to have any function at all, but some kinds of damage may cause a lot of problems. Being able to repair non-functioning DNA when combining DNA from parents is essential for keeping the human race in a good state. The American geneticist Hermann Joseph Muller used this argument to explain why sexual reproduction is favored over asexual reproduction in organisms. When an organism reproduces asexually it passes all its DNA errors on to its offspring, without a possibility to repair them, eventually leading to the extinction of the population. The process has been dubbed Muller's ratchet, after the ratchet device, which can only turn in one direction. This is the theory, practice is slightly more complicated.

Muller's ratchet should already be at work in humans, since there is one human chromosome in which no combination of parental genes takes place: chromosome 23. Chromosome 23 determines whether we become a man or a woman. A man gets a copy of the chromosome called X from his mother and a copy called Y from his father. A woman gets an X from her mother and an X from her father. A woman can merge both X's and repair possible errors, but a man has two different copies of the chromosome, and has no possibility to combine, let alone repair, them. The Y is passed on from father to son, with no involvement of women. Muller's ratchet theory says the genes on the chromosome that make a man should deteriorate quickly, and men should soon become extinct.

There is no sign of men becoming extinct. Apparently there are some other mechanisms at work to repair DNA on the Y chromosome. If I cannot obtain a copy of some piece of DNA from my mother, maybe I can store copies of the DNA in the Y chromosome itself? If I maintain two copies, I can always check if a piece of DNA is correct by comparing it against the copy of this piece of DNA. This is the mechanism used by the Y chromosome, where the copies are stored as palindromes, with some noise in the middle of these palindromes. Such palindromes with gaps in the middle are often called inverted repeats in biology. The Y chromosome contains eight huge palindromes, the longest of which consists of almost three million characters. Around 25% of the Y chromosome consists of palindromes. The DNA in the palindromes carries genes for describing the male testes. So the mechanism by means of which men survive is called palindrome...

Sunday, July 1, 2012

Other implementations and solutions

Next year it is 25 years ago since I constructed an algorithm for finding palindromes efficiently. I was 22 years old then, and had just started as a PhD student. I was quite excited about having solved the problem of finding palindromes efficiently, and wrote a scientific paper about it. My excitement wasn't shared by the scientific community though. In the last twenty-five years this paper has been cited less than ten times, and appears at the bottom end of my most cited papers list.

In 2007 we developed the ICFP programming contest. The ICFP programing contest is a very challenging contest, in which thousands of programmers try to show off their programming talents in their programming language of choice. Our contest asked participants to transform the left picture below to the right picture using as few commands as possible. We included a problem related to palindromes in our contest, since this was my pet-problem ever since 1988. After the contest I explained how to solve the palindrome problem efficiently in a blog post.












When some years later I looked at the number of hits the contest pages received, I found that each month, thousands of people are reading the blog message about finding palindromes. Why does this page attract so many visitors?

The palindrome problem is a common question in interviews for jobs for software developers, so the blog post attracts software developers looking for a new job, and preparing themselves for an interview. Another reason the blog post attracts visitors is that I think quite a few people are genuinely interested in the problem and its solution. The concept of palindromes appears in almost all languages, which means that the question of finding palindromes is asked all over the world. The blog post indeed attracts visitors from all over the world. The last 100 (July 1, 2012) visitors come from all continents except Australia.

Many people ask the question of how to find palindromes, but also many people try to answer the question. You can find hundreds of solutions for finding palindromes on the internet. Some of these are variants of my linear-time solution, others are more naive quadratic or sometimes even cubic-time solutions. Below I give the list I found, ordered on the programming language used for the implementation. If there exists a linear-time implementation, I don't list less efficient solutions.

C The same linear-time solution as my blog post.
C++ 1 The same linear-time solution as my blog post. This post has an extensive description of the palindromes in palindromes property, including some pictures which try to explain it.
C++ 2 A C++ implementation of Manacher's algorithm.
C++ 3 The same linear-time solution as my blog post in C++ by Fernando Pelliccioni.
C# As far as I can see, this is a quadratic-time solution.
Factor A cubic-time solution.
F# A quadratic-time solution.
Go A quadratic-time solution, by Lars Björnfot.
Haskell Just as the program for finding palindromes described in my blog post, this blog post describes a linear-time program for finding palindromes in Haskell. The interesting aspect of this solution is that it returns results lazily: as soon as it finds a maximal palindrome, it writes it to the output.
Java 1 A quadratic-time solution.
Java 2 Another quadratic-time solution in Java, by Marcello de Sales.
Java 3 Yet another quadratic-time solution in Java, by Mohit Bhandari.
Matlab A quadratic-time solution, by Lalit Mohan.
PHP 1 As far as I can see this is a quadratic-time solution, by Marc Donaldson.
PHP 2 This is a quadratic- or cubic-time solution, by Joseba Bikandi. You can try it on-line.
Python The same linear-time solution as my blog post, developed by Fred Akalin. This post contains an extensive description of the palindromes in palindromes property too.
R This page describes the interface of functionality for finding palindromes in DNA strings. It doesn't say how efficient the software is.
Ruby 1 A quadratic-time solution, by Matthew Kerry.
Ruby 2 Another quadratic-time solution in Ruby, by Rick DeNatale. The post and the comments mention some more Ruby implementations.
Ruby 3 Yet another quadratic-time solution in Ruby, by Mitchell Fang.
Scheme A quadratic-time solution.
I could only find linear-time implementations for finding palindromes in C, C++, Haskell, and Python. Please let me know if you find better or alternative implementations for finding palindromes. This list easily gets outdated, so please keep me informed.

Friday, June 29, 2012

The Sator square

The Sator square is a square that reads the same forwards, backwards, upwards and downwards:

Sator
Arepo
Tenet
Opera
Rotas

The square has been found at many places around the world: on amulets, carved into walls or stones, written on disks used for extinguishing fire, and so on. Its first known appearance is in the ruins of Pompeii. The square was found in a building that was decorated in a style that became popular since the year 50. Pompeii was covered in ash from the Vesuvius in the year 79, and it follows that this example is almost 2000 years old. This makes the square one of the oldest known palindromes.

What does the Sator square mean? Except for `Arepo', scholars mostly agree on the meaning of the individual words. `Sator' is the seeder, sower, or begetter. It is also used as a metaphor for God, not necessarily Christian, but also Roman. `Tenet' is a verb, and means it, he or she holds. `Opera' is often considered to mean with effort. It is related to opus or opera, which means work. Finally, `Rotas' is generally considered a noun, meaning the wheels. But it might also be a verb meaning turn. Nobody has ever found a word in Roman, Greek, Etruscan, or any Indo-European language that explains `Arepo'. Some people think that it is a name, others that it stands for plough, misspelled to fit the palindrome, and yet others that it is just nonsense. For a long period, people thought the square had a Christian meaning by rearranging the letters into two Pater Nosters in the form of a cross. The Pre-Christian Pompeii find crushed this theory. If Arepo is a name, the Sator square might read: `The sower Arepo holds the wheels with effort', and if it stands for plough, you might get: `God holds the plough, but you turn the furrows'. According to John Cullen, this could have been a motto for farmers in Roman times. Most explanations I have read call the Sator square an ancient meme, with as much meaning as the sentence `all your base are belong to us'. In 1937, the Italian Antonio Ferrua probably gives the best explanation for what the square means: `Esattamente quello che si vuole'(!) E basta di questo argumento (it means exactly what you want it to mean. And so much for that argument!). The Sator square is an example of a meme that went viral long before the internet.

Here is an instance of the square I found in the library in Skara, Sweden, written by an unknown author probably in Stockholm, Sweden, in the year 1722. It starts with Rotas instead of Sator, just as the first appearances of the square. For some reason the version starting with Sator has become more popular.

Because it is a palindrome, the Sator square was thought to have many healing effects, curing snake-bites, headaches, jaundice, and many other illnesses. Medieval books mention the square as a cure for fever and insanity. Interestingly, although we now know that using words to cure an illness is of little help, we humans do use palindromes to repair our body. I will say more about this in a later blog post on palindromes in DNA.

The amount of human effort gone into explaining the meaning of the Sator square is unbelievable. Since 1889, when Francis John Haverfield described a find of the square in a Romano-Bristish building in Cirencester, there has been a steady stream of articles on the Sator square, and the total number of articles easily surpasses a hundred. Rose Mary Sheldon recently published a 54-page annotated biography of the literature on the Sator square: The Sator rebus: An unsolved cryptogram? Charles Douglas Gunn wrote his PhD thesis on the square at Yale in 1969. He suggests that the square was written by a Roman who wanted to take palindromic squares one step further from the misformed four-letter word square Roma tibi subito montibus ibit amor, meaning `For by my efforts you are about to reach Rome, the object of your travel'. He wrote software to generate all possible five-letter Latin word squares. These squares take up more than a hundred pages in his thesis. He concludes that the Sator square is the best.

There are not many other examples of squares like the Sator square, consisting of five words that together constitute a meaningful palindrome. Piet Burger discovered the following Dutch square in 1990.

Sneed
nette
etste
etten
deens

Its meaning is so far-fetched that I won't try to give an English translation.

Thursday, May 31, 2012

Finding palindromes efficiently

Using the palindromes in palindromes property, I now develop an algorithm for finding palindromes that requires a number of steps approximately equal to the length of its input. This linear-time algorithm can be used to find palindromes in documents of any size, even in DNA strings. Finding palindromes in a string of length $20,000,000$ using this algorithm requires a number of seconds on a modern laptop. It is impossible to find palindromes significantly faster, unless you have a machine with multiple cores, and use a parallel algorithm.

The program for efficiently finding palindromes is only about 25 lines long, but for formatting purposes it extends to over 80 lines in this blog post. Although the program is short, it is rather intricate. I guess that you need to experiment a bit with to find out how and why it works.

The naive algorithm for finding palindromes discussed in a previous blog post uses the following top-level definition for finding maximal palindromes:


< maximalPalindromes :: Array Int Char -> [Int]
< maximalPalindromes a  =   
<   let (first,last)  =  bounds a
<       centers       =  [0 .. 2*(last-first+1)]
<   in  map (lengthPalindromeAround a) centers

The reason why this algorithm is naive is that lengthPalindromeAround calculates the maximal palindrome around a center independently of the palindromes calculated previously. I now change this by calculating the maximal palindromes from left to right around the centers of a string. In this calculation I either extend a palindrome around a center, or I move the center around which I determine the maximal palindrome rightwards because I have found a maximal palindrome around a center. So I replace the above definition by

> maximalPalindromes :: Array Int Char -> [Int]
> maximalPalindromes a  =   
>   let (first,last)  =  bounds a
>   in  reverse (extendPalindrome a first 0 [])

Before I introduce and explain function extendPalindrome, I give an example of how the algorithm works. Suppose I want to find the maximal palindromes in the string "yabadabadoo". The algorithm starts by finding the maximal palindrome around the position in front of the string, which cannot be anything else than the empty string. It moves the position around which to find the maximal palindrome one step to point to the `y'. The maximal palindrome around this position is "y", since there is no character in front of it. It again moves the position around which to find palindromes one step to point to the position in between `y' and `a'. Since `y' and `a' are different, the maximal palindrome around this position is the empty string. Moving the center to `a', it finds that "a" is the maximal palindrome around this center, since `y' and `b' are different. The maximal palindrome around the next center in between `a' and `b' is again the empty string. Moving the center to `b', it can extend the current longest palindrome "b" around this center, since both before and after `b' it finds an `a'. It cannot further extend the palindrome "aba", since `y' and `d' are different. To determine the maximal palindrome around the center in between `b' and `a', the next center position, it uses the fact that "aba" is a palindrome, and that it already knows that the maximal palindrome around the center in between `a' and `b' is the empty string. Using the palindromes in palindromes property, it finds that the maximal palindrome around the position in between `b' and `a' is also the empty string, without having to look at the `b' and the `a'. To determine the maximal palindrome around the next center position on the last `a' of "aba", it has to determine if `d' equals `b', which it doesn't of course. Also here it uses the palindromes in palindromes property. Since "a" is the maximal palindrome around the center of the first `a' in "aba", and it reaches until the start of the palindrome "aba", I have to determine if the palindrome "a" around the second `a' can be extended. I won't describe all steps extendPalindrome takes in detail, but only give one more detail I already described above: the second occurrence of the palindrome "aba" in "yabadabadoo" is not found by extending the palindrome around its center, but by using the palindromes in palindromes property to find "aba" a second time in "abadaba".

Function extendPalindrome takes four arguments. The first argument is the array a in which we are determining maximal palindromes. The second argument is the position in the array directly after the longest palindrome around the current center (the longest palindrome around the center before the first symbol has length $0$, so the position directly after the empty palindrome around the first center is the first position in the array). I will call this the current rightmost position. The third argument is the length of the current longest palindrome around that center (starting with 0), and the fourth and final argument is a list of lengths of longest palindromes around positions before the center of the current longest tail palindrome, in reverse order (starting with the empty list []). It returns the list of lengths of maximal palindromes around the centers of the array, in reverse order. Applying function reverse to the result gives the maximal palindromes in the right order. The function extendPalindrome maintains the invariant that the current palindrome is the longest palindrome that reaches until the current rightmost position.

There are three cases to be considered in function extendPalindrome.

  • If the current position is after the end of the array, so rightmost is greater than last, I cannot extend the current palindrome anymore, and it follows that it is maximal. This corresponds to the null after case in the definition of maximal palindromes. It only remains to find the maximal palindromes around the centers between the current center and the end of the array, for which I use the function finalPalindromes.
  • If the current palindrome extends to the start of the array, or it cannot be extended, it is also maximal, and I add it to the list of maximal palindromes found. This case corresponds to the null after || last before /= head after case in the definition of maximal palindromes. I then determine the maximal palindrome around the following center by means of the function moveCenter.
  • If the element at the current rightmost position in the array equals the element before the current palindrome I extend the current palindrome.
In the following Haskell code, I list the arguments to a function below the function, since they don't fit on a single line anymore. I have also included some comments in the code, which are the pieces of text appearing after --.

> extendPalindrome 
>   a 
>   rightmost 
>   currentPalindrome 
>   currentMaximalPalindromes
>   | rightmost > last                                   =  
>       -- reached the end of the array
>       finalPalindromes 
>         currentPalindrome 
>         currentMaximalPalindromes 
>         (currentPalindrome:currentMaximalPalindromes)
>   | rightmost-currentPalindrome == first ||
>     a!rightmost /= a!(rightmost-currentPalindrome-1)   =  
>       -- the current palindrome extends to the start 
>       -- of the array, or it cannot be extended
>       moveCenter 
>         a 
>         rightmost 
>         (currentPalindrome:currentMaximalPalindromes) 
>         currentMaximalPalindromes 
>         currentPalindrome 
>   | otherwise                                          =  
>       -- the current palindrome can be extended
>       extendPalindrome 
>         a 
>         (rightmost+1) 
>         (currentPalindrome+2) 
>         currentMaximalPalindromes      
>   where  (first,last)  =  bounds a

In two of the three cases, function extendPalindrome finds a maximal palindrome, and goes on to the next center by means of function finalPalindromes or moveCenter. In the other case it extends the current palindrome, and moves the rightmost position one further to the right.

Function moveCenter moves the center around which the algorithm determines the maximal palindrome. In this function I make essential use of the palindromes in palindromes property. It takes the array as argument, the current rightmost position in the array, the list of maximal palindromes to be extended, the list of palindromes around centers before the center of the current palindrome, and the number of centers in between the center of the current palindrome and the rightmost position. It uses the palindromes in palindromes property to calculate the longest palindrome around the center after the center of the current palindrome.

  • If the last center is on the last element, there is no center in between the rightmost position and the center of the current palindrome. I call extendPalindrome with rightmost position one more than the previous position, and a current palindrome of length 1.
  • If the previous element in the list of maximal palindromes reaches exactly to the left end of the current palindrome, I use the palindromes in palindromes property of palindromes to find the next current palindrome using extendPalindrome.
  • In the other case, I have found the longest palindrome around a center, add that to the list of maximal palindromes, and proceed by moving the center one position, and calling moveCenter again. I only know that the previous element in the list of maximal palindromes does not reach exactly to the left end of the current palindrome, so it might be either shorter or longer. If it is longer, I need to cut off the new maximal palindrome found, so that it reaches exactly to the current rightmost position.

> moveCenter 
>   a 
>   rightmost 
>   currentMaximalPalindromes 
>   previousMaximalPalindromes 
>   nrOfCenters
>   | nrOfCenters == 0                                  =  
>       -- the last center is on the last element: 
>       -- try to extend the palindrome of length 1
>       extendPalindrome 
>         a 
>         (rightmost+1) 
>         1 
>         currentMaximalPalindromes
>   | nrOfCenters-1 == head previousMaximalPalindromes  =  
>       -- the previous maximal palindrome reaches 
>       -- exactly to the end of the last current 
>       -- palindrome. Use the palindromes in palindromes 
>       -- property to extend the current palindrome
>       extendPalindrome 
>         a 
>         rightmost 
>         (head previousMaximalPalindromes) 
>         currentMaximalPalindromes
>   | otherwise                                         =  
>       -- move the center one step. Add the length of 
>       -- the longest palindrome to the maximal
>       -- palindromes
>       moveCenter 
>         a 
>         rightmost 
>         (min (head previousMaximalPalindromes) 
>              (nrOfCenters-1)
>         :currentMaximalPalindromes) 
>         (tail previousMaximalPalindromes) 
>         (nrOfCenters-1)

In the first case, function moveCenter moves the rightmost position one to the right. In the second case it calls extendPalindrome to find the maximal palindrome around the next center, and in the third case it adds a maximal palindrome to the list of maximal palindromes, and moves the center of the current palindromes one position to the right.

Function finalPalindromes calculates the lengths of the longest palindromes around the centers that come after the center of the current palindrome of the array. These palindromes are again obtained by using the palindromes in palindromes property. Function finalPalindromes is called when we have reached the end of the array, so it is impossible to extend a palindrome. We iterate over the list of maximal palindromes, and use the palindromes in palindromes property to find the maximal palindrome at the final centers. As in the function moveCenter, if the previous element in the list of maximal palindromes reaches before the left end of the current palindrome, I need to cut off the new maximal palindrome found, so that it reaches exactly to the end of the array.


> finalPalindromes 
>   nrOfCenters     
>   previousMaximalPalindromes 
>   currentMaximalPalindromes   
>   | nrOfCenters == 0  =  currentMaximalPalindromes
>   | otherwise         =
>       finalPalindromes 
>         (nrOfCenters-1) 
>         (tail previousMaximalPalindromes) 
>         (min (head previousMaximalPalindromes) 
>              (nrOfCenters-1)
>         :currentMaximalPalindromes)

In each step, function finalPalindromes adds a maximal palindrome to the list of maximal palindromes, and moves on to the next center.

I have discussed the number of steps this algorithm takes for each function. At a global level, this algorithm either extends the current palindrome, and moves the rightmost position in the array, or it extends the list of lengths of maximal palindromes, and moves the center around which we determine the maximal palindrome. If the length of the input array is $n$, the number of steps the algorithm is $n$ for the number of moves of the rightmost position, plus $2n+1$ for the number of center positions. This is a linear-time algorithm.

Tuesday, May 22, 2012

Sotades

His name has been mentioned already a couple of times: Sotades reportedly wrote some of the first palindromes.

Sotades was an Ancient Greek poet, living in the third century before Christ. He lived in Alexandria during the reign of Ptolemy II Philadelphus (285 BC-246 BC). Ptolemy II was born in Kos in about 309 BC. As a youth, he enjoyed the best tutors. The practice of getting the best scholars or poets available to educate the crown prince was something that Ptolemy II's father (also called Ptolemy, which explains the II in his son's name), had the occasion to observe in Macedonia, where the young Alexander was taught by no less a figure than Aristotle himself. Ptolemy II and his father set up the Alexandrian University and Library, and attracted the most learned men of the times to Alexandria. Ptolemy II provided these scholars with a life free from want and taxes, allowing them to study, write, perform research, and lecture in their respective disciplines.

Ptomely II was married to Arsinoe (I), and got three children with her. He had also a sister called Arsinoe (II), who was married to a Greek king. Sister Arsinoe was a scheming woman. She had the oldest son of her husband, by a different woman, murdered to position her own children for the throne. When her husband was killed in the battle that ensued after the widow of the murdered son fled to a neighboring country, she married her half-brother (also called Ptolemy), who claimed the same throne as her husband had had. When her new husband became too powerful, she and her sons conspired against him. They were found out. Two sons got killed, and she and her eldest son (also called Ptolemy) fled to her full brother Ptolemy II in Egypt. In Eqypt she managed to get rid of Ptolemy's first wife Arsinoe I by accusing her of conspiring against her husband. Arsinoe I was sent in exile, and Arsinoe II married her brother.

Alexandria not only attracted scholars, but also poets and satirists, such as Sotades. One of his poems attacked Ptolemy II's marriage to his sister Arsinoe, from which came the infamous line: "You're sticking your prick in an unholy hole." For this, Sotades was imprisoned, but he escaped to the island of Kaunos, where he was captured by Ptolemy's admiral, shut up in a leaden chest, and thrown into the sea.

Sotades wrote obscene verses, in particular about the love between an older man and an adolescent boy. He invented his own metre for such verses, the sotadic metre. This metre spread around the world, and was used by other satirists to hint at an obscene or satirical interpretation. For example, Ennius wrote his `Sota' already in the same century in southern Italy. Sotades' verses could be read both forwards and backwards, where the backwards reading was the same, or contained an obscene interpretation. Unfortunately, none of the palindromic verses of Sotades survived.

Tuesday, May 15, 2012

Palindromes in palindromes

In a previous blog post, I develop a quadratic time algorithm for finding the length of the maximal palindromes around all centers of a string. Although it is a lot faster than the naive algorithm that finds all palindromes in a string, it is still not very useful when trying to find palindromes in DNA, or in a text as long as the Bible.

To design an efficient algorithm for finding palindromes, I reuse as much information as possible when calculating the length of maximal palindromes around centers. In this blog post, I will explain the central idea that allows me to reuse previously calculated maximal palindromes.

In the naive algorithm for finding the maximal palindromes around all centers of a string, I start with calculating the maximal palindrome around the first, leftmost, center, then the palindrome around the next center on the first character, then the palindrome around the center in between the first two characters, and so on. I calculate the palindromes from left to right. In the illustration below I depict a string with the letters blackened out. But you may assume that it represents the string "yabadabadoo". In this string, p is the maximal palindrome around center b. For example, p might be "abadaba", with b equal to $9$. I assume p is the longest palindrome that reaches to its rightmost position, so that no palindrome around a center before b reaches there.

I calculate the maximal palindromes from left to right, so I have already calculated the maximal palindrome around center a, which appears within palindrome p. I call the maximal palindrome around center a q. In the "yabadabadoo" example a might be center $5$, with "aba" as maximal palindrome q around it. In the illustration q is completely within p, but it might also extend to before p. It cannot extend after p, since then there would be a longer palindrome that extends to the rightmost position of p. Center a' is the center I get when mirroring center a with respect to center b: b plus the difference between b and a. This would be $13$ ($=9+(9-5)$) in our example. What is the maximal palindrome around a'?

Since p is a palindrome, the string with the same length as q around a' is reverse q, provided q is completely within p. If q extends to before p, we cut it off at the start of p, and remove an equally long part at the end of q. Since q is a palindrome, reverse q equals q, and is hence also a palindrome. I will call this palindrome q' to distinguish it from q. Is it the maximal palindrome around a'?

To find out whether or not q' is the maximal palindrome around a', I determine whether or not the characters before and after q' are the same. If the maximal palindrome around center a does not extend to the start of p, I know that the characters around q are different. Since the same characters appear around the palindrome q', it is the maximal palindrome around center a'. In this case I don't even have to look at the characters before and after q': the fact that q does not extend to the start of p gives me enough information to determine that q' is maximal. If the maximal palindrome around center q does extend to the start of p, as in our example, where "abadaba" starts with "aba", or even before, I have to compare the character before q' with the character after p. If these are different, q' is the maximal palindrome around a', otherwise, I have to investigate further to find out the maximal palindrome around a'. I will do this in the blog post that discusses an efficient algorithm for finding the length of all maximal palindromes in a string. In our example, "aba" is not the maximal palindrome around center a', since the character 'd' appears both before and after q'