Tuesday, May 15, 2012

Palindromes in palindromes

In a previous blog post, I develop a quadratic time algorithm for finding the length of the maximal palindromes around all centers of a string. Although it is a lot faster than the naive algorithm that finds all palindromes in a string, it is still not very useful when trying to find palindromes in DNA, or in a text as long as the Bible.

To design an efficient algorithm for finding palindromes, I reuse as much information as possible when calculating the length of maximal palindromes around centers. In this blog post, I will explain the central idea that allows me to reuse previously calculated maximal palindromes.

In the naive algorithm for finding the maximal palindromes around all centers of a string, I start with calculating the maximal palindrome around the first, leftmost, center, then the palindrome around the next center on the first character, then the palindrome around the center in between the first two characters, and so on. I calculate the palindromes from left to right. In the illustration below I depict a string with the letters blackened out. But you may assume that it represents the string "yabadabadoo". In this string, p is the maximal palindrome around center b. For example, p might be "abadaba", with b equal to $9$. I assume p is the longest palindrome that reaches to its rightmost position, so that no palindrome around a center before b reaches there.

I calculate the maximal palindromes from left to right, so I have already calculated the maximal palindrome around center a, which appears within palindrome p. I call the maximal palindrome around center a q. In the "yabadabadoo" example a might be center $5$, with "aba" as maximal palindrome q around it. In the illustration q is completely within p, but it might also extend to before p. It cannot extend after p, since then there would be a longer palindrome that extends to the rightmost position of p. Center a' is the center I get when mirroring center a with respect to center b: b plus the difference between b and a. This would be $13$ ($=9+(9-5)$) in our example. What is the maximal palindrome around a'?

Since p is a palindrome, the string with the same length as q around a' is reverse q, provided q is completely within p. If q extends to before p, we cut it off at the start of p, and remove an equally long part at the end of q. Since q is a palindrome, reverse q equals q, and is hence also a palindrome. I will call this palindrome q' to distinguish it from q. Is it the maximal palindrome around a'?

To find out whether or not q' is the maximal palindrome around a', I determine whether or not the characters before and after q' are the same. If the maximal palindrome around center a does not extend to the start of p, I know that the characters around q are different. Since the same characters appear around the palindrome q', it is the maximal palindrome around center a'. In this case I don't even have to look at the characters before and after q': the fact that q does not extend to the start of p gives me enough information to determine that q' is maximal. If the maximal palindrome around center q does extend to the start of p, as in our example, where "abadaba" starts with "aba", or even before, I have to compare the character before q' with the character after p. If these are different, q' is the maximal palindrome around a', otherwise, I have to investigate further to find out the maximal palindrome around a'. I will do this in the blog post that discusses an efficient algorithm for finding the length of all maximal palindromes in a string. In our example, "aba" is not the maximal palindrome around center a', since the character 'd' appears both before and after q'

Wednesday, May 9, 2012

The word `palindrome'

The word `palindrome' comes from the Greek παλινδρόμους. The word παλιν translates to `again', and the word δρόμους (or δρόμος) to `way' or `direction', and the combination is often translated to `running back again'. In the 17th century, the English poet Ben Johnson connected the word palindrome to the concept of a word or a sequence of words that reads, letter for letter, the same backwards as forwards. The palindrome concept existed long before the 17th century. Reportedly, the first recorded palindromes were written by Sotades in (again) Greece in the third century B.C.

`Palindrome' is an old word, which already appears in some early Greek texts. For example, it appears in the work Timon of Lucian of Samosata.

Lucian was a rhetorician and satirist living in the 2nd century A.D. He wrote more than eighty works, among which Timon, sometimes also translated as The Misanthrope. Shakespeare's play Timon of Athens has been influenced by the work of Lucian.

Timon of Athens was a citizen of Athens during the era of the Peloponnesian War between Athens and Sparta (431 BC–404 BC). He was wealthy and lavished his money on flattering friends. When funds ran out, friends deserted and Timon was reduced to working in the fields. One day, he found a pot of gold and soon his unreliable friends were back. This time, he drove them away with dirt clods. Timon became an angry despiser of mankind, and his reputation for misanthropy grew to legendary status.

He can't have been nice company: a typical sentence of the misanthropic Timon in Lucian's work is `Go your ways, then, Hermes, and take Plutus back to Zeus. I am quite content to let every man of them go hang.' Hermes is the messenger god, en Plutus the god of wealth. In Greek, this sentence reads: ὥστε παλίνδρομος ἄπιθι, ὦ Ἑρμῆ, τὸν Πλοῦτον ἐπανάγων τῷ Διί: ἐμοὶ δὲ τοῦτο ἱκανὸν ἦν, πάντας ἀνθρώπους ἡβηδὸν οἰμώζειν ποιῆσαι. The second word is the occurrence of `palindrome'. It is not translated by `running back again', but clearly something goes back, or returns, here.

Another occurrence of `palindrome' is found in Diogenes Laërtius' book `Lives of Eminent Philosophers', which was probably published somewhere in the 3rd century A.D. Laërtius introduces Aristippus in Chapter 8.

Aristippus was drawn to Athens by the fame of Socrates. He became a lecturer himself, and was one of the first to both pay and charge fees for lecturing. `And on one occasion the sum of twenty minae which he had sent was returned to him, Socrates declaring that the supernatural sign would not let him take it; the very offer, in fact, annoyed him.' In Greek, this reads: καί ποτε πέμψας αὐτῷ μνᾶς εἴκοσι παλινδρόμους ἀπέλαβεν, εἰπόντος Σωκράτους τὸ δαιμόνιον αὐτῷ μὴ ἐπιτρέπειν: ἐδυσχέραινε γὰρ ἐπὶ τούτῳ. The 7th word of this sentence is `palindrome' in Greek. Again, it is not translated exactly as `running back again', but something is returned.

Although the roots of the word `palindrome' are in Greek, the Greek themselves describe the palindrome concept by karkinikê epigrafê (καρκινικὴ επιγραφή; "crab inscription"), or simply karkinoi (καρκίνοι; "crabs"), alluding to the movement of crabs.

Thursday, May 3, 2012

A naive algorithm for finding maximal palindromes

The previous blog post discusses why it is sufficient to calculate the length of the maximal palindrome around each center position in a string if I want to find palindromes in a string. In this blog post I will describe a naive algorithm for finding the length of all maximal palindromes in a string. This algorithm is slightly less naive than the naive algorithm for finding palindromes given in an earlier blog post. The latter algorithm requires a supercomputer for a couple of days to find palindromes in a long string, the algorithm I will develop in this blogpost requires several days of computing power of the laptop on which I am typing this post. In numbers, this algorithm is about a milliard times faster on such a long string.

Given a string as input, I want to find the list of lengths of maximal palindromes around all centers of the string. I will use the function maximalPalindromes for this purpose. It has the following type


< maximalPalindromes :: String -> [Int]

I want to find the length of the maximal palindrome around each center in a string. I will do this by trying to extend the trivial palindromes consisting of either a single letter (for odd centers) or of the empty string (for even centers) around each center. To extend a palindrome, I have to compare the characters before and after the current palindrome. It would be helpful if I had random access into the string, so that looking up the character at a particular position in a string can be done in constant time. Since an array allows for constant time lookup, I change the input type of the maximalPalindromes function to an array. I have to import the module Data.Array to use arrays.

> import Data.Array
>
> maximalPalindromes :: Array Int Char -> [Int]

If I change my input type from strings to arrays, I have to convert an input string into an array. The Data.Array module contains a function just for this purpose: the function listArray creates an array from a string together with a pair of indices for the first and last position in the string. Function maximalPalindromes calculates the following lengths of maximal palindromes in the string "abb":

?maximalPalindromes (listArray (0,2) "abb")
[0,1,0,1,2,1,0]

Function maximalPalindromes calculates the length of maximal palindromes by first calculating all center positions of an input array, and then the length of the maximal palindrome around each of these centers. The center positions of an array with first and last as bounds are the elements in the list [0 .. 2*(last-first+1)]. I will always use arrays that start at index 0, which implies that the length of an array is last+1.


> maximalPalindromes a  =   
>   let (first,last)  =  bounds a
>       centers       =  [0 .. 2*(last-first+1)]
>   in  map (lengthPalindromeAround a) centers

Function lengthPalindromeAround takes an array and a center position, and calculates the length of the longest palindrome around that position. Center positions do not exactly correspond to indices in the array, since they not only describe locations of characters, but also locations in between characters. To obtain an array index from a center position I divide it by two. If the center position is even, it is in between two characters, and if I divide it by two, I get an array index that points at the character after the center position. I compare the character pointed to by the index with the character before it to find out if the (empty) palindrome can be extended. If it can, I subtract one from the starting position, and add one to the end position, and try to extend the palindrome with the new indices. If the palindrome cannot be extended, because the elements around it are different, or the current palindrome starts at the start or ends at the end of the array, I return the length of the maximal palindrome found, which is the difference between the end and the start position minus one. For odd center positions I can start with the indices before and after the character pointed to by the index obtained by dividing the center position by two, rounding down.

> lengthPalindromeAround  ::  Array Int Char -> Int -> Int
> lengthPalindromeAround a center 
>   | even center = 
>       lengthPalindrome (first+c-1) (first+c) 
>   | odd  center = 
>       lengthPalindrome (first+c-1) (first+c+1) 
>   where  c             =  div center 2
>          (first,last)  =  bounds a
>          lengthPalindrome start end  = 
>             if   start < 0 
>               || end > last-first 
>               || a!start /= a!end
>             then end-start-1
>             else lengthPalindrome (start-1) (end+1) 

For each position, this function may take an amount of steps linear in the length of the array, so this is a quadratic-time algorithm.

Friday, April 27, 2012

Maximal palindromes

The algorithm for finding palindromic substrings described in my previous blog post has two major problems. First, if I want to find the palindromes in a string of substantial length, it will take a normal computer many years to calculate the result. I didn't mention the second problem explicitly in my previous blog post, but it is pretty obvious. The number of palindromic substrings returned by the algorithm is at least as large as the length of the string, since all single letters are palindromes, and there might be many more. If the string contains a palindrome of length $1,000,000$, then the algorithm will also return the palindrome of length $999,998$ obtained by removing the first and last element of the million letter palindrome, the palindromes of length $999,996$, $999,994$, and so on. This implies that the total number of palindromes occurring in a long string is huge, and I can easily drown in the palindromes returned.

Of the series of palindromes of length $1,000,000$, $999,998$, $999,996$ etc, all palindromes have the same center, and all shorter palindromes can be derived from the longest palindrome by removing equally many characters from the front and from the end. In the string "yabadabadoo", the palindromes around the center at the second occurrence of a 'b', are "b", "aba", and "dabad". "dabad" is the maximal palindrome around this center, since its extension "adabado" is not a palindrome. Maximality does not imply it is the longest palindromic substring, it only is the maximal palindrome around a particular center in the string. For example, in "yabadabadoo", "abadaba" is a longer palindrome, but with a different center. A center position is either on a letter, as in "dabad" on 'b', or in between two letters, as in "oo", where the center is in between the two o's. The string "yabadabadoo" has $23$ centers: one on each letter (of which there are eleven), one before each letter (another eleven), and one after the last letter of the string. If I assign $0$ to the center before the first letter, $1$ to the center on the first letter, $2$ to the center after the first letter, and so on, the maximal palindrome "dabad" has center $13$, and the maximal palindrome "abadaba" $9$. For a string of length $n$, there are $2n+1$ center positions in the string. Since we can derive all shorter palindromes around a center from the maximal palindrome around the center, it is sufficient to calculate the maximal palindromes around the centers in a string. So instead of calculating all palindromic substrings of a string, I calculate all maximal palindromic substrings of a string.

There are just as many maximal palindromes in a string as there are center positions, so there still are quite a few maximal palindromes. But the number of maximal palindromes might be substantially lower than the number of (not necessarily maximal) palindromic substrings in a string. Take for example the string of length $n$ just containing the character 'a'. The number of maximal palindromes is equal to the number of center positions in the string, $2n+1$. The total number of palindromes in the string is the number of substrings of the string, since all substrings consist of only 'a's, and hence all are palindromes. Since substrings is defined as concat . map tails . inits, I can calculate the total number of substrings if I know how many initial and final substrings appear in a string of length $n$. Function inits returns $n+1$ substrings, of length $0...n$, respectively. Similarly, for a string of length $n$, function tails returns $n+1$ substrings, also of length $0...n$. So the total number of substrings is: \[ \sum\limits_{i=0}^{n} (i+1) = \frac{n(n+1)}{2} + n+1 = \frac{(n+2)(n+1)}{2} = \frac{n^2}{2}+\frac{3}{2}n+1\] So in the string "aaaaaaaaaa" of length $10$, there are $21$ maximal palindromes, and $66$ palindromic substrings.

If I know the center and the length of a palindrome, I can recover the palindrome if I have the original string. For example, the maximal palindrome of length $5$ around center $13$ in the string "yabadabadoo" is the string "dabad", and the maximal palindrome of length $7$ around center $9$ is the string "abadaba". So to find maximal palindromes, it suffices to find their center and length.

Given a string, a center within the string, and a lengthp denoting the length of the maximal palindrome in the string around the center, I can use a Haskell program to check if the length is indeed the length of the maximal palindrome around the center in the string. To determine the substring denoted by a center and a length, I first split the string into the initial part of length div (center-lengthp) 2, the part before the maximal palindrome, and the following lengthp letters. Function div divides its first argument by its second argument, throwing away the remainder. Function splitAt takes a positive integer $n$ and a list, and splits the list into two lists. The first list contains the first $n$ letters, and the second list contains the rest of the letters. The lengthp letters after before should form a palindrome. The letters around it, the last letter of the part of the string before it and the head of the string after it, should be different to make it maximal. Functions last and head are predefined functions which return the last or head element of a list, respectively. The palindrome is maximal too if at least one of the strings before or after it is empty.


> maximalPalindrome string center lengthp  =
>   let (before,rest)  =  splitAt 
>                           (div (center-lengthp) 2) 
>                           string
>       (p,after)      =  splitAt lengthp rest
>   in    length p    ==  lengthp 
>      && odd center  ==  odd lengthp
>      && palindrome p 
>      && (  null before 
>         || null after 
>         || last before /= head after
>         )

The function maximalPalindrome uses two sanity checks. It first checks that there indeed exists a substring of length lengthp around center by means of length p == lengthp, where function length is a predefined function that returns the length of a list. The other sanity check checks that the maximal palindrome around an even center has even length, and a maximal palindrome around an odd center has odd length, by means of odd center == odd lengthp, which equals True only if both center and lengthp are odd or both are even. Function odd is a predefined function that determines whether or not a number is odd. The logic or operator || returns True if one of its arguments is True. If its left-hand side argument is True it doesn't evaluate its right-hand argument, which is just as well for the above definition, since evaluating last before if before is empty (null before) would lead to a undefined value.

In the next blog post I will show how to calculate the length of all maximal palindromes in a string.

Wednesday, April 18, 2012

A naive algorithm for finding palindromes

It is easy to check by hand that a sentence like "Madam, I'm Adam" is a palindrome if you ignore whitespace and punctuation characters. But checking that the world's longest palindrome, consisting of 17,826 words, is indeed a palindrome is another matter. For such a string, I need software to check that it is a palindrome.

I have already defined the necessary software to check that a string is a palindrome. The palindrome (or textPalindrome or dnaPalindrome) function defined in the previous blog post does the job. Given an implementation of the function reverse that takes about as many steps as the length of the string it reverses, the palindrome function determines whether or not a string is a palindrome in about the same amount of steps. It is impossible to do this significantly faster on a computer.

Problem solved?

Well, it depends on what you want. If you want to check for a complete string whether or not it is a palindrome, then the problem is solved. But often you want to find occurrences of palindromes inside another, possibly non-palindromic, string. For example, you might want to find the about 5,700,000 characters that together form eight palindromes, in a string of about 20,000,000 characters representing the male-specific region of the Y chromosome. Or you night want to find the longest palindrome in the Bible. These are different problems, for which we need more than just the palindrome function.

To find palindromes in the male-specific region of the Y chromosome, I have to consider all substrings of the DNA string, and check if each of the substrings is a palindrome. How can I find all substrings that are palindromes in a string? A string may contain many substrings that are palindromes. Just the string "abba" contains the palindromes "a" (twice), "b" (twice), "bb", and "abba", and we also consider the empty substring, which appears a lot, a palindrome.

Finding all palindromic substrings is specified by:


> palindromes  :: String -> [String]
> palindromes  =  filter palindrome . substrings

The first line gives the type of the function palindromes. Function palindromes takes a value of type String as argument, and returns a list of palindromic strings, so a value of type [String]. The argument type is to the left of the arrow ->, the result type to the right. Function substrings calculates all substrings of a string. The function substrings is composed with the function filter palindrome using the composition operator . represented by the dot symbol. filter palindrome removes all substrings that are not palindromes. It remains to define function substrings. Function substrings is defined in terms of two helper functions inits and tails. Function inits returns the list of all initial substring of a string. For example, inits "abba" gives ["", "a", "ab", "abb", "abba"]. Similarly, function tails returns the list of all final substrings of a string, so tails "abba" gives ["abba", "bba", "ba", "a", ""].

> inits []      =  [[]]
> inits (x:xs)  =  [[]] ++ map (x:) (inits xs)

> tails []      =  [[]]
> tails (x:xs)  =  [x:xs] ++ tails xs

The empty list has just one initial list, namely the empty list. The initial lists of a list consisting of an element x followed by a list xs are calculated by first calculating the initial lists of xs, prepending x to each of these lists by means of the function (x:), and adding the empty initial list. The empty list also has just one final list, namely the empty list. The final lists of a list consisting of an element x followed by a list xs are calculated by first calculating the final lists of xs, and adding the complete list x:xs. Function substrings is now defined by either taking all initial substrings of all final substrings, or all final substrings of all initial substrings. It doesn't matter much which I take, so I define

> substrings = concat . map tails . inits

The standard function concat flattens takes a list of lists, and turns that into a list. So concat [[3,1],[2,4]] results in [3,1,2,4]. It removes one nesting level. Applying function substrings to "abba" gives ["", "a", "", "ab", "b", "", "abb", "bb", "b", "", "abba", "bba", "ba", "a", ""].

Although we now have a program to find all palindromes in a string, it is of no use when the string in which you want to find palindromes is very large. You might just as well try to find the palindromes by hand, since waiting for the program to find them is easily going to take years. Just how impossible it is to find palindromes in a string of length 20,000,000 using this program, consider the amount of computation necessary to find them. Since we want to find palindromic substrings of this string, we have to calculate the substrings first. We have a single substring of length 20,000,000, two substrings of length 19,999,999 each, three of length 19,999,998, four of length 19,999,997, and so on. So the length of all substrings we need to consider when looking for palindromes is \[ 1 \times n + 2 \times (n-1) + 3 \times (n-2) + \ldots + n \times 1 \] where $n$ is the length of the string in which we are looking for palindromes. I will use some simple arithmetic to obtain a value that is easier to calculate. The above expression can be reformulated to: \[ \sum\limits_{i=1}^{n} i \times (n+1-i) \] Now I apply a law which says that constant factors can be moved outside a sum, so \[ \sum\limits_{i=1}^{n} i \times k = k \times \sum\limits_{i=1}^{n} i \] where I instantiate $k$ with $n+1$, and I apply laws for sums of consecutive integers, namely \[ \sum\limits_{k=1}^n k = \frac{n(n+1)}{2} \\ \sum\limits_{k=1}^n k^2 = \frac{2n^3 + 3n^2 + n}{6} \] and some simple arithmetic to find that the above expression equals: \[ \frac{n^3+3n^2+2n}{6} \] So given a string of length $n$, the length of all substrings that appear in the string is $\frac{n^3+3n^2+2n}{6}$, which is slightly less than $n^3$.

Let me return now to the DNA string mentioned at the start of this blog post. Since its length is around 20,000,000, the total length of the substrings that appear inside it amounts to 1,333,333,533,333,340,000,000. This is more than a trilliard characters to be inspected! Suppose, unrealistically, that I can use the fastest supercomputer available on earth as of 2011, the Fujitsi K computer, named after the japanese word for 10 billiard, kei.

The kei computer is actually not so much a computer, but a collection of over 80,000 computers put together. The speed of a supercomputer is specified as the number of basic operations it can perform per second. The speed of the K computer is 10.51 petaflops, where a petaflop is a billiard of such basic operations per second. So the K computer is appropriately but modestly named after its speed. If a single basic operation is enough to compare two characters (which it isn't), then the K computer would need $1,333,333$, the length of the substrings divided by a billiard, divided by $10.51$, the number of billiard operations the K computer can perform per second, seconds, or about one and a half day, to calculate these comparisons. A normal computer would spend its entire lifetime on this problem, and not finish it.

That is not good enough.

Sunday, April 15, 2012

What is a palindrome II? A specification in Haskell

In this blog post I will describe the concept of a palindrome more precisely than in my last blog post, by giving the property that determines whether or not a sequence of symbols is a palindrome in the functional programming language Haskell.

Why is a specification of a palindrome as a program more precise than the specification given in text in the previous blog post, which was taken from the Oxford English Dictionary (OED)? And why should someone interested in the concept of palindromes be interested in a program that describes the concept? These are philosophical questions, to which computer science helps to give an answer. The purpose of a definition is to distinguish certain things from other things. The definition of a palindrome distinguishes palindromic words or sequences of words from non-palindromic ones. How do I determine if a sequence of words is a palindrome? The textual definition in the OED says that it should read the same backwards as forwards, letter for letter. The first example in the OED is "Lewd did I live, and evil I did dwel", which doesn't read the same backwards and forwards, unless capital and underscore letters are considered equal, and comma's are ignored. Hmmm. Are there any more exceptions I should know of?

A much better way to define the concept of palindromes is to provide a program, which takes a string as input, and which returns either yes or no, determining whether or not the input string is a palindrome. The OED definition leads to three different programs for determining whether or not an input string is a palindrome: a definition which compares strings letter for letter, and only accepts string which are literally the same forwards and backwards, a definition which ignores punctuation and capitalization, and a definition which compares DNA strings. Such a program is a much more precise way to define what it means to be a palindrome. To determine if a string is a palindrome it suffices to run the program on the string. To study the concept of palindromes, it suffices to study the program. No surprises.

Not just palindromes profit from being defined by means of a program, many other concepts, ideas, and regulations would be better off being specified as a program. Besides the obvious anagrams, pangrams, ambigrams, and so on, tax laws, testaments, exam regulations, and anything that follows some rules is best described by means of a program. A program is precise about the rules, the exceptions, and when they apply. If a concept cannot be described precisely, using a program becomes harder. I wouldn't know how to construct a Turing test: a program to determine whether or not a species is intelligent. And even a simple concept like a chair is not easily captured in a program.

Haskell is a programming language celebrating its 25th anniversary in 2012. It was conceived at an academic conference on functional programming in 1987, in an attempt to combine efforts in developing a purely functional programming language. In the nineties of the last century it was used for research on advanced programming language concepts, and in teaching at many universities. Since a number of years it is becoming more popular in industry, in particular in the investment banking industry, where it is used to specify and implement involved mathematical financial models. I use Haskell because it allows me to describe ideas precisely and concisely. I will introduce the necessary Haskell concepts as I go.

This blog post itself is a literate Haskell program. If you save it as a file and make sure its name ends in .lhs, you can load it in ghci, an interpreter for Haskell programs, and use the definitions given in this post. To make this work I need the following two lines.


> import Prelude hiding (reverse)
> import Data.Char hiding (isLetter)

The first line says that I can use all the standard Haskell functions except the function reverse, which I am going to define in this blog post. The second line says that I can use all basic functions on characters, such as the function isSpace, which tells whether or nor a character is a space. The function isLetter is excluded, since I am going to define that function in this blog post too.

I use the following notation for a list of characters. The empty list of characters is denoted by [] and a symbol x followed by a list of characters xs is denoted by x:xs. An individual character is surrounded by quotes to distinguish it from arbitrary variables like x. For example, in this notation, I can write "refer" as 'r':'e':'f':'e':'r':[]. Furthermore, I can concatenate two lists of characters by means of the operator ++, so that "Madam, " ++ "I'm Adam" is "Madam, I'm Adam".

How can I determine whether or not a string (a list of characters) is a palindrome? The simplest method is to reverse the string and to compare it with itself. So the string xs is a palindrome (palindrome xs), if xs is equal to its reverse: xs == reverse xs, where xs == ys is True only when the strings xs and ys are exactly equal.


> palindrome xs  =  xs == reverse xs

This equation is actually a program in Haskell. We can load this program in ghci and run it. For example, palindrome "abba" evaluates, unsurprisingly, to True, and palindrome "yabadabadoo" evaluates to False.

How do I compute the reverse of a string? For this purpose, I define a recursive function: if I know how to calculate the reverse of the empty string, and if I know how to calculate the reverse of x:xs, given that I know how to calculate the reverse of xs, then I can use a computer to calculate the reverse of any string. The two cases for reverse are not hard: the reverse of the empty string is the empty string, and the reverse of x:xs is the reverse of xs followed by the string consisting of the single character x:


> reverse []      =  []
> reverse (x:xs)  =  reverse xs ++ [x]

Here [x] is the list containing the single element x, which can also be written "x". This completely defines what it means to be a palindrome.

The above definition only qualifies strings as palindromes when they are exactly equal when reversed. So "Madam, I'm Adam" does not pass this test. For this string to also pass the palindrome test, I slightly adapt the definition. I now say that a string is a palindrome if it is equal to its reverse after throwing away all punctuation symbols such as spaces, comma's, periods, etc, and after turning all characters into lower case characters. I call such a palindrome a textPalindrome.


> textPalindrome xs  =  let  ys  =  filter isLetter xs
>                            zs  =  map toLower ys
>                       in   zs == reverse zs
>
> isLetter l = not (isPunctuation l) && not (isSpace l)

Haskell's let ... in ... construct allows me to introduce new definitions after the let, which I can use in the definitions after the in. Here I use two intermediate results ys and zs. By filtering the original string xs with the predicate isLetter, I obtain a new string ys in which no punctuation or space characters appear anymore. So "Madam, I'm Adam" is turned into "MadamImAdam". Function filter is a standard function in Haskell which takes a predicate p and a list xs, and keeps all the elements of xs that satisfy the predicate p. The function isLetter uses two basic Haskell functions on characters, isPunctuation and isSpace. isPunctuation returns True for all punctuation characters such as the dot, comma, space, and exclamation mark, and False for all non-punctuation characters, such as letters. Function isSpace works similarly on space characters. Function isLetter takes a character, and checks that it is not a punctuation character, and (denoted by &&) not a space character. After filtering out all non-letter characters from xs giving ys, I map each letter in ys to its lower case by means of map toLower ys. So "MadamImAdam" is turned into "madamimadam". map is also a standard Haskell function, which takes a function as argument, toLower in this example, and applies this function to all characters in a string. The function toLower turns a capital letter into lowercase, and does nothing to lowercase letters. Function textPalindrome accepts all palindromes, irrespective of their punctuation.

Both palindrome and textPalindrome cannot be used to check if a DNA sequence is a palindrome, because the symbol 'A' should be considered equal to 'T', and 'C' to 'G', and the equality == used in the definition of palindrome considers them different. We need to change the equality function to compare characters in DNA strings. We define the DNA character equality function =:= by


> 'A' =:= 'T'  =  True
> 'T' =:= 'A'  =  True
> 'C' =:= 'G'  =  True
> 'G' =:= 'C'  =  True
> _   =:= _    =  False

We use this new equality function in a definition of dnaPalindrome for sequences of DNA symbols. We pairwise combine the elements of xs and reverse xs with the equality function =:= using the function zipWith. zipWith is another standard function, which takes an operator and two lists, and `zips' the two lists with the operator. Thus we obtain a list of comparisons, which we fold to a single result by requiring that each element in the list is True. Function and takes a list of boolean values, and returns True only if all elements in the list are True.

> dnaPalindrome xs  =  and (zipWith (=:=) xs (reverse xs))

Thus we have three functions for checking whether or not a string is a palindrome: palindrome for strings that are exactly equal when reversed, textPalindrome for strings that are exactly equal when reversed modulo punctuation and space characters, and dnaPalindrome for DNA strings.

Sunday, April 8, 2012

What is a palindrome?

The Oxford English Dictionary describes a palindrome as "a word or a sequence of words that reads, letter for letter, the same backwards as forwards." In extended use it appears in:

  • Music. A piece of music in which the second half is a retrograde repetition of the first half; the retrograde itself.
  • Numbers. A number, or a date expressed numerically, that is unchanged when the order of its digits is reversed.
  • Biology. A nucleic acid sequence that is identical to its complementary sequence when each is read in the same direction (which is usually the direction called 5' - 3').
I will return to these definitions in later blog posts.

To determine whether a sequence of symbols is a palindrome I need to know what the symbols are from which the palindrome is composed, which symbols I can ignore, and when two symbols are considered equal.

The above definitions tell me that the symbols may be single characters, notes, digits, or elements of DNA. But even complete words, or sentences, are used in palindromes. J.A. Lindon, who wrote many palindromes, published the following poem in 1955:

As I was passing near the jail
I met a man, but hurried by.
His face was ghastly, grimly pale.
He had a gun: I wondered why,
His face was ghastly, grimly pale,
I met a man, but hurried by,
As I was passing near the jail.
Reportedly, the first recorded palindromes, written by Sotades in Greece in the third century BC are also on the level of sentences.

A palindrome is a simple kind of mathematical symmetry. Where symmetries are exact and the smallest imperfection disqualifies a symmetry, palindromes in which the symbols are characters are often not exact symmetries. The word "refer" reads "refer" when you read it backwards, but the sequence of words in Adam's introduction "Madam, I'm Adam" reads "madA m'I ,madaM" when reversed, which clearly is not the same. Thus it would not qualify as a palindrome if we are very strict, but when talking about palindromes in languages, capitalization, spaces, and punctuation symbols are almost always ignored, or even changed.

In some situations, even different symbols are considered equal when finding palindromes. In DNA, a double helix is formed by two paired strands of nucleotides that run in opposite directions, and the nucleotides always pair in the same way (Adenine (A) with Thymine (T); Cytosine (C) with Guanine (G)). A (single-stranded) nucleotide sequence is said to be a palindrome if it is equal to its reverse complement. For example, the DNA sequence ACCTAGGT is palindromic because its nucleotide-by-nucleotide complement is TGGATCCA, and reversing the order of the nucleotides in the complement gives the original sequence.