Friday, April 27, 2012

Maximal palindromes

The algorithm for finding palindromic substrings described in my previous blog post has two major problems. First, if I want to find the palindromes in a string of substantial length, it will take a normal computer many years to calculate the result. I didn't mention the second problem explicitly in my previous blog post, but it is pretty obvious. The number of palindromic substrings returned by the algorithm is at least as large as the length of the string, since all single letters are palindromes, and there might be many more. If the string contains a palindrome of length $1,000,000$, then the algorithm will also return the palindrome of length $999,998$ obtained by removing the first and last element of the million letter palindrome, the palindromes of length $999,996$, $999,994$, and so on. This implies that the total number of palindromes occurring in a long string is huge, and I can easily drown in the palindromes returned.

Of the series of palindromes of length $1,000,000$, $999,998$, $999,996$ etc, all palindromes have the same center, and all shorter palindromes can be derived from the longest palindrome by removing equally many characters from the front and from the end. In the string "yabadabadoo", the palindromes around the center at the second occurrence of a 'b', are "b", "aba", and "dabad". "dabad" is the maximal palindrome around this center, since its extension "adabado" is not a palindrome. Maximality does not imply it is the longest palindromic substring, it only is the maximal palindrome around a particular center in the string. For example, in "yabadabadoo", "abadaba" is a longer palindrome, but with a different center. A center position is either on a letter, as in "dabad" on 'b', or in between two letters, as in "oo", where the center is in between the two o's. The string "yabadabadoo" has $23$ centers: one on each letter (of which there are eleven), one before each letter (another eleven), and one after the last letter of the string. If I assign $0$ to the center before the first letter, $1$ to the center on the first letter, $2$ to the center after the first letter, and so on, the maximal palindrome "dabad" has center $13$, and the maximal palindrome "abadaba" $9$. For a string of length $n$, there are $2n+1$ center positions in the string. Since we can derive all shorter palindromes around a center from the maximal palindrome around the center, it is sufficient to calculate the maximal palindromes around the centers in a string. So instead of calculating all palindromic substrings of a string, I calculate all maximal palindromic substrings of a string.

There are just as many maximal palindromes in a string as there are center positions, so there still are quite a few maximal palindromes. But the number of maximal palindromes might be substantially lower than the number of (not necessarily maximal) palindromic substrings in a string. Take for example the string of length $n$ just containing the character 'a'. The number of maximal palindromes is equal to the number of center positions in the string, $2n+1$. The total number of palindromes in the string is the number of substrings of the string, since all substrings consist of only 'a's, and hence all are palindromes. Since substrings is defined as concat . map tails . inits, I can calculate the total number of substrings if I know how many initial and final substrings appear in a string of length $n$. Function inits returns $n+1$ substrings, of length $0...n$, respectively. Similarly, for a string of length $n$, function tails returns $n+1$ substrings, also of length $0...n$. So the total number of substrings is: \[ \sum\limits_{i=0}^{n} (i+1) = \frac{n(n+1)}{2} + n+1 = \frac{(n+2)(n+1)}{2} = \frac{n^2}{2}+\frac{3}{2}n+1\] So in the string "aaaaaaaaaa" of length $10$, there are $21$ maximal palindromes, and $66$ palindromic substrings.

If I know the center and the length of a palindrome, I can recover the palindrome if I have the original string. For example, the maximal palindrome of length $5$ around center $13$ in the string "yabadabadoo" is the string "dabad", and the maximal palindrome of length $7$ around center $9$ is the string "abadaba". So to find maximal palindromes, it suffices to find their center and length.

Given a string, a center within the string, and a lengthp denoting the length of the maximal palindrome in the string around the center, I can use a Haskell program to check if the length is indeed the length of the maximal palindrome around the center in the string. To determine the substring denoted by a center and a length, I first split the string into the initial part of length div (center-lengthp) 2, the part before the maximal palindrome, and the following lengthp letters. Function div divides its first argument by its second argument, throwing away the remainder. Function splitAt takes a positive integer $n$ and a list, and splits the list into two lists. The first list contains the first $n$ letters, and the second list contains the rest of the letters. The lengthp letters after before should form a palindrome. The letters around it, the last letter of the part of the string before it and the head of the string after it, should be different to make it maximal. Functions last and head are predefined functions which return the last or head element of a list, respectively. The palindrome is maximal too if at least one of the strings before or after it is empty.

> maximalPalindrome string center lengthp  =
>   let (before,rest)  =  splitAt 
>                           (div (center-lengthp) 2) 
>                           string
>       (p,after)      =  splitAt lengthp rest
>   in    length p    ==  lengthp 
>      && odd center  ==  odd lengthp
>      && palindrome p 
>      && (  null before 
>         || null after 
>         || last before /= head after
>         )

The function maximalPalindrome uses two sanity checks. It first checks that there indeed exists a substring of length lengthp around center by means of length p == lengthp, where function length is a predefined function that returns the length of a list. The other sanity check checks that the maximal palindrome around an even center has even length, and a maximal palindrome around an odd center has odd length, by means of odd center == odd lengthp, which equals True only if both center and lengthp are odd or both are even. Function odd is a predefined function that determines whether or not a number is odd. The logic or operator || returns True if one of its arguments is True. If its left-hand side argument is True it doesn't evaluate its right-hand argument, which is just as well for the above definition, since evaluating last before if before is empty (null before) would lead to a undefined value.

In the next blog post I will show how to calculate the length of all maximal palindromes in a string.

Wednesday, April 18, 2012

A naive algorithm for finding palindromes

It is easy to check by hand that a sentence like "Madam, I'm Adam" is a palindrome if you ignore whitespace and punctuation characters. But checking that the world's longest palindrome, consisting of 17,826 words, is indeed a palindrome is another matter. For such a string, I need software to check that it is a palindrome.

I have already defined the necessary software to check that a string is a palindrome. The palindrome (or textPalindrome or dnaPalindrome) function defined in the previous blog post does the job. Given an implementation of the function reverse that takes about as many steps as the length of the string it reverses, the palindrome function determines whether or not a string is a palindrome in about the same amount of steps. It is impossible to do this significantly faster on a computer.

Problem solved?

Well, it depends on what you want. If you want to check for a complete string whether or not it is a palindrome, then the problem is solved. But often you want to find occurrences of palindromes inside another, possibly non-palindromic, string. For example, you might want to find the about 5,700,000 characters that together form eight palindromes, in a string of about 20,000,000 characters representing the male-specific region of the Y chromosome. Or you night want to find the longest palindrome in the Bible. These are different problems, for which we need more than just the palindrome function.

To find palindromes in the male-specific region of the Y chromosome, I have to consider all substrings of the DNA string, and check if each of the substrings is a palindrome. How can I find all substrings that are palindromes in a string? A string may contain many substrings that are palindromes. Just the string "abba" contains the palindromes "a" (twice), "b" (twice), "bb", and "abba", and we also consider the empty substring, which appears a lot, a palindrome.

Finding all palindromic substrings is specified by:

> palindromes  :: String -> [String]
> palindromes  =  filter palindrome . substrings

The first line gives the type of the function palindromes. Function palindromes takes a value of type String as argument, and returns a list of palindromic strings, so a value of type [String]. The argument type is to the left of the arrow ->, the result type to the right. Function substrings calculates all substrings of a string. The function substrings is composed with the function filter palindrome using the composition operator . represented by the dot symbol. filter palindrome removes all substrings that are not palindromes. It remains to define function substrings. Function substrings is defined in terms of two helper functions inits and tails. Function inits returns the list of all initial substring of a string. For example, inits "abba" gives ["", "a", "ab", "abb", "abba"]. Similarly, function tails returns the list of all final substrings of a string, so tails "abba" gives ["abba", "bba", "ba", "a", ""].

> inits []      =  [[]]
> inits (x:xs)  =  [[]] ++ map (x:) (inits xs)

> tails []      =  [[]]
> tails (x:xs)  =  [x:xs] ++ tails xs

The empty list has just one initial list, namely the empty list. The initial lists of a list consisting of an element x followed by a list xs are calculated by first calculating the initial lists of xs, prepending x to each of these lists by means of the function (x:), and adding the empty initial list. The empty list also has just one final list, namely the empty list. The final lists of a list consisting of an element x followed by a list xs are calculated by first calculating the final lists of xs, and adding the complete list x:xs. Function substrings is now defined by either taking all initial substrings of all final substrings, or all final substrings of all initial substrings. It doesn't matter much which I take, so I define

> substrings = concat . map tails . inits

The standard function concat flattens takes a list of lists, and turns that into a list. So concat [[3,1],[2,4]] results in [3,1,2,4]. It removes one nesting level. Applying function substrings to "abba" gives ["", "a", "", "ab", "b", "", "abb", "bb", "b", "", "abba", "bba", "ba", "a", ""].

Although we now have a program to find all palindromes in a string, it is of no use when the string in which you want to find palindromes is very large. You might just as well try to find the palindromes by hand, since waiting for the program to find them is easily going to take years. Just how impossible it is to find palindromes in a string of length 20,000,000 using this program, consider the amount of computation necessary to find them. Since we want to find palindromic substrings of this string, we have to calculate the substrings first. We have a single substring of length 20,000,000, two substrings of length 19,999,999 each, three of length 19,999,998, four of length 19,999,997, and so on. So the length of all substrings we need to consider when looking for palindromes is \[ 1 \times n + 2 \times (n-1) + 3 \times (n-2) + \ldots + n \times 1 \] where $n$ is the length of the string in which we are looking for palindromes. I will use some simple arithmetic to obtain a value that is easier to calculate. The above expression can be reformulated to: \[ \sum\limits_{i=1}^{n} i \times (n+1-i) \] Now I apply a law which says that constant factors can be moved outside a sum, so \[ \sum\limits_{i=1}^{n} i \times k = k \times \sum\limits_{i=1}^{n} i \] where I instantiate $k$ with $n+1$, and I apply laws for sums of consecutive integers, namely \[ \sum\limits_{k=1}^n k = \frac{n(n+1)}{2} \\ \sum\limits_{k=1}^n k^2 = \frac{2n^3 + 3n^2 + n}{6} \] and some simple arithmetic to find that the above expression equals: \[ \frac{n^3+3n^2+2n}{6} \] So given a string of length $n$, the length of all substrings that appear in the string is $\frac{n^3+3n^2+2n}{6}$, which is slightly less than $n^3$.

Let me return now to the DNA string mentioned at the start of this blog post. Since its length is around 20,000,000, the total length of the substrings that appear inside it amounts to 1,333,333,533,333,340,000,000. This is more than a trilliard characters to be inspected! Suppose, unrealistically, that I can use the fastest supercomputer available on earth as of 2011, the Fujitsi K computer, named after the japanese word for 10 billiard, kei.

The kei computer is actually not so much a computer, but a collection of over 80,000 computers put together. The speed of a supercomputer is specified as the number of basic operations it can perform per second. The speed of the K computer is 10.51 petaflops, where a petaflop is a billiard of such basic operations per second. So the K computer is appropriately but modestly named after its speed. If a single basic operation is enough to compare two characters (which it isn't), then the K computer would need $1,333,333$, the length of the substrings divided by a billiard, divided by $10.51$, the number of billiard operations the K computer can perform per second, seconds, or about one and a half day, to calculate these comparisons. A normal computer would spend its entire lifetime on this problem, and not finish it.

That is not good enough.

Sunday, April 15, 2012

What is a palindrome II? A specification in Haskell

In this blog post I will describe the concept of a palindrome more precisely than in my last blog post, by giving the property that determines whether or not a sequence of symbols is a palindrome in the functional programming language Haskell.

Why is a specification of a palindrome as a program more precise than the specification given in text in the previous blog post, which was taken from the Oxford English Dictionary (OED)? And why should someone interested in the concept of palindromes be interested in a program that describes the concept? These are philosophical questions, to which computer science helps to give an answer. The purpose of a definition is to distinguish certain things from other things. The definition of a palindrome distinguishes palindromic words or sequences of words from non-palindromic ones. How do I determine if a sequence of words is a palindrome? The textual definition in the OED says that it should read the same backwards as forwards, letter for letter. The first example in the OED is "Lewd did I live, and evil I did dwel", which doesn't read the same backwards and forwards, unless capital and underscore letters are considered equal, and comma's are ignored. Hmmm. Are there any more exceptions I should know of?

A much better way to define the concept of palindromes is to provide a program, which takes a string as input, and which returns either yes or no, determining whether or not the input string is a palindrome. The OED definition leads to three different programs for determining whether or not an input string is a palindrome: a definition which compares strings letter for letter, and only accepts string which are literally the same forwards and backwards, a definition which ignores punctuation and capitalization, and a definition which compares DNA strings. Such a program is a much more precise way to define what it means to be a palindrome. To determine if a string is a palindrome it suffices to run the program on the string. To study the concept of palindromes, it suffices to study the program. No surprises.

Not just palindromes profit from being defined by means of a program, many other concepts, ideas, and regulations would be better off being specified as a program. Besides the obvious anagrams, pangrams, ambigrams, and so on, tax laws, testaments, exam regulations, and anything that follows some rules is best described by means of a program. A program is precise about the rules, the exceptions, and when they apply. If a concept cannot be described precisely, using a program becomes harder. I wouldn't know how to construct a Turing test: a program to determine whether or not a species is intelligent. And even a simple concept like a chair is not easily captured in a program.

Haskell is a programming language celebrating its 25th anniversary in 2012. It was conceived at an academic conference on functional programming in 1987, in an attempt to combine efforts in developing a purely functional programming language. In the nineties of the last century it was used for research on advanced programming language concepts, and in teaching at many universities. Since a number of years it is becoming more popular in industry, in particular in the investment banking industry, where it is used to specify and implement involved mathematical financial models. I use Haskell because it allows me to describe ideas precisely and concisely. I will introduce the necessary Haskell concepts as I go.

This blog post itself is a literate Haskell program. If you save it as a file and make sure its name ends in .lhs, you can load it in ghci, an interpreter for Haskell programs, and use the definitions given in this post. To make this work I need the following two lines.

> import Prelude hiding (reverse)
> import Data.Char hiding (isLetter)

The first line says that I can use all the standard Haskell functions except the function reverse, which I am going to define in this blog post. The second line says that I can use all basic functions on characters, such as the function isSpace, which tells whether or nor a character is a space. The function isLetter is excluded, since I am going to define that function in this blog post too.

I use the following notation for a list of characters. The empty list of characters is denoted by [] and a symbol x followed by a list of characters xs is denoted by x:xs. An individual character is surrounded by quotes to distinguish it from arbitrary variables like x. For example, in this notation, I can write "refer" as 'r':'e':'f':'e':'r':[]. Furthermore, I can concatenate two lists of characters by means of the operator ++, so that "Madam, " ++ "I'm Adam" is "Madam, I'm Adam".

How can I determine whether or not a string (a list of characters) is a palindrome? The simplest method is to reverse the string and to compare it with itself. So the string xs is a palindrome (palindrome xs), if xs is equal to its reverse: xs == reverse xs, where xs == ys is True only when the strings xs and ys are exactly equal.

> palindrome xs  =  xs == reverse xs

This equation is actually a program in Haskell. We can load this program in ghci and run it. For example, palindrome "abba" evaluates, unsurprisingly, to True, and palindrome "yabadabadoo" evaluates to False.

How do I compute the reverse of a string? For this purpose, I define a recursive function: if I know how to calculate the reverse of the empty string, and if I know how to calculate the reverse of x:xs, given that I know how to calculate the reverse of xs, then I can use a computer to calculate the reverse of any string. The two cases for reverse are not hard: the reverse of the empty string is the empty string, and the reverse of x:xs is the reverse of xs followed by the string consisting of the single character x:

> reverse []      =  []
> reverse (x:xs)  =  reverse xs ++ [x]

Here [x] is the list containing the single element x, which can also be written "x". This completely defines what it means to be a palindrome.

The above definition only qualifies strings as palindromes when they are exactly equal when reversed. So "Madam, I'm Adam" does not pass this test. For this string to also pass the palindrome test, I slightly adapt the definition. I now say that a string is a palindrome if it is equal to its reverse after throwing away all punctuation symbols such as spaces, comma's, periods, etc, and after turning all characters into lower case characters. I call such a palindrome a textPalindrome.

> textPalindrome xs  =  let  ys  =  filter isLetter xs
>                            zs  =  map toLower ys
>                       in   zs == reverse zs
> isLetter l = not (isPunctuation l) && not (isSpace l)

Haskell's let ... in ... construct allows me to introduce new definitions after the let, which I can use in the definitions after the in. Here I use two intermediate results ys and zs. By filtering the original string xs with the predicate isLetter, I obtain a new string ys in which no punctuation or space characters appear anymore. So "Madam, I'm Adam" is turned into "MadamImAdam". Function filter is a standard function in Haskell which takes a predicate p and a list xs, and keeps all the elements of xs that satisfy the predicate p. The function isLetter uses two basic Haskell functions on characters, isPunctuation and isSpace. isPunctuation returns True for all punctuation characters such as the dot, comma, space, and exclamation mark, and False for all non-punctuation characters, such as letters. Function isSpace works similarly on space characters. Function isLetter takes a character, and checks that it is not a punctuation character, and (denoted by &&) not a space character. After filtering out all non-letter characters from xs giving ys, I map each letter in ys to its lower case by means of map toLower ys. So "MadamImAdam" is turned into "madamimadam". map is also a standard Haskell function, which takes a function as argument, toLower in this example, and applies this function to all characters in a string. The function toLower turns a capital letter into lowercase, and does nothing to lowercase letters. Function textPalindrome accepts all palindromes, irrespective of their punctuation.

Both palindrome and textPalindrome cannot be used to check if a DNA sequence is a palindrome, because the symbol 'A' should be considered equal to 'T', and 'C' to 'G', and the equality == used in the definition of palindrome considers them different. We need to change the equality function to compare characters in DNA strings. We define the DNA character equality function =:= by

> 'A' =:= 'T'  =  True
> 'T' =:= 'A'  =  True
> 'C' =:= 'G'  =  True
> 'G' =:= 'C'  =  True
> _   =:= _    =  False

We use this new equality function in a definition of dnaPalindrome for sequences of DNA symbols. We pairwise combine the elements of xs and reverse xs with the equality function =:= using the function zipWith. zipWith is another standard function, which takes an operator and two lists, and `zips' the two lists with the operator. Thus we obtain a list of comparisons, which we fold to a single result by requiring that each element in the list is True. Function and takes a list of boolean values, and returns True only if all elements in the list are True.

> dnaPalindrome xs  =  and (zipWith (=:=) xs (reverse xs))

Thus we have three functions for checking whether or not a string is a palindrome: palindrome for strings that are exactly equal when reversed, textPalindrome for strings that are exactly equal when reversed modulo punctuation and space characters, and dnaPalindrome for DNA strings.

Sunday, April 8, 2012

What is a palindrome?

The Oxford English Dictionary describes a palindrome as "a word or a sequence of words that reads, letter for letter, the same backwards as forwards." In extended use it appears in:

  • Music. A piece of music in which the second half is a retrograde repetition of the first half; the retrograde itself.
  • Numbers. A number, or a date expressed numerically, that is unchanged when the order of its digits is reversed.
  • Biology. A nucleic acid sequence that is identical to its complementary sequence when each is read in the same direction (which is usually the direction called 5' - 3').
I will return to these definitions in later blog posts.

To determine whether a sequence of symbols is a palindrome I need to know what the symbols are from which the palindrome is composed, which symbols I can ignore, and when two symbols are considered equal.

The above definitions tell me that the symbols may be single characters, notes, digits, or elements of DNA. But even complete words, or sentences, are used in palindromes. J.A. Lindon, who wrote many palindromes, published the following poem in 1955:

As I was passing near the jail
I met a man, but hurried by.
His face was ghastly, grimly pale.
He had a gun: I wondered why,
His face was ghastly, grimly pale,
I met a man, but hurried by,
As I was passing near the jail.
Reportedly, the first recorded palindromes, written by Sotades in Greece in the third century BC are also on the level of sentences.

A palindrome is a simple kind of mathematical symmetry. Where symmetries are exact and the smallest imperfection disqualifies a symmetry, palindromes in which the symbols are characters are often not exact symmetries. The word "refer" reads "refer" when you read it backwards, but the sequence of words in Adam's introduction "Madam, I'm Adam" reads "madA m'I ,madaM" when reversed, which clearly is not the same. Thus it would not qualify as a palindrome if we are very strict, but when talking about palindromes in languages, capitalization, spaces, and punctuation symbols are almost always ignored, or even changed.

In some situations, even different symbols are considered equal when finding palindromes. In DNA, a double helix is formed by two paired strands of nucleotides that run in opposite directions, and the nucleotides always pair in the same way (Adenine (A) with Thymine (T); Cytosine (C) with Guanine (G)). A (single-stranded) nucleotide sequence is said to be a palindrome if it is equal to its reverse complement. For example, the DNA sequence ACCTAGGT is palindromic because its nucleotide-by-nucleotide complement is TGGATCCA, and reversing the order of the nucleotides in the complement gives the original sequence.