Sunday, July 22, 2012

Palindromes in DNA

DNA strings are often millions of letters long. For example, the copy I have of the DNA of the Amycolatopsis marina, a bacteria discovered in 2009, consists of $6,503,724$ characters.

Bacteria are of course some of the smallest living beings. The DNA in the copy I have of the human chromosome 18 is an astounding $74,657,229$ characters long. Chromosome 18 is one of the 23 chromosomes of human beings, and is estimated to contain in between $300$ and $500$ genes. A gene is responsible for part of our functioning as human beings. An example of a gene contained within chromosome 18 is the NPC1 gene, named after the Niemann-Pick disease, type C1. This gene is named after the disease you get when you have an error on the gene. Experiments with mice show that this gene probably controls the appetite, and people with a mutation on the NPC1 gene often suffer from obesitas. Interestingly, the same mutation on NPC1 might make you immune for the ebola virus, one of the deadliest known viruses. So the mutation is partly a blessing in disguise.

If I search for the keyword palindrome in the electronic publications available at the library of my university, I get more than $500$ hits. The first ten of these hits are all about palindromes in DNA. My guess is that at least $90$% of the list of publications are about palindromes in DNA. So what is the interest in palindromes in DNA? Surely, if your strings only contain `A', `T', `C', and `G' you are likely to get many palindromes?

Let us look at how many palindromes we expect to find in the DNA of the Amycolatopsis marina. Suppose I want to find palindromes of length twelve. I calculate the chance that an arbitrary DNA string of length twelve is a palindrome as follows. The first six characters of the string don't matter. The seventh character needs to match with the sixth character, for which we have a chance of one in four. Remember that in DNA, `A' matches with `T' and vice versa, but both `A' and `T' do not match with themselves, or with `C' and `G'. The eighth character needs to match with the fifth character, for which we also have a chance of one in four. This goes on until the twelfth character, so I get a chance of \[ \frac{1}{4} \times \frac{1}{4} \times \frac{1}{4} \times \frac{1}{4} \times \frac{1}{4} \times \frac{1}{4} = (\frac{1}{4})^6 = \frac{1}{4^6} = \frac{1}{4096} \] Since the Amycolatopsis marina is $6,503,724$ characters long, there are $6,503,713$ substrings of length twelve. Multiplying this number with the chance that it is a palindrome, I expect to get $1,589$ palindromes. Using my algorithm for finding palindromes I get $1,784$ palindromes. This is slightly above $10$% more than expected, but that might be an accident. If I look at the palindromes of length fourteen in the $18$th human chromosome, using a similar calculation I expect to find $4,556$ palindromes. My algorithm finds $25,323$. More than five times as many palindromes as expected! This is not an accident anymore. Palindromes play a role in DNA. But what role?

Palindromes perform several tasks in human DNA. I will discuss one particularly intriguing task in this blog post.

Why do we humans have sex? I can answer this question from several perspectives. The most common answer will probably mention love, pleasure, or children. Looking at the question from a biological perspective, the answer discusses the biological advantages of having sex. Our children get their genes from two parents. They get two complete copies of human DNA, and merge these copies in some way to make them the way they are. In this merging process, damaged DNA from one parent can be replaced by functioning DNA from the other parent. There is a lot of DNA to repair: every day, each cell may be damaged at one million different places. Most of this damage is harmless, since a lot of DNA does not seem to have any function at all, but some kinds of damage may cause a lot of problems. Being able to repair non-functioning DNA when combining DNA from parents is essential for keeping the human race in a good state. The American geneticist Hermann Joseph Muller used this argument to explain why sexual reproduction is favored over asexual reproduction in organisms. When an organism reproduces asexually it passes all its DNA errors on to its offspring, without a possibility to repair them, eventually leading to the extinction of the population. The process has been dubbed Muller's ratchet, after the ratchet device, which can only turn in one direction. This is the theory, practice is slightly more complicated.

Muller's ratchet should already be at work in humans, since there is one human chromosome in which no combination of parental genes takes place: chromosome 23. Chromosome 23 determines whether we become a man or a woman. A man gets a copy of the chromosome called X from his mother and a copy called Y from his father. A woman gets an X from her mother and an X from her father. A woman can merge both X's and repair possible errors, but a man has two different copies of the chromosome, and has no possibility to combine, let alone repair, them. The Y is passed on from father to son, with no involvement of women. Muller's ratchet theory says the genes on the chromosome that make a man should deteriorate quickly, and men should soon become extinct.

There is no sign of men becoming extinct. Apparently there are some other mechanisms at work to repair DNA on the Y chromosome. If I cannot obtain a copy of some piece of DNA from my mother, maybe I can store copies of the DNA in the Y chromosome itself? If I maintain two copies, I can always check if a piece of DNA is correct by comparing it against the copy of this piece of DNA. This is the mechanism used by the Y chromosome, where the copies are stored as palindromes, with some noise in the middle of these palindromes. Such palindromes with gaps in the middle are often called inverted repeats in biology. The Y chromosome contains eight huge palindromes, the longest of which consists of almost three million characters. Around 25% of the Y chromosome consists of palindromes. The DNA in the palindromes carries genes for describing the male testes. So the mechanism by means of which men survive is called palindrome...

Sunday, July 1, 2012

Other implementations and solutions

Next year it is 25 years ago since I constructed an algorithm for finding palindromes efficiently. I was 22 years old then, and had just started as a PhD student. I was quite excited about having solved the problem of finding palindromes efficiently, and wrote a scientific paper about it. My excitement wasn't shared by the scientific community though. In the last twenty-five years this paper has been cited less than ten times, and appears at the bottom end of my most cited papers list.

In 2007 we developed the ICFP programming contest. The ICFP programing contest is a very challenging contest, in which thousands of programmers try to show off their programming talents in their programming language of choice. Our contest asked participants to transform the left picture below to the right picture using as few commands as possible. We included a problem related to palindromes in our contest, since this was my pet-problem ever since 1988. After the contest I explained how to solve the palindrome problem efficiently in a blog post.

When some years later I looked at the number of hits the contest pages received, I found that each month, thousands of people are reading the blog message about finding palindromes. Why does this page attract so many visitors?

The palindrome problem is a common question in interviews for jobs for software developers, so the blog post attracts software developers looking for a new job, and preparing themselves for an interview. Another reason the blog post attracts visitors is that I think quite a few people are genuinely interested in the problem and its solution. The concept of palindromes appears in almost all languages, which means that the question of finding palindromes is asked all over the world. The blog post indeed attracts visitors from all over the world. The last 100 (July 1, 2012) visitors come from all continents except Australia.

Many people ask the question of how to find palindromes, but also many people try to answer the question. You can find hundreds of solutions for finding palindromes on the internet. Some of these are variants of my linear-time solution, others are more naive quadratic or sometimes even cubic-time solutions. Below I give the list I found, ordered on the programming language used for the implementation. If there exists a linear-time implementation, I don't list less efficient solutions.

C The same linear-time solution as my blog post.
C++ 1 The same linear-time solution as my blog post. This post has an extensive description of the palindromes in palindromes property, including some pictures which try to explain it.
C++ 2 A C++ implementation of Manacher's algorithm.
C++ 3 The same linear-time solution as my blog post in C++ by Fernando Pelliccioni.
C# As far as I can see, this is a quadratic-time solution.
Factor A cubic-time solution.
F# A quadratic-time solution.
Go A quadratic-time solution, by Lars Björnfot.
Haskell Just as the program for finding palindromes described in my blog post, this blog post describes a linear-time program for finding palindromes in Haskell. The interesting aspect of this solution is that it returns results lazily: as soon as it finds a maximal palindrome, it writes it to the output.
Java 1 A quadratic-time solution.
Java 2 Another quadratic-time solution in Java, by Marcello de Sales.
Java 3 Yet another quadratic-time solution in Java, by Mohit Bhandari.
Matlab A quadratic-time solution, by Lalit Mohan.
PHP 1 As far as I can see this is a quadratic-time solution, by Marc Donaldson.
PHP 2 This is a quadratic- or cubic-time solution, by Joseba Bikandi. You can try it on-line.
Python The same linear-time solution as my blog post, developed by Fred Akalin. This post contains an extensive description of the palindromes in palindromes property too.
R This page describes the interface of functionality for finding palindromes in DNA strings. It doesn't say how efficient the software is.
Ruby 1 A quadratic-time solution, by Matthew Kerry.
Ruby 2 Another quadratic-time solution in Ruby, by Rick DeNatale. The post and the comments mention some more Ruby implementations.
Ruby 3 Yet another quadratic-time solution in Ruby, by Mitchell Fang.
Scheme A quadratic-time solution.
I could only find linear-time implementations for finding palindromes in C, C++, Haskell, and Python. Please let me know if you find better or alternative implementations for finding palindromes. This list easily gets outdated, so please keep me informed.