To test the program I obtain by combining palindrome finding with errors and gaps, I downloaded a copy of the Y chromosome. The first paper that described the occurrences of huge palindromes in the male chromosome referred to in my blog post on palindromes in DNA is `The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes'. This paper appeared in Nature in 2003, and the Nature web pages provide a link to the Y chromosome studied in the paper. The paper caused quite a stir. It has been cited more than a thousand times in other scientific papers. A thousand citations is a lot: none of my direct colleagues in computer science ever wrote an article with so many citations.
Since it is computationally infeasible to test the program for finding palindromes with errors and gaps with large numbers of allowed errors, I looked up the information about palindromes in the paper. The paper contains a table that gives information about the eight palindromes the authors found in chromosome Y.
Palindrome | Arm length (kb) | Arm-to-arm identity (%) | Spacer length (kb) | Palindrome span (kb) |
P1 | 1,450 | 99.97 | 2.1 | 2,902 |
P1.1 | 9,9 | 99.95 | 3.9 | 24 |
P1.2 | 9,9 | 99.95 | 3.9 | 24 |
P2 | 122 | 99.97 | 2.1 | 246 |
P3 | 283 | 99.94 | 170 | 736 |
P4 | 190 | 99.98 | 40 | 419 |
P5 | 496 | 99.98 | 3.5 | 996 |
P6 | 110 | 99.97 | 46 | 266 |
P7 | 8.7 | 99.97 | 12.6 | 30 |
P8 | 36 | 99.997 | 3.4 | 75 |
Palindrome P1 consists of almost 3 million symbols, and the arm-to-arm identity is 99.97 percent. An arm has length almost one and a half million, so the number of errors is around 450. I assume that the errors are more or less evenly distributed in the palindrome, which implies that in the central 10,000 symbols of an arm of a palindrome, after the gap, about 3 errors would occur. So instead of trying to find the long palindrome with possibly 450 errors and a gap of size around 2100 symbols, I instead try to find palindromes with a gap of size 2200 (to be on the safe side) of length at least 5000 (to be on the safe side) with at most 5 errors (to be on the safe side). To my surprise, I do not find a single palindromes that satisfies these requirements. This long palindrome with a gap of size 2100 does not seem to appear in the Y chromosome of the authors. Since P2 has the same gap size, it does not appear either. Hmm, what's wrong here? Have I made a mistake?
Lets have a look at the other palindromes reported in the table. The next smallest gap size reported in the above table is 3.4 kb. If I try to find palindromes with a gap of size 3400, I get two hits. Around position 18905096 I find a palindrome of length 32056 (including the gap) if I allow for 8 errors. If I try to reduce the gap length, I find that I can take a gap length of 2320, and still find the same palindrome. So maybe this is one of the palindromes with a gap of around 2.1 kb? The arm length is (32056-2320)/2 = 14868, which doesn't correspond at all with the reported arm lengths. I also find a palindrome around position 9806955. This palindrome has length 17304, and 5 errors. This palindrome indeed has a gap of size 3400: reducing the gap size leads to much shorter palindromes. But the palindrome around this position is much shorter than the reported length in Nature. I experimented with different gap sizes and numbers of allowed errors, and found the following palindromes in the Y chromosome:
Center | Length | Errors | Gap length |
9806955 | 13904 | 5 | 3400 |
12078263 | 43356 | 1 | 151498 |
13738581 | 34452 | 0 | 39262 |
14445793 | 64738 | 5 | 132950 |
17989037 | 14934 | 5 | 297114 |
18899672 | 3364 | 7 | 39540 |
18905096 | 29736 | 8 | 2320 |
20530881 | 24374 | 2 | 157760 |
Since my findings are very different from the results reported in Nature, I contacted the group responsible for the paper, the Whitehead Institute at the Department of Biology at the MIT. The research scientist that answered my questions was kind enough to provide me with the positions at which the palindromes occur in the Y chromosome. Given the positions, I tried to compare the arms by hand (or eye, actually). I got nowhere: the start and end of the arms did not resemble each other at all.
The final clue I needed was that it is not enough to discard errors constituted by two symbols that do not match in a palindrome. Sometimes I also have to delete symbols in one arm of a palindrome, or, equivalently, add symbols to the other arm. With this last clue, I selected the reported palindromic arms in the Y chromosome, and compared them. It turns out that in each case I only have to delete or insert a couple of DNA symbols to obtain a palindrome. The table of palindromes I find are the palindromes P8 to P1 reported in Nature, with P7 missing. My seventh palindrome does not appear in the list of positions I received, but this is probably one of P1.1 or P1.2, of which I didn't receive the positions.
After months of playing with the Y chromosome in some of my spare time, I now finally find the palindromes I was after. If I would have to do the same thing again for another DNA string, I would follow the same approach:
- use the program that finds palindromes with gaps and errors with various gap sizes and numbers of allowed errors to find centers of palindrome candidates
- hand align the sequences around the centers of palindrome candidates to determine the length of the palindromes around these centers, given some number of allowed deletions or insertions