# bioalgorithms.info

You communicate algorithms by using psuedocode and English. A point made early in Chapter 2 is that computers can't use pseudocode, let alone English. You have to implement the algorithm in a computer language so that the computer can run the algorithm on real input. Unfortunately, there are a large number of types of computers (Macintosh, Windows, Linux, and so on) and a large number of languages (C, C++, BASIC, Java, Perl, Python, and so on). If you write a C program on the Macintosh, you may not be able to use it under Windows, and vice versa. We have decided to use Java as a programming language for these algorithms, because it can be written on any type of computer and run on any other type of computer.

## Longest Common Subsequence

I've noticed a large number of people searching Google for "LCS implementation java". Actually, I think it's just one guy who needs the code, but he can't find it. For some reason, Google kicks this site back to him as a result. For a long time, I didn't bother to put an implementation of the LCS algorithm here because it's a piece of pseudocode in the book, but I feel kind of bad for that one guy who needs the code so badly yet can't find it, so I wrote it in three languages. I will assume that if you are downloading one of these versions then you know how to compile and run programs written in one of these three languages. If not, having code for the longest common subsequence is probably the least of your worries.

## Completed projects

 #11. Spectral Alignment version 0.2 download |web Robin Friedman, Neil Jones 6/29/04 posted 7/4/04

## Project descriptions

### Project 1. Brute Force Partial Digest Problem

#### Background

Read sections 4.1-4.3. Be sure you understand the formulation of the partial digest problem, and both the na\"{i}ve (slow) brute force algorithm, and the practical branch and bound algorithm. In particular, know why the branch-and-bound solution should operate faster than the brute force algorithms.

#### Implementation

Implement the (recursive) branch-and-bound algorithm listed in section 4.3. The input to the algorithm itself is a multiset of partial digest fragment lengths (which are integers). The output should be a list of restriction sites (also integers).

Also implement a simple class that can accept a length of DNA and a set of restriction sites on the DNA (as integers) and generate the partial digest fragment lengths. When experimenting with the algorithm it's often useful to construct a model of DNA rather than partial digest fragment lengths, which would require manual computation.

### Project 2. Deciphering codes

#### Background

Many problems in bioinformatics, such as discovering regulatory motifs, can be motivated by deciphering text codes.

The encrypted text writen by Captain Kidd in section 4.4 uses a simple character substitution scheme. Mr. Legrand solves this with his knowledge that the most frequent word in English is 'the'. Had Legrand had access to a computer, he could have used more general knowledge of English to decode the text. For example, since 'the' is just a combination of three letters, called a 3-tuple, he could have compared the distribution of all 3-tuples throughout all English text to the distribution of 3-tuples in the encrypted text. Likewise, this can be done with all 2-tuples, or even by simply comparing the distribution of single letters.

#### Implementation

Write a program to find the distribution of all k-tuples in a corpus of English text. The choice of k should be left to the user, but should be less than, say, 10. For simplicity, remove all punctuation and spaces, and convert all letters to lower case. As a test case, apply this to Captain Kidd's text to see the best way to decipher it. Keep in mind that not all k-tuples will necessarily appear in the corpus.

### Project 3. Motif Search

#### Background

Regulatory motifs are the short sequences flanking the genes that are responsible for regulation of transcription when certain proteins, called transcription factors, bind to these motifs. Motif finding is the problem of discovering such motifs without any prior knowledge of how the motifs look. We consider several approaches to Motif Finding that range from infeasible to fast but inaccurate.

#### Implementation

Implement the brute-force-median-string algorithm and the branch-and-bound median string algorithm described in chapter 4. Also implement the Greedy Motif Search algorithm. The brute force median string and greedy motif search algorithms have not been implemented yet, so you'll be doing this from scratch. Assume that the input sequences will be written in the DNA alphabet.

The input to a "Motif Finding Algorithm" should be a list of sequences, and an integer parameter, w, denoting the width of the motif the algorithm is looking for. The output of the algorithm is another set of sequences, each sequence being w letters long.

A popular format for multiple sequence data is the Fasta format. Poke around the biojava code to see how you can read Fasta format without actually writing the fasta reading code. While Fasta is an easy format, there are other formats that aren't so simple and being able to reuse someone else's code will save a lot of time.

Your test cases should be fairly small and simple so that they can run in a reasonable amount of time. It will be inconvenient for you to hardcode your test cases into each algorithm class---you can create a "unit test class" that does nothing but run a number of different test cases on the various algorithms you've implemented.

### Project 4. Greedy algorithms and Genome Rearrangements

#### Background

The discovery that many species with similar genomes differ in gene ordering has lead to the study of the genome rearrangement problem. Research in this area has lead to the discovery of ``evolutionary hotspots'' between human and mouse as well as metrics for phylogenetic tree reconstruction (a topic explored in Chapter 10). A common formulation of solving genome rearrangements is NP-complete, and so greedy approximations are frequently used to get suboptimal answers in a reasonable time.

#### Practical assignment

Implement the pseudocode for SimpleReversalSort and BreakPointReversalSort (but read carefully the paragraph after the proof of the lemma, since it demonstrates a potential bug in the BreakPointReversalSort algorithm!). The input to your program should be a permutation of numbers \$1 \ldots n\$. The algorithm should return all of the permutations that it goes through in order. one line after each reversal. Design several test cases, as BreakPointReversalSort is a simple algorithm, but surprisingly easy to get wrong.

### Project 5. Pairwise Alignment Using Dynamic Programming

#### Background

Dynamic programming provides a framework for understanding DNA sequence comparison algorithms, many of which have been used by biologists to make important inferences about gene function and evolutionary history.

#### Implementation

This particular implementation project has several implementations and may take several weeks to write, debug, and test.

Implement the longest common subsequence (LCS) algorithm described in section 6.5. Given two seqences, your program should output a dynamic programming table or grid, and a final alignment. The input to your algorithm will be two sequences. The output will be the LCS of the two strings, but this is actually kind of tricky. The LCS is simply a string of characters, but it should also be able to give the coordinates of each string in the LCS to each of the input strings.

Next, implement the following dynammic programming algorithms:

• global alignment;
• local alignment;
• alignment with affine gap penalties.
Each of these should take two sequences and a scoring matrix.

Finally, implement linear-space versions of all of the above.

Biojava has some classes already built for these types of problems. Use anything that biojava has implemented, except for the algorithm itself.

### Project 6. Multiple Alignment

#### Background

Multiple alignment of more than two sequences using the dynamic programming alignment algorithms that work for two sequences ends up in an exponential algorithm. Several heuristics have been proposed.

#### Implementation

Implement the dynamic multiple alignment algorithm for n DNA sequences, where n is a parameter. Accept a scoring matrix as an input, but if one is not supplied use the following method to optimize for the maximum number of shared nucleotides at an alignment position: score {A,A,A} = 3, score {A,A,T} = 2, score {A,C,T} = 1.

Next, implement the more efficient greedy pairwise multiple alignment algorithm.

### Project 7. Exon Chaining

#### Background

Gene prediction is the problem of examining a DNA sequences and determining which parts of that sequence are genes. A number of excellent approaches are in use at NCBI and other sequence databases to automatically annotate large genomes. We explore a couple of approaches in Chapter 6.

#### Implementation

Implement the exon chaining algorithm listed in section 6.13. The input to the algorithm should be a list of weighted intervals, and the output should be the path of intervals with maximum score.

Figure 6.26 has a test case that you can reuse. You should probably also test some trivial cases (empty inputs, single inputs, a single path, etc).

### Project 8. A Puzzle

#### Background

Much of the mammalian genome is covered by repeats, stretches of DNA that don't seem to serve any purpose and are replicated many times. These repeats confound the process of DNA sequencing because they create a lot of confusion in where a particular chunk of sequence came from. Similar (but not directly related) to DNA sequence assembly with repeats is the Triazzle puzzle described in section 8.9.

#### Implementation

Formulate the Triazzle problem under a graph framework, and figure out an algorithm for solving it. Implement a randomized algorithm that assembles the puzzle (in a random fashion) and explore how long it will take it to succeed. You may want to borrow one or more of the actual puzzles to study while you write this algorithm.

The input to the algorithm should be a list of pieces (how these are described is part of the problem) and some sort of description of the board. The output should be an assignment of pieces to spots on the board.

This implementation will require an implementation of the triazzle puzzle. Try to make the triazzle board design so that it can accomodae pieces that aren't triangular (there is a hexagonal triazzle).

There's an obvious temptation to turn this into a GUI, complete with a picture of the board. This sounds difficult. If you want to try it, though, don't let me stop you.

### Project 9. DNA Sequencing Using Graph Algorithms

#### Background

Graph algorithms can be helpful in the problem of measuring the sequence of a piece of DNA. Typically, a long DNA segment is broken into pieces, the sequence of each of the pieces is measured, and then the larger segment's sequence is calculated. One sequencing technology that isn't used very much but is still instructive to study is Sequencing By Hybridization, covered in Chapter 8.

#### Implementation

Given a (relatively short) DNA sequence, implement an algorithm that can generate a list of all \$l\$-tuples in the sequence, and from that list, the overlap graph.

Implement another algorithm that calculates an Eulerian path through the overlap graph.

### Project 10. Protein Sequencing Using Graph Algorithms

#### Background

The determination of protein sequences is a vital component to medical diagnostics and functional biological analysis. The technology used to sequence proteins are substantially more complex than the technology used to sequence DNA. Chapter 8 explores protein sequencing in some detail.

#### Implementation

Implement a program to construct the spectrum graph described in section 8.12 given the sequence of a protein. You'll need to research the molecular masses of amino acids. Refer to section 8.12 for the definition of spectrum graph and how it may be used to potentially solve peptide sequencing problem.

Implement an algorithm that, given a perfect spectrum graph from a peptide sequence with repeats, outputs all potential sequences. Note that perfect means N-terminal ions only.

### Project 11. Spectral Alignment

#### Background

One problem with the de novo sequencing strategy given in the book in chapter 8 (via the spectrum graph) is that it is valid only for spectra that are complete and have no C-terminal ions. This never happens, and altering the algorithm to handle C-terminal ions as well is not easy. It is possible, though, but the naive formulation of the problem is NP-complete.

Instead, we'll consider the problem of database search. In database search (for protein sequencing), you measure the mass spectrum of a protein, and then search in a database for a good match to a known protein. The details are explained in chapter 8, sections 8.13-8.15.

#### Implementation

Implement the spectral alignment and convolution algorithms of Chapter 8.

### Project 12. Combinatorial Pattern Matching

#### Background

Biologists are often interested in finding matches of short sequences against entire genomes. Sophisticated data structures can greatly assist in this task.

#### Implementation

Implement an algorithm for constructing a keyword tree as described in section 9.4. The input should be a set of sequences. The output should be an a keyword tree.

### Project 13. Clustering

#### Background

Many of the new biological experimental techniques generate vast amounts of data. When viewed as a whole these data can be perplexing, however they are more sensible when alike data is clustered into groups.

#### Implementation

Implement k-means clustering. As a test case, use the points listed in figure 10.1.