Practical Problem Set #7: Hidden Markov Models

Using the HMM approach, you will now revisit the dishonest casino mentioned in the class handout. With a computer implementation of the Viterbi algorithm, we can now decode longer sequences of coin tossing.

Fortunately, the umdhmm package, a set of C programs that implement the Viterbi and forward-backward algorithms, is available from http://www.cfar.umd.edu/~kanungo/software/software.html. You can uncompress and compile it under Linux/Unix/Windows/Macintosh. The README file in the package explained how to specify a HMM and how to run the standard HMM algorithms. There is also a tutorial (hmmtut.pdf) in the zip file for additional background information on HMMs.

  1. Using the Hidden Markov Model (HMM) in Figure 11.1 in the handout (also shown above representing a dishonest casino), decode the following sequence of coin tosses (i.e., compute the most probable sequence of states that generates the sequence of coin tosses). You should fill up a 2-by-10 dynamic programming table. HHHHHTTTTT What if the transition probabilities from F to B and from B to F were 3/10? How would you interpret these two results (you may want to use Excel to speed up your calculations)?
  2. With the HMM shown in the figure above, with the HMM in Figure 11.1, what is the probability that the "T" at the seventh position is generated by the biased coin?
  3. Compare the two probabilistic models we have learned: profile HMMs and PSSMs. Which one is more general? Can you devise a profile HMM to emulate a PSSM? Can a PSSM emulate any profile HMM?
  4. Write a text file specifying the HMM in the dishonest casino, Figure 11.1, following the format explained in the umdhmm README file so that your HMM is understandable by the programs. Next, use the HMM to find the most probable sequences of hidden states that generates the following sequences of coin tosses:
  5. We will use the umdhmm package to help us to identify CG-islands in a genomic sequence.

    The dinucleotide transition probabilities in CG-islands are different from that in non-CG-islands. The following transition probability tables (page 50 in Durbin et al's book) are obtained based on the statistics of annotated genomic sequences:

    Transition probabilities outside a CG-island.

    Transition probabilities inside a CG-island.

    +

    A

    C

    G

    T

    A

    0.180

    0.274

    0.426

    0.120

    C

    0.171

    0.368

    0.274

    0.188

    G

    0.161

    0.339

    0.375

    0.125

    T

    0.079

    0.355

    0.384

    0.182

    -

    A

    C

    G

    T

    A

    0.300

    0.205

    0.285

    0.210

    C

    0.322

    0.298

    0.078

    0.302

    G

    0.248

    0.246

    0.298

    0.208

    T

    0.177

    0.239

    0.292

    0.292

    Your HMM would have a group of 4 states A+, C+, G+, and T+ which emit A, C, G, and T respectively in CG-islands, and a group of another 4 states A-, C-, G-, and T- correspondingly toin normal genomic regions. The above 2 tables specify the transition probabilities within each group. Now it is your task to design the transition probabilities between the states across groups.

Multiple sequence based search

In the C. elegans (worm) genome, several large paralogous gene families that were first thought to be nematode specific have since been classified as putative G-protein coupled receptors (GPCRs). Detecting similarity between these nematode sequences and known GPCRs in other organisms is a nontrivial sequence analysis task. Here we arbitrarily choose the putative GPCR gene sra-4 (Wormpep AH6.8; SWISS-PROT Q09206; 329 aa) as an example. The task is to find a significant similarity between AH6.8 and a protein of known function in another organism. (Please note the Resources section after the Procedures which has links and information about how to use the tools necessary to complete this part)

  1. Obtaining the sequence. Go to SwissProt and retrieve the amino acid sequence for Q09206. Be sure to read the annotations of the sequence before you do any further searches. Is there any experimental evidence that this protein is a GPCR?
  2. Initial BLAST: To quickly find any proteins similar to AH6.8, you can run a BLASTP search against the nr database using the WWW BLAST at NCBI. How many hits do you find with a significant E-value (<0.01)? How many among them are non-worm genes? What do you think of the significance of these non-worm hits? What conclusions can you draw from this initial BLAST result?
  3. Sequence gathering: The first step for further analysis is to more carefully define a nonredundant set of sequences that belong to the same family as AH6.8. You could have used the collection of significant hits returned from your BLASTP search. To be more careful, you want to use the Wormpep database, an authoritative nonredundant source of nematode predicted protein sequences. Blast AH6.8 against the Wormpep. How many hits are significant (this time you want to be more careful, so you use the E-value cutoff of 10-6)? As a crude protection against erroneous computational gene predictions, you want to exclude sequences that are too long or too short. How many sequences are shorter than 200 aa or longer than 500 aa? Removing these sequences, you can proceed to the next step with a clean set of sequences.
  4. Multiple sequence alignment. The next step is to produce a multiple alignment. You will use ClustalW a popular program for multiple alignment. You can take a quick look at the alignment in the graphical display of ClustalX (part of the ClustalW package), just to make sure the result makes sense. Save your result as worm.aln and proceed to the next step.
  5. Profile searches. Construct a profile HMM of the multiple alignment, and to search it against the sequence database. You will use the HMMER program.

    The hmmbuild command builds a profile "worm.hmm" from the alignment, taking a few seconds. The hmmcalibrate command automatically estimates some parameters needed for calculating accurate E-values in database searches, taking several minutes. The hmmsearch command searches Swissprot with using the profile, which can take several hours. The output is a ranked list of hits, giving E-values. Go through your hmmsearch result, can you find any non-worm GPCRs (with significant E-values)? The E-value in hmmsearch is different from that in the blast search: usually an E-value of 0.05 is already a marginal but significant result.

    You can also run all the above programs in the www version of HMMER. But this is not encouraged because the server can only process 1-2 external jobs at one time.

  6. PSI-BLAST. As a comparison, you also run PSI-BLAST to search AH6.8 against nr. Do you find non-worm GPCR genes? Do you find any non-GPCR genes in your result; what are they? Can you explain why PSI-BLAST includes genes of different functions? Do you think this an artifact of PSI-BLAST or do these non-GPCR genes suggest an alternative function of AH6.8?

Resources

  1. HMMER: Sean Eddy's HMMER website http://hmmer.wustl.edu/. If you don't have an account on bioinf, You can download to your local machine and run it.
  2. Web version of HMMER at http://bioweb.pasteur.fr/seqanal/motif/hmmer-uk.html. Easy to use. But can be very slow. You got to do your search early.
  3. CLUSTALW: available at http://www.ebi.ac.uk/clustalw/. SDSC biology workbench (http://workbench.sdsc.edu/) also has it.
  4. SwissProt: http://us.expasy.org/sprot/. A local copy of SwissProt on bioinf can be found at /software/bioinf/BLAST/data/swissprot.
  5. BLASTP, and PSI-BLAST are available at http://www.ncbi.nlm.nih.gov/BLAST/
  6. Wormpep: http://www.sanger.ac.uk/Projects/C_elegans/wormpep/

Suggested Reading

Chapters 3-6. Durbin et al. Biological Sequence Analysis. 1998.