Using the HMM approach, you will now revisit
the dishonest casino mentioned in the class handout.
With a computer implementation of the Viterbi algorithm, we can now decode
longer sequences of coin tossing.
Fortunately, the umdhmm package, a
set of C programs that implement the Viterbi and forward-backward algorithms,
is available from http://www.cfar.umd.edu/~kanungo/software/software.html.
You can uncompress and compile it under Linux/Unix/Windows/Macintosh. The
README file in the package explained how to specify a HMM and how to run the
standard HMM algorithms. There is also a tutorial (hmmtut.pdf) in the zip file for
additional background information on HMMs.
Using the Hidden
Markov Model (HMM) in Figure
11.1 in the handout (also shown above representing a dishonest casino),
decode the following sequence of coin tosses (i.e., compute the most
probable sequence of states that generates the sequence of coin tosses).
You should fill up a 2-by-10 dynamic programming table. HHHHHTTTTT
What if the transition probabilities from F to B and from B to F were 3/10? How
would you interpret these two results (you may want to use Excel to speed up your calculations)?
With the HMM shown in the figure above, with the HMM in Figure
11.1, what is the probability that the "T" at the seventh
position is generated by the biased coin?
Compare the two probabilistic models we have
learned: profile HMMs and PSSMs. Which one is more general? Can you devise a profile
HMM to emulate a PSSM? Can a PSSM emulate any profile HMM?
Write a text file specifying the HMM in the dishonest casino,
Figure 11.1,
following the format explained in the umdhmm README file so that your HMM is
understandable by the programs. Next, use the
HMM to find the most probable sequences of hidden states that generates
the following sequences of coin tosses:
HHHHHTTTTT
HHHHHHTTTT
HHTHTHTHTHTTTTHTHHTHHHHHHHHHTHTHTHHTHTHHHHTHTH
THHHHHHHHHHHHTTHTTHTHTHTHHTTHHHHHHHHHHHHHHHHH
We will use the
umdhmm package to help us to identify CG-islands in a genomic sequence.
The dinucleotide transition
probabilities in CG-islands are different from that in non-CG-islands. The
following transition probability tables (page 50 in Durbin et al's book) are
obtained based on the statistics of annotated genomic sequences:
Transition probabilities
outside a CG-island.
Transition probabilities
inside a CG-island.
+
A
C
G
T
A
0.180
0.274
0.426
0.120
C
0.171
0.368
0.274
0.188
G
0.161
0.339
0.375
0.125
T
0.079
0.355
0.384
0.182
-
A
C
G
T
A
0.300
0.205
0.285
0.210
C
0.322
0.298
0.078
0.302
G
0.248
0.246
0.298
0.208
T
0.177
0.239
0.292
0.292
Your HMM would have a group of 4
states A+, C+, G+, and T+ which
emit A, C, G, and T respectively in CG-islands, and a group of another 4
states A-, C-, G-, and T-
correspondingly toin normal genomic regions. The above
2 tables specify the transition probabilities within each group. Now it is
your task to design the transition probabilities between the states across
groups.
Describe how do
you want to design the transition probabilities between the states across
groups?
Turn your design
into the format understandable by the umdhmm programs. You may use Microsoft excel to
ease the pain of organizing tables as large as 8-by-8.
Use your HMM to
search the following stretch of genomic sequence chr22_10k.fa for CG-islands.
The number of CG-islands can vary with different parameter settings for
probabilities of switching between CG-islands and non-CG-islands. With
reasonable parameter settings 4-5 CG-islands will be found.
Your colleague asks you why don't you build an intuitive HMM that consists
of just 2 states, one emitting symbols in CG-islands and the other emitting symbols in
non-CG-islands. How do you answer this question?
Multiple sequence based search
In the C. elegans (worm) genome,
several large paralogous gene families that were first thought to be nematode
specific have since been classified as putative
G-protein coupled receptors (GPCRs). Detecting similarity between these
nematode sequences and known GPCRs in other organisms is a nontrivial
sequence analysis task. Here we arbitrarily choose the putative GPCR gene sra-4
(Wormpep AH6.8; SWISS-PROT Q09206; 329 aa) as an example. The task is to find
a significant similarity between AH6.8 and a protein of known function in
another organism.
(Please note the Resources section after the Procedures which has links and
information about how to use the tools necessary to complete this part)
Obtaining
the sequence. Go to SwissProt and retrieve the amino
acid sequence for
Q09206. Be sure to read the annotations of the sequence before you do
any further searches.
Is there any experimental evidence that this protein is a GPCR?
Initial BLAST: To quickly
find any proteins similar to AH6.8, you can run a BLASTP search against
the nr database using the WWW BLAST at NCBI. How many hits do you find with a significant
E-value (<0.01)? How many among them are non-worm genes? What do you think of the
significance of these non-worm hits? What conclusions can you draw from this initial
BLAST result?
Sequence gathering: The
first step for further analysis is to more carefully define a
nonredundant set of sequences that belong to the same family as AH6.8.
You could have used the collection of significant hits returned from
your BLASTP search. To be more careful, you want to use the Wormpep
database, an
authoritative nonredundant source of nematode predicted protein
sequences. Blast AH6.8 against the Wormpep. How many hits are
significant (this time you want to be more careful, so you use the
E-value cutoff of 10-6)?
As a crude protection against erroneous computational gene
predictions, you want to exclude sequences that are too long or too
short. How many sequences are shorter than 200 aa or longer than 500
aa? Removing these sequences, you can proceed to the next step with a
clean set of sequences.
Multiple sequence alignment. The next step is to produce a multiple alignment. You will
use ClustalW a popular program for multiple alignment. You can take a quick look at
the alignment in the graphical display of ClustalX (part of the
ClustalW package), just to make sure the result makes sense. Save your
result as worm.aln and proceed to the next step.
Profile searches.
Construct a profile HMM of the
multiple alignment, and to search it against
the sequence database. You will use the HMMER program.
The hmmbuild
command builds a profile "worm.hmm" from the alignment, taking a few
seconds. The hmmcalibrate
command automatically estimates some parameters needed for calculating
accurate E-values in database searches, taking several minutes. The hmmsearch command searches Swissprot
with using the profile, which can take several
hours. The output is a ranked list of hits, giving E-values. Go through
your hmmsearch result, can you find any non-worm GPCRs (with significant
E-values)? The E-value in hmmsearch is different from that in the blast
search: usually an E-value of 0.05 is already a marginal but significant
result.
You can also run all the above programs in the www
version of HMMER. But this is not encouraged because the server can only
process 1-2 external jobs at one time.
PSI-BLAST. As a
comparison, you also run PSI-BLAST to
search AH6.8 against nr. Do you find non-worm GPCR genes? Do
you find any non-GPCR
genes in your result; what are they? Can you explain why PSI-BLAST
includes genes of different functions? Do you think this an artifact
of PSI-BLAST or do
these non-GPCR genes suggest an alternative function of AH6.8?
Resources
HMMER: Sean Eddy's HMMER website http://hmmer.wustl.edu/. If you
don't have an account on bioinf, You can download to your local
machine and run it.