E-value and P-value are indicators of the statistical significance of hits
return from a database search. Describe how they are defined (do not use any
traditional dynamic programming algorithm appears to be the best way to find an
optimal local alignment, why is a heuristic (BLAST) used instead to search
sequence databases for similar sequences?
the following guarantees that the alignment found by BLAST involves two
proteins of similar function and why?
high alignment score (S)
high percent identity between the sequences
researcher was given a newly discovered full length coding cDNA sequence from
human that encodes a MAP kinase that is important for a signaling pathway that
she is interested in. She now wants to see if a homologous protein(s) exists
anywhere in the mouse genome, which is the model organism she is studying.
Of the many versions of BLAST, which program should she use?
If time and computational power were not issues, would your answer change?
matrices assume that the comparison of amino acids in all positions of a
protein is equally important. In reality, some positions in the sequence are
more important than others. Explain the basic differences between PSI-BLAST
and BLASTP (protein-protein blast). What kind of challenges drove the
development of PSI-BLAST?
The E-value (E) gives a measure of the number of false results (false positives)
you would expect to see if you used a given alignment score threshold.
you expect the E-value E to change if we double the length of our query
you expect the E-value E to change if we cut the size of our database in half?
general, would you expect the E-value E to change if we use a different scoring
The BLAT algorithm for aligning sequences to a full genome is
fast---even faster than BLAST. What important difference in approaches between
BLAT and BLAST help account for this speed up?
BLAST is a heuristic tool that emulates local alignment. Why
isn't there a Basic Global Alignment Search Tool (BGAST) emulating global
alignment to complement BLAST? How useful would this tool be? What are the
difficulties in implementing this algorithm?
The following graph represents the score of the BLAST alignment
as it is extended from an initial High scoring pair [at (a)] in a single
direction to position (i). The final alignment will start at position (a) and
end at which position?
the BLAST algorithm was designed for a good trade off between sensitivity and
the running time. (Here we define sensitivity as our ability to find all the
high scoring alignments with the database.) What algorithmic changes could you
suggest (not parameter changes like word size and score thresholds) to make the
BLAST algorithm more sensitive? What changes would you suggest to dramatically
speed up the algorithm?
When aligning large
sequences of DNA, a repeat mask is usually used to mask out repetitive
regions of DNA. Why is this? What would happen if you didn't use repeat
protein Cmpp16 was extensively studied in the Xoconostle-Cazares et al. paper
above. Cmpp16 was determined to be a paralog of viral movement proteins (MP).
Taking the sequence for Cmpp16, use any tools you can (in particular, those
learned about so far in the course) to either support or refute the claim the
authors make that Cmpp16 is a viral MP protein.
Part A: MJ1477: the Missing aaRS?
Aminoacyl-tRNA synthetases (aaRSs) are important proteins
that catalyze the reaction that attaches amino acids to their corresponding
tRNAs. The tRNAs are then used by the ribosome to extend peptide sequences
during translation. aaRSs are for the most part present in all cellular
organisms since they are essential for protein synthesis. Methanococcus
jannaschii, a bacterium whose genome has
recently been sequenced, does not appear to have an aaRS for cysteine (CysRS).
et al (2001) devised a novel computational method to search for the missing
CysRS and concluded that the MJ1477 protein was responsible for the CysRS
activity. Here we will attempt to validate their results. (NOTE: Please use
the default parameters unless otherwise specified.)
Go to Swiss Prot and search for MJ1477.
Get the amino acid sequence in FASTA format from the website (you don't
need to turn in the sequence with this assignment).
Go to NCBI and access the
PSI-BLAST page. Enter the amino acid sequence as the query, use nr as
the database, and set the inclusion threshold to 0.005. Run the query.
On the next page you will need to click Format! to view your results
Take a quick look at your
results. Run the PSI-BLAST iterations until your results converge.
Examine your results. Notice
the most of the genes are predicted or hypothetical. Find the first
result that has a known function (HINT: it's E-value is greater than
1e-35). Based on the score, alignment, and E value for this protein,
would you consider these proteins homologous? If so does the function of
that protein provide any hint of evidence that MJ1477 might be the missing
CysRS protein? Do you agree with their conclusion? Compare you
conclusion with the results from Ruan
Part C: Unknown Sequences
A Biology Professor was browsing through the data repository
for his lab and found a folder of sequences that lacked identification. He
noticed the data was put there by one of his bioinformatics students that he
had earlier asked to study sequence alignment statistics. Curious what the
sequences are, he has asked you to figure out where the sequences came from.
Get seq1.fa and seq2.fa from
the class website, which are just 2 of over a thousand sequences in the
BLAST both sequences against
the nr database.
Based on the results you
received, what are you going to report to the Biology Professor? Where do
you think the sequences came from?
Try BLASTing seq3.fa and
seq4.fa, which also came from the same set of sequences. Do these results
support your claim?