Practical Problem Set #5: Database Search

  1. Both E-value and P-value are indicators of the statistical significance of hits return from a database search. Describe how they are defined (do not use any formulas).
  2. Since the traditional dynamic programming algorithm appears to be the best way to find an optimal local alignment, why is a heuristic (BLAST) used instead to search sequence databases for similar sequences?
  3. Which of the following guarantees that the alignment found by BLAST involves two proteins of similar function and why?
  4. A researcher was given a newly discovered full length coding cDNA sequence from human that encodes a MAP kinase that is important for a signaling pathway that she is interested in. She now wants to see if a homologous protein(s) exists anywhere in the mouse genome, which is the model organism she is studying.
  5. Scoring matrices assume that the comparison of amino acids in all positions of a protein is equally important. In reality, some positions in the sequence are more important than others. Explain the basic differences between PSI-BLAST and BLASTP (protein-protein blast). What kind of challenges drove the development of PSI-BLAST?
  6. The E-value (E) gives a measure of the number of false results (false positives) you would expect to see if you used a given alignment score threshold.
  7. The BLAT algorithm for aligning sequences to a full genome is fast---even faster than BLAST. What important difference in approaches between BLAT and BLAST help account for this speed up?
  8. BLAST is a heuristic tool that emulates local alignment. Why isn't there a Basic Global Alignment Search Tool (BGAST) emulating global alignment to complement BLAST? How useful would this tool be? What are the difficulties in implementing this algorithm?
  9. The following graph represents the score of the BLAST alignment as it is extended from an initial High scoring pair [at (a)] in a single direction to position (i). The final alignment will start at position (a) and end at which position?

  10. In general the BLAST algorithm was designed for a good trade off between sensitivity and the running time. (Here we define sensitivity as our ability to find all the high scoring alignments with the database.) What algorithmic changes could you suggest (not parameter changes like word size and score thresholds) to make the BLAST algorithm more sensitive? What changes would you suggest to dramatically speed up the algorithm?
  11. When aligning large sequences of DNA, a repeat mask is usually used to mask out repetitive regions of DNA. Why is this? What would happen if you didn't use repeat masks?
  12. The protein Cmpp16 was extensively studied in the Xoconostle-Cazares et al. paper above. Cmpp16 was determined to be a paralog of viral movement proteins (MP). Taking the sequence for Cmpp16, use any tools you can (in particular, those learned about so far in the course) to either support or refute the claim the authors make that Cmpp16 is a viral MP protein.


Part A: MJ1477: the Missing aaRS?

Aminoacyl-tRNA synthetases (aaRSs) are important proteins that catalyze the reaction that attaches amino acids to their corresponding tRNAs. The tRNAs are then used by the ribosome to extend peptide sequences during translation. aaRSs are for the most part present in all cellular organisms since they are essential for protein synthesis. Methanococcus jannaschii, a bacterium whose genome has recently been sequenced, does not appear to have an aaRS for cysteine (CysRS). Fabrega et al (2001) devised a novel computational method to search for the missing CysRS and concluded that the MJ1477 protein was responsible for the CysRS activity. Here we will attempt to validate their results. (NOTE: Please use the default parameters unless otherwise specified.)

  1. Go to Swiss Prot and search for MJ1477. Get the amino acid sequence in FASTA format from the website (you don't need to turn in the sequence with this assignment).
  2. Go to NCBI and access the PSI-BLAST page. Enter the amino acid sequence as the query, use nr as the database, and set the inclusion threshold to 0.005. Run the query. On the next page you will need to click Format! to view your results
  3. Take a quick look at your results. Run the PSI-BLAST iterations until your results converge.
  4. Examine your results. Notice the most of the genes are predicted or hypothetical. Find the first result that has a known function (HINT: it's E-value is greater than 1e-35). Based on the score, alignment, and E value for this protein, would you consider these proteins homologous? If so does the function of that protein provide any hint of evidence that MJ1477 might be the missing CysRS protein? Do you agree with their conclusion? Compare you conclusion with the results from Ruan et al(2004).

Part C: Unknown Sequences

A Biology Professor was browsing through the data repository for his lab and found a folder of sequences that lacked identification. He noticed the data was put there by one of his bioinformatics students that he had earlier asked to study sequence alignment statistics. Curious what the sequences are, he has asked you to figure out where the sequences came from.

  1. Get seq1.fa and seq2.fa from the class website, which are just 2 of over a thousand sequences in the folder.
  2. BLAST both sequences against the nr database.
  3. Based on the results you received, what are you going to report to the Biology Professor? Where do you think the sequences came from?
  4. Try BLASTing seq3.fa and seq4.fa, which also came from the same set of sequences. Do these results support your claim?

Suggested Reading

            Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410.

            Altschul, S.F., Madden, T.L., Scheifer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25:3389-3402.

            Bray N, Dubchak I, Pachter L. AVID: A Global Alignment Program. Genome Research, Vol. 13, Issue 1, 97-102, January 2003

            Fabrega, et al. An aminoacyl tRNA synthetase whose sequence fits into neither of the two known classes. Nature. 2001 May 3;411(6833):110-4.

            Fitch WM. Homology a personal view on some of the problems. Trends Genet. 2000 May;16(5):227-31.

            Iyer LM, Aravind L, Bork P, Hofmann K, Mushegian AR, Zhulin IB, Koonin EV. Quoderat demonstrandum? The mystery of experimental validation of apparently erroneous computational analyses of protein sequences. Genome Biol. 2001;2(12):RESEARCH0051. Epub 2001 Nov 13.

            Karlin, S. & Altschul, S.F. (1993) "Applications and statistics for multiple high-scoring segments in molecular sequences." Proc. Natl. Acad. Sci. USA 90:5873-5877.

            Korf, Yandell, and Bedell. BLAST. O'Reilly 2003.

            Kent WJ. BLAT--the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64.

Pevsner. (2003) Bioinformatics and Functional Genomics. Wiley-Liss. Ch 4-5.

            Ruan B, et al. Cysteinyl-tRNA(Cys) formation in Methanocaldococcus jannaschii: the mechanism is still unknown. J Bacteriol. 2004 Jan;186(1):8-14.

            Xoconostle-Cazares B, Xiang Y, Ruiz-Medrano R, Wang HL, Monzer J, Yoo BC, McFarland KC, Franceschi VR, Lucas WJ. Plant paralog to viral movement protein that potentiates transport of mRNA into the phloem. Science. 1999 Jan 1;283(5398):94-8.