Practical Problem Set #4: Gene Prediction

Comprehension Questions

  1. Most statistical gene prediction programs require a set of parameters, estimated based on a training set of DNA sequences with genes clearly marked. What are the two major experimental methods used to reliably find a gene?
  2. A single nucleotide substitution at which position in a codon would most likely have the greatest impact on the function of the encoded protein: the first, the second, or the third? Why?
  3. Which of the following of point mutations would most likely have the greatest impact on the function of the encoded protein: a single nucleotide substitution mutation (i.e. A mutates to G) or a single nucleotide deletion (i.e. A is deleted from the sequence)? Why?
  4. Find the intron(s) in the "world's shortest intron-containing gene". In addition, spell out the amino acid sequence it encodes.


  5. Although the genetic code is universal, organisms usually have their own preference for codon usage. For example, the web site gives statistics on the codon usage of Escherichia coli. Your colleague has an EST fragment from E. coli with the following sequence: AAGUCAUUAUUUUCG.

    Assuming this is the coding strand, can you help her to identify the most likely translation frame?

  6. The identification of exon-intron junctions is a major challenge to gene prediction algorithms. Conventionally, position weight matrices (PWMs) or profiles have been used for this task. One bioinformatics graduate student, knowing that the branch point 20-50 upstream of the 3' splice site is a more biologically important signal than the 3' slice site, wants to build a PWM for the 60 nucleotides upstream of the 3' splice site. The student wishes to capture the signal around the branch point, but finds nothing. Can you explain why?
  7. There are two general strategies for performing gene prediction: similarity based approaches and statistics-based approaches. Explain which genes are likely to be missed by the statistics-based approach and by the similarity-based approach.
  8. Sequence homology or similarity information is used in both the similarity based and the comparative genomics approaches for gene prediction. What is the difference between these two approaches?
  9. What is a pseudogene? Why gene prediction algorithms have difficulty to discern pseudogenes from true genes?


Genscan is one of the best gene finding algorithms. The UCSC Genome Browser ( is a convenient graphic visualization tool for genome annotations. In this practical exercise you are given 3 sequences from human genome (genomic_dna1.fa, genomic_dna2.fa, genomic_dna3.fa). Your tasks will be to use the gene prediction program Genscan (, or to find potential genes in these genomic sequences, and validate your findings against the annotations using the UCSC browser.

  1. Run Genscan. Go to the Genscan web site, submit your sequence, be sure to check "Print predicted coding sequences (-cds)" so that you can get the predicted cds, and wait for the result.
  2. Display the genomic regions on UCSC genome browser. You can display the predicted cds on the UCSC genome browser. In order to display the exon structure in the predicted cds, you will need to use the BLAT program by Jim Kent, which can quickly look for a sequence in the human genome and return the genomic regions with high similarity to your query sequence. Go to the UCSC genome browser home page, select BLAT and human genome, paste your predicted cds and submit. You will be brought to the actual display window of the genome browser where your sequence from BLAT search is displayed together with several other annotation tracks. You can play around the browser as it integrates a lot of information. Click on your sequence from BLAT search track to see the actual base-by-base display. Make sure you can recognize the signals (Start codon, Stop codon, splicing signals, etc.)
  3. Compare your predictions against the annotation for known genes and predictions by other gene-finding algorithms. Compare your predictions against the known genes track. If the known genes track is not shown, you can display it by using the drop down controls under the browser window. Analyze prediction result for each sequence:

Algorithmic Questions

  1. Smith and Waterman modified the global alignment algorithm by allowing "free rides" from start to anywhere in the middle of the alignment grid, yielding the local alignment algorithm. When aligning two genomic sequences containing orthologous genes from two organisms, you may want to allow free rides from one node to any node at its downstream (rightward or downward) in the alignment grid to jump over introns. Can you design an efficient dynamic programming algorithm that allows any number of free rides? What about allowing at most k free rides?
  2. Finding a Family of Genes. Suppose the genome of an organism is just sequenced. Usually, a general purpose gene prediction algorithms would be used for de novo annotation of this new genome sequence. However, this time your task is to design a computational strategy to find all genes in a family in this genome. You could take advantage of all known gene sequences in the family in other organisms. Read articles Manning et al, 2002 and Claudel-Renard et al, 2003 and discuss how you would design your algorithm.

Suggested Reading

Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78-94.

Claudel-Renard C, Chevalet C, Faraut T, Kahn D. (2003) Enyzme-specific profiles for genome annotation: PRIAM. Nucleic Acids Res, 31(22):6633-9

E. Rivas and S.R. Eddy. (2001) Nocoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics, 2:8.

Kent, W.J. 2002. BLAT -- The BLAST-Like Alignment Tool. Genome Research 4: 656-664.)

Korf, I., P. Flicek, D. Duan, and M.R. Brent. (2001) Integrating genomic homology into gene structure prediction. Bioinformatics 17: S140-148.

Manning G, Whyte DB, Martinez R, Hunter T, Sudarsanam S. (2002) The protein kinase complement of the human genome. Science, 298(5600):1912-34

Pevsner. (2003) Bioinformatics and Functional Genomics. Wiley-Liss. 551-562.

S. Rogic, A. K. Mackworth and B. F. F. Ouellette. (2001) Evaluation of gene finding programs. Genome Research, 11: 817-832.

Zhang MQ. (2002) Computational Prediction of Eukaryotic Protein Coding Genes. Nat. Rev. Genet. 3:698-709