Practical Problem Set #3: Sequence Alignment

Dot-Matrices, Global, and Local Alignments

  1. Dot matrix plots provide a quick way to visualize the similarities between two sequences. The following plots were made with a java applet available at In addition, a quick tutorial on dot plots is available on this website. Two common parameters that are adjusted to increase the readability of the plots are the window length and the number of allowable mismatches per window.

  2. Why is using percent identity alone not the best approach for assessing the quality of a local alignment?
  3. Study the two alignments below:

    Alignment 1

               : :: :  : :   ::: :::   ::


    Alignment 2

               :   ::       ::::::::         ::
    Seq2       A---TG-------CGGCTAATGGCAATATACG
  4. You are shown the two alignments below. One is an alignment of two DNA sequences with an identity of 36%. The other alignment is of two amino acid sequences with an identity of 28%. Which of the two alignments represents greater biological similarity between sequences?

    DNA Alignment

           : ::  :   :  ::        ::


    Amino Acid Alignment

           :  :::            :    ::
  5. Fill out the following dynamic programming tables using the following parameters: (match +1, mismatch -1, insertion/deletion -1). Write out the optimal alignment at the end and compute its score. Show the path corresponding to the optimal alignment in each case.

Discovering Similarities between Oncogenes

Russell Doolittle ( pioneered the application of sequence analysis algorithms in the late '70s and early '80s. Doolittle used an early database of biological sequences to run queries to identify genes with similar functions. In the exercise below we will follow the steps Doolittle took to discover functional roles for the v-mos oncogene of the Moloney Murine Sarcoma Virus. Not long after the v-mos gene was sequenced at the Salk Institute, a group studying the v-src oncogene of the Rous Sarcoma Virus published their findings along with the sequence in 1980. An early attempt was made by the group to find similarities between the sequences of the two genes but none were found.


  1. The biology workbench ( is a suite of web-based programs for sequence-based analysis. You will be required to create a new account. After creating an account log on. (NOTE: Unless otherwise specified, use the default parameters for each algorithm)
  2. Using the Nucleotide Tools menu, upload the two sequences found in the files vmos.fasta and src.fasta on the class website. (vmos.fasta contains the v-mos oncogene from the Moloney Murine Sarcoma Virus, src.fasta contains the v-src gene from the Rous Sarcoma Virus).
  3. Using the ALIGN tool for global alignment and then the LALIGN tool for local alignment, align the sequences and comment on the results. Write down the percent identity and the alignment score for each algorithm.
  4. Would you consider the sequences homologous based on your alignments from above (HINT: What do you think the percent identity would be for random sequences?)?
  5. Next we will translate the nucleotide sequences into amino acid sequences one at a time. This is accomplished using the SIXFRAME tool. Why does the SIXFRAME tool give you six possible amino acid translations? Which should you choose and why? Select the frame for each gene that you think is promising and import them using the button at the bottom of the page.
  6. Next align the two amino acid sequences (one from each original gene) using the ALIGN and LALIGN programs in the Protein Tools section. If you are unsure if you chose the correct frame in step 5, select a different frame until you get the best alignment score possible. Compare this alignment with the nucleotide alignment. Would you consider this to serve as better evidence of homology than the nucleotide alignment? Why or why not?

Global vs. Local Alignment

You have been given the amino acid sequence of an unknown mouse gene. You have decided that the gene may have some similarities with two human genes, DAPK1 and CDH1. Their definitions from LocusLink ( are as below.


DAPK1 - Death-associated protein kinase 1 is a positive mediator of gamma-interferon induced programmed cell death. DAPK1 encodes a structurally unique 160-kD calmodulin dependent serine-threonine kinase that carries 8 ankyrin repeats and 2 putative P-loop consensus sites. It is a tumor suppressor candidate.

CDH1 - This gene is a classical cadherin from the cadherin superfamily. The encoded protein is a calcium dependent cell-cell adhesion glycoprotein comprised of five extracellular cadherin repeats, a transmembrane region and a highly conserved cytoplasmic tail. Mutations in this gene are correlated with gastric, breast, colorectal, thyroid and ovarian cancer. Loss of function is thought to contribute to progression in cancer by increasing proliferation, invasion, and/or metastasis. The ectodomain of this protein mediates bacterial adhesion to mammalian cells and the cytoplasmic domain is required for internalization. Identified transcript variants arise from mutation at consensus splice sites.


We will compare the unknown gene to each of the two given genes to see if we can learn more about the function of the unknown gene.

  1. Log onto biology workbench (see above)
  2. Upload the following amino acid sequence files from the class website: mouse-unknown.fasta, human-DAPK.fasta, human-CDH1.fasta.
  3. Find the global alignment (Protein Tools => ALIGN) between both unknown DAPK and unknown CDH1. Which of the sequences (DAPK or CDH1) is more similar (higher alignment score) to the unknown?
  4. Next perform local alignments between unknown DAPK and unknown CDH1. Local alignments can be performed with the LALIGN tool. Which two genes give the best local alignment? Which tool (global or local alignment) do you think is more useful for inferring function?
  5. Pfam is a database of protein families and alignment information. Go to the Pfam website ( Click on the "Protein Search" tab at the top of the screen. From the best local alignment you generated from step 4, copy the sequence of one of the proteins over the course of the alignment into the window on the Pfam protein search page and click on "Search Pfam". Describe the result you get. Is the result surprising given the definition of the genes above?
  6. The following website ( or has a java applet for displaying dot matrix plots. Enter the sequences for each of the proteins. Experiment with the parameters and plot each sequence against each of the other sequences. Can you see evidence of your alignments in the plots?
  7. Try plotting DAPK against itself. What can you infer about the sequence of DAPK from the plot? What can you say about repeats within this protein (how many, size)? Look up DAPK on the Pfam website and give the name of the domain.