Practical Problem Set #1: Working With Databases

Part A: Green Fluorescent Protein

One of the most well studied proteins in molecular biology is the green fluorescent protein, or GFP. For this problem, you will visit popular online biological databases websites and gather information on GFP.

Genbank:

  1. Genbank is a database of nucleotide sequences. It can be accessed at the NCBI website (National Center for Biotechnology Information) at http://www.ncbi.nlm.nih.gov/.   In the search pull down menu at the top, make sure "nucleotide" is selected.   In the text box at the top of the screen where it solicits input for searching, type "GFP" and hit the Go button.
  2. This search will bring up over 1000 results.  To narrow the search, click on "Limits" just below the box where you typed "GFP".  Limit the search to "gene name" (in the dropdown box) and click the "Go" button again.  You will now have approximately 50 results.  Go to the end of the list (you will have to click "next" one time (the "next" link appears to the right).
  3. The last two entries, M62653 and M62654, are from a seminal 1992 paper.  Click on M62653 (the last entry), look over the Genbank record, and answer the following questions:
    1. How long is the nucleotide sequence?
    2. How many "guanines" appear in the gene's DNA sequence?  Is there a bias towards any particular nucleotide?
    3. What is the Latin name of the organism whose DNA was sequenced for this GFP?

Swissprot:

  1. Swissprot is a database of amino acid sequences that can be accessed at http://us.expasy.org/sprot/.  At the Swissprot homepage, type GFP and click the Search button.  The last link in the Swissprot section (not the trembl section) should be GFP_AEQVI  P42212).
  2. Examine the web page for this protein, and answer the following:
    1. How many references are cited?
    2. This Swissprot record has links to other databases.  Pfam (Protein Families) is a database of multiple alignments. Pfam accession numbers begin with the letters PF, followed by five numbers (e.g. PF12345).  What is the Pfam accession number for GFP_AEQVI? (NOTE: An accession number is simply a tag that you can use to refer to a particular item in a database.  Many of the databases you will use will have accession numbers.  There is no standard formatting for accession numbers across databases.)
  1. The Swissprot database is available via ftp.  To see the data in its textual format (i.e. what you get when you ftp), scroll down to the bottom of the GFP_AEQVI web page, and click the link that says "View entry in raw text format (no links)."  Answer the following questions:
    1. The first two letters on each line identify what kind of line it is (e.g. ID = Identifier, DT = date, etc.)  Find the line that has the Latin name for the species.   What two letters appear at the beginning of the line?
    2. What two symbols, which appear on a line by themselves at the bottom of the file, indicate the end of the record for GFP_AEQVI?  (In the ftp file that you can download, these symbols are the "record separators".)

Protein Data Bank:

  1. The PDB (Protein Data Bank) is a database of protein structures at http://www.rcsb.org/pdb. From that page, click the “SearchLite” link.  On the resulting page, type “GFP” into the text box and click the “Search” button. Look at the first result (1EMB) and click the “Explore” link to the right. Then click on the “Download/Display File” link (on the left). Then, click on the link to display the structure file in PDB file format complete with coordinates as HTML.
  2. In this file the majority of lines are “ATOM” lines.  Scroll down until you see those lines and note how the atoms are numbered (in this case, 1 to 1908).  Answer the following questions:
    1. What kind of atom is #16 (3rd column)
    2. What kind of amino acid is atom #16 in? (4th column)
    3. What are the (x,y,z) coordinates of atom #16?

ENSEMBL:

  1. ENSEMBL is web-based genomic resource available at http://www.ensembl.org/.  The first website is the ENSEMBL home page.  How many species are available on this website?
  2. Search for anything with “zinc finger” (a structural motif in proteins).  Find the first mouse GENE, and browse to that page.  Please note that mouse is Mus musculus, and that the results are grouped in several ways.  You are looking for GENE index.  Follow the first link. Record the following:
    1. The ensembl mouse gene id for this first link.
    2. The genomic location for this mouse gene.
    3. The cDNA transcript for this mouse gene - found by following link to view gene in genomic location and looking over the basepair view.  Move the pointer slowly.
    4. The ensembl human gene id for homologous protein (back to the mouse gene specific page and look down to homology).
    5. The genomic location for this human gene.
    6. How many cDNA transcripts are given? Record for ENSESTT00000205818.
    7. What is the Hamming distance between the first ten nucleotides?
    8. Human and mouse genomes can be partitioned into a large number of synteny blocks, with each human synteny block corresponding to a mouse synteny block. What mouse synteny block is the mouse gene located on? What human synteny block does the human homolog belong on? What is the correspondence between these two synteny blocks?

Part B: Searching A Nucleotide Sequence Database

  1. Go to the following web page: http://nh-brin.unh.edu/Bioinformatics/Tutorials/DinoDNA/
  2. Copy the DNA sequence marked JurassicPark DinoDNA from the book Jurassic Park.  (Read the text to learn the story behind this particular DNA).
  3. Go the NCBI Blast home page at http://www.ncbi.nlm.nih.gov/BLAST/.  Go to the link that says Nucleotide-nucleotide BLAST [blastn]
  4. Paste the DinoDNA DNA sequence into the text box and hit the Blast! button.
    1. What is the gi number of the first result?
    2. What is the length of the match of the first result?
    3. What is the e-value of the first result?
    4. Is the DNA sequence in Jurassic park fictional (i.e. made up / random) or “borrowed” (i.e. copied from real DNA)?