One of the most well studied proteins in molecular biology
is the green fluorescent protein, or GFP.
For this problem, you will visit popular online biological databases websites
and gather information on GFP.
Genbank is a database of nucleotide sequences. It can be accessed at the
NCBI website (National Center for Biotechnology
Information) at http://www.ncbi.nlm.nih.gov/.
In the search pull down menu at
the top, make sure "nucleotide" is selected. In the text
box at the top of the screen where it
solicits input for searching, type "GFP" and hit the Go
search will bring up over 1000 results. To narrow the search, click on "Limits" just
below the box where you typed "GFP".
Limit the search to "gene name" (in the dropdown box) and
click the "Go" button again. You will now have approximately 50 results. Go to the end of the list (you
will have to click "next" one time (the "next" link
appears to the right).
The last two entries, M62653 and M62654, are from a seminal 1992 paper. Click on M62653 (the last entry),
look over the Genbank record, and answer the following questions:
How long is the nucleotide sequence?
How many "guanines" appear in the gene's DNA sequence? Is there a bias towards any
What is the Latin name of the organism whose DNA was sequenced for this GFP?
Swissprot is a database of amino acid sequences that can
be accessed at http://us.expasy.org/sprot/. At the Swissprot homepage, type
GFP and click the Search button. The last link in the Swissprot section (not the trembl section)
should be GFP_AEQVI P42212).
Examine the web page for this protein, and answer the following:
many references are cited?
Swissprot record has links to other databases. Pfam (Protein Families) is a database of multiple
alignments. Pfam accession numbers begin with the letters PF, followed by
five numbers (e.g. PF12345).
What is the Pfam accession number for GFP_AEQVI? (NOTE: An
accession number is simply a tag that you can use to refer to a
particular item in a database.
Many of the databases you will use will have accession
numbers. There is no
standard formatting for accession numbers across databases.)
Swissprot database is available via ftp. To see the data in its textual format (i.e. what you
get when you ftp), scroll down to the bottom of the GFP_AEQVI web page,
and click the link that says "View entry in raw text format (no
links)." Answer the following
first two letters on each line identify what kind of line it is (e.g. ID
= Identifier, DT = date, etc.)
Find the line that has the Latin name for the species.
What two letters appear at the
beginning of the line?
two symbols, which appear on a line by themselves at the bottom of the
file, indicate the end of the record for GFP_AEQVI? (In the ftp file that you can
download, these symbols are the "record separators".)
Protein Data Bank:
The PDB (Protein Data Bank) is a database of protein
structures at http://www.rcsb.org/pdb.
From that page, click the “SearchLite” link. On the resulting page, type “GFP” into the text box and
click the “Search” button. Look at the first result (1EMB) and click the
“Explore” link to the right. Then click on the “Download/Display File”
link (on the left). Then, click on the link to display the structure file
in PDB file format complete with coordinates as HTML.
this file the majority of lines are “ATOM” lines. Scroll down until you see those
lines and note how the atoms are numbered (in this case, 1 to 1908). Answer the following questions:
kind of atom is #16 (3rd column)
kind of amino acid is atom #16 in? (4th column)
are the (x,y,z) coordinates of atom #16?
ENSEMBL is web-based genomic resource available at http://www.ensembl.org/. The first website is the ENSEMBL
home page. How many species
are available on this website?
for anything with “zinc finger” (a structural motif in proteins). Find the first mouse GENE, and
browse to that page. Please
note that mouse is Mus musculus,
and that the results are grouped in several ways. You are looking for GENE
index. Follow the first link.
Record the following:
ensembl mouse gene id for this first link.
genomic location for this mouse gene.
cDNA transcript for this mouse gene - found by following link to view
gene in genomic location and looking over the basepair view. Move the pointer slowly.
ensembl human gene id for homologous protein (back to the mouse gene
specific page and look down to homology).
genomic location for this human gene.
many cDNA transcripts are given? Record for ENSESTT00000205818.
is the Hamming distance between the first ten nucleotides?
and mouse genomes can be partitioned into a large number of synteny
blocks, with each human synteny block corresponding to a mouse synteny
block. What mouse synteny block is the mouse gene located on? What human
synteny block does the human homolog belong on? What is the
correspondence between these two synteny blocks?