One of the most well studied proteins in molecular biology
is the green fluorescent protein, or GFP.
For this problem, you will visit popular online biological databases websites
and gather information on GFP.
Genbank:
Genbank is a database of nucleotide sequences. It can be accessed at the
NCBI website (National Center for Biotechnology
Information) at http://www.ncbi.nlm.nih.gov/.
In the search pull down menu at
the top, make sure "nucleotide" is selected. In the text
box at the top of the screen where it
solicits input for searching, type "GFP" and hit the Go
button.
This
search will bring up over 1000 results. To narrow the search, click on "Limits" just
below the box where you typed "GFP".
Limit the search to "gene name" (in the dropdown box) and
click the "Go" button again. You will now have approximately 50 results. Go to the end of the list (you
will have to click "next" one time (the "next" link
appears to the right).
The last two entries, M62653 and M62654, are from a seminal 1992 paper. Click on M62653 (the last entry),
look over the Genbank record, and answer the following questions:
How long is the nucleotide sequence?
How many "guanines" appear in the gene's DNA sequence? Is there a bias towards any
particular nucleotide?
What is the Latin name of the organism whose DNA was sequenced for this GFP?
Swissprot:
Swissprot is a database of amino acid sequences that can
be accessed at http://us.expasy.org/sprot/. At the Swissprot homepage, type
GFP and click the Search button. The last link in the Swissprot section (not the trembl section)
should be GFP_AEQVI P42212).
Examine the web page for this protein, and answer the following:
How
many references are cited?
This
Swissprot record has links to other databases. Pfam (Protein Families) is a database of multiple
alignments. Pfam accession numbers begin with the letters PF, followed by
five numbers (e.g. PF12345).
What is the Pfam accession number for GFP_AEQVI? (NOTE: An
accession number is simply a tag that you can use to refer to a
particular item in a database.
Many of the databases you will use will have accession
numbers. There is no
standard formatting for accession numbers across databases.)
The
Swissprot database is available via ftp. To see the data in its textual format (i.e. what you
get when you ftp), scroll down to the bottom of the GFP_AEQVI web page,
and click the link that says "View entry in raw text format (no
links)." Answer the following
questions:
The
first two letters on each line identify what kind of line it is (e.g. ID
= Identifier, DT = date, etc.)
Find the line that has the Latin name for the species.
What two letters appear at the
beginning of the line?
What
two symbols, which appear on a line by themselves at the bottom of the
file, indicate the end of the record for GFP_AEQVI? (In the ftp file that you can
download, these symbols are the "record separators".)
Protein Data Bank:
The PDB (Protein Data Bank) is a database of protein
structures at http://www.rcsb.org/pdb.
From that page, click the ÒSearchLiteÓ link. On the resulting page, type ÒGFPÓ into the text box and
click the ÒSearchÓ button. Look at the first result (1EMB) and click the
ÒExploreÓ link to the right. Then click on the ÒDownload/Display FileÓ
link (on the left). Then, click on the link to display the structure file
in PDB file format complete with coordinates as HTML.
In
this file the majority of lines are ÒATOMÓ lines. Scroll down until you see those
lines and note how the atoms are numbered (in this case, 1 to 1908). Answer the following questions:
What
kind of atom is #16 (3rd column)
What
kind of amino acid is atom #16 in? (4th column)
What
are the (x,y,z) coordinates of atom #16?
ENSEMBL:
ENSEMBL is web-based genomic resource available at http://www.ensembl.org/. The first website is the ENSEMBL
home page. How many species
are available on this website?
Search
for anything with Òzinc fingerÓ (a structural motif in proteins). Find the first mouse GENE, and
browse to that page. Please
note that mouse is Mus musculus,
and that the results are grouped in several ways. You are looking for GENE
index. Follow the first link.
Record the following:
The
ensembl mouse gene id for this first link.
The
genomic location for this mouse gene.
The
cDNA transcript for this mouse gene - found by following link to view
gene in genomic location and looking over the basepair view. Move the pointer slowly.
The
ensembl human gene id for homologous protein (back to the mouse gene
specific page and look down to homology).
The
genomic location for this human gene.
How
many cDNA transcripts are given? Record for ENSESTT00000205818.
What
is the Hamming distance between the first ten nucleotides?
Human
and mouse genomes can be partitioned into a large number of synteny
blocks, with each human synteny block corresponding to a mouse synteny
block. What mouse synteny block is the mouse gene located on? What human
synteny block does the human homolog belong on? What is the
correspondence between these two synteny blocks?