Practical Problem Set #2: Motif Finding

Motifs and Profiles

Answer briefly using your own words. No group work allowed.

  1. Give a precise definition of "motif".
  2. Table 1 (below) contains a set of 10 patterns representing a motif of length l=8.
    1. Construct an alignment of all instances of the motif shown.
    2. Do you think multiple alignment tools like Clustal would be able to find this motif?
    3. Construct a profile.
    4. Construct a consensus sequence.
    5. For the alignment of patterns from a, compute the consensus score of your profile from question b.
    6. For the alignment of patters from a, compute the entropy score of your profile.
    7. Compute the total distance from your consensus string from question 5(c) to each of the patterns in table 1.
    8. Assuming a uniform nucleotide distribution in the genome, how many times would you expect to find the consensus sequence with up to 1 mismatch in the genome of length 106?
    9. Assuming a uniform nucleotide distribution in the genome, how many times would you expect to find the consensus sequence with up to k mismatch in the genome of length 106?
    10. Assuming a uniform nucleotide distribution in the genome, how many times would you expect to see your consensus sequence in a text of length 1,000,000?











Motif Finding Tools

A biologist at your university has found 15 target genes that she thinks are co-regulated. She gives you 15 upstream regions of length 50 base pairs in FASTA format, file DNASample50.txt, and asks you to identify the motif, and if possible the potential regulating protein. She tells you the sequences are from Homo sapiens, and by intuition feels the motif is of length 8. She wants you to suggest only the best possible candidate motif.

Part A: Instructions

Attach ALL output files with results. Record all your parameters and collect all output files. For each program, make a decision regarding the one motif that you think is best.

  1. Run Consensus (use advanced version).
  2. Run MITRA.
  3. Run Gibbs Sampler with all options available. Do not invoke the recursive sampler or provide a background model.
  4. Run MEME.

Consider all motifs generated, select the best motif and perform the following:

  1. Using an alignment of the binding sites identified by the motif finders, generate a representing Sequence Logo.
  2. Determine a potential DNA binding protein using TRANSFAC, a database of eukaryotic DNA binding proteins. Identify the potential regulating protein by either generating a small number of plausible patterns from your sequence logo or using the binding sites. Do not search using the full length sequences.

After you ran all the programs your biologist friend confesses that she is not sure if her intuition about the motif length was correct. Re-run all the tools above without knowledge of motif length. Do you get the same results?

Part B: Questions

  1. Did all tools generate the same motif?
  2. Which tool would you run if the length of the motif is unknown?
  3. If you increased the queue size for Consensus, what would you expect to happen?
  4. What would you expect to happen if you increased the number of iterations for the Gibbs Sampler?
  5. TRANSFAC contains several tools that can search a set of sequences using known profiles. Imagine you are studying a set of sequences that are regulated by an unknown protein that is very similar to a protein in TRANSFAC. What might happen in this case if you were only to search using TRANSFAC profiles?
  6. In this case you were given a very narrow upstream region to search. Often, you are instead asked to search upstream regions many base pairs in length. Using only Consensus, search the files DNASample300.txt, DNASample1000.txt, and DNASample3000.txt with sequences of length 300, 1000, and 3000 base pairs respectively. At what point is Consensus no longer about to identify your regulatory motif?
  7. Perform the same experiment with MITRA. Where does MITRA break?
  8. Did you search the reverse strand for motif occurrences? Why?

Part C: Experimental Verification of Found Motifs

Describe a biological experiment to validate your hypothesis. How many hours do would you estimate the experiment requires?

Recent Developments in Motif Finding

Part A: Chromatin Immunoprecipitation

A popular experimental technique to confirm motif binding and determine protein-DNA interaction is chromatin immunoprecipitation (ChIP). A high-throughput variant of ChIP, ChIP on chip, was developed by Iyer et al (2001) and Ren et al (2000), and is reviewed by Nal et al (2001).

  1. Briefly describe the methodology for ChIP.
  2. Describe, in general, the modifications of ChIP on chip to the standard ChIP protocol.
  3. How could you modify ChIP on chip to detect binding of protein complexes, as opposed to a single protein?
  4. Reformulate the Motif Finding Problem for ChIP on chip experiments. What are the differences?

Part B: The assumption of independence

A simplifying assumption for motif finders is that nucleotide positions are independent. Several groups have developed approaches that do not require that independence. Refer to Barash et al (2003) and Keich et al (2002) for computational approaches to handle dependencies.

  1. Consider the consensus and profile representations for a motif. How could you modify them to account for dependence?
  2. Give a rigorous formulation of the Motif Finding Problem that waives the independence assumption. What is the objective function you want to optimize?
  3. Imagine you are given the crystal structure of a DNA binding protein bound to its binding site. Would you expect the structure to provide information regarding dependencies?

Suggested Reading

Bailey T. and Elkan C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, 28-36.

Barash Y., Elidan G., Friedman N., and Kaplan T. (2003) Modeling Dependencies in Protein-DNA Binding Sites. Proceedings of the Seventh Annual International Conference on Computational Molecular Biology, 28-37.

Eskin E. and Pevzner P.A. (2002) Finding composite regulatory patterns in DNA sequences. Bioinformatics, 18, S354-63.

Hetz G.Z. and Stormo G.D. (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15, 563-77.

Iyer V.R., Horak C.E., Scafe C.S., Botsein D., Synder M., and Brown P.O. (2001) Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature, 409, 533-8.

Keich U., and Pevzner P.A. (2002) Finding motifs in the twilight zone. Bioinformatics, 18, 1374-81.

Lawerence C.E., Altschul S.F., Boguski M.S., Liu J.S., Neuwald A.F., and Wootton J.C. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262, 208-14. (available via JSTOR)

Mandel-Gutfreund Y., Baron A.,  and Margalit H. (2001) A structure-based approach for prediction of protein binding sites in gene upstream regions. Pac Symp Biocomput., 139-50.

Nal B., Mohr E., and Ferrier P. (2001) Location analysis of DNA-bound proteins at the whole-genome level: untangling transcriptional regulatory networks. Bioessays, 23, 473-6.

Orlando V. (2000) Mapping chromosomal proteins in vivo by formaldehyde-crosslinked-chromatin immunoprecipitation. Trends Biochem Sci., 25, 99-104.

Ren B., Robert F., Wyrick J.J., Aparicio O., Jennings E.G., Simon I., Zeitlinger J., Schreiber J., Hannett N., Kanin E., Volkert T.L., Wilson C.J., Bell S.P., and Young R.A. (2000) Genome-wide location and function of DNA binding proteins. Science, 290, 2306-9.

Schneider T.D. and Stephens R.M. (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Research, 18, 6097-100.

Matys V., Fricke E., Geffers R., Gossling E., Haubrock M., Hehl R., Hornischer K., Karas D., Kel A.E., Kel-Margoulis O.V., Kloos D.U., Land S., Lewicki-Potapov B., Michael H., Munch R., Reuter I., Roter S., Saxel H., Scheer M., Thiele S., and Wingender E.(2003) TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Reseach, 31, 374-8.

Stormo GD. (2000) DNA binding sites: representation and discovery. Bioinformatics, 16,15-23.