Practical Problem Set #6: DNA Arrays

  1. There are several popular DNA array platforms in use. A major difference between the platforms is how the genes are probed. In the case of cDNA arrays, full length cDNA is spotted on the DNA array for each gene to be queried. Oligonucleotide arrays are comprised of k-mer nucleotide probes for each gene (k is typically 25-40). An advantage of oligonucleotide probes is their enhanced sensitivity, or the ability to detect weak expressed transcripts. However, the length of the gene probe directly corresponds to how specific the hybridization of the probe is. A variation of oligonucleotide arrays is perfect match/mismatch arrays, introduced by Affymetrix. A set of paired oligonucleotide probes, typically 25-mers, is designed for each gene. Each pair contains the canonical sequence, or perfect match probe, of the gene and also a deliberate mutation in the 13th position (middle) of the gene, or mismatch probe. The mismatch probe measures the degree of cross hybridization, or how much lower the detection signals for noise are.

    Compare Affymetrix arrays with cDNA arrays with regard to the given criteria:

  2. A key issue when comparing DNA arrays is normalization, or the process by which expression levels are made comparable. A common approach to normalization is global normalization. In this approach, the averages of the expression distributions (expression levels for all genes within a DNA array) across DNA arrays are set to be equal. This follows from the assumption that while genes can be differentially expressed, the amount of transcription is essentially similar across samples.
  3. You are comparing two separate single hybridization sample DNA array experiments, one from liver tissue and one from whole blood. Assume the major cell types for liver are hepatocytes and for whole blood are red blood cells. Would you expect global normalization to perform adequately in this case? Why or why not? Hint: Consider the organelles in both cell types.
  4. Affymetrix implements a secondary normalization by utilizing the perfect match/mismatch model. In this model, the mismatch probe contains a deliberate mismatch in the 13th position to measures the degree of cross hybridization, and is "subtracted" from the perfect match probe. However, Naef et al (2003) have recently shown that for low expressed genes, perfect match alone is a good indicator. For saturated genes, mismatch alone is a still a valid indicator. Given their findings, what would you suggest as an alternative to Affymetrix's normalization procedure?
  5. A researcher gives you a list of common housekeeping genes to help you with normalization. Housekeeping genes are a set of genes that are ubiquitously expressed in a relatively stable manner. As opposed to normalizing to the average gene of the entire array, this form of normalization uses the average of the housekeeping genes.

    For your experiments, you are asked to compare the differences in gene intensities due to global versus housekeeping normalization. As a simple test, you notice that when you use global normalization, the housekeeping genes are significant higher (but not saturated) in one array versus another.

  6. A biologist designed a series of two sample hybridization DNA array experiments consistent with a reference design. See Figure 6.17(c) from Pevsner. Halfway through his experiments, he realizes he inadvertently hybridized twice as much sample as compared to pool or reference.
  7. A researcher gives you a list of 1,000 significantly expressed genes from a DNA array with 12,000 genes. He utilized the t-test with a p-value cutoff of 0.05. He realizes that you expect 600 false positives. How many of the genes identified as significantly expressed do you think have a biologically significant role? What if the p-value cutoff had been 0.000001? (0.012 false positives) HINT: Think why a biologists clusters genes rather than directly studying gene lists.
  8. When examining the clusters of gene that result from clustering gene expression values, researchers will usually use one of the two following assumptions to help them analyze their data. State which of the following assumptions you agree with and why.
    1. Genes in the cluster are co-regulated.
    2. Genes in the cluster are involved in the same biological function.
    3. Genes in the cluster are bound by the same transcription factor(s)
    4. Genes can be a member of only one cluster
  9. A researcher is studying cancer of the thyroid. There are two hospitals that have run DNA arrays on the same chip (Affymetrix U133A) for several samples of normal thyroid tissue and cancerous thyroid tissue. The researcher ran SAM on the data from one of the hospitals comparing the normal and cancerous tissues and identified several differentially expressed genes. However, when they pooled both datasets together, they got the following SAM plot:

    They ask you why SAM failed to find any differentially expressed genes. What should you tell them?

  10. Suggested Reading


    Naef, et al. (2003) A study of accuracy and precision in oligonucleotide arrays: extracting more signal at large concentrations. Bioinformatics 19(2): 178-84.

    Pevsner. (2003) Bioinformatics and Functional Genomics. Wiley-Liss. 551-562.

    Tusher, Tibshirani and Chu (2001): "Significance analysis of microarrays applied to the ionizing radiation response". PNAS 2001 98: 5116-5121.