DNA melting: Identifying the unknown sample

From Course Wiki
Jump to: navigation, search
20.309: Biological Instrumentation and Measurement

ImageBar 774.jpg


Overview

In the DNA lab, you had four samples. Each sample had a true melting temperature $ M_j $ (where $ j $ is an integer from 1 to 4). The instructors told you that the fourth sample was identical to one of the other three samples. Therefore, the unknown sample 4 should have exactly the same melting temperature as sample 1, 2, or 3. Your job was to figure out which one matched the unknown.

Procedure

Most groups measured each sample type in triplicate. (Some special students did something a little bit different.) This resulted in 12 observations, $ O_{j,i} $, where $ j $ is the sample type and $ i $ is the replicate number — an integer from 1 to 3. The majority of lab groups calculated the average melting temperature for each sample type, $ \bar{O_i} $, and guessed that sample 4 matched whichever of the known samples had the closest melting temperature.

Seems reasonable.

Except ... The observations included measurement error: $ O_{j,i}=M_j+E_{j,i} $, where $ E_{j,i} $ represents an error term. The presence of measurement error leads to the possibility that an unfortunate confluence of error terms could have caused you to misidentify the unknown sample. It’s not hard to imagine what factors tend to increase the likelihood of such an unfortunate fluke: the true means are close together, or the error terms are large.

Uncertainty model

To get a handle on the possibility that your results were total crap due to bad luck alone (not incompetence), it is necessary to formulate a mathematical model for the distribution of the error terms. How about this? The error terms are normally distributed with mean $ \mu=0 $ and standard deviation $ \sigma $. (Note that the error distribution among all of the sample types is the same.) This is a simple and reasonable model. It goes without saying that the model is not perfectly correct.

Within the confines of this model, it is possible to estimate the chance that your result was a fluke.

Which unknown?

There are 6 possible pairwise hypotheses to test:

  1. $ M_4\stackrel{?}{=}M_1 $
  2. $ M_4\stackrel{?}{=}M_2 $
  3. $ M_4\stackrel{?}{=}M_3 $
  4. $ M_1\stackrel{?}{=}M_2 $
  5. $ M_1\stackrel{?}{=}M_3 $
  6. $ M_2\stackrel{?}{=}M_3 $

These are called null hypotheses. Each null hypothesis asserts that one group does not have a significantly different mean than another.

Each null hypothesis can be either accepted or rejected, so there are 26=64 possible outcomes to testing all 6. In order to uniquely identify sample 4, exactly two of hypotheses 1-3 must be rejected. In other words, sample 4 must be similar to one of the other samples and dissimilar to the two others. It should also be the case that the putative match for sample 4 should be dissimilar to the two other possible matches. This means that two more null hypotheses must be rejected for a unique identification. For example, if sample 4 is the same is sample 1, a unique identification also requires that sample 1 is distinct from sample 2 and sample 1 is distinct from sample 3. It's probably okay either way if samples 2 and 3 are distinct or not. So the outcome of one of the null hypotheses — that the two sample which do not match the putative sample 4 are not distinct — can go either way.

In summary, 4 of the null hypotheses must be true, 1 must be false, and 1 doesn't matter. 6 of 64 possible outcomes are consistent with a unique sample 4 identity. It is not possible to draw a conclusion from the other 58 outcomes. The following table summarizes the possibilities:

H1 H2 H3 H4 H5 H6 Conclusion
F T T F F X U=1
T F T F X F U=2
T T F X F F U=3
all others none

You could argue that only hypotheses 1-3 are relevant. Maybe. But imagine what the defense counsel would say if it turned out that hypotheses 1 and 2 were rejected but 4-6 were not. You would testify that statistics showed the murderer must be suspect number 3. But the defense would argue, "How can you say it's number 3? You can't even tell suspect 3 from 1 or 2?" It would be unconvincing to present a result that implicated suspect 3 unless hypotheses 1, 2, 5, and 6 were also rejected. Rejecting hypothesis number 4 is extra credit.

Evaluating the hypotheses

Student’s t-test offers a method for assigning a numerical degree of confidence to each null hypothesis. Essentially, the test considers the entire universe of possible outcomes of your experimental study. Imagine that you repeated the study an infinite number of times. (This may not be hard for you to imagine.) Repeating the study ad infinitum will elicit all possible outcomes of $ E_{j,i} $. The t-test categorizes each outcome into one of two realms: those that are more favorable to the null hypothesis (i.e., $ O $ is closer to $ M $ than the result you got); and those that are less favorable ($ O $ is farther from $ M $ than the result you got).

The t-test can be summarized by a p-value, which is equal to the percentage of possible outcomes that is less favorable to the null hypothesis than the result you got. A low p-value means that there are relatively few possible results less favorable to the null hypotheses than the one you got and many results that are more favorable. So it's probably reasonable to reject the null hypothesis. Rejecting the null hypothesis means that the means are likely not the same.

In most circumstances, the experimenter chooses a significance level such as 10% or 5% or 1% in advance of examining the data. Another way to think of this: if you chose a significance level of 5% and repeated the study 100 times, you would expect to incorrectly reject the null hypothesis because of bad luck on 5 occasions.

Multiple comparisons

A problem comes up when you use multiple t-tests to compare means. For example, if you chose a significance level for each individual t-test equal to 5%, there would be a 30% overall chance that at least one of the null hypothesis among the 6 comparisons was rejected just due to chance. The multcompare function in MATLAB implements a correction to the t-test procedure to assure that the family-wise error rate (FWER) is less than the significance level. In other words, the chance of any of the six hypotheses being incorrectly rejected due to bad luck is less than the FWER. There is an optional argument to multcompare that lets you set the FWER.

If you used multcompare in your analysis, a good measure of confidence is the FWER that you chose. Since all the hypotheses may not be required to uniquely identify the unknown, the FWER is slightly conservative. So what. You required more things to be true than you strictly needed to. But it is likely that you would have gained very little by removing the unnecessary hypothesis from consideration. It is even more likely that the error terms did not perfectly satisfy the assumptions of test, so your calculation is at best an approximation of the possibility of this type of error.

Chance of making a mistake — choosing the significance level

The significance level sets a bound on the chance of accidentally rejecting a null hypothesis that ought to have been accepted. Alert readers will have mused, "wait a minute ... the problem can also occur the other way around: a null hypothesis might be erroneously accepted."

The plot below shows a simulation of using the multiple comparison test to identify an unknown sample over a range of measurement error magnitudes and significance levels. There are 4 simulated sample types with true means 1, 3, 5, and 1. (The unknown is sample 1.) The upper plots show datasets for one hundred simulated studies where each sample type was measured 3 times in the presence of various values of noise, $ \sigma $. The lower plots show significance level on the horizontal axis. For each significance level, three quantities are plotted: fraction of correct identifications (blue), incorrect identifications (red), and outcomes for which no conclusion can be drawn (gold). Making the significance threshold smaller reduces the number of erroneous identifications. But there is also a cost: a lower significance threshold increases the number of cases where no conclusion can be reached.

Unsurprisingly, less noise in the data produces better results.

DnaMultipleComparisonSimulation.png

The significance level can be chosen as appropriate for the circumstance. Falsely implicating a murder suspect has a very high cost, so it is probably wise to choose a low significance threshold and bear the increased risk that you might not be able to draw a conclusion at all.

What significance level to report

In the best of all cases, you would choose the significance level before you started the experiment and report that — something like p<0.01.

You can probably think of ways the simple error model this analysis relies on might be deficient. For example, there is no attempt to include any kind of systematic error. If there were significant systematic error sources in your experiment, your estimate of the likelihood of an unlucky accident may be far from the truth. Because most real experiments do not perfectly satisfy the assumptions of the test, it is usually ridiculous to report an extremely small p-value. (That doesn't stop people from doing it, though.)

Meaning of a t-test p value

A few people argued that the p-value of a single hypothesis test is good measure of confidence. In the example above, where the unknown was sample 3, the confidence measure would be the p-value associated with hypothesis 3. Consider the two situations summarized below. In both cases, the significance level was chosen to be 5%.

  1. hypothesis 2 has a p-value of 4.9% (so it is rejected) and hypothesis 3 has a p-value of 5.1% (not rejected)
  2. hypothesis 2 has a p-value of 4.9% (rejected) and hypothesis 3 has a p-value of 90% (not rejected)

The p-value characterizes how likely it is that a particular result was obtained by accident. It does not compare the relative likelihood of one hypothesis to another. Clearly, there is a much smaller chance that the sample was erroneously identified in the second case above, even though the p-values for hypothesis 3 are identical. (Possibly the means of samples 2 and 3 are farther apart in the second case.)