DNA Melting: Processing DNA Melting Data

From Course Wiki
Jump to: navigation, search
20.309: Biological Instrumentation and Measurement

ImageBar 774.jpg


Fantastic. You have some simulated data or real data. It's time to estimate $ \scriptstyle \Delta S^{\circ} $, and $ \scriptstyle \Delta H^{\circ} $. Nonlinear regression is one approach to extracting these values. There are several approaches to estimating the melting temperature using both the raw data and the extracted model parameters.

Most students use Matlab to analyze their raw data. Several scholars in the past elected to use Python, instead. If you choose this path, be aware that the wiki pages for Python data analysis are significantly less developed than the corresponding Matlab cribs. Nonlinear fitting in Python has been challenging. If you decide to use Python, consult and add your thoughts to this page: DNA melting data analysis in Python.

In outline, the data analysis steps are:

  • Write a function that models DNA melting
  • Load data file produced by LabVIEW VI or simulation
  • Provide initial guesses for model parameters and run the regression
  • Normalize results and convert to correct units
  • Estimate the melting temperature
  • Create plots

The model function is essentially the inverse of simulating DNA melting. The arguments to the function should include a vector of all the model parameters and explanatory variables. Make the signature of the function should match the requirements of the curve fitter you plan to use, for example : lsqcurvefit in Matlab.

Organization of data analysis script

Since you will be analyzing several experimental runs in a similar way, it makes sense to save the particulars of each experimental run (such as the name of the file that the data is stored in and the KCl concentration) in an array of data structures. Matlab implements a data type called a cellular array. Cellular arrays use pointy braces {} instead of parenthesis (). If foo is a cellular array, then foo{1} is the first element.

The following code demonstrates how to create the cellular array of data structures. The code fills in the first element of a cellular array (DnaMatchSampleInfo):

DnaMatchSampleInfo = {}  % null cellular array

DnaMatchSampleInfo{1}.filename = 'DNA Melting Data\20bp 100mM.txt';
DnaMatchSampleInfo{1}.SampleName = '20 bp 100mM';
DnaMatchSampleInfo{1}.KclConcentration = 100;
DnaMatchSampleInfo{1}.DnaConcentration = 30;
DnaMatchSampleInfo{1}.FilterNormalizedCutoffFrequency = 0.1;
DnaMatchSampleInfo{1}.FilterOrder = 25;
DnaMatchSampleInfo{1}.NormalizationMin = 0.01;
DnaMatchSampleInfo{1}.NormalizationMax = 0.99;
DnaMatchSampleInfo{1}.TrimStart = 0;
DnaMatchSampleInfo{1}.TrimEnd = 0;

... fill in more entries here ...

After initializing DnaMatchSampleInfo, each element will contain a data structure with the parameters of a particular experimental run. You can create additional cellular arrays for different sets of experiments. For example, DnaLengthSampleInfo might contain parameters for the 20, 30, and 40 base pair samples.

Using the information in the cellular arrays, it is easy to write a for loop that will process all of the data for a particular set of experiments. The function (called DnaMelt might look something like this:

% Function to process DNA melting data from LabVIEW VI
% sampleInfo is a cellular array of data structures
% output is a cellular array containing original data and computed values
function out = DnaMelt(sampleInfo)

% initialize variables
out = {};

for ii=1:length(sampleInfo)

    ... process the data ...


... process/plot specific results about whole data set...

Loading a data file

Use the load command to read in a data file.

    rawData = load(sampleInfo{ii}.filename);

The first column contains RTD voltage samples and the second contains the corresponding fluorescence voltages.

Converting voltages to temperature and relative fluorescence

These are straightforward mathematical manipluations on the dataset using the properties of the RTD and your normalized/corrected/adjusted fluorescence signal with the equation for dsDNA fraction.

Designing an FIR low pass filter

The matlab commands fir1 and freqz are useful for designing a low pass filter. Use fir1 to generate the filter kernel and freqz to plot its frequency response. Increasing the order of the filter will make the transition between pass and stop bands sharper.

Try plotting a few different filter designs:

freqz(fir1(30, 0.15))
freqz(fir1(300, 0.15))

Applying the FIR filter

Use the conv command to apply the filter. But be careful how you treat the samples near the beginning and end of the signal. conv returns a vector of length m + n - 1, where m and n are the lenghts of the two operands. conv will assume that all values outside of the defined signal are zero. This will distort your signal near the beginning and end. You can handle this by pre-padding your data with made-up data (usually the initial and final values) or just chopping off the extra samples.

The following functions may be of use:

% usage: ConvolveAndClip(kernel, data)
% convolves <kernel> with <data> and trims the ends of the result to length
% length(data) - length(kernel)
function out=ConvolveAndClip(kernel, data)

temp = conv(kernel, data);
trim = length(kernel);
out = temp(trim:end-trim);
% usage: PadAndConvolve(kernel, data)
% Pads <data> with initial and final values, convolves <data> with 
% <kernel>, and trims the result to the same length as <data>
function out=PadAndConvolve(kernel, data)

frontPaddingLength = floor(length(kernel)/2);
rearPaddingLength = ceil(length(kernel)/2);
frontPadding = data(1) * ones(frontPaddingLength, 1);
rearPadding = data(end) * ones(rearPaddingLength, 1);

s = size(data);

if(s(1) > s(2))
    paddedData = [frontPadding;data;rearPadding];
    paddedData = [frontPadding' data rearPadding'];

out = ConvolveAndClip(kernel, paddedData);

Ensuring that F(Temperature) is a function

Because there is noise in the temperature readout, your raw data is not guaranteed to be a function. You will run into all sorts of trouble taking the finite difference later on if your data is not a function. This can be solved by either sorting the data by temperature (using Matlab's sort command) or binning the samples by temperature (by iterating through the data with a for loop.)

Each approach has merits and disadvantages. Sorting by temperature can result in some very small ΔT values, which tend to be very noisy. Binning has the advantage of resulting in a uniformly sampled dataset — provided that there is at least one sample in each bin.

Fitting to Model

This topic will be discussed at length in an evening tutorial, and in person in the lab. We start with the basic technique outline in the simulations, and add additional parameters to account for fluorophore quenching with temperature, fluorophore bleaching, and temperature lab between the sample and the heating block. This year we are expanding on a multi-parameter non-linear regression technique that we first developed with the previous class.