Nuts and Bolts: Finding minima around a peak in R

Minima.merSo finding minima is apparently a really complex problem. But all I need is a consistent, non-subjective way to determine cut-offs around a peak. This is what I came up with, perhaps useful to you?

With the help of a genomics savy co-author, I’m doing a bit of de novo genome assembly from Illumina whole genome shotgun sequence. But this method could work with any data set up in a similar way. The data I was working with was a k-mer count table from jellyfish. K-mers are strings of length k in your DNA sequence data. Counting the number of occurrences of every k-mer can be important for assembling a genome… for reasons that I am still trying to wrap my brain around. Assembly is complicated! But I digress.

Anywho, say you have a data table, such as that produced by jellyfish, that looks something like this:

> dat <- read.table("data.from.jellyfish.hist", head=T)
> head(dat)
  frequency multiplicity
1    513363            1
2      2024           10
3        25          100
4        21          101
5        18          102
6        22          103

And when you plot it out, it looks something like this:

> plot(dat$multiplicity, dat$frequency, type=”hist”)

Minima.mer

I want to know at what x (aka ‘muliplicity’) I should subset this data, to get the bulk of the peak, but lose the wobbly tail. So I wrote a script, using the zoo and ggplot2 packages, based on this post to identify some minima. After a line of code (full commented code at this gist) , out pops some coordinates for a start minima and a stop minima, and a pretty little graph.

> Minima.mer(dat, extra=TRUE)
   frequency multiplicity
36      1770           13
[1] "Start minima"
           bin freq
10 (12.5,13.4] 1770
[1] "Stop minima"
           bin freq
57 (59.1,60.1] 1125

Minima.mer2

But it needs some work. You still have to look at the plot to make sure the two minima reported fall on either side of the peak you are interested in. You can adjust by changing the size of the ‘window’ over which the minima is determined, and/or the number of bins the data is broken into, by changing ‘breaks’. The cut-offs allow you to subset the data around a peak of interest, if you have multiple peaks in your data. ‘Extra’ plots a graph, for a visual check, as well the coordinates for both the starting and stopping minima.

Advertisements

2 thoughts on “Nuts and Bolts: Finding minima around a peak in R

    • Oh yeah? Not sure what the output of the latest jellyfish looks like, but any output you get that looks like the example data table in the post should work with the R code, whether it’s from new or old jellyfish or some other source. What does the latest jellyfish output look like?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s