With the help of a genomics savy co-author, I’m doing a bit of de novo genome assembly from Illumina whole genome shotgun sequence. But this method could work with any data set up in a similar way. The data I was working with was a k-mer count table from jellyfish. K-mers are strings of length k in your DNA sequence data. Counting the number of occurrences of every k-mer can be important for assembling a genome… for reasons that I am still trying to wrap my brain around. Assembly is complicated! But I digress.
Anywho, say you have a data table, such as that produced by jellyfish, that looks something like this:
> dat <- read.table("data.from.jellyfish.hist", head=T) > head(dat) frequency multiplicity 1 513363 1 2 2024 10 3 25 100 4 21 101 5 18 102 6 22 103
And when you plot it out, it looks something like this:
> plot(dat$multiplicity, dat$frequency, type=”hist”)
I want to know at what x (aka ‘muliplicity’) I should subset this data, to get the bulk of the peak, but lose the wobbly tail. So I wrote a script, using the zoo and ggplot2 packages, based on this post to identify some minima. After a line of code (full commented code at this gist) , out pops some coordinates for a start minima and a stop minima, and a pretty little graph.
> Minima.mer(dat, extra=TRUE) frequency multiplicity 36 1770 13  "Start minima" bin freq 10 (12.5,13.4] 1770  "Stop minima" bin freq 57 (59.1,60.1] 1125
But it needs some work. You still have to look at the plot to make sure the two minima reported fall on either side of the peak you are interested in. You can adjust by changing the size of the ‘window’ over which the minima is determined, and/or the number of bins the data is broken into, by changing ‘breaks’. The cut-offs allow you to subset the data around a peak of interest, if you have multiple peaks in your data. ‘Extra’ plots a graph, for a visual check, as well the coordinates for both the starting and stopping minima.