So you want do do ChIP-seq huh?

Bioinformaticians think about algorithms, they discuss analyses and normalization, but at the end of the day, they spend the overwhelming majority of their time preparing the data. And yet, this essential step of any analysis has so far received little attention from the community. We recently tackled this question for ChIP-seq data and came up with a couple of ideas that we put in a discretizer that we called Zerone.

Zerone is born from some old Hidden Markov routines from my post-doc, and from my frustration of working with ENCODE ChIP-seq data. A few years back, we discovered serious issues with about 20% of the ChIP-seq ENCODE profiles in K562, ranging from lack of reproducibility between replicates to total absence of signal. But what could we do? Just throw those profiles away because we did not like them? Surely there must be a more scientific way to approach this question.

We set out to identify a signature for low quality data by a machine learning approach. This seemed difficult because there are so many ways for things to go wrong. Likewise, good ChIP-seq data is also very variable, so what you have to look at is not obvious. But this is precisely what machine learning is good for. We manually flagged ChIP-seq profiles as “good” and “bad”, and then collected a bunch of random features from the discretized profiles. We fed all this to a Support Vector Machine, changing the feature set by trial and error until we got an algorithm that could recognize the profiles that humans would flag as “bad”.

The key features say something about the size and the height of the peaks, their number, the signal to noise ratio and the correlation between replicates. Quite obvious in retrospect. But it is good to see that a simple model is already very efficient at spotting problems.

Our credo is that data is either good enough to be used, or not good enough and it should be thrown away altogether. Zerone produces a single discretized profile from all the data sources. And instead of saying which parts of the ChIP-seq profiles are nice and which are not, it does an all-or-none quality control. When a profile is bad, it is all bad, which means that you will have to throw away at least one of the replicates. At least this is what we are trying to achieve with this approach.

Otherwise, Zerone is also a great discretizer in itself. It is relatively fast, memory frugal and it does not annoy you with a lot of parameters. One thing it is not at, though, it to find exact peaks. If you are looking for transcription factor binding sites or other fine-scale features, Zerone is not what you need. Zerone is a discretizer and not a peak-finder, this means that it returns a profile of windows where the signal is either enriched or not, which is very useful to stack tens to hundreds of different ChIP-seq profiles.

More than software, Zerone is an idea: that data preparation can be improved. We hope that this idea will spread its wings and fly to other areas of bioinformatics.

« | »

comments powered by Disqus