Barcode clustering with Starcode

Our team has recently published an article in Bioinformatics describing Starcode, our software to cluster short sequences.

The first years of the lab have been focused on setting up the TRIP (Thousands of Reporters Integrated in Parallel) technology to study position effects. In a nutshell, we integrate reporters with the Sleeping Beauty transposon in our favourite genome, but before this, we barcode each insertion with a random sequence of 20 nucleotides. The barcode allows us to track RNA expression, DNA repair, protein binding etc. on each inserted reporter.

Our typical experiment is to sequence RT-PCR products that contain all the barcodes expressed by a cell population. The abundance of each barcode tells us how much transcript is produced by each reporter. The only snag is that sequencing is not perfect, so many sequenced barcodes will have mistakes. We need to discover those mistakes and revert them to get an accurate tally of the counts. Since we do this all the time, we decided to make a proper software with a name and all and publish it as such.

Under the assumption that sequencing errors are rare, we expect that barcodes with errors are less frequent than barcodes with errors, and that barcodes will have few errors in any case. So for this problem, the best option is to perform error correction by sequence clustering, i.e. we consider that a barcode has a mistake when its sequence is very close to a more abundant one. The first part of the process is to identify similar sequences, the second is to merge them in clusters that correspond to “real” barcodes. The first problem is more demanding than the second, so we spent a lot of efforts to solve it with a clean solution. More specifically, we wanted an all-pairs search algorithm, i.e. an algorithm that finds all the pairs of related sequences for a similarity threshold (as opposed to a heuristic).

Starcode is of open source, you can download/fork or comment the code here, and you may also have noticed that there is also a link to it on the right banner. Starcode is not the silver bullet for every clustering problem, but for us it does the job efficiently. We are willing to share and to improve, if you have any feature request, just get in touch with us and we would be happy to discuss how to make it work for your problem. Don't hesitate to try out the code and tell us what you think about it, your feedback is more than welcome.

Getting Starcode published was a little tedious, the review process was about 8 months, mostly waiting for reviewer 3, who in the end did not review the revised version. In any event, the reviews were constructive and relevant and they helped us improve the manuscript and the software. In the process, we also discovered that Slidesort, the main competing software had a bug. We wrote to the authors to suggest them to fix it... but apparently they have other things to do. No comment.

« | »

comments powered by Disqus