The Lab Notes

The main theme of our research is to understand how gene regulation and genome organization tie in with each other. The Lab Notes are the latest headlines from the lab, featuring a collection of random thoughts and useful code snippets.

Overfitting as a hobby

As a team building exercise, we decided to conceive a scientific study, write the paper and submit it in 24 hours. The result is an article on protein folding rates published in PLoS ONE.

Even though we hit the deadline, the first version was more a draft than a mature manuscript. Understandably, the text has been heavily modified during the revision process. In the end, it is an interesting study and the whole experience shows how efficient team work can be.

We reviewed the literature of protein folding rates and tested all the models we could with new data. The prediction quality ranged from catastrophic to disastrous, so we tried to understand the reasons. It was not long to discover that the models were overfitted, either by overtraining the model, or by using a too small data set.

How to avoid the embarrassment of predicting folding rates faster than light? It is actually harder than it sounds. We argue that learning curves are the safest method to control the fit quality. That said, it is always difficult to know whether a sample is representative of the population, so many times the proof is in the pudding: you have...

Barcode clustering with Starcode

Our team has recently published an article in Bioinformatics describing Starcode, our software to cluster short sequences.

The first years of the lab have been focused on setting up the TRIP (Thousands of Reporters Integrated in Parallel) technology to study position effects. In a nutshell, we integrate reporters with the Sleeping Beauty transposon in our favourite genome, but before this, we barcode each insertion with a random sequence of 20 nucleotides. The barcode allows us to track RNA expression, DNA repair, protein binding etc. on each inserted reporter.

Our typical experiment is to sequence RT-PCR products that contain all the barcodes expressed by a cell population. The abundance of each barcode tells us how much transcript is produced by each reporter. The only snag is that sequencing is not perfect, so many sequenced barcodes will have mistakes. We need to discover those mistakes and revert them to get an accurate tally of the counts. Since we do this all the time, we decided to make a proper software with a name and all and publish it as such.

Under the assumption that sequencing errors are rare, we expect that barcodes with errors are less frequent than barcodes...

On bad statistical pratice

I have just published in GigaScience a short note entitled The signed Kolmogorov-Smirnov test: why it should not be used. I had discussed this issue previously on The Grand Locus, and I have refined the arguments through the publication process.

The “signed Kolmogorov-Smirnov test” is a non standard statistical practice on the rise in the field of genomics. This is unfortunate, as there are several reasons why this test should be avoided. First and most importantly, is that it does not test whether two samples have the same mean. Second, it is less powerful than the standard t-test.

Why is it used at all then? It is tempting to speculate that the reason may have something to do with p-hacking, which is the practice of changing statistical test until you find one that gives the p-value you expect. This of course has to be discouraged, and one way to do this is to highlight poor statistical practice. So my aim with this paper was to give a peer-reviewed reference that can be cited in order to argue against the use of this practice.

The editor of GigaScience and the reviewers have been...

A simple life

Since the summer, a strange creature is growing in our laboratory. Trichoplax adhaerens is the simplest known multi-cellular animal (metazoan) on the planet. At most 1-2 millimetres in size, it consists of two layers of cells and not much else. It leaves on rocks in coastal waters, and is quite abundant in the Mediterranean sea.

The pictures above are 20x magnifications of the same animal taken at 1 minute interval. When searching food, the animals change shape continuously. These pictures were taken immediately on arrival of the animals from the laboratory of Bernd Schierwater in Hamburg. Since then, Trichoplax grows happily in our laboratory.

Most relevant for us, Trichoplax adhaerens has few cell types (4-5), no muscle, no neuron, and a small genome with few genes (in the range of 11,000). Trichoplax adhaerens is the simplest system capable of maintaining a cell identity. Many unicellular organisms can differentiate (like the Plasmodium malaria parasites for instance), but maintaining a cell identity is a different challenge. In unicellular organisms, differentiation is invariably triggered by the environment. This makes differentiation a sensing problem. The more accurate the sensors the better.

The purpose of maintaining a cell identity...

How to convert PubMed references to BibTeX

Time and again, you need to write a paper in LaTeX with at lot of citations from PubMed. And we all know that PubMed does not support the BibTex format. Fortunately, you do not have to go through the pain of a manual conversion. If you are familiar with basic scripting, this can be done fairly easily with the following steps.

Write your TeX document with pubmed citation numbers. Each article on pubmed has a pmid which consists of numbers only, which is the key of the PubMed record of this article. For instance, when you cite the PubMed article 22999052 in your TeX document you would write


Extract the pmids from the tex document. For instance, if your TeX document is called document.tex, at the Linux command line you can do this with

grep -o "pmid[0-9]*" document.tex | sort -u | sed 's/pmid//'

Use the eFetch API to get the PubMed records in XML format. Assuming that you now have a comma-separated list of pmids somewhere ready (in the example below, the pmids are 22999052,21813512), paste the following in your browser text box (or open this link in a new...

Good scientific discussions

The culture of meetings varies a lot between research teams. Most labs have a team meeting and a journal club, with a wide variation in frequency, duration and topics between labs. As the principal investigator, you want good scientific discussion in the team, but this comes at a cost that we often underestimate (I found Jason Fried's TED talk why work does not happen at work very instructive about this).

Our lab itself is an ongoing experiment and through trial and error we have learned a few things about scientific discussions that are worth sharing. Our first attempt to promote communication were 5 minute micro-meetings between two people. During this time, they were supposed to explain they will do during the day, and why. The meetings were twice a week, with a rotation schedule. Even if everybody liked the idea, it turned out to be unsustainable because synchronization between the two people did not happen naturally. At the time one got a 5 minute break, the other would be in the middle of a technical experiment and so on. After skipping a few meetings, the momentum was lost and they quickly died out.

Our second attempt...

The role of SWI/SNF in HIV-1 chromatin remodeling

If you type the keyword “SWI/SNF chromatin remodeling” and “HIV-1” in PubMed, less than 20 research articles appear on your screen. Actually, the topic of nucleosome remodeling of HIV-1 provirus is less than 10 years old. The more we investigate HIV-1, the more we know that the connection between host chromatin and HIV-1 pathogenesis cannot be ignored.

The integration of HIV-1 provirus into the cellular genome is an essential mechanism for the establishment of stable infection. After this step, how the HIV-1 provirus further manipulates chromosomal features to continue its life cycle is therefore considered important. SWI/SNF is one of the main actors involved in the alteration of DNA accessibility within repressive nucleosomes. In fact, back in 1996, the SWI/SNF regulator has been found in the RNA polymerase II holoenzyme and has been reported to be involved in chromatin remodeling [1]. Later, it was realized that the SWI/SNF complex found in both eukaryotes and prokaryotes is actually a group of proteins that associate to remodel the nucleosome state (active or repressive). SWI/SNF contains either Brahma (BRM) or the closely related BRG1 as its catalytic subunit and shares...

3D animations with R

Every now and then I need to make a rotating animation of a 3D plot. The R package rgl turns out to have everything you need, but the grip is a little difficult. Below is an example that will walk you through the steps to make this animation.

First things first, you must make sure that rgl is installed. On Unbuntu, you may also have to install additional libraries. And by the end, you will need to use imagemagick, so at the shell command line you can issue

sudo apt-get install libglu1-mesa-dev
sudo apt-get install imagemagick

You can now start R. Since the package is on the CRAN, you can install it as usual.


For the purposes of this example, we create a random 3D cloud that consist of two Gaussian spheres next to each other.

# Distribute 1000 points at random among two spheres.
x <- matrix(rnorm(3000, mean=rep(c(0,2), each=1500)),
ncol=3, byrow=TRUE)
cols <- rep(c("dodgerblue4", "dodgerblue2"), each=500)

Now we plot the cloud in 3D and save the frames to different .png files. After the call to plot3d you can resize the window...

HIV Therapies – From “Hit Hard, Hit Early” to “Shock and Kill”

Since the outbreak of AIDS more than 30 years ago in the United States, HIV/AIDS is still one of the top ten causes of death worldwide [1]. One of the difficulties to cure HIV/AIDS is due to the lack of an effective HIV vaccine, although numerous laboratories are searching for it.

In 1995, David Ho first promoted a “hit hard, hit early” approach to eliminate HIV-1 infection in the early phase of the infection [2]. Yet, later on this approach was abandoned because of the high risk of side effects and the high cost of the treatment, this approach is still a milestone in the history of HIV/AIDS treatment. Nowadays, the standard approach for HIV/AIDS treatment is based on the standard antiretroviral therapy (ART) combining at least three antiretroviral (ARV) drugs to maximally suppress the HIV virus and stop the progression of HIV disease. Typical combinations include 2 nucleoside Reverse Transcriptase Inhibitors (NRTIs) + 1 Protease Inhibitor (PI) or 2 NRTIs + 1 non-nucleoside Reverse Transcriptase Inhibitor (NNRTI) [3]. Clinical studies showed that ART is able to impressively decrease the mortality of AIDS patients.

It was gradually realized that one of the main reasons...

How to gunzip on the fly with Python

For a long time I wondered how R was able to recognize gzipped files and decompress them on the fly. This is neat because the large data files that we manipulate in bio-informatics are better kept compressed on the disk and decompressed upon loading them in memory.

Most binary file formats start with a magic number, indicating which file type it is. A properly gzipped file starts with 1F8B. You need to read the first two bytes, and once you figure out whether the file is compressed, you either read the file as usual, or read it with the functions of the gzip package.

Here I wrote a small module called After importing the class gzopen, you can use it to seamlessly open gzipped files.

# -*- coding:utf-8 -*-

import gzip

class gzopen(object):
"""Generic opener that decompresses gzipped files
if needed. Encapsulates an open file or a GzipFile.
Use the same way you would use 'open()'.
def __init__(self, fname):
f = open(fname)
# Read magic number (the first 2 bytes) and rewind.
magic_number =
# Encapsulated 'self.f' is a file or a GzipFile.
if magic_number == '\x1f\x8b':
self.f = gzip...