The Lab Notes

The main theme of our research is to understand how gene regulation and genome organization tie in with each other. The Lab Notes are the latest headlines from the lab, featuring a collection of random thoughts and useful code snippets.

Overfitting as a hobby

As a team building exercise, we decided to conceive a scientific study, write the paper and submit it in 24 hours. The result is an article on protein folding rates published in PLoS ONE.

Even though we hit the deadline, the first version was more a draft than a mature manuscript. Understandably, the text has been heavily modified during the revision process. In the end, it is an interesting study and the whole experience shows how efficient team work can be.

We reviewed the literature of protein folding rates and tested all the models we could with new data. The prediction quality ranged from catastrophic to disastrous, so we tried to understand the reasons. It was not long to discover that the models were overfitted, either by overtraining the model, or by using a too small data set.

How to avoid the embarrassment of predicting folding rates faster than light? It is actually harder than it sounds. We argue that learning curves are the safest method to control the fit quality. That said, it is always difficult to know whether a sample is representative of the population, so many times the proof is in the pudding: you have...