Overfitting as a hobby

As a team building exercise, we decided to conceive a scientific study, write the paper and submit it in 24 hours. The result is an article on protein folding rates published in PLoS ONE.

Even though we hit the deadline, the first version was more a draft than a mature manuscript. Understandably, the text has been heavily modified during the revision process. In the end, it is an interesting study and the whole experience shows how efficient team work can be.

We reviewed the literature of protein folding rates and tested all the models we could with new data. The prediction quality ranged from catastrophic to disastrous, so we tried to understand the reasons. It was not long to discover that the models were overfitted, either by overtraining the model, or by using a too small data set.

How to avoid the embarrassment of predicting folding rates faster than light? It is actually harder than it sounds. We argue that learning curves are the safest method to control the fit quality. That said, it is always difficult to know whether a sample is representative of the population, so many times the proof is in the pudding: you have to wait until you get new data.

The text is more about machine learning than it is about protein folding. We took special care to expose the matter as clearly as possible, so it can be used as a real world case of overfitting for teaching purposes.


« | »





comments powered by Disqus