08 November 2016

Three whales of the crisis

Have you called a dream?

Maria Kondratova, "Biomolecule"

In October 2016, a group of Russian bioinformatics scientists won a stage of the prestigious ENCODE-DREAM scientific competition, timed to coincide with a seminar on the application of data analysis and machine learning methods in biology, held as part of the ISCB-RECOMB international conference on regulatory and systemic genomics. The algorithm proposed by the Russian team for predicting the binding sites of proteins regulating gene expression was recognized as the best. However, the story of the victory of the bioinformatics team under the leadership of Ivan Kulakovsky is more than just a "success story" (although it is worth a lot); this story is about how a fundamentally new model of the organization of science is being formed and begins to work before our eyes.


It is believed that transcription factors (TF) regulate the activity of genes by binding DNA sites in open (accessible) regions of chromatin. During the ENCODE-DREAM Challenge, the participants of the competition were asked to predict the binding sites of TF on the scale of the complete genome of a certain cell type based on information about the availability of chromatin and DNA sequence. Drawing from the website mappingignorance.org .

The explosive growth in the number of laboratories and research projects in the field of biological and medical research in recent decades has exposed some fundamental problems in the organizational principles of modern science, which seriously slow down its development. In biology, this crisis is confidently based on three problematic "whales".

The whale is the first. Big data – big problems

In "traditional" science, the situation was simple: whoever received the data analyzed them (at least at the laboratory level, although, as a rule, even at the level of an individual scientist). The situation changed when high-performance methods became popular (for example, DNA microarray [1] and "next generation" DNA sequencing [2]), when, as a result of one experiment, the expression (activity) of not one, but all genes in a sample is measured, in dozens and hundreds of samples at once. Big data has come to biological science. Processing these gigantic numerical arrays requires proficiency in complex statistical methods and special mathematical training.

The logical consequence of this situation was the further deepening of the specialization of scientists: along with the "wet" – experimental – biological laboratories (wet lab), a lot of "dry" bioinformatic groups (dry lab) appeared, engaged exclusively in data analysis.

A large number of articles are devoted to "Dry" (or computer) biology on "biomolecule". Here are some of them: "The computational future of biology", "I would go to bioinformatics, let them teach me!", "Bioinformatics: Big Databases versus Big R" and "The Research Group of Philip Haitovich, or how biologists work with large amounts of data" [3-6]. – Ed.

Ideally, such specialization should have greatly increased the efficiency of scientific work. However, in practice, things are not so rosy. New forms of specialization rest on the traditional format of publishing scientific data (publication), implying the leadership of one group and painful priority issues. Who is the author of the result – the group that set up a complex and expensive experiment, or the analysts who found an interesting dependence in the data and eventually turned "scientific information" into "scientific knowledge"? As a rule, the two sides (bioinformatics and experimental biologists) do not have the same answers to this question. As a result, publications appear that contain interesting, but carelessly analyzed data on the one hand (experimenters who do not want to "share"), or interesting statistical methods that are not confirmed by experimental verification, on the other (from bioinformatics who have not found "their" experimenter who is ready to invest in confirming or refuting their theories). This state of affairs is hardly optimal.

The second whale. The crisis of cooperation

The more specialization in science grows (as you know, a real specialist knows "everything about nothing"), the more obvious becomes the need for a counter "synthetic" movement – generalization, scientific cooperation, communication of scientists of different directions. This is understood even at the state level: in Europe, in order to apply for any large grant, it is necessary to combine the efforts of several laboratories. However, hand on heart, such a semi-prudent crossing of a "hedgehog with a hedgehog" for the sake of obtaining funding is rarely truly effective and breakthrough. As a rule, it can only be characterized by a restrained definition of "better than nothing."

And although the official position of the world fundamental science is a free scientific search, in real academic life, the position of "dogs in the manger" – "This is my topic ..." – is also, unfortunately, very common. The lack of common interests and strong mutual jealousy and suspicion (due to the same priority issue) greatly complicate effective scientific communication.

The third whale. And who are the judges?

A working bioinformatic method allows you to save a lot on experiments and sets new directions for scientific research, however, with the growing number of mathematical approaches to a particular biological problem, a natural question arises – which one of them better describes reality and under what conditions?

The position of bioinformatics as a "service science" with someone else's data plays a bad joke with the verifiability of methods. The success of the publication (determined by the rating of the journal and the number of citations) is 90% the success of the original data and only 10% the success of the analytical method. Bioinformatics, collaborating with successful experimental laboratories, thus have a significant advantage over their colleagues, practically independent of the quality of the approaches and algorithms they use. Methods published in Nature or Cell are not necessarily the best methods. But with such an obvious inequality of priority access to data, how can we determine the best? The issue of verifiability of bioinformatic methods is the third problematic "whale" of modern biological science.

The exit is the same as the entrance – strategies for overcoming the crisis

An alternative to semi-compulsory "grant" cooperation is a system of scientific networks, actively implemented around the world by enthusiasts, the most famous of whom is the scientist and innovator Steven Friend. The non-profit organization Sage Bionetworks founded by him is engaged in the popularization of "open science" and is developing the Synapse cloud platform, on which scientists from different countries can unite to solve a problem of interest to them (usually related to big data analysis).

In parallel, more and more international non-profit consortia are being organized in biology, bringing together both experimenters and bioinformatics specialists interested in a specific broad topic. Some are clearly focused on the study of specific diseases (for example, the TCGA and ICGC consortia are engaged in systematic research of various types of cancer). Others are more focused on the study of fundamental issues (such as ENCODE and FANTOM, which study the regulation of gene activity in various cell types).

Another approach aimed at solving both the problem of overcoming scientific isolationism and the problem of verifiability of bioinformatic methods is the system of scientific competitions. Such a competition lies, one might say, at the very origins of molecular biology, when, having practically the same data, Watson and Crick competed with Pauling in predicting the structure of DNA.

The first mass competition of theorists (as far as I know) was the CASP – an open and independent competition, which is organized every two years by scientists engaged in structural biology. For him, several laboratories involved in the crystallization and determination of the three-dimensional structure of proteins "hold" their data for several months, giving structural biologists the opportunity to compete in predicting an unpublished structure [7]. The winning team is considered to be the team of theorists whose model has demonstrated the best compliance with experimental data.

A similar approach – testing predictive methods with closed (until the end of the competition) experimental data – has also passed into a new generation of scientific competitions, the largest of which is the DREAM Challenge [8].

Since its launch in 2006, more than a dozen computational problems from various fields of molecular and theoretical biology have been proposed and successfully solved under the banner of scientific crowdsourcing within the framework of the DREAM (Dialogue for Reverse Engineering Assessment and Methods) initiative. The Synapse cloud ecosystem has been the platform for organizing competitions for several years.

The time frame set by the competition allows avoiding the "laxity" that is common, alas, with informal cooperation, and the "carrot" that attracts scientists to this competition is the publication of the winners' results in a prestigious scientific journal and an invitation to speak at a conference. Thus, for young scientists, these competitions are a real chance to get into the "big science" in one leap. But more importantly, DREAM Challenge forms a new bioinformatic "table of ranks", more objective and independent than the traditional "publication rating", and gradually turns the scientific community towards an open collective brainstorming of current scientific problems as opposed to the traditional "proprietary" approach. Before our eyes, a virtually independent scientific brand is being formed, and perhaps in a few years "tested and confirmed – DREAM Challenge" will mean no less than "published in Nature".

During the ENCODE-DREAM Challenge, participants of the competition are invited to predict the genome–wide "map" of DNA binding by regulatory proteins – transcription factors - in a given cell type (Fig. 1). It is allowed to use computer methods of sequence analysis (describing characteristic DNA patterns bound by proteins) and experimental data on chromatin availability (DNase-Seq) and gene expression (RNA-Seq).

Figure 1. Local DNA region, the "footprint" of the transcription factor
in open chromatin, protected from the action of nuclease

Experimental binding maps obtained by chromatin immunoprecipitation followed by deep sequencing (ChIP-Seq) are used to train machine learning methods (by cell types other than the "target" one) and to verify the predictions of various teams [9].

DREAM Challenge 2016

In 2016, several parallel competitions are held within the framework of DREAM. The topic of the joint project of the ENCODE and DREAM consortia was the prediction of DNA sites binding transcription factors based on experimental information on the availability of chromatin and computer analysis of the genomic sequence.

As you know, all cells of a multicellular organism contain the same DNA (genome), but they have different properties: in the human body, for example, there are several hundred specialized cell types. Different sets of genes work (are expressed) in different cells, due to this, such a variety of body tissues is achieved. And although scientists have long deciphered the genetic code [10], we still have a poor understanding of exactly how genes turn on and off.

It is known that transcription factors play an important role in this – proteins that sit on DNA at special points (binding sites) and attract RNA polymerase there, which reads from the mRNA gene necessary for the synthesis of a certain protein.

Another, more general, mechanism regulating the activity of genes in the cell is the packaging of chromatin (a complex of DNA and the main packaging proteins-histones). It is believed that the genes in the tightly packed chromatin fraction are turned off and inaccessible to the action of transcription factors, and in the "open" chromatin - on the contrary.

Experimental determination of the binding sites of certain proteins with genes is an expensive and complex procedure that must be repeated for each regulatory protein (and there are up to one and a half thousand of them in humans) separately in each cell type. Determining the availability of chromatin is a much simpler operation, and it is enough to perform it once for each cell type. Therefore, it would be very tempting to create a computer method capable of recreating the binding profile of transcription factors to the genome, based on the availability of chromatin and known information about the genomic sequence.

From theoretical grounds, it is impossible to deduce an exact formula by which it would be possible to say whether a protein sits on this part of DNA or not: too many factors influence this process. However, in the known experimental data, we can try to identify patterns describing the characteristic we are interested in. Methods of detecting dependencies in data are what in the modern world is called machine learning. A typical machine learning task is to understand a pattern (learn) from one experiment in order to then predict the result of another experiment. But there is one subtlety here: most of the modern molecular biological data has been obtained for cancer cell lines [11] (it is easier to work with them in the laboratory), while normal cells are of the greatest interest to researchers. How to effectively extrapolate cancer data to normal tissues? That was exactly the challenge! this year's competition.

For ENCODE-DREAM, the "machine learning" methods were worked out by the participants on known experimental data on the binding sites of transcription factors in cancer cell lines, and then tested by the organizers on unpublished data on the binding of transcription factors in normal liver cells.

Better less, yes better! Secrets of the success of the autosomal team

It is believed that machine learning algorithms work better the more data they have passed through themselves at the learning stage. Team innovation autosome.ru What ultimately secured them the first place was to purposefully limit the data set and use information from only similar cell lines to train the algorithm.

The second success factor was the use of the powerful machine learning method XGBoost, which works as an ensemble of decision trees [12].

And finally, the third factor was the painstaking work on the design and selection of features (or "features" – features, as they are called in the language of machine learning), which help to distinguish related parts of the genome from unrelated ones. Some signs very weakly confirm or refute the hypothesis of the factor binding to DNA, but machine learning algorithms are able to take into account even weak patterns and use them to improve the model. Some signs give an increase in the quality of prediction by tens of percent, others – by tenths of a percent, but when it is possible to combine the "evidence" from dozens of such signs, the total effect is significant.

Together, these approaches won not just a "technical victory", but a convincing superiority over rivals. The final "generalized rating" of the team autosome.ru (Fig. 2), based on various estimates of the quality of predictions for 12 transcription factors, turned out to be almost twice as high as that of the closest competitor, the J–Team, which took second place. In the near future, the results of this first stage of the DREAM challenge will be reported at the conference.


Figure 2. Know your heroes! From left to right, members of the Russian (mostly) team autosome.ru Ilya Vorontsov, post-graduate student of the N.I. Vavilov Institute of General Genetics of the Russian Academy of Sciences (Moscow, Russia). Andrey Lando, Master's student at the Moscow Institute of Physics and Technology (Dolgoprudny, Russia). Grigory Sapunov, co-founder of Intento. Cheburashka, mascot. The team leader is Ivan Kulakovsky, a leading researcher at the V.A. Engelhardt Institute of Molecular Biology of the Russian Academy of Sciences (Moscow, Russia). Valentina Boeva, head. She graduated from the laboratory of the Koshan Institute (Paris, France), graduated from the Moscow State University and defended her PhD thesis in Russia. Irina Eliseeva, Researcher at the Institute of Protein of the Russian Academy of Sciences (Pushchino, Russia). Vsevolod Makeev, Corresponding Member of the Russian Academy of Sciences, Head. Laboratory of the Vavilov Institute of General Genetics of the Russian Academy of Sciences (Moscow, Russia).

Until recently, bioinformatics, fully aware of their dependent position (on experimenters – producers of data), sadly joked that, despite the dizzying successes of their science, it is not necessary to expect the award of the Nobel Prize for data analysis in the near future.

Now bioinformatics have to look for a new subject for jokes – in 2013, Martin Karplus, Michael Levitt and Arie Warshel received the Nobel Prize for the development of methods for modeling large and complex chemical systems and reactions: ""Virtual" Nobel Prize in Chemistry (2013)" [13]. – Ed.

However, competitions like the DREAM Challenge change the usual (and unfair) state of things, and perhaps we will still see the author of some stunning algorithm at the reception of the Swedish king. But even if not, then, turning science away from the selfish "struggle for priority" in the direction of a collective search for truth based on cooperation (at the final – crowdsourcing – stage, all teams participating in the competition will exchange data and algorithms and discuss winning strategies together), such competitions act as a kind of "prototype" of science a future more open, free and efficient than the science of today, ready for new challenges and new dreams.

Informal team autosome.ru It unites researchers from several institutes (primarily IMB RAS and IOGen RAS) working in the field of regulatory genomics and bioinformatics. Participants in different composition in different years have developed a lot of computational methods and databases for the analysis of gene regulation in eukaryotes (examples of works can be found on their website). Among the past successful projects, the work of the Russian group under the leadership of Vsevolod Makeev in the international consortium FANTOM5 can be noted. Participation in DREAM is one of many joint projects, during which previous developments were actively used.

The author expresses gratitude to the team members autosome.ru for help in working on the article and to Andrey Zinoviev (Curie Institute, Paris) for an interesting discussion about the ways of developing data science.


  1. biomolecule: "The most important methods of molecular biology and genetic engineering";
  2. biomolecule: "454-sequencing (high-performance DNA pyrosequencing)";
  3. biomolecule: "Computational future of biology";
  4. biomolecule: "I would go to bioinformatics, let them teach me!";
  5. biomolecule: "Bioinformatics: Big Databases versus Big P";
  6. Biomolecule: "The Philip Haitovich Research Group, or how biologists work with large amounts of data";
  7. biomolecule: "The triumph of computer methods: prediction of the structure of proteins";
  8. Bender E. (2016). Challenges: crowdsourced solutions. Nature. 533, S62–S64;
  9. biomolecule: "The new CETCh-seq method can catch many results in one label";
  10. Biomolecule: "At the origins of the genetic code: kindred spirits";
  11. biomolecule: "The Immortal Cells of Henrietta Lacks";
  12. Chen T. and Guestrin C. (2016). XGBoost: reliable large-scale tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 785–794;
  13. biomolecule: ""Virtual" Nobel Prize in Chemistry (2013)".

Portal "Eternal youth" http://vechnayamolodost.ru 08.11.2016

Found a typo? Select it and press ctrl + enter Print version