06 July 2015

EMC and Academic University create software for bioinformatic calculations

The genome as a collider

But along with the logic came the need to store and process huge amounts of data: the analysis of genomic data is the same big data as social networks or data from the Large Hadron Collider. To do this, EMC, together with the Academic University, is creating and testing a platform for bioinformatic computing.

A lot of dataOne of the main data providers is systems biology.

This is a science that studies the work of an organism (in general, any one) based on data about its genome – the structure of DNA. Information about all proteins that can exist in this organism is encoded in long DNA chains: in fact, DNA "knows" about the structure of this organism and how it can react to certain environmental conditions.

It is very difficult to work with information "extracted" from DNA for many reasons, but the main thing is that there is a lot of it. Such a volume of data is very difficult to store and process, since each DNA molecule consists of 3.1 billion nucleotides. Translated into the language of data, the genome of each person in compressed form "weighs" 0.5 TB, and in expanded form, which is needed to work with it, it is three times more. Moreover, the sequence of genes itself does not indicate their purpose in any way: to do this, you need to compare the genomes of a large number of people and identify those areas that are always found and only if a person has, for example, a certain disease. Then it is possible to talk with a certain degree of probability about the connection of the disease and a certain part of the genome, however, additional studies are needed to accurately establish the correspondence, requiring simultaneous storage and processing of data on a very large number of genomes.

As part of the Human Genome Project in 1984-2003 (the project to decipher the complete human genome – approx. 20 university centers in the USA, Great Britain, Japan, France, Germany, China read and processed 3.2 billion pairs of nucleotides (and spent about $2.7 billion on it). Now The 100,000 Genomes Project is being led by the United Kingdom, the United States and China are preparing projects to decode a million genomes. This amount of data is needed to collect high-quality statistics on the connection of each specific disease with a specific DNA site. Technically, this situation is similar to the search for rare elementary particles (for example, the Higgs boson) on mega-accelerators like the Large Hadron Collider: out of millions of particle collisions, only one or two of the desired object is born, but in order to find it, you need to analyze and process the entire data array.

In genome science, the amount of data that needs to be stored and analyzed reaches tens and hundreds of thousands of terabytes. This is big data – big data that cannot be processed by the efforts of the human brain alone. Tasks of this class – both from the point of view of hardware and from the point of view of software – cannot be solved by people with a classical biological education.

Against the background of the genetic boom, people from mathematics, physics and IT came to biology. Specialists with an understanding of biological issues and a strong mathematical base are called bioinformatics and are trained at several specialized faculties (the first of them was opened at Moscow State University).

"Omixes"What does this big data include, where does it come from and why is it difficult to analyze it?


The most famous biological molecule is, of course, DNA, and the most famous science in this field is genomics, which deals with sequencing, that is, decoding, determining the sequence of nucleotides in DNA. Genome sequencing has become almost a flow-through procedure today: determining the sequence of nucleotides (without interpretation) for a person costs only $ 900 today, and the price is falling all the time.

But do not run and sequence DNA from your saliva (this is a standard source of genetic information) right now: a simple decoding is not enough to say something definite about your health from a clinical point of view (and in part these data are redundant, because it is "interesting", that is, useful for practical use, only a small part of the genome). Just knowing the sequence of nucleotides in DNA is absolutely not enough to understand how an organism functions. Most of the DNA is non-coding, that is, it does not carry information about proteins, but it has a lot of sequences that regulate the work of the genome. In addition, from the DNA data, we cannot tell which part of it is "acting", that is, producing proteins, now, and which part is "sleeping". 

To understand what really happens in a cell, scientists understand the structure, function and quantity of proteins produced in it. They are also called proteins – this is a tracing paper from the English protein, which means "protein", and the field of knowledge that studies them is called proteomics.

However, a combination of these data is not enough: some important proteins can be synthesized under the influence of certain factors (for example, stress). Such proteins live for a very short time and then disintegrate, so we cannot register them. But the information about them remains in special RNA molecules that copy a particular section of DNA, so that a protein can then be synthesized on its basis. Such RNAs are studied by the science of transcriptomics.

All these names in English end in -omics (genomics, transcriptomics, proteomics, metabolomics – the science of metabolites, lipidomics – the science of fats and lipids, and so on), so together they are called omix technologies. To understand the work of the body, the causes of genetically determined diseases, as well as reactions to stimuli, infections and other environmental influences, it is necessary to analyze the data of omix technologies comprehensively.

These data are very voluminous, so that 0.5 Tb of the genome increases several times more. In addition, the data is obtained in different formats, traditionally processed by different programs. All this requires not only the allocation of supercomputer capacities for data storage and processing, but also the creation of a specialized environment that would "translate" information from different omix data, integrate them and allow analysis on one screen.

Bioinformatic "cranberry"In Russia, EMC, traditionally known as a supplier of data storage systems, took on the creation of such an environment (with the working name cranberry – "cranberry").

Physically, the supercomputer capacities (1.5 thousand virtual machines) are located in St. Petersburg on Vasilievsky Island in the building of a former tobacco factory. It turned out to be very suitable because it was built strong enough to withstand heavy machines: each floor tile can withstand a weight of up to two tons. Now the factory has adopted a powerful IT infrastructure: supercomputers are not much lighter than industrial machines.

Several scientific groups are testing cranberry on their tasks at once. Among them is the laboratory of the world-famous bioinformatician Pavel Pevsner, created at the expense of the first wave of mega-grants at the Academic University of St. Petersburg. Other users are the Center for Genomic Bioinformatics of St. Petersburg State University named after Dobrzhansky and Parseq Lab, a private company promoting bioinformatic data into clinical practice for medical diagnostics.

"We work with open data or data of our colleagues, collaborators. We create genome and RNA assembly systems and test them on our servers and a platform developed by the EMC Skolkovo Research and Development Center. This is a cloud, but the cloud is specialized, modified for our needs, and this makes it much more efficient than universal cloud platforms available on the market" – Alla Lapidus, Deputy head of the Laboratory of Algorithmic Biology of the SPbAU RAS.

The laboratory, headed by Pavel Pevsner, was established in 2011 with the funds of a megagrant, but is already well known in the world of bioinformatics: the SPAdes genome assembly software package is used by more than 1,500 laboratories around the world, including the Craig Venter Institute, an advanced center for synthetic biology, where a synthetic bacterial cell was obtained for the first time in the world.

SPADes and its "younger brother" – rnaSPADes, developed by the same group for the analysis of transcriptomic data, deployed on the EMC environment, make it possible to efficiently analyze simultaneously genomic, transcriptomic and proteomic data in order, in particular, to qualitatively improve the genetic analysis of cancer cells and identify the causes of the disease. Application in clinical practice is not far off: qualitative improvement of genome analysis will reveal more mutations-markers (such as those of Angelina Jolie, for example), which are indicators of an ultra-high risk of developing a certain disease and will allow their carriers to take preventive measures to avoid the disease and prolong their healthy life.

Portal "Eternal youth" http://vechnayamolodost.ru
Found a typo? Select it and press ctrl + enter Print version