04 December 2019

How Big Data is Changing Biology

Mikhail Gelfand – about genes and evolution

Pavel Lebedev, Rusbase

Biology, like many other sciences, is developing today thanks to the study of big data. Why it is possible to answer many biological questions only by performing calculations on a computer, how to predict the future "life" of a dead cell and what questions a neural network will not solve, said Mikhail Gelfand, Doctor of Biology and Candidate of Physical and Mathematical Sciences. The famous Russian bioinformatician spoke at the last Yandex technology conference – YaTalks.

gelfand.jpg

From philosophy to Big Data

How did biology work before? At first it was a section of philosophy. When Aristotle said that a fly has eight legs, he did it based on basic principles. No one counted the legs of flies, and the fact that a fly has six legs was discovered many centuries later. It's more of an urban legend, but it reflects the content part. 

Then observational biology began, from which classical botany and zoology emerged. Scientists realized quite late that it was possible to do experiments on living beings. For example, put a mouse under a glass hood, pump out oxygen and see what happens. Experiments appeared and people began to look through a microscope and see cells. People began to observe not the phenomena themselves, but their consequences. The transition from direct observation to indirect experiment anticipated bionformatics. Today there is a division of biology into in vivo and in silicon – experimental and computer.

Bioinformatics is largely not a science, but a set of skills that a good biologist uses in different situations. A huge number of biological questions can be answered only by performing calculations on a computer.

I have such a story. There was a student, by the end of the mehmat she had several bioinformatic articles in parallel with the main diploma. Then she said she wanted to study experimental biology and enrolled in graduate school at UCLA in Los Angeles. It's been a while. There she had to find a protein that a certain bacterium injects into our cell so that this bacterium could then live in this cell.

She tried for a long time to find this protein, but could not. Her new boss remembered that she was engaged in bioinformatics and can perform calculations. She spent a week at the computer and found four potential candidates who could perform this function. Several months of work were saved. And this is how bioinformatics was arranged, which I do in particular, somewhere before the noughties.

In the noughties, there was a certain turning point, because experimental technologies became so effective and so cheap that there was a lot of data. Biology has moved along the same path of such venerable sciences as, for example, astrophysics or high-energy physics.

In bionformatics, firstly, big data has appeared, and secondly, now we can study not individual proteins and genes, but look at the work of the cell as a whole.

I deceived you, like any normal lecturer, when I said that bioinformatics is not a science.

There is a fundamental science behind bioinformatics, which is called molecular evolution. In fact, all the empirical techniques that we use have a deep evolutionary substrate. To understand how it works now, you need to understand how it appeared. In some industries, this has advanced more, as in classical molecular biology, in some worse, for example, in structural biology.

Big data in structural biology are available for disparate organisms: humans, fruit flies, yeast, there are too long distances between them. But work is beginning to be carried out, in which, for example, they study how the brains of primates work, not all at once, but several dozen, and here in this place you can do evolutionary things, but such work is just beginning.

Transcription factors

I will tell you one rather narrow line. In what I will try to tell, there will be a single plot, but thematically – this is far from all bioinformatics. I will talk about transcription factors and about proteins that bind to DNA molecules.

I had a challenge because there is a wonderful book by Alexander Markov and Elena Naimark "The prospect of selection". It says that it is impossible to frighten the reader with the term "transcription factor binding site". I recommend the book, but I will talk about transcription analysis.

We have squirrels. Protein on the one hand is just a string of 20 letters-amino acids. From a molecular point of view, it is a sequence that is folded into a structure. How a protein binds to DNA will depend on how it binds to some small molecule. We have a site that encodes a protein, we need to turn the gene on and off depending on external conditions. There is an RNA polymerase enzyme that copies a gene. There is a sequence called an operator and a repressor protein binds to it. If the repressor is bound to RNA, DNA cannot work.

In order to understand how the work of a gene is regulated, it is necessary to understand which operators are in the sequence before this gene, this is one of the tasks of molecular biology. Well, for example, we can search for them experimentally.

It turned out that when big data appeared, we can describe how the whole cell works. It often happens that we do not know anything experimentally, but we want to find a motive. We can look not at one genome, but at many at once. Here it is worth remembering the Wald principle.

This story is from the Second World War. Abraham Wald was a mathematician, an Austrian Jew by birth, who in the late 30s managed to figure out what was going on and moved from Austria to the United States. There he worked in the strategic bureau in Manhattan, in an office that solved mathematical problems on orders of the military. This may be instructive, I'll tell you why now, even in a broader context than bionformatics.

Allied planes that flew to bomb Germany suffered heavy losses from anti-aircraft guns. It became clear that it was necessary to strengthen the armor. The problem is that if the bomber's armor is properly reinforced, then it will turn from a bomber into a tank and will not be able to fly, just by weight.

The customers had the following technical task – to count the holes on the returned aircraft, to say in which areas of the aircraft the most holes. And then we will cover these seats on the plane with armor. Is it clear to everyone why this is a bad technical task? Because this is a survivor's mistake. Holes are counted in those planes that have returned, not in those that have fallen. Wald understood this and solved the opposite problem, he calculated in which places there were fewer holes than expected.

The moral of this story is that it is not necessary to listen to the customer, it is necessary to solve not the task that he says, but the one that he really needs. And no neural network will do this to you.

Mutations

What happens to the genome sequence? And the same thing. We have a random stream of mutations due to copy errors. Further, if any mutation turned out to be neutral, evolution does not notice them and the fitness of organisms turns out to be the same. And there are bad mutations that reduce viability. For example, the regulation of a gene has deteriorated or the sequence of a gene has deteriorated and it encodes the wrong protein.

Mutations will be conservative, mutations in basic positions will occur much less frequently than in neutral ones. Indeed, those positions in which transcription factors bind are conservative, forming conservative islands on this alignment. DNA is a linear molecule, and somewhere it is packed more loosely, and somewhere more compactly. Where it is packed tightly, the genes are silent. And where the DNA is packed more loosely, the genes work there.

There is some great mystery here, for us, multicellular, in particular, why the genome of each cell is the same, and the cells and tissues are different. Because different genes work in different types of cells, the work of genes is regulated by transcription factors and, in addition, packaging.

There is a difficult game there, it is still not very clear what is primary, what is secondary. The observation is that intensively working regions of the genome – in them the DNA is looser.

How to find out where it is loose, where it is not loose? The following experiment is being done. We take protein. We isolate DNA first, but so gently that the packaging does not change, we destroy cells, cell membranes and pour protein. Protein cuts DNA. The trick is that it is easy to cut loose sections of this protein, and it is difficult to cut tightly packed sections of the protein, it is difficult for the molecule to get to each specific point in the protein. DNA is isolated on a large population of cells.

There will be a lot of short fragments from loose sections, there will be more long fragments from compact sections. The binding of proteins or fragments of the genome are biologically determined things. Where the DNA was loose, there were many fragments, and where the DNA was tightly packed, there were few incisions.

"Astrology for people with a good memory"

Colleagues have learned to predict not only how genes work now, at the time of the experiment, but also how these cells would work after a while if we hadn't killed this cell. The trick is that in the image each cell turns into a vector – its state now and the state that is projected into the future. We can look at these vector fields. The flow of cell changes goes from left to right. The picture shows how the precursors turn into different types of cells. To draw and interpret such pictures, you need good math.

If you know which genes regulate the work of other genes, you can make an image in the form of a network. You can see how hierarchies, small elements of the network are arranged. This is such a zoological part of this science. And you can do research in evolutionary biology.

We studied the evolution of the regulatory system and made a bunch of specific predictions along the way. And we had a big table where it was written out where which gene is and how it is regulated. We published an article, and then I gave a seminar in Hesse in Germany, and, as expected, after the seminar, those who want to talk to me sign up and we communicate.

I go into another laboratory – there's a serious-looking Frau professor there, she has our article on her desk and it's marked with ticks in the table. And I understand that a graduate student is sitting somewhere and checking predictions by line. She noticed that I paid attention to the table, smiled and said, "So far everything is fine." In fact, this is a risky science, such astrology, but easily verifiable, "astrology for people with a good memory."

Portal "Eternal youth" http://vechnayamolodost.ru


Found a typo? Select it and press ctrl + enter Print version