20 May 2010

Molecular Biology + Computers = Bioinformatics

Bioinformatics: Molecular biology between a test tube and a computerWe publish the transcript of the lecture of the Doctor of Biological Sciences, deputy.

Director of the A.A. Harkevich Institute of Information Transmission Problems of the Russian Academy of Sciences, Professor of the Faculty of Bioengineering and Bioinformatics of Moscow State University Mikhail Gelfand, delivered on April 1, 2010 at the Polytechnic Museum as part of the project "Public Lectures of <url>".

Good afternoon. I think before we start, it would be right if we were silent for a while and remember our fellow citizens, residents of our city who died on Monday.

Well, thank you. I will talk about bioinformatics (1).

Konstantin Viktorovich Severinov, who talked about biology last time, began with a completely enchanting slide with an electronic message that he received from a candidate of military sciences, a retired colonel, demanding that the teaching of evolution be immediately banned. It also came to me, it's not so interesting to repeat, and the tradition to start with some crazy nonsense is, apparently, correct. In addition, in this hall on the ground floor there is an announcement about the lectures of a certain Dr. Chudinov, and, therefore, attempts to clear this building of this nonsense also fail. Therefore, it all seems to be quite in the spirit of the times…

Boris Dolgin. Over time, I hope it will disappear by itself…

Michael Gelfand. Yes, it will be replaced by lectures by " <url>". So, if you type the phrase "academy of bioinformatics" in Google, then this will pop up there (2).

"Academy of Bioinformatics""Academy of Bioinformatics"

Previously, this page was one of the first to pop up on the word "bioinformatics", but now it's not like that, now you need to look for it, but it will pop up on the "academy of bioinformatics".

So, this is not going to be discussed (4).

And we will talk about normal meaningful molecular biology, which people have learned to do not only in living beings, which is called in vivo, not only in test tubes, which is called in vitro, but also in a computer, for which they came up with the name in silico. In reality, there are still some experimental data under all this, but the computer has become an important means of processing them.

If we do another Google search – for the phrase "the genome has been decoded" – then much more interesting things will come out there (5).

the genome has been decoded!There will be about 600,000 English–language pages, and among them there will be a "human genome", and not just a human genome, but its three-dimensional structure (I may come back to this later for a minute), sorghum, Pseudomonas bacterium, again a three-dimensional human genome, a cancer genome (not an animal cancer, but a genome cancer, also human), a dog.

Russian Russian corn pops up, again a brain tumor, of course, the "genome of a Russian person", followed by the genome of a pig, the genome of a bacterium that damages teeth, the genome of a Neanderthal.

Unfortunately, this is not entirely true.

Another similar slide (6) is the number of complete bacterial genomes available. I stopped at 2007, I didn't have time to collect new figures, it's growing exponentially like this.

622 complete bacterial genomes (more than a thousand in 2010)The untruth here is that when they say "the genome has been decoded", it means that people have determined the sequence of nucleotides that make up the genome.

And this is not a transcript. If anyone remembers this frame (7), the Gestapo chief has a note in his hand, which was taken away from the unfortunate Professor Pleishner, and imagine for a moment that it would have been truly deciphered. Stirlitz, aka Isaev, would never have left this room, and Muller, accordingly, would have received another order.

Is the genome decoded?
Intercept an encrypted message –
it doesn't mean to understand him yetAnd so what is colloquially called "genome decoding" is actually not so much decoding as intercepting an encrypted message.

Instead of the DNA molecule that floated in the test tube, you now have a sequence of nucleotides that make up this molecule and which are now recorded in the computer. But sometimes we understand the meaning of this intercepted message, and more often we don't.

To understand what we are talking about: slide (8) shows one tenth of a percent, one ppm, of the Escherichia coli genome. This is such a standard laboratory object, which, apparently, is the most studied living being on Earth. In general, the genome of a bacterium – to understand the size of the disaster – is several million nucleotides, from several hundred to several thousand genes, and most of the genome encodes proteins.

0.1% of the E. coli genomeOn the same slide (9), a smaller fraction of the human genome fits by three orders of magnitude.

The human genome is about 3 billion nucleotides, about 20 thousand genes, in fact, not much more than a large bacterium. And most of the genome does not encode proteins, but has all sorts of other functions. I'm not going to say anything about it, because, firstly, few people know anything about it, and secondly, it will be too special.

0.0001% of the human genomeAnd when they say that the genome of some creature has been deciphered, they mean that you can draw this - make, so to speak, wallpaper with letters.

Well, great, why not now do experimental biology and start studying it all slowly, since these data are available. The problem here is that there are not enough opportunities for this. This picture (10) illustrates the following catastrophe: on the horizontal axis – years, and on the vertical – data.

Data volume growthLet me draw your attention, this is a logarithmic scale, that is, there is an order of magnitude difference between the divisions.

Red shows the number of articles published in the specified year in the PubMed database – this is such a basic bibliographic database on biomedicine, including molecular biology. And if you look closely, you can see that this is a slightly inclined straight line in logarithmic coordinates, that is, exponential growth, but very slow. What is shown in blue is the number of fragments of different genomes in GenBank, this is another database, such a standard depository where all sequenced DNA fragments are stored (sequenced is the same thing that is incorrectly called "decoded", sequencing is the definition of a sequence). And green is the volume of GenBank in nucleotides (elementary units of DNA, for us – just in symbols). And if we conditionally assume that one article describes one experiment made with one gene, which is true in the first approximation, and one fragment contains one gene, which is also true in the first approximation, then it is clear that around 1995 a catastrophe occurred – one line crossed another, and now more genes are known, what, in principle, we are able to study, even if we will not do anything else.

And there was a hope that something useful could be done without studying the genes experimentally one by one, but looking at the whole set, using various kinds of computer considerations. That's what I'm going to talk about.

A few more words about what I won't tell you about (11).

Not just textsThere are other types of data, also massive.

In general, an amazing thing has happened to biology in recent years – it has become a science rich in data, like astrophysics and high-energy physics. There are more concrete facts than we are able to analyze one by one. So, there is data generated by other types of experiments, we can talk about how intensively genes work, i.e., we can, say, measure the concentration of proteins in a cell. You can massively study protein-protein interactions or protein-DNA interactions – these may be some kind of structural complexes, they may be some kind of regulatory interactions. You can study the structure of the genome…

There is a problem with this data. When we talk about the genome, we are dealing with a completely discrete and well-defined unit. Of course, in different cells of the same organism, the genomes differ slightly due to random changes, but not very much, and in the first approximation this can be neglected. Therefore, it makes sense to talk about the genome of a particular person. We can talk about the human genome as a species, while understanding that the genomes of two individual people are, of course, different. When we talk, say, about the level of gene activity or about protein concentrations, firstly, the data is quite noisy – the experiments there are not so good, and secondly, we need to understand that we are averaging a very large number of individual differences all the time - protein concentrations in different cells are completely not identical, even inside the same fabric. And in all such data, we are dealing with some average - average in tissues, average in cell cycle time, etc. And the picture still turns out to be very beautiful – this is like another trouble of this science, that there are very beautiful pictures, behind the drawing of which the content side is sometimes lost.

This slide (12) shows the development cycle of malarial plasmodium – approximately two days.

Expression (level of work) of genesOn the horizontal axis, this is time, and on the vertical axis, these are different genes.

The color shows the level of work of this gene: green is less than average, and red is more than average. Roughly speaking, the concentration of the protein that this gene encodes. And if the genes are arranged in the right order, then such a remarkable cyclicity is visible, which just depends on the stage of development of malarial plasmodium. And then it turns out that if the genes are combined into functional groups, i.e. those groups of genes that naturally work together, then this cyclicity is even more pronounced. And this is actually a very good, very useful activity: for the first time, we have the opportunity to describe the work of the cell as a whole, and not just some of its small pieces. Here is the same picture (13), only about flowers – this is the development of the flower of the rhesus Tal, in the vertical again genes, and horizontally different conditions, and if properly ordered, rectangles are formed – these are groups of genes that work together in the same organs of the flower.

The development of the flower of rezukhi Tal:
double clustering – on genes and on conditionsThe last of the beautiful pictures (14) is protein–protein interactions, individual points are proteins, and the lines that connect them are that these proteins physically interact in the cell.

And this is how genes regulate each other's work, here now the dots are genes, well, at the same time the proteins encoded by them, in the first approximation, for our purposes, this is the same thing, and the arrows are the fact that this gene regulates the work of this gene, the arrows are multicolored, because maybe different regulation. Thus, you can look at how the whole cell is arranged.

Protein-protein (structural, signaling, etc.)
and protein-PROtein (regulatory) interactions in yeastI have already said that the three-dimensional structure of the genome has been deciphered, now you can not only write out the genome in the form of a sequence of symbols, but you can also tell which parts of this molecule are physically close to each other.

Of course, this is also averaged over many cells at once. This is the result of literally December, it's interesting to look at – pretty funny results are already being obtained.

And I won't talk about all this anymore, although bioinformatics plays one of the central roles in such works. And I will talk about how to decode genomes now in the right sense (15).

TasksFor example, we want to find out where the genes are in this long DNA sequence.

I have already said that 90% of the bacterial genome is protein–coding regions, but the problem is that we do not know in advance which parts of the bacterial genome encode proteins, and which ones are engaged in something else. In addition, we want to carry out a functional annotation, tell about genes and proteins, what they do, that is, what is the function of the protein encoded in this genome. We want to know about regulation, i.e. how it works, when, under what conditions these genes are turned on, in what tissues, under what external conditions. And, ultimately, the global goal is to say something not about individual genes and individual proteins, but about genomes and organisms as a whole. And in fact, quite a lot can often be done right now. Here, for example, for very many bacteria whose genomic sequences have been determined, sequencing (sequence determination) – this is the only experiment that has ever been done with this bacterium. It turns out that just by looking at the genome sequence, we can fairly confidently describe the basic metabolism of a bacterium. That is, we can say what it can use as nutrients, on what substrates it can grow, what is mandatory for it, and what it can do without.

Now I will try to tell two stories. One is quite well-known, and all the main ideas have already been implemented there, but this example shows the basic principles of bioinformatic work.

This is the science of gene identification. We have DNA sequences in the form I have shown. Here is this slide (8), it was "honest", here all the letters were the same size, this is such a pure genome, as it came out after sequencing.

0.1% of the E. coli genomeAnd this slide (9) was actually not very "honest", because some letters were lowercase, and some were uppercase, and the uppercase letters are the areas that encode proteins.

These data do not arise directly from the sequencing machine, and the marking of the genome into sections that encode proteins (that is, genes) and sections that do not encode proteins is one of the traditional tasks of bioinformatics, it was set for the first time in 1981-1982 by several people at once, and I'm going to try about it now tell me.

0.0001% of the human genomeI will show this kind of picture (16) many times, so I will now try to explain what is drawn here.

The horizontal axis is a coordinate along the genome, that is, just the number of the nucleotide in the sequence, and each arrow means that the corresponding section encodes a protein. We want to get such a markup: to determine the beginnings and ends of genes. A separate question is how to find out what the function of these genes is, I will talk about this in the second story, but for now we just want to get arrows. What do we have for this?

Identification of genesFirstly, there is a table of the genetic code (17), which was compiled in the early 60s by the classics of molecular biology.

We know which triples of nucleotides (codons) correspond to which amino acids. This is what is drawn here: here the CCC triple encodes proline, CCA also encodes proline, and CTG encodes glycine, and so on: there are all three nucleotides, and there are 20 amino acids that they encode in the standard genetic code.

Table of the genetic codeThis is a very convenient thing, because if we suddenly know the sequence of a protein, then we can find the gene encoding it by simply formally recoding the nucleotide sequence into a protein sequence using a table.

Where everything matches (18), and our protein will be encoded. This is actually not such an artificial situation as it seems, because when the protein composition of a cell is determined by spectrometry, they do exactly that. However, it is not the sequences themselves that are determined, but the masses of protein fragments. And then they compare these masses with the masses of all sorts of fragments that can be encoded in the genome. If it matches somewhere, then this protein is present.

Search for genes if a protein is known: justThe second task looks much more realistic.

We do not have exactly the protein that is encoded, but a protein related to it, i.e. close in sequence. Then we do exactly the same thing, only now we no longer hope for exact matches, but hope for approximate matches. Here (19) exactly matching positions are shown in green, and non–matching ones in yellow, But it is absolutely impossible to accidentally see such a level of similarity, so we can assume that a protein related to the given one is encoded here. Just in case, I will pay attention, breaks may form when the amino acid in one protein does not correspond to anything in another protein. Because during the evolution of proteins, there are not only amino acid substitutions, but also insertions and prolapses.

...or a related protein: also simpleFor what follows, here is what is essential: in protein-coding sequences in the genome sequence, such insertions and outliers will always be multiples of three.

It turns out this way, because the triple corresponds to one amino acid, if you suddenly make a nucleotide insert of length one or two, then you will lose the reading phase, all the triples will be different, and no reasonable protein will be encoded anymore.

Nontrivial computational problems already arise here. There is a nucleotide sequence of a new genome, and we compare it with all the already known proteins to see if there is any relative there, and map the corresponding gene. But the amount of data is growing exponentially. According to Moore's law, the power of computers also grows exponentially, but the exponential indicator with an increase in computer performance is less than the exponential indicator with an increase in the volume of genomic data. Therefore, we have to come up with all the best algorithms, because otherwise any algorithm, generally speaking, at some point begins to choke. And there is already interesting mathematics and computer science.

And if we don't have related proteins, if the gene encodes a completely new protein? This also happens. Here you can use the structure of the genetic code. In the universal genetic code there are three stop codons (20), they do not encode any amino acid, but are a sign of the end of the gene.

Genetic code: Stop codonsIt is clear that they cannot appear inside the protein-coding region (in the correct phase).

Thus, we can simply consider the possible segments between correctly sphazed stop codons. Genes can only lie inside such "open reading frames". This is a good technique, it greatly reduces the number of possibilities, but it does not solve the problem completely, because we get quite a lot of overlapping open frames, and we do not have the opportunity to choose which of them is correct yet. The picture (21) shows that, on average, there are one and a half to two open frames for each site. I.e., there are one and a half to two times more potential genes than in reality. Which is not good.

Open reading framesThe second consideration that we can use is that the genetic code encodes proteins, and proteins are not random sequences of amino acids, but in some sense biologically meaningful.

For example, different amino acids in proteins occur with significantly different frequencies, and these frequencies are more or less universal for all living beings: for example, tryptophan is rare everywhere, and lysine and leucine are frequent everywhere. Thus, triples that correspond to frequent amino acids will often occur in protein-coding regions. And there is no such pattern in non-coding regions, where different triples occur in the first approximation with the same frequency. And we can measure the non-randomness of the distribution of triples in certain areas.

In addition to the fact that the triples correspond to different amino acids, frequent and rare, there are also genome-specific features. In the table of the genetic code there are many synonyms (22) – codons that encode the same amino acid.

Genetic code: synonymsAnd it turns out that the frequencies of synonymous codons are not the same – of the 6 codons that encode leucine, the frequencies of the most frequent and the rarest differ, in my opinion, by one and a half orders of magnitude in E. coli.

This can also be used as a statistical property in recognition (23).

Codon usage (codon usage statistics)The result is something like this picture (24) (this is from a very old article, but nothing has changed much since then): on the horizontal axis are the coordinates along the genome, on the vertical axis are the values of a function that measures the similarity of sequences to some standard in terms of codon frequencies (actually in fact, it's a little more complicated, but it doesn't matter).

And the lines that are drawn here in some places are open reading frames. And we can, for example, given some length of the fragment, slide, as on a slide rule, along the sequence and calculate the values of our static function at each position of our window. And if we had a reasonable function, then there will be plateaus over the protein-coding regions, and there will be dips over the non-coding regions. In the first approximation, this is how it turns out.

Statistical featuresThe more difficult problem is this: we can't make our window size too small, because otherwise the statistics won't work, there will be too much noise.

And because of this, we cannot accurately map the origin of the gene. A gene can have many potential start codons, and it will be unclear which one to choose as the right start. This does not happen with stop codons, because when you see a stop codon, it means that the gene has run out, there is nowhere to go. And the starting codon ATG – it simultaneously encodes the amino acid methionine and may well occur in the middle of the gene. In bacteria, two more codons can be the starting ones, and in the middle of the gene they encode amino acids, leucine and valine (25). And looking at the graph, we cannot choose from several potentially starting codons: the gene has already been mapped in the first approximation, but we cannot determine the beginning.

Genetic code: start codonsIt turns out that there is also something to cling to, because before the beginning of the gene there is a sequence that is recognized by the ribosome as a sign of this very beginning.

The ribosome is a cellular structure, or, as Severinov taught us to say, a nanomachine (and before him Mikhail Valentinovich Kovalchuk also used this term in relation to the ribosome; unlike, however, Severinov, he used it thoughtlessly, and Severinov understood what he meant), which is responsible for protein synthesis. So – the ribosome binds to this site to start broadcasting. Here are a few sequences (26): ATG is the beginning of the gene, the first codon, the gene itself will be further, I did not give it, and before that there is a sequence that is recognized by the ribosome.

Origins of the Bacillus subtilis genesThis is one of the traditional and first tasks of bioinformatics – the search for such functional motives.

Here the examples are well chosen, and, generally speaking, this motif can be seen with the eye. So, somewhere in this picture in each fragment there is a section that is recognized by the ribosome as the beginning of the gene, and the exercise is to see this signal with the eye.

A remark from the audience. AGG, probably…

Michael Gelfand. So, are there any other options? I'm going to wait another half a second now…

A remark from the audience. A lot of Eh?

A remark from the audience. AGG was, A lot of A was…

A remark from the audience. GAG?

Michael Gelfand. GAG was. Well, let me show you the answer (27). I have specially selected several sequences with deviations to hint that in fact reality is much more disgusting than what is drawn here. Nevertheless, we can come up with some kind of formal recognition rule that will recognize such areas, not only when they are so perfect – AGGAGG and everything is fine, but will also recognize weaker variants. And we will take the start before which there is such a site. And we will be happy.

Ribosome binding siteBut there may be a situation when we don't know anything, the protein has no known relatives, and the bacterium is new, so we don't know how its ribosome binding site works.

In fact, it is also species-specific, in some bacteria it is not visible at all, well, in any case, it has not been seen yet.

At the same time, there are achievements of recent years, when genomes began to arrive en masse, and we have many genomes from this group for very many taxonomic groups of bacteria and can compare them. There is a remarkable observation that protein-coding regions evolve much more slowly than regions between genes. It's clear why: we have a stream of random mutations all the time, simply because of genome doubling errors during division, but if this mutation happens in the protein-coding region, it is very likely to spoil something in the protein that is encoded there. And if it happens in the intergenic region, there are fewer functionally important positions there, and mutations, although they occur with the same frequency, are recorded more often. Here is the alignment (28) of the genomes of six different bacteria, in the middle is the same E. coli, these are three Salmonella, this is a plague bacillus.

Comparison of genes in related genomesWhen we build an alignment, we arrange the sequences so as to maximize the similarity between them.

The assumption that is implicitly made in this case is that we are reconstructing the evolutionary history of the leveled areas.

Green is how the beginnings of these genes were annotated in GenBank. For this one – it's klebsiella, the causative agent of one of the atypical pneumonia – nothing has been annotated at all, because it was an incomplete genome that had not yet been looked at. Now you can see that the right start is undoubtedly this one, because after it there is a very conservative area, and before it everything is completely falling apart.

I said that important positions change more slowly than unimportant ones. We have another kind of unimportant positions. I showed a table of the genetic code, and in it synonymous codons often differ in the third position. That is, if two codons encode the same amino acid, then the first two nucleotides in them will match, and the third is arbitrary. Here is a fragment of the same alignment (29), an asterisk means that all the nucleotides in the column match, the absence of an asterisk means that a replacement has happened somewhere. We see that within the protein-coding region, substitutions occur mainly in the third positions of codons

Frequency of nucleotide substitutions in protein-coding regionsNow you know everything to do the next exercise.

Is it another gene in E. coli, three salmonella, and plague bacillus, where does the gene (30) begin?

rbsD in enterobacteriaRemarks from the audience.

Michael Gelfand.

One option was here. More? Accepted… Why not here? Because after this place there is an insertion into one nucleotide. In the protein-coding region, we said, this can not be. So it's really more correct to do this (31).

rbsD in Enterobacteria: the answerThis illustrates the dangers that come with working with a single genome.

Above (32) there is also a suitable starting codon, and in front of it there is a sequence resembling a ribosome binding site – here four letters out of six match, and here four out of six match. Until we had five related genomes, computationally we could not choose between these possibilities, and when there are many genomes, we can.

The existing annotation (was) incorrectWhy were there many identical letters in the non-coding area, too?

Because in the intergenic regions – this will also be essential for the future – there are regulatory sites on which the work of genes depends, and they are also functionally important and therefore also conservative. And we did see conservatism, but caused by a different functional load – not by the fact that it is a protein-coding site, but by the fact that it is a regulatory sequence.

I have told it at some length, with a lot of details, because, firstly, this is a really well-done area, and secondly, because a moral follows from it (33). It consists in the following: it is useful to use many heterogeneous considerations, despite the fact that each of them may be quite weak. It's dangerous to rely on one thing, because usually we don't know how to do anything very well. The second consideration is that it is good to conduct simultaneous analysis of a large number of genomes, preferably located at different evolutionary distances from each other: very close genomes are useful for some tasks, and more distant ones are useful for other tasks. And this is the conclusion that I hope follows from this section.

MoralNow I'm going to try to tell you a new result – well, not that it's completely new, it's an ongoing study, but the final result was published only last year.

This is an example of how with the help of bioinformatics, with the help of comparative genomic analysis, it is possible to do something completely new, it is possible to tell biologists in a language they understand things that they did not know about before and that are of interest to them. It will be a story about transporters.

Transporters are proteins that are located in the cell membrane and play the role of gates – they let in and release various substances. Accordingly, at the expense of transporters, the cell feeds – when it pumps something nutritious inside, at the expense of transporters, the cell throws out some waste products, and so on. I will talk about importers – "transporters inside".

Whenever you want to drag something into the cell, you drag it along the concentration gradient - there is still more of this substance in the cell than in the external environment. So you can't just make holes in the membrane, just pores. If you just had pores, then substances would flow in the opposite direction – from a place where there is a high concentration to a place where there is a low concentration. And the cell needs the opposite. And in order for her to drag something along the gradient, she needs to spend energy.

And the cell spends energy in two main ways, there are also others. Accordingly, there are two main classes of transporters (34).

TransportersThe first method is implemented by so-called ATP-dependent transporters.

These are transporters that decompose one molecule of ATP (adenosine triphosphate) for one act of transport. At the same time, energy is released. ATP is generally the main energy accumulator in the cell. The ATP-dependent transporter consists of three types of subunits: these are, firstly, proteins that are located in the membrane and form a channel; this is an ATPase - a protein that decomposes ATP into ADP and a phosphate group, while energy is released; and an external protein, substrate-binding, which catches molecules of that substance, which should be dragged inside. And for one act (in the first approximation) of ATP decomposition, you launch one molecule of your substance into the cell from the outside inside.

The second method is the so–called secondary transporters. First, you create a difference in concentrations, for example, of hydrogen ions (that is, just protons) in the cell and outside. And then, when you drag something along the gradient inside, you simultaneously release a hydrogen ion, reducing the concentration difference, that is, against the gradient. There is an exchange – this is what is called secondary transport.

These are two completely different cars. The only thing they have in common is that both there and there is a protein that is located in the membrane.

Transporters are a gold mine for bioinformatics, because they are difficult to study experimentally. Biologists know quite a lot about enzymes, but much less about transporters, because they are much more difficult to work with. On the other hand, it is easy to identify transporters simply by their sequence. Firstly, they form large families of similar proteins, and sometimes they can simply be identified by their similarity to already known transporters. Secondly, even if it is something new, a protein that passes through the membrane several times, more precisely, its transmembrane sections, have a rather characteristic amino acid composition, and therefore they are easy to identify.

What is difficult to do – it is difficult to predict the specificity of transporters. Here you have seen the conveyor, you know its transmembrane segments, which transporters it looks like – but you can never guarantee from the sequence which substrate it imports.

I'll try to explain it now. The picture (35) shows a phylogenetic tree of proteins. At the ends of the twigs are different transporters. The length of the branches, including the internal ones, is the level of similarity in sequence. We believe that the level of similarity reflects the degree of kinship. So it's just, in a sense, a family tree of these proteins. And the colors mean different substrates: nickel transporters, cobalt, dipeptide transporter, dipeptide transporter again, nickel again, cobalt again… Transporters with the same function tend to be similar to each other, but if I erased all the colors in this picture and left only experimentally determined specificities, then there would be no way to say anything about new transporters just by looking at the level of similarity.

Substrate-binding proteins, a family of "nickel and oligopeptide" transportersThe second example is a tree of transporters of different vitamins.

Red are transporters of NAD (nicotinamide adenine dinucleotide), pink are transporters of riboflavin (vitamin B2), blue are transporters of thiamine (vitamin B1), and green is a deoxynucleotide transporter. And again we have the same mosaic along the tree, relatives tend to cluster, but there is no good rule.

Family of vitamin transportersThis story began more than 10 years ago, when we studied the pathway of riboflavin synthesis (37).

The metabolic pathway of riboflavin synthesis (vitamin B2)This substance enters many enzymes as a cofactor – a small molecule that binds to the reaction center of the enzyme and participates in catalysis.

Our goal was to study the regulation of genes encoding enzymes from this pathway, and the prediction of specificity turned out to be a by-product. There are precursors, of which there are many in the cell, and then there is a chain of reactions that leads to riboflavin. We saw that a very conservative sequence occurs before the genes of the riboflavin pathway. Here (38) letters are not visible, but the colors, I hope, are visible, and here is red – these are absolutely conservative positions, and there are many of them. And the bacteria are very different. So, generally speaking, it does not happen, this is an exotic situation, and there is a separate story why it happened.

Conservative sequence before riboflavin pathway genes
from very different bacteriaHere is the person (39) who saw all this, this is Lyosha Vitreshchak, he was my graduate student at that time.

He saw that these sequences could be folded into such a structure.

Conservative secondary structure of the RFN elementWhen a new genome arrives, you see such a thing in it, you can identify it very easily, there is no mistake.

Here is the regulation scheme (40), now it is not so important, then it began to be called an RNA switch, these are quite popular objects.

RFN: Regulation mechanismWhat matters to me is this: when we began to look at such structures, we saw that they completely meet before the genes of the riboflavin pathway – in the picture (41) these are multicolored arrows, and the structure is a black arrow.

And so we have five genes of the riboflavin pathway, and in front of them – once per genome usually – there is such a structure. Which is very reasonable, if you believe that it really regulates the synthesis of riboflavin, as it later turned out. And on the left in the picture is the taxonomy of the bacteria in whose genomes we looked at it. And in one group – the gram–positive ones - such a structure was found before another gene, about which nothing was known.

...and before one more gene (ypaA)Then it was natural to think about what this gene does, which is what we actually did.

That's what we understood about him (42).

YpaA/RibU: riboflavin transporterFirst, we saw that it encodes a protein with five potential transmembrane segments, which means that it is most likely a transporter.

Then we saw that it is regulated in the same way as the riboflavin synthesis genes, because we saw the same site. And why should the conveyor be regulated like that? For example, it can transport riboflavin itself. If the bacterium lacks riboflavin, it includes all the possibilities – firstly, biosynthesis (you are trying to create it yourself, and secondly, a transporter, in case something can be pumped from the external environment. But it could also be a transporter of some riboflavin precursor to import something useful from the middle of the metabolic pathway and save some of the synthesis reactions. When we looked carefully, it turned out that there are two bacteria that have this potential transporter, it is regulated by riboflavin (according to the previous theory), and there is no riboflavin pathway at all. And thus, it must be a riboflavin transporter, because if it is a precursor transporter, then it is useless – the bacterium does not have enzymes that could translate this precursor into the final product. That's the beauty of working with complete genomes – if you don't see something, then it really doesn't exist, there's no possibility that it's left in the unfinished part. So, streptococcus and enterococcus do not have a riboflavin pathway, they themselves cannot make riboflavin, but there is an incomprehensible transporter that is regulated by riboflavin. Well, therefore, it must be a riboflavin transporter – there are simply no logical possibilities left. We predicted this in 1999, and experimental articles were published in 2000 and 2006, and it turned out to be true.

Then another story, very similar, also about vitamin, only about biotin. We studied the regulation of the biotin pathway (43) (red circles are potential regulatory sites in DNA, arrows are genes) and again we saw an unattended transporter, and again it was regulated in the same way as biosynthesis genes. Thus, it has something to do with biotin. And since there are genomes where there is no biotin pathway at all, and this transporter exists, then this is the biotin transporter.

BioY Biotin TransporterWe also carefully checked that these bacteria really need biotin, that they have biotin-dependent enzymes in which biotin is a cofactor.

Therefore, this is a biotin transporter, just like in the "previous series". But there was something else (I'm now starting to hang guns on the wall, which will then shoot). Next to this biotin transporter, two genes were dangling, which met in some genomes, did not meet in others, what they did was unclear, they were similar to the components of ATP–dependent transporters, in particular. there was an ATP-binding protein (such proteins are well recognized, you can't confuse them). But since it was all very chaotic, and there were genomes where there were no such proteins, we wrote in small letters at the end of the article that there was something like that, and did not interpret it in any way. So it remained for future generations.

A similar story was also about vitamin B1 (thiamine); these are generally quite monotonous stories. There are two branches of the path (44), and a bunch of transporters were told here – they are framed in the picture, but the considerations are still the same.

Metabolic reconstruction of the thiamine biosynthesis pathway (vitamin B1)That's why we think it's a thiamine transporter (45)?

Therefore, it has several predicted transmembrane segments, it is regulated in the same way as the thiamine pathway genes, it occurs in genomes in which there is no thiamine pathway, but they still need thiamine - well, therefore, a thiamine transporter, there is nowhere to go.

This is a more beautiful case (46), because it is just an intermediate product conveyor. The story is the same: it is regulated together with thiamine genes, but it does not occur in genomes where there is no thiamine pathway, that is, it does not replace the entire pathway. Therefore, it is not a conveyor of the final product. But in the genomes where such a transporter occurs, one of the genes of the initial stage of the pathway may be missing. That is, it is clear that this transporter replaces this reaction, therefore, it is an intermediate product transporter.

I showed these slides to get you used to this logic and to show how different small considerations can work at the same time. And then there was this story.

We studied another vitamin, cobalamin (B12), in the same way, metabolism and regulation. They wrote an article in which, in particular, they predicted a certain number of cobalt transporters (the cobalt ion is part of cobalamin). And we received a letter from colleagues from Humboldt University in Berlin, who wrote to us very politely that our article is absolutely wonderful, but since we are apparently not biochemists, we do not understand the simple fact that cobalt and nickel are very similar, and any cobalt transporter is also a nickel transporter, and any nickel transporter is also a cobalt transporter, because the cell cannot distinguish between them. And they, as biochemists, have been studying this for a long time and successfully. Therefore, they wrote to us, we need to be careful with the conclusions, because you do not know bioinformatics and biochemistry. And we just as politely replied that we, in fact, your biochemistry according to figs. You can make a protein do anything, and we understand that the cell uses these transporters precisely as cobalt transporters, because they are regulated by the absence of cobalamin (and why should a nickel transporter be regulated by the absence of cobalamin?), and these genes are located in the same places of the genome as the genes for cobalamin synthesis, and nickel transporters should do there nothing. And so we were in this pleasant correspondence for some time, and then Dima Rodionov (48), who was the main author of this work, won a small European grant and said that he wanted to study experimental biology and, since there are Germans who are ready to communicate with us, he would go to them to work in the laboratory. And he went to Thomas Hebbeln (48), who wrote us all these letters, to do this project – to systematically look at cobalt and nickel transporters.

Dmitry Rodionov -> Thomas EitingerWhat are the considerations for this? (47) The first is colocalization: genes that do the same thing like to be together in the bacterial genome.

As a separate observation, it is weak, but when you observe it systematically, you can believe it. Accordingly, nickel transporters live together with the genes of nickel-dependent enzymes, and cobalt transporters live together with the genes of cobalamin synthesis. And the second is regulation: cobalt transporters are regulated by a cobalamin RNA switch - this structure, which is easily recognized, reacts to a lack of cobalamin (we studied it at the very beginning); and nickel transporters are regulated by a nickel repressor – there is another motive, but you can also see that they are regulated by a lack of nickel, therefore, they are nickel.

Co and NiAnd so Dima went to Thomas, but the experimental work somehow did not go, and on the computer he saw this (49): the cell is inside, and in the membrane it has five different families of transporters - nickel and cobalt (some of them are pure cobalt, some are both cobalt and nickel, and someone is pure nickel).

There are secondary transporters from below, and ATP-dependent ones from above, because they have ATPase.

Five families of transportersFor one family, here was a wonderful picture (50), a very good evolutionary tree.

Unlike what I showed earlier, this family was neatly falling apart into a nickel branch and a cobalt branch, nothing was confused with anything.

A new family of Co and Ni transportersGenomic loci were also arranged well (51): in the cobalamin locus there were transporter genes, cobalamin synthesis genes, a regulatory element, and in the nickel locus, too, there was a nickel–dependent enzyme, a nickel regulator and here is our transporter.

Everything is absolutely wonderful, just a picture from a textbook, I tell it to students.

The structure of lociThen they made an experiment (52), since the laboratory is experimental, and it turned out that, indeed, the transporters predicted as cobalt work as cobalt, and nickel is not practically imported, and nickel imports predicted nickel, but cobalt does not…

Thomas specially drew a picture in magnification to show that the cobalt transporter still imports a little nickel, otherwise it was a shame for him as a biochemist.

Verification: Ion transport testEverything is very good.

But the problem was the following – there was too much of everything (53). Here we have ATPase – great, we know that we should have ATPase. We have a transmembrane protein, the picture on the right is how the genes in the genome are arranged, and on the left is how the proteins in the membrane are arranged. Atpases and transmembrane proteins in cobalt and nickel transporters are similar. And then it's not very clear. Here, it seems, are substrate-binding proteins, they have one transmembrane segment and an external domain, everything is as it should be. But for some reason, another transmembrane protein, and it is everywhere, and it is clearly superfluous, it is not needed in the traditional scheme.

Structure: too many componentsFurther, when we looked at the ATPase and transmembrane protein, it turned out that they were very similar to the same biotin proteins that we had seen before and could not say anything about.

This is an old picture (54), even two, from two different articles, and situations are circled in red when our predicted biotin transporter had these additional proteins, about which we knew nothing. And green circled situations when he exists completely alone, without additions – therefore, he does not need additions.

BioY Biotin TransporterThen Dima persuaded Thomas to do a completely meaningless, at first glance, experiment: take a normally working cobalt transporter and kill his ATPase.

And the argument was this: look, the biotin transporter can work without ATPase (in some genomes), and the systems are very similar, so this one will probably also be able to work without ATPase. And they did it. And it turned out that, indeed, if you take a complex that looks like a normal ATP-dependent transporter, well, except with some additional appendage, and kill, generally speaking, a vital component from it, what remains still works (55). It works worse, but it works. But if you kill a protein that was kind of superfluous there, then everything breaks down. And this was actually the first example of such a transporter that connects both ATP-dependent and ATP-independent transport, there were no other such examples.

For transport, the MN component is sufficient
(the first example of such an ABC transporter)Then, of course, they immediately made the same experience with the biotin transporter (56), Dima and Thomas are the authors here again, and I'm not here anymore…

In general, unlike Severinov, who last time told classical works, I am telling works that will become classical in fifty years; since I am not the author here, it is easy for me to talk about it. There was the same story. The biotin transporter works alone, and if it has an additional ATPase component, it works more intensively. There's just a different kinetics.

BioY is also sufficient (even in genomes containing BioMN);
BioMNY has a cooler kineticsAnd then it turned out that there are a lot of such potential transporters.

This is drawn in a picture from the Thomas Laboratory website (57) (he has now made it practically the main subject for study): there is a standard complex that is standard for everyone – transmembrane protein and ATPase – and additional components that, in addition, can generally work separately, as in the case of biotin. And now we know what we need to pay attention to, we saw such genes next to a very large number of other transporters that we have already met – riboflavin, thiamine, hydroxymethylpyrimidine – I was just talking about them at the beginning of this story.

The tip of the iceberg?Then Dima defended his thesis and went to a postdoc to 

To Andrey Osterman in San Diego (58).

Andrey is an absolutely wonderful person, he is actually a real biochemical biochemist who has completely crossed himself into a new faith. That is, he continues to study biochemistry very successfully, he is interested in discovering new enzymes, this is his main occupation, but he realized that a very powerful way to detect new enzymatic activities is a preliminary computer analysis. And he learned how to do it, but he doesn't look at regulation, he just looks at where the genes are located, and Dima just went to him to do regulation, which gives a lot for these tasks. And then, when Dima found himself among biochemists and began to communicate with them, he and Andrey discovered that a lot of people actually study such transporters in different laboratories, but they don't know that they all study the same thing, because they all studied them separately. And a wonderful article was published – "A new class of modular conveyors" – she has authors from 5 different laboratories (4 experimental and ours). It turned out to be the whole universe. How these transporters are arranged: there are such systems as I have shown – transmembrane protein, ATPase and an additional component that determines specificity; transmembrane protein and ATPase are all similar, and the component that determines specificity is different for everyone, therefore it determines specificity. And here you are (59): biotin, cobalt, nickel – this is what we have already studied; thiamine – what we have come up with, but other people have experimentally watched it; some precursors of cobalamin, it is unknown what; amino acid methionine, cuosin, this is a modified nucleotide, and so on.

Then there turned out to be an even more amazing thing, on which we hope to enter paradise sooner or later – this thing can generally work like a screwdriver with a removable sting. Since the ATPase and transmembrane component do not determine specificity, they, generally speaking, can be universal, encoded in the genome in a completely different place and work with a large number of different components that determine specificity. And here we see a whole textbook of biochemistry (60) – again biotin; riboflavin – the same riboflavin transporter that I started with; folate; again the precursors of thiamine and thiamine itself, only other variants of transporters, and much more. They are all regulated in different ways…

Since the laboratories were experimental, they checked it (61), it turned out that the riboflavin transporter carries riboflavin, the predicted thiamine – thiamine, ATP is mandatory for them, they do not work without it. And folate turned out to be the same as biotin – it works well in the presence of ATPase, but if the ATPase is broken to it, what remains works as a secondary transporter.

Experimental confirmationsWell, here's how it looks (62): a cell is drawn here.

There are systems that work as a whole – an ATPase, a transmembrane component and a component that determines specificity. And there are also those that work as secondary transporters with different specificity, and additionally there is a universal "charger", which consists of an ATPase and a transmembrane component. In combination with a specific conveyor, such a complex increases its effectiveness.

I have told this story in detail because it is an example when a completely unexpected biology was first predicted every time, and then experimentally tested.

Universal "energy complex"
+ components that determine specificityIt was applied bioinformatics – this is what biologists keep us for.

In fact, it is very interesting to engage in non-compliant bioinformatics, I will not tell you in detail, but just name the areas (63). Molecular evolution – the origin of genes, the taxonomy of organisms, horizontal transfers, i.e. how genes from one organism can get into another. It is interesting to see how selection works at the molecular level, for example, there is a very popular field – to identify genes that are rapidly evolving on the path that leads to humans. This is based on the idea that it was the effect of selection on these genes that led to us becoming human. Of course, while this area is quite speculative, but there are quite funny results, for example, it turned out that the gene, mutations in which lead to hereditary speech disorders, really evolved very quickly in a line that leads to a person.

"Non-compliant" bioinformaticsIt is interesting to look at the cell as a whole – this is a fashionable field now called systems biology.

It hasn't really taken shape yet, but you can build different models there, try to describe something. This science is gaining popularity and, apparently, will gradually become quite intelligible.

And besides, it's interesting to think about big tasks, i.e. not to poke around, so to speak, with each family of proteins, but to understand in general how everything turned out (64).

"Big tasks"Well, the biggest question where all this came from is, apparently, not a question of biology, but a question of chemistry, but one can try – and this is interesting - to reconstruct the properties of the last common ancestor of all living organisms.

For example, it is clear that his genetic code was the same as ours, because everyone's genetic code is the same. He most likely had an RNA genome, that is, the main carrier molecule of genetic information he had was not DNA, but RNA. This follows from various reasons, in particular due to the fact that ribosomes are all the same, and other cellular machines that work with RNA are all the same, and those that work with DNA - they are already different in bacteria and in us. Therefore, we can think that our common ancestor with bacteria had an RNA–based genome, and DNA is a later invention.

One can speculate – and this is partially done – about the origin of eukaryotes (these are organisms whose cells have a nucleus, for example, you and me). Apparently, this is some kind of chimera, because mitochondria are actually bacteria that have learned to live inside another cell. They degraded greatly after that, they gave a significant part of their bacterial genes to the main nuclear genome, but, nevertheless, it is clear that these are obvious bacteria, and it is even known from which taxonomic group: our mitochondria are the closest relatives of rickettsias.

Zhenya (Evgeny Viktorovich) Kunin is trying to build some deeper models based on this kind of considerations. Nothing is visible in this picture (65) – this is correct, because, most likely, everything is wrong there. Nevertheless, until about the middle of the story (if you follow the big events), you can hope to descend by simply comparing the genomes of currently existing organisms. Then, apparently, hand-waving and biochemistry will begin.

These are the people I mentioned (66). Dima Rodionov, who was engaged in metabolism and the search for transporters, Lesha Vitreschak, who came up with RNA switches - they were extremely important for determining specificity, besides the fact that this in itself is a remarkable discovery. Andrey Alexandrovich Mironov wrote the programs with which we did all this, and besides, he is simply the central person in this company. And these are our experimental colleagues - Thomas Eitinger and Andrey Osterman.

The last slide (67) is one of the younger Bruegels, the picture is called "The Battle with Fallen Angels" – in my opinion, this is completely wrong, because these creatures undoubtedly symbolize sequenced genomes, and they are very different, as you can see, but these are extremely few noble people in white, these are bioinformatics who are trying to study all these genomes, and there are clearly not enough forces. Thanks.

Discussion of the lecture

Михаил Гельфанд (фото Наташи Четвериковой)

Boris Dolgin: What do you need to have enough strength? Now I will explain the question. We need more people to go to this area, we need cars, we need – what?

Mikhail Gelfand: Well, first of all, it is necessary that people go to this area, and they actually go to this area, it is quite popular. Biologists learn to do some such routine things themselves. Bioinformatics is actually not a very complicated science, there are a lot of small considerations that are not very difficult to learn how to apply in the right order. And in good biological groups, people just know how to do it themselves. The story about RNA switches is very revealing. They were invented simultaneously (we – a little earlier) from the bioinformatic end and from the experimental one. We stopped doing this because we didn't have an experiment, and the experimenters we tried to work with threw us. And the experimental group in Iale – they very quickly learned how to do about the same bioinformatics, only a little less detailed, and they exist perfectly, and in fact they do very beautiful work. That is, on the one hand, there should be more people who are trying to work at the forefront and develop methods, and on the other hand, there should be in situ bioinformatics in strong biological groups.

Boris Dolgin: Is it necessary to change biological education in some way so that people can perceive it?..

Mikhail Gelfand: Yes, to some extent this is happening. Even at Moscow University, which, in general, is a fairly conservative place, there, in addition to the fact that there is our faculty of bioengineering and bioinformatics, we also conduct special courses in bioinformatics at the Faculty of Biology at molecular departments. American universities have bioinformatics programs almost everywhere. There are decent textbooks. There are no good ones at all, but there are decent ones.

Question from the audience: I would like to hear your opinion, did life originate in the earthly broth or did it fly from space?

Mikhail Gelfand: According to the Occam principle, there is no reason to think that life flew in from space. There are more or less plausible scenarios of how life arose not in the broth, but – within your metaphor – on the walls of the pan, i.e. in compartments that were formed in clay minerals. The Kunin article I mentioned is about this. The hypothesis of panspermia, firstly, does not solve anything, and secondly, it is unverifiable. There are no arguments in sight.

Boris Dolgin: I will still clarify. I was confused by the broth. Is this metaphor still used by biologists or is it left in the Soviet past?

Mikhail Gelfand: One should not think that everything that people close to the Soviet government were doing is compromised by this very fact – science itself does not really depend on it. Broth is not broth, but some substances, relatively simple, that had to combine to form relatively complex molecules - they had to exist. Calling it a highly diluted broth or something else is already a matter of taste.

Question from the audience: A question that has remained since the days of "Bilingua". Why is DNA more efficient than RNA, and why did evolution stop at the double helix and not go down the path of tripling and so on?

Mikhail Gelfand: Triple, quadruple DNA and so on practically do not exist simply for physical reasons. Triple DNA happens, but it imposes extremely strict restrictions on the sequence. You can not turn any DNA sequence into a triple helix, but into a double one - any one. DNA is better than RNA because it is significantly more resistant to damage. In particular, for example, the existence of two threads allows, in case of an error, to correct one thread using data from the other.

Question from the audience: When they say that the DNA of a certain chromosome of a higher organism has been deciphered, they mean that this DNA can be completely untangled, isolated as a linear one-dimensional object in order to number all the nucleotides. Is there any certainty that this DNA can be unraveled in principle?

Mikhail Gelfand: Well, technically this is not how it is done. Technically, you first cut it, then determine the sequence of fragments, and then build them into a linear molecule by overlapping. To untangle a collapsed linear object – well, I don't know that anyone has tried to do this, but on the other hand, the cell copes with this, because when it reproduces DNA in the act of replication, then it solves all these problems related to entanglement and everything else in this way. And about how DNA is arranged in a cell – namely, chromosomes, including the entire structure of chromatin, how it is wound on proteins and so on – this is now being very actively studied, this is just the part about which I have said almost nothing, only mentioned. In particular, now there are some ideas about how it is arranged spatially, which sections of chromosomes (not necessarily the same chromosome) in a cell are close on average – this is averaging over many cells, cells are not identical in this sense.

Boris Dolgin: If I understood the question and answer correctly, it means that trying to consider the sequence as linear impoverishes the meaning, just as trying to consider poems as a single stanza, ignoring the rhythm and so on, means impoverishing the understanding of the verse, so here…

Mikhail Gelfand: Well, I would ignore the metaphor with the verse, with your permission, as not explaining anything. There are two aspects here. The first aspect is that to really consider DNA only as a text is undoubtedly poorer, because we know that DNA is more complicated, there is still a bunch of everything, including spatial organization, chemical modifications, and so on. It's not that they are gradually moving away from this – they will always work with the genome anyway. But this is gradually enriched by ideas about other aspects.

Boris Dolgin: To what extent do algorithmic mathematicians go into this field?

Mikhail Gelfand: Yes, of course, I said that there are quite non-trivial mathematical problems there, and in our group there are about half people with a biological education and a mathematical one.

Boris Dolgin: Is it clear how to attract mathematicians, is there a way for them?

Mikhail Gelfand: I am a mathematician by education. There is a way: I have passed the candidate's exam in molecular biology five times.

Question from the audience: Tell me, the Germans seem to be very practical people, and, most likely, does all of the above have any practical significance?

Mikhail Gelfand: Germans are not only practical people, but also romantic…

Boris Dolgin: And meticulous.

Mikhail Gelfand: Yes, and pedantic. In principle, bioinformatics has the same applied value as biology in general, it's just a part of biology that takes advantage of new opportunities. Biology is of practical importance, but this is a topic for a separate lecture, most likely, I should not read it. In general, this is demagogic, of course, there will be an answer – as well as a question…

Specifically transporters ... resistance to cancer drugs is quite often determined by transporters – the cell learns to throw out those anti–cancer drugs that get into it, thereby studying transporters ... - well, then you can finish it yourself. To understand how a cell lives, this very understanding, apparently, is practically useful. Riboflavin transporter – this whole story about riboflavin – experimental confirmation of this was done at a company that is practically engaged in the production of riboflavin. If you want to force a cell to make riboflavin en masse, then you should forbid it to stop doing this. Normally, a cell, when it has a lot of riboflavin, feels it with the help of the same regulatory structure, closes the work of riboflavin genes, enzymes of the corresponding pathway are no longer produced. Now you want to force the cell – you want it to have a lot of riboflavin, and it went on and on making it. The first thing you do in such a situation is to violate her regulation, negative feedback "a lot of riboflavin – stop doing it." If you are able to predict regulatory areas in this way, then you are able to influence it somehow. The second is the same story – with transporters. A cell, generally speaking, if it has the ability to take riboflavin from the external environment, then it will never make it itself, it is much more economical to take it from the environment. Accordingly, if you want to make a strain that is a producer of riboflavin, then you kill its transporters. The experimental work of 2006, where the riboflavin transporter was tested, is a work from the company.

Question from the audience: I would like to hear, after all, what are the limitations of bioinformatics, because it seems that you can't do much by yourself, looking at the data. For example, a good example with a prediction about the transporter, but can you do something more interesting, for example, to understand the interactions of genes - not the function of an individual gene or group, but more complex processes? Thanks.

Mikhail Gelfand: We can discover biological facts that biologists were previously unaware of, and there was not even an idea that these facts could exist. A new class of transporters – it doesn't seem very interesting to you, it seems quite interesting to me. Here is the RNA structure, which I mentioned, but practically did not tell, – this is actually the first example of a regulatory structure that can directly bind small molecules. This was predicted, it turned out to be true later, and it has to do with quite fundamental things, because if you imagine the RNK world when there were no proteins yet, but RNA was already there, then there were ribozymes – RNA molecules that can work as enzymes – this is the first example of a natural RNK structure that directly interacts with small molecules. We are able to understand how regulatory systems are being rebuilt in evolution, this is again too special a topic for a popular lecture, but we are able to talk about the regulation of groups of genes at once and try in some sense to reproduce regulatory interactions between whole groups of genes. We are able to predict the metabolism of bacteria quite well (actually, I started with this), again looking at the sequences. Let's do it differently – let's you tell an example of some fact that seemed interesting, and I'll tell you whether we can do it or not.

Remark from the audience: Well, for God's sake, let's say correlations between certain genes, to say what their functions are - exactly some arbitrary group with which there were no experiments at all.

Mikhail Gelfand: There were no experiments about these transporters until we started doing them. This is quite a group of genes.

Remark from the audience: As I understand it, this is some kind of very localized group.

Mikhail Gelfand: You misunderstood.

Question from the audience: I would like to ask about the tools of your research. I mean the programs you use. Of course, you are not running through the sequences with your eyes…

Mikhail Gelfand: Quite a lot with my eyes. But there are also programs. For example, when we want to find homologues, i.e. related proteins, there are standard Internet services that do this. In general, quite a lot of such tools are implemented in the form of Internet servers, and you can just put your sequence there and get some kind of answer. For example, to predict transmembrane segments, people have made such a tool. For a comparative analysis of regulation, we have our own programs that Andrei Alexandrovich Mironov wrote, and now we are just trying to make them more publicly available. In reality, apparently, life is arranged in such a way that the group first writes a program for itself, for a specific task that needs to be done. If this program turns out to be reasonable, then there is social pressure to make it publicly available. It is believed that if you have published something, then you provide it to everyone on demand, for programs this means that you need to write documentation for them. For the authors of successful articles, this turns out to be a big burden. And instead of sending your program, it is now much more efficient to implement it in the form of an Internet server. And the procedures in demand are what people really need – all this is recorded in the form of Internet servers.

Question from the audience: They asked what problems can and should be solved. Is the determination of the secondary, tertiary structure of a protein by the primary sequence part of the tasks of bioinformatics?

Mikhail Gelfand: Yes, this is a traditional problem, it can be solved by physical methods, it can be solved by such statistical bioinformatic methods. The secondary structure is predicted pretty well, there is no noticeable progress there now – but the question here is what is "to solve". If you need an absolute prediction in all cases, then no, it doesn't work that way. But for many situations, useful predictions are obtained. The secondary structure of the sequence is predicted pretty well, the spatial structure (tertiary) is predicted worse – de novo, if you have no relatives. But with a very high probability you will have a relative with an already known structure, and the task is to intelligently fit your sequence into a known structure. This is done pretty well. There is a wonderful event – the World cup for predicting protein structures. It works like this: groups that have determined the structure of a protein experimentally, in the period between when the structure became known and when it was published in the journal, announce that they have a structure, and then anyone can predict the structure for this sequence (the sequence is known, but the structure is not, that is, it has not yet been published, but it has been announced that a structure is about to be created for this sequence). Moreover, as in athletics, there are different categories – protein with relatives, protein without relatives, prediction of the secondary structure, spatial, prediction of only the course of the main chain, i.e. the main properties of the structure, say, the type of laying, or the position of all side groups. And then once a year a special commission organizes a conference and gives out medals. According to the prediction of genes, they used to arrange this, too, now they have stopped, everything is done. For a long time no one believed that it was possible to predict the spatial structure entirely de novo, and now even in this area there is significant progress.

Question from the audience: I recently read in Chemistry and Life that such a discovery was made when a gene encodes not one protein, but several. Can you tell me something about it?

Mikhail Gelfand: I do not know what was meant in Chemistry and Life, but there are plenty of such stories. For example, all eukaryotes have splicing – that is, transcription first, and then before translation, insignificant sections are cut out and what is left is connected, and only after that a matrix RNA is obtained. So, splicing can occur in different ways, as a result, generally speaking, you will get different proteins. And in viruses, it often happens that one fragment of DNA can encode several proteins at once in different reading frames (here the question is in the formal definition of the gene, it "floats up" a little all the time) – here you shift by one nucleotide, you get a completely different sequence, and it also turns out to be meaningful. Viruses save the size of their genome very much, there is a terrible selection for efficiency, for the speed of replication. And because of this, viruses very often have such overlapping reading frames, it's already a matter of taste – do you want to call this thing one gene or two.

Question from the audience: You used the term "related protein". By this, did you mean that they have a similar sequence, or that they really have a kinship relationship (i.e. one protein generates another, for example)?

Mikhail Gelfand: Thank you, this is actually a wonderful question, very correct. Formally, I mean that they are similar to each other, because, strictly speaking, I can't check anything else, I don't have a time machine to find out if they were once the same protein. At the same time, I make the assumption that these proteins, similar to each other, really come from a common ancestor. Here man and chimpanzee come from a common ancestor, so similar proteins of man and chimpanzee come from a similar protein that was in this ancestor. Why do we think that this assumption is correct: firstly, because it is simply logically natural, and secondly, we, generally speaking, can check it. If we take a new set of similar proteins from different organisms and start building phylogenetic trees for each of these proteins, then in the first approximation these trees will be the same. This means that in the nodes of this tree we are just reconstructing the very ancestral proteins that ancestral organisms had. And the fact that this genealogy for different proteins turns out to be arranged in the same way – well, apparently, the most economical way to explain it is that this genealogy simply reflects the genealogy of the species themselves. I can tell a story in this regard. People are engaged in the following activity, which is called "molecular paleontology" (they have just started doing it, I know two or three jobs). So if we believe that similar proteins are relatives and had a common ancestor, then let's take this ancestor and reconstruct its sequence by looking at modern ones. Then we can synthesize it, it's not very complicated genetic engineering, and let's just study its properties. Two examples of such works: one – when the visual pigment of a dinosaur was reconstructed to simply remove the absorption spectrum, and the second – when one of the translation factors of the common ancestor of all bacteria was reconstructed and its temperature optimum was measured (here the basic hypothesis is that in each organism the temperature optimum of proteins is the temperature at which this organism predominantly lives). It turned out to be 60-70 degrees, thermophile, but not hyperthermophile.

Question from the audience: If there are sequences of nucleotides that do not encode proteins, then why are they needed?

Mikhail Gelfand: They encode a lot of other things. Firstly, they encode structural and regulatory RNAs, and secondly, they contain just regulatory sites. Remember, I showed an example where there was a conservative site, but not inside the gene, I said that this site regulates the work of the gene. This is one side, there are many other functions besides coding proteins, for example, telling when to turn this gene on and when to turn it off. Almost all regulatory sites are located in front of the genes and in the intergenic gaps. In bacteria, it works like this: their intergenic intervals are short, about half of them, according to our estimate, are under strict selection, and their function is apparently regulatory. You and I may have "parasites" in the genome, repeats – sequences that have learned to copy themselves and insert themselves into random places. The genome of a bacterium must replicate quickly, because it must share quickly, because if it divides too slowly, the rapidly dividing neighboring clone will crush. Eukaryotes do not have such selection for the compactness of the genome (especially multicellular ones), and the genome can grow quite strongly, and come up with special mechanisms for cleaning the genome from "parasitic" DNA fragments (except for natural selection) – it's unclear how. And besides, it's very dangerous, because as soon as you start shredding the genome, it's a very powerful source of all sorts of errors. Canonical example: there is a single class of human cells where real genome rearrangements really take place – these are lymphocytes, cells of the immune system – when they mature, the genome is rearranged. And all lymphomas and leukemias are connected with the fact that this restructuring happened incorrectly, an immortal clone was formed – and hence the white-bloodedness and all the joys associated with it. That is, you cannot come up with a mechanism for effectively clearing the genome of more or less uncontrollably multiplying elements that would not be more harmful than these elements themselves, when, roughly speaking, the police turns out to be worse than bandits.

Question from the audience: Are we working on creating a digital model of a bacterium or some part of it, its virtual description and some kind of evolution?

Mikhail Gelfand: Yes, they are being conducted, and of various kinds. There are simply models of metabolism: people solve linear programming problems or build systems of differential equations and try to describe metabolism. Now they are able to build relatively decent stationary models for the whole bacterium that have reasonable predictive power. Decent kinetic models are being built for individual systems, it's like one side. And another approach is when you try to model fairly general properties, you imagine a bacterium in the form of some abstract being, a "black box" with some inputs and outputs, and then you arrange a population of bacteria, they can slightly modify their inputs and outputs – what is inside the black box - and you can spy such here is an artificial evolution. There is also this kind of work, I like them a little less, but there are also funny ones among them.

Question from the audience: And the second question: biological nanomachines – how well have they been studied, are the mechanisms known, and if so, are there any successful attempts to artificially create any nanomachines?

Mikhail Gelfand: They are quite well studied, this is just the material of traditional molecular biology. There is a spatial structure of the ribosome, literally the position of each atom working in the ribosome, and in different states – at the beginning of the translation, at different stages of the translation itself. For many enzymes, these structures are also well known. People have learned to build DNA-binding proteins that have specific sequences, that is, they bind not in any place, but in a specific one, and when they bind to DNA, they crack it in this place.

Mikhail Potanin: You talked about various methods of sequence analysis. Was there an attempt to combine these techniques into an expert system? In particular, there are groups at your Institute of Information Transmission Problems that deal with expert systems…

Gelfand: Thank you, I'm aware. I do not know any good examples of automated expert systems in the field of genome analysis. It seems to me that compared to the doctors with whom our creators of expert systems deal, we simply do not have enough experience. A good expert system is built on pumping experience and ways of thinking out of the expert's head, she tries to reproduce it. Medical expert systems are roughly like this. And we don't have enough experience to do something reasonable based on it. What really exists, and is very useful, are auxiliary programs, complexes that allow you to do various kinds of routine operations in a reasonable manner, with a user–friendly interface, and so on. Not expert systems, but rather assistants for the expert. And we are trying to write this kind of program, Andrei Alexandrovich Mironov is doing it very successfully, and other groups have done this kind of thing. This turns out to be really really useful. These pictures with arrows that I showed really simplify the work when you need to look at many genomes at once. And it doesn't make any sense to automate this hand-waving, which I told you about, it doesn't work. When I teach, I teach students some tricks, and in what order to apply them correctly - this is achieved by exercise...

Sergey: How sure are you that the bacteria had one common ancestor? Or could they have come from some different points after all?

Gelfand: If they came from different points, then in some amazing way they all came to the same place. They have the same ribosomes, the same transcription apparatus, the same aminoacyl-RNA synthetases, the same RNA polymerases, the replication apparatus... In fact, they all have the same basic mechanisms, not to mention the genetic code. This is actually the strongest argument in favor of monophyleticity, that is, that there was one common ancestor. Because it is difficult to assume that exactly one code could occur several times. Another thing is that the common ancestor of all living beings is not necessarily a cell, it could also be a puddle in which molecules floated (more precisely, as I said, a pore in a clay mineral). It is suspected that the membrane was invented independently two times. This idea – right or wrong – is at least a reason to think. And the fact that the broadcast device was invented once – there is no reason to think otherwise.

Portal "Eternal youth" http://vechnayamolodost.ru20.05.2010

Found a typo? Select it and press ctrl + enter Print version