Biology and neural networks
The use of neural networks in modern biology
Vyacheslav Golovanov, Habr
The article was co–authored with Anastasia Novosadskaya, a specialist in molecular biology and the use of neural networks in this field, and Vladislav Svetlakov, a specialist in neural networks.
Science, methods of analysis and research directions are changing every year. And if at the very beginning of the development of biology the stake was at the macro level, then over time the level "decreased" to the molecular level. And along with the decrease in the level, the flow of data received increased: an avalanche literally fell on scientists. Naturally, it is necessary to isolate important new information from this entire data stream. Moreover, it is impossible for a researcher to do this independently without technology, and sometimes there are difficulties with technology. And then Big Data comes to biology.
In the distant, distant past, our ancestors were interested in biology at the macro level; it was the study of all living things that can be seen around with the naked eye: birds, insects, plants and so on, so on. It was the time of descriptive biology. People described literally everything they saw: the leaf is green, there are veins on it, here it is oval, here it is some kind of finger ...
Then, over time, people acquired new knowledge, they layered on top of each other, and technological progress did not stand still. People gradually began to move away from the macro level, moving to more "small" levels that can no longer be studied, having only good eyesight. Koch, Pasteur, Vinogradsky and many others brought knowledge about microorganisms into the world, and the study of biology moved to the cellular level. People were increasingly interested in the sources of diseases, fermentation, the possibility of using microscopic organisms, the characteristics of the cells of their body. After Watson, Crick and Franklin discovered the structure of DNA, which became a sensation and a real breakthrough for science. It was a kind of leap from the cellular level to the molecular level. And now biology is not at all the same as it was at the very beginning, and not even the same as it was 30-40 years ago. Equipment has become much more complicated, research tools and methods have improved, it is often necessary to have competencies in several areas in order to conduct a really high-quality experiment. In addition, along with the “miniaturization” of biology, there is more and more data from one study — this is the decoding of the sequence of genes and their functions, the assessment of the level of expression and much more. And there are so many of them that a researcher cannot conduct an analysis on his own, without the help of technology. And now the IT term "Big Data" comes to biology.
There is still no specific framework showing the transition from "just data" to "big". It can only be noted that the data are heterogeneous, and some of them may be incorrect, some may be incomplete, and many of them are simply repeated. And such data arrays need to be processed efficiently and quickly. By processing, we mean obtaining new information, new knowledge. One of the simplest and brightest examples of Big Data can be a sequenced (decoded) human genome. For comparison: the number of Telegram users is 700 million, while the number of nucleotide pairs in a person is about 3 billion.
The accumulation of such large data sets modern biology the task of efficient processing of digital information with maximum automation and optimization, and thereby reducing the influence of the human factor. In addition, compared to the last century, now every year there is not a smooth accumulation of new biological information, but a whole influx of it at a rapid pace. It turns out a kind of avalanche, which simply does not have time to transform into new knowledge. Over the past twenty to thirty years, artificial intelligence (AI) methods have been developing, especially machine learning and one of its directions — artificial neural networks.
However, no one forgot about mathematical methods of analysis. This refers to mathematical and statistical analysis using programming languages. In biology, R and Python are popular, because they are quite easy to learn and a large number of libraries for statistical analysis and visualization have been written for them. And here it is worth understanding the following: for some experiments (yes, with a large array of data) it is more acceptable and profitable to use mathematical methods, for others it is more convenient and faster to use computer vision.
In particular, when it is impossible to estimate all correlations due to their huge number, when the process is too laborious or when the data does not lend itself to normal approximation, it is more profitable to use machine learning. Simply because a person cannot always evaluate and notice the smallest relationships and details of the experiment. Or maybe, but this will have to be achieved through a huge amount of trial and error, which is inconvenient and unprofitable. It is obvious that in some situations there is no principle in choosing a method and everything will rest only on the desire and knowledge of the researcher(s).
No one has canceled the analysis of this or that information with the help of manually set rules by researchers. For example, the program did not independently conclude about the complementarity of nucleotides (the mutual correspondence of molecules in DNA / RNA), this rule was set by a person. It became possible to calculate the results of transcription or replication using computer computing resources, simulate the result of protein translation, and so on (in other words, to obtain information from the analysis of molecular data, which can be further processed in even more detail).
Naturally, such rules can grow into whole trees, and they can form the basis of classification (for example, as in the case of decisive trees), however, as practice shows, the rules invented by a person are limited by his own imagination. In turn, Deep Learning approaches do not have such a disadvantage, which is why they have found such a deep application in almost all spheres of our life, including in experimental biology.
In biology, neural networks are used quite extensively, along with computer statistical analysis and analysis using conditions. For example, for in-depth analysis of images and physiological patterns in digital phenotyping (determining the features of external characteristics), for the study of genomes and transcriptomes (the results of the translation of the genome into RNA), studying the structure of proteins, lipids and other organic molecules of cells, predicting the behavior of molecules in targeted drug delivery, the effectiveness of targeted therapy and much more.
For example, the Flora incognita mobile application is built on a neural network and allows you to identify a plant species from a photo. In fact, it's like a Yandex or Google photo search, only more precisely. In addition, the application has additional information on each of the identified species. Convenient, fast. And it has helped far more than one generation of biologists in botany practice.
Or the AlphaFold2 application, which allows you to predict the three-dimensional structure of the protein (folding). Predicting protein folding into a specific conformation is important for explaining and understanding protein functionality, interaction methods, and much more.
An interesting example is the development of a neural network by scientists at New York University to determine the structure of an organism's DNA. She was able to isolate the main genome sequences that are responsible for the life cycle and appearance (phenotype). And it is thanks to such a neural network that it is now becoming much easier to improve agricultural plants and animals and, as a result, food and other industries.
The beauty of neural networks
An artificial neural network is a mathematical model, including its software or hardware implementation, which was created following the example of biological neural networks. In a biological neural network, each neuron is functionally connected to other similar neurons and transmits and receives various discrete signals. An artificial neural network actually simulates the work of a natural biological neural network: it analyzes and, in some cases, remembers information. But in no case do we equate biological and artificial neural networks, because a biological neural network has a much more complex architecture (as in principle almost all biological structures are more complex than computer ones), the connections between neurons themselves are quite complex due to the presence of many neurotransmitters and electrical break nodes. In addition, the principle of summation of signals works in a biological neural network.
The key feature that provides neural networks with the ability to reproduce almost any dependencies and correlations (they, as mentioned above, are one of the main components of biological data) is their multilayer structure. One neuron is not able to restore complex dependencies, due to the limitations of its functions, however, combining such neurons into groups called layers allows you to expand the predictive ability at times. By the way, the human retina (the organ responsible for the perception of information and its transformation into nerve impulses) also a multi-layered structure. There are 10 microscopic layers that, working together, allow a person, for example, to read such posts on Habra.
To create a high-quality neural network, an activation function is needed. Without the activation function, a neuron can be compared to linear regression, which successfully copes with the creation of linear separating planes or the allocation of linear dependencies. When the activation function is added, the neuron has the opportunity to create a separating plane, which in turn will be nonlinear. The combination of such nonlinear separating planes allows you to create any separating hyperplanes, which will allow you to restore complex, nonlinear dependencies that are absolutely necessary for experimental and predictive biology.
Scheme of separating planes and classifications
As mentioned earlier, biological data has a complex internal structure, the massiveness of the data itself and complex dependencies, and neural networks allow us to solve just such tasks. The researcher is required to provide a sufficient amount of input (training) data (which, as a rule, is not a problem in biological research), after which the neural network in the learning process will be able to find the very grain that will allow making correct predictions. But for this we need to use higher-level “modules” of the neural network, for example convolutions. And it makes sense to analyze some biological data using recurrent neural networks or autoencoders.
Successful application of artificial neural networks requires the presence of annotated data sets, that is, data for which the detectable parameters are known. For example, to analyze the physiological state of a plant from a photograph, a set of data is needed, both with photographs and with an assessment of the viability of the plant for each photograph. To recognize objects, such as plants or people, it is necessary to provide the neural network with information about the location of these objects in the photos of the training sample, and for image segmentation – an indication of the segmented area where the object of interest is located. There are public datasets on which you can successfully train your own neural networks. Such a set includes, for example, PlantVillage.
An interesting fact: to test the implementation of certain approaches of computer vision, Lena Söderberg's photo is used as a classic test image. This is an image of a Swedish fashion model from a fragment of the cover of Playboy magazine in 1972.
You can use such sets in conjunction with your own annotated images, for example, using them together with your training data, or using transfer training of a neural network. This approach is based on the features of the architectures of modern neural networks. It is known that the key role in the final classification is played by the lowest layers of the neural network, which process low-level information. Transformations that occur in the upper layers of the neural network allow you to obtain abstract information that is transmitted to the input of low-level classifiers. For these reasons, it is quite effective to train a neural network on a single, maximally common set of data, and then retrain it, most often with freezing of the upper layers, on more "narrow" data. This approach allows for effective training of the neural network in cases of a small amount of initial data. And such a learning opportunity for a neural network is its real advantage, because in this process complex dependencies are identified and generalization is performed.
When the neural network is "trained", it begins to help the scientist in his research. It is enough just to load the input data. And why would biologists do that? At least to reduce the cost of experiments. But there is something even more important. These are correlations and often quite a large amplitude and spread of data. It is impossible to estimate all the subtleties of external and internal influences on molecules with one mathematical formula. Even with normalization, predicting, for example, further therapy for a group of patients is very difficult. What can we say about the butch effect here. High-quality artificial neural networks help to reduce risks and prevent consequences if a scientist has built the wrong logical chain and put forward the wrong hypothesis.
Classification scheme of plant diseases by photos
Another important advantage of using artificial neural networks, and in principle computer analysis methods, is the speed of execution. The performance of modern computers allows you to perform analysis many times faster than a person, and not only with the preservation of the original level of accuracy, but also with its increase. Artificial neural networks successfully cope with a variety of tasks. For example, results have been obtained showing the effectiveness of the use of neural networks for the classification of plant diseases. In the study in the figure above, 38 categories of diseases were classified by photographs. Moreover, it is possible not only to determine the disease or condition of the object, but also its type (recall Flora incognita).
As a result, I would like to repeat once again that without the help of such disciplines as machine learning, computational biology, bioinformatics, the development that we have in applied science now would be impossible. You should never forget what a strong influence the human factor has. We live in an amazing time when automated systems are able to cope with the work that is familiar to a person, and in some cases many times surpassing him in this. Since this trend has not bypassed the scientific community, we have the opportunity not only to increase the volume of data received, but also to bring their analysis to a fundamentally new, deep level.
Portal "Eternal youth" http://vechnayamolodost.ru