02 March 2021

"Transformer"

MSU physicists have developed a new model for accelerated drug development

Employees of the Faculty of Physics of Moscow State University have created a new model for accelerated drug development. The model generates 90% of chemically valid medicinal compounds capable of binding to a given protein, using only its amino acid sequence as input data. It can significantly speed up and simplify the process of drug development. The work was published in the prestigious journal Scientific reports (Daria Grechishnikova, Transformer neural network for protein-specific de novo drug generation as a machine translation problem).

Drug development is a very expensive and long process. It takes an average of 10-13 years, and its cost reaches several billion dollars. The development is divided into several stages. One of the most important stages is the search for a new molecule capable of affecting the target protein. This is an extremely difficult task, since the number of all chemically possible molecules is huge and, according to various estimates, ranges from 10 23 to 10 60. To date, only 10 8 molecules have been synthesized. Computer methods are almost always used to search for new structures.

There are two main types of computational methods. The first one is based on the three-dimensional structure of the protein. If the configuration of the binding site is known, then it is possible to optimize the structure of the molecule directly for it. The second type is methods based on information about already known ligands binding to a given target protein. It is possible to establish a connection between the physicochemical properties of a compound and its activity against a protein and use this knowledge to create new structures. Unfortunately, most of the existing methods in computational chemistry tend to generate complex synthesized molecules. In addition, many methods are based on manually encoded rules that greatly limit the number of molecules available to the algorithm. In short, the search for structures remains a difficult task. Currently, the possibility of using machine learning methods to solve the problems of generating new molecules is being actively investigated.

"We used a deep neural network "Transformer". This architecture was invented by researchers from Google Brains in 2017 for natural language processing. The transformer consists of an encoder and a decoder. The encoder maps the input sequence of characters into some vector. Then the decoder character-by-character generates a sequence at the output using this vector. One of the most important features of the "Transformer" are self attention layers. Self attention is an attention mechanism that establishes connections between different parts of the same sequence and builds its representation based on this information. In our task, we consider amino acids and individual symbols of the string representation of the molecule (SMILES) as words," said Daria Grechishnikova, an employee of the Department of Biophysics of the Faculty of Physics of Moscow State University.

Self attention-the layer requires a constant number of sequential operations to establish connections between any elements of the sequence, which allows it to cope with long sequences. This mechanism is well suited for the task of translating a protein sequence into a string representation of a ligand for two reasons. Firstly, the amino acid sequences of proteins can be quite long – tens of times longer than the string representation of molecules. Secondly, functionally significant elements of the protein structure can be formed by amino acid residues located far from each other in the sequence. Therefore, it is important that the model captures the dependencies between the deleted elements well.

For the first time, we presented protein-specific drug design as a translation problem between the "language" of amino acids and the string representation of the molecular structure (SMILES). "A protein is considered as a 'context' for the generation of a molecule binding to it. This formulation of the problem allowed us to adopt one of the most successful architectures in the field of machine translation to the tasks of generating molecules. It turned out that the amino acid sequence of the protein is sufficient to generate molecules that bind to a given protein," Daria Grechishnikova continued.

The developed model can significantly speed up and simplify the process of drug development. It will allow you to quickly and efficiently create molecules that can interact with a specific protein. Previously published models require data on known molecules binding to a protein or information about its three-dimensional structure. However, for new target proteins, additional methods must be used to obtain such information. "So, for example, for new proteins, such as the viral proteins SARS-CoV-2, causing the infectious disease COVID-19, there were no data on the affinity of binding to any compounds. In this case, it is most likely that it will not be possible to apply approaches that use additional training of the model on protein-binding molecules. Approaches based on protein structure may also be inapplicable, since for some proteins it is difficult or even impossible to determine the three-dimensional structure. The proposed model requires only knowledge of the amino acid sequence of the protein, which greatly simplifies the task of searching for molecules," added Daria Grechishnikova.

Portal "Eternal youth" http://vechnayamolodost.ru


Found a typo? Select it and press ctrl + enter Print version