27 March 2008

The triumph of computer methods: prediction of the structure of proteins

Anton Chugunov, "Biomolecule"Knowledge of the spatial organization of protein molecules is the key not only to understanding their functions and mechanism of operation, but also the basis for the development of effective and safe medicines.

At the same time, it is not always possible or advisable to determine the structure of proteins in a direct experiment due to the complexity, high cost and limited possibilities of experimental techniques. However, sometimes it is possible to overcome these difficulties by approaching the problem "from the other end": the structure of biomacromolecules can be predicted using theoretical approaches based on physical or empirical approximations. This article provides a theoretical justification for the possibility of predicting the structure of proteins and briefly discusses the main approaches to this task.

Why is it necessary to know the structure of proteins?

Proteins, the universal biopolymers from which life is built, perform the entire spectrum of biological functions: from structural to catalytic. (Their role for life in general is recognized even by the classics of Marxist-Leninist philosophy.) Of course, many other molecules are also irreplaceable: the "primacy" in storing and transmitting information belongs to nucleic acids, and lipids, the main components of the biomembranes of living cells, take over a fair share of the structural and formative functions. Ribonucleic acids, in addition to the structural and catalytic functions that have already become familiar to them, are attributed more and more new "roles", reinforcing the hypothesis of an "RNA world" that may have existed at the dawn of the epoch of the origin of life on Earth. Despite all this, it is proteins that play the maximum role in the living world (at least as we know it now), and the importance of studying them is not limited only to fundamental science: today both medicine and industry are consumers of knowledge about the functions and structure of proteins.

Understanding the mechanisms of functioning of living systems, and hence the ability to influence them, for example, with the help of drugs [1], requires knowledge of the structure of protein molecules and a deep understanding of their functions. Thanks to the work of Christian Anfinsen [2] – Nobel laureate in chemistry in 1972 "for his work on ribonuclease, in particular, for establishing a connection between the sequence of amino acids and the conformation of a biologically active molecule" – we know that "the information necessary [for protein folding] is contained in the linear sequence of amino acids of the peptide chain, and that no additional genetic information greater than that contained in DNA is not required" [2]. However, the physicochemical aspects of this complex process, also called protein folding, are still only approximately understood.

In addition to scientists, protein structure is also of interest to specialists of a more practical profile. Pharmacists and doctors, for example, are interested in the production and release of new generations of medicines to the market. However, nowadays it is no longer possible to count on random success, and it is necessary to have a good understanding of the molecular mechanisms of action of the projected drug – aimed, most likely, at interacting with some protein (receptor or enzyme) in the human body. Designing a new drug taking into account the atomic structure of the "target" molecules on which this drug will act is a knowledge-intensive and complex process called drag design [1].

Proteins are also used in various industries – for example, chemical and food, and in the future, energy, and others. The development of new biotechnological enzymes capable of serving for the benefit of society, in addition to knowing the structure of proteins and understanding the mechanisms of their work, also requires the ability to design new functions in proteins that previously performed some other work [3]. Here, however, the ability to solve the inverse problem is required – not to determine the structure of an existing protein, but to create a protein whose structure (and therefore properties) will be set in advance – but solving this problem requires similar knowledge and skills!

What is the difficulty?

Compared to the time period of 30-40 years ago, when knowledge about the structure of biological molecules was still extremely limited, and the determination of the amino acid sequence of insulin or the spatial structure of myoglobin was a real scientific breakthrough, now the flow of biological information is increasing year by year at a rapid pace. The completion of genomic projects following one after another [4] has actually freed researchers from the routine of "classical" sequencing of protein molecules – the sequences of all proteins are converted from the read genomes of many organisms into annotated databases accessible via the Internet. Thus, the number of sequences in the Swiss-Prot database (version 55.1 of March 18, 2008), supervised and annotated by specialists manually (!), is ≈360,000, and the number of records in the TrEMBL database (version 38.1), annotated automatically based on available genomic information, is approaching 5.5 million.

It has become possible to obtain such a fantastic number of sequences thanks to modern high-performance genome sequencing technologies [5], which make the task of reading the entire (well, almost all) DNA of a new species (or even an individual!) it's only a matter of time. Another situation arises with the determination of the spatial structure of protein molecules: the tools for solving this problem are X–ray diffraction analysis (XRD) and nuclear magnetic resonance spectroscopy (NMR) – has not yet reached the degree of maturity to be able to obtain the structure of any protein of interest to researchers with limited time and material costs.

The difficulty lies in obtaining the necessary amounts of protein, preparing a preparation suitable for studying X-ray diffraction or magnetization in an isotope-labeled sample, and analyzing data. Each stage of this task often requires a unique approach and therefore cannot be fully automated. It is especially difficult to characterize the structure of proteins forming complex molecular complexes and integral proteins of biological membranes (which make up to a third of the total number of proteins in most organisms). Therefore, even taking into account the fact that not only scientific teams are engaged in decoding protein structures on their own initiative, but also the international consortium PSI (Protein Structure Initiative), whose task is the most complete and broad structural characterization of the entire protein diversity in the living world, the number of proteins with a known structure is relatively small. As of March 25, 2008, the number of structures in the Brookhaven Protein Structure Bank (PDB) is slightly less than 50,000, but if repeated experiments on the same proteins under different conditions, as well as structures of artificially modified and closely related proteins, are excluded from this set, this number will be reduced to less than 10,000, amounting to ≈1-2% of the total number of practically important proteins.

The way out of this situation can be provided by methods of theoretical prediction of the spatial structure, the decisive advantage of which is the relatively high speed and low complexity of obtaining models of the structure of proteins. The flip side of this advantage is the "quality" of the models – the accuracy of prediction, which is not always sufficient for practically important tasks (for example, studying the interaction of a receptor with ligands). However, as already mentioned, in conditions of limited availability of structural data on the object of interest to researchers, the molecular model turns out to be a reasonable substitute – especially considering the fact that more or less realistic models can be built for >50% of all proteins with an unknown structure.

Of course, when working with theoretically predicted protein models, one must be critical of the results obtained and be prepared for the fact that the results obtained must be verified using independent methods – which, in other respects, concerns most scientific fields in which work has not yet turned into pure technology.

Next, we will consider the basic theoretical prerequisites that make the prediction of the three-dimensional structure of protein molecules possible and, in general, the main techniques used today in this field.

Folding: Is it possible to predict the structure of a protein on a computer?

Folding – folding of proteins (and other biomacromolecules) from an expanded conformation into a "native" form is a physicochemical process, as a result of which proteins in their natural "habitat" (solution, cytoplasm or membrane) acquire spatial stacking and functions characteristic only for them [6]. Folding is ranked among the list of the largest unresolved scientific problems of our time – since this process is far from being fully understood [7].

From a thermodynamic point of view, protein self-folding is the transition of a protein molecule into the most statistically probable conformation (which can practically be equated to the conformation with the lowest potential energy). The so-called Levinthal paradox is associated with the kinetics of folding [8], according to which, if a protein molecule with a length of at least 100 amino acid residues "sorted out" all possible conformations before folding into a native form, this process would require time exceeding the lifetime of the Universe. However, it is known from practice that the maximum folding time is limited to minutes, the typical time is on the order of milliseconds, and the shortest required period registered for a three–sheet β-layer is only 140 ns [9]!

Of course, Levinthal's paradox is apparent. Its solution lies in the fact that the molecule, of course, never accepts the vast majority of theoretically possible conformations. The cooperative effects of folding – the simultaneous formation of "embryos" of the secondary structure, which are energetically stable and no longer changing in the process of further folding – lead to the fact that the protein molecule finds the "shortest path" on an imaginary hyperplane of potential energy to a point corresponding to the native conformation of the protein. At the same time, the native conformation is separated by a noticeable "energy gap" (potential energy gap) from the overwhelming number of non-folded forms, and its closest "neighborhood" (very "narrow", however) determines the natural conformational mobility of the molecule.

The limited understanding of folding mechanisms is also due to the fact that it is difficult to observe experimentally: this is a fairly fast dynamic process that needs to be "examined" at the level of individual molecules! And although they are already studying folding (or rather, unfolding) on individual molecules [10], this has not yet led to a fundamentally new level of understanding of the folding mechanism – and such an understanding could give an effective algorithm for theoretical modeling of this process.

Biological molecules are most often modeled using the empirical force fields approach [11], which, unlike the "absolutely correct" quantum chemical approach (see box), allows calculating the energy characteristics and dynamic properties of biomacromolecules in the foreseeable future. However, such a radical acceleration of the calculation time cannot be given in vain: although many computer experiments in empirical force fields give realistic results, some cooperative interactions that are important for folding, such as the hydrophobic effect or the influence of solvent molecules, are not reduced to paired interactions between individual atoms and cannot be correctly taken into account in this approach.


Quantum chemistry in the calculation of the properties of protein molecules

As you know, the Schrodinger equation – the "flesh and blood" of quantum physics and chemistry – is the most accurate way to describe the structure and dynamics of molecules today. However, an exact (analytical) solution can be obtained only for extremely simple systems – for example, the helium atom. In all more complex cases, numerical solutions of approximations of this equation are resorted to – the so-called semi-empirical methods of quantum chemistry.

The most that these semi–empirical methods are usually used for in protein modeling is optimization of the geometry and charge state of the residues of the protein reaction center, because larger systems become "unaffordable" for these extremely complex and resource-intensive approaches.

Methods of empirical force fields (such as molecular dynamics [11]) have nothing to do with quantum chemistry and "handle" the atoms of simulated molecules (in particular, proteins) as with classical elastic particles bound by a system of paired interactions. Parameters of these interactions (very simple, it should be noted) they are called the force field and determine the behavior of the system during modeling.

Electronic effects such as atom polarizability, electron transfer, formation and breaking of chemical bonds, as well as cooperative hydrophobic interactions cannot be modeled in this approach.


There are two main obstacles to running a simulation of the molecular dynamics (MD) of a protein in the necessary environment and "in silicon" to observe folding, obtaining the desired structure at the end of the process. Firstly, the characteristic folding times are still at the level of milliseconds, and the maximum achievable simulation time at this stage of computer technology development rarely exceeds one microsecond. But even if we imagine that we are not limited in the power of computers, there are still doubts about the ability of modern energy functions to effectively cope with folding – the accuracy of these functions that control the evolution of a molecule inside a computer may not be sufficient to direct folding in the right direction. In addition, an algorithm that simulates mobility can permanently "loop" a molecule in a local energy minimum, which never happens in the real folding process. (However, there are still some successes in modeling folding using molecular dynamics: small proteins – like the 36-amino acid fragment of villin – can be folded into MD with a duration of about a microsecond, running calculations on a supercomputer or in a distributed computing network [12].)

So, the use of the molecular dynamics method as a means of modeling the folding process is currently impractical and practically unattainable. However, it is possible to predict the result of folding – that is, the three-dimensional structure of the protein. Theoretical approaches serving this purpose are divided into two large groups: “ab initio” (or “de novo”) folding techniques that do not explicitly use data on the structure of other proteins, and comparative modeling (or modeling based on homology). Further, both of these groups will be considered in more detail with a greater emphasis on the latter as taking into account the phenomenon of protein evolution.

Folding "from the first principles"

It should be noted at once that the term “ab initio” folding, often used to denote methods of computer prediction of protein structure without using structural data on other proteins, has no relation to the “ab initio” that exists in quantum chemistry. The quantum chemical term "ab initio" (Latin - "from the first principles") denotes the calculation of the properties of molecules by solving the Schrodinger equation (more precisely, one of its approximations), and in the field of modeling the structure of proteins, the same term means only that the prediction does not explicitly use information about the structure of other proteins. However, all calculations are usually performed in empirical force fields describing paired interactions in a classical particle system representing a protein molecule. These force fields themselves implicitly include data on the structure of molecules (not necessarily protein) – such as partial charges and the mass of atoms, as well as the lengths and angles of valence bonds – and have nothing to do with quantum mechanical methods. Therefore, it will be advisable in the future to use the term “de novo” folding (Latin – anew, from the beginning).

The most "physically correct" approaches from this group consist mainly in MD calculations for modeling the folding process and result (see three paragraphs above), however, these methods, due to their enormous computational complexity and inaccuracy of potential energy functions, achieve success only for some very small proteins. In other cases – also, however, related to small proteins (no more than 150 amino acid residues) – additional approximations are resorted to in order to reduce the computational complexity of the calculation.

To increase computational efficiency, simplified protein representation models are often used in de novo approaches – individual amino acid residues present in the model are not presented in as much detail as in "full-atom" approaches: the entire side chain is modeled by only one or two centers ("pseudo-atoms"). For example, the side chain of tryptophan contains 16 atoms, and in a simplified form there can be only two or three of them (and only one for less voluminous residues).

De novo folding is carried out in a special force field (also simplified in comparison, for example, with those used in MD), evaluating a huge number of options for laying a collapsible molecule by the value of potential energy. The identification of a conformation that is significantly (with a "gap") lower in potential energy than the others can serve as a sign of the end of the search – similar to how the native conformation is separated from the non-folded intermediate states with some separation.

Of course, in addition to the correct potential energy function, it is necessary to overcome the "combinatorial explosion" created by the Levinthal paradox. Obviously, it is impossible to sort through all the conformations in order to choose the lowest in energy, and due to a poor understanding of the mechanisms of protein folding, it is not yet possible to repeat the "shortest path" that leads to the native structure on a computer.

In order to somehow get closer to the natural folding mechanism, researchers are trying to isolate structurally conservative fragments in the sequence of the simulated protein (similar to those that are folded first in nature and remain unchanged in the future) and, as it were, "assemble a mosaic" of these fragments. This procedure, which is also extremely resource-intensive (an astronomical number of variants still need to be sorted out!), allows you to significantly reduce the calculation time, and encouraging results have already been obtained for small proteins (Fig. 1).

Figure 1. De novo folding: prediction of the spatial structure of small proteins [14]. The Rosetta program generates an ensemble of models obtained after the "assembly" of structurally conservative fragments of a molecule in a specialized force field. Short (4-10 amino acid residues) fragments of the sequence of the simulated protein act as "embryos" of the structure of the future model (moreover, they differ and "overlap" in different models), and the conformation of these fragments is "assigned" using conformations of homologous fragments from proteins with an already known structure. (In this sense, "de novo" is not modeling "anew" in the full sense of the word, but "borrowing" local structural fragments of such a small length in this case is not considered using the structure of homologous proteins entirely.)
The figure above shows the superimposed experimental structure of the Hox-B1 protein (in red) and the corresponding low-energy structure predicted by the Rosetta program (in blue). An almost perfect coincidence of the conformations of aromatic residues in the central region of the protein is visible. The dependence of the energies of the models from the ensemble obtained in the calculation on the standard deviation (RMS) of the models from the native structure is shown below. (COEX in molecular modeling is used as a measure of the spatial proximity of two models: low COEX (<1-2 Å) indicates the proximity of two structures.) Blue shows models generated from the native structure as a "control" (and naturally turned out to be very close to it in terms of COEX values), black – models created in the process of prediction. The red arrow marks the model, the structure of which is given from above.
In this case, although there is no clear dependence of COEX on energy, the lowest-energy conformation turned out to be very close to the native structure; however, another model with an energy very close to optimal has a COEX from the native structure already 4 Å (which is quite a lot). This fact illustrates the not very high reliability of predictions in practical applications – because in real problems, when the predicted structure is really unknown, there will be nothing to compare the COEX models with – you will have to be guided only by energy values.

One of the research teams actively engaged in predicting the structure of proteins de novo is the Washington laboratory of David Baker, who is also a professor at the Howard Hughes Medical Institute. The Rosetta program developed by them has repeatedly shown itself to be a good side in predicting the structure of proteins of small length (Fig. 1) – ~100-150 amino acid residues [13], as well as in the design of enzymes with new functions [3].

A similar approach is used in the TASSER program [15], where short structural fragments are "assembled" in a specialized force field, and the result (a model presumably close to the native one) is selected from the ensemble of predictions by identifying the densest structural cluster – which, according to researchers, is a "nest" of physically realistic models.

The mentioned methods are very demanding on computational resources – the prediction of the protein structure with a length of 112 residues using the Rosetta method [13] required the use of a supercomputer and a distributed Rosetta@Home network of about 70,000 personal computers. (Of course, all these powers went not only to predict one structure – more than one protein was included in the study.) This resource intensity once again emphasizes that the understanding of folding mechanisms is not at its height: a way to move towards the native structure without going through a lot of unrealistic options has not yet been found. And the functions of estimating potential energy often make mistakes: after all, one successful prediction, which becomes the reason for publication in one of the leading journals [13-17], has a lot of unsuccessful attempts!..

But there is also an application for predictions with not very high accuracy: after all, the mentioned algorithms can not only predict the structure "from scratch", but also optimize the model if an experimental structure requiring refinement is set as a starting point – for example, an NMR model or data from cryo-electron microscopy. In addition, predicting the structure of all proteins in a row from an organism can help identify proteins with an unknown type of stacking - so that experimenters can concentrate on them and "decipher" the structure of another structural family.

So, de novo folding techniques for small proteins have already reached a certain maturity [17], and the ability to create a protein with a non–naturally occurring type of stacking "from scratch" [18] further emphasizes the potential of this area - after all, not every sequence is capable of folding!

However, for longer proteins, the success of de novo approaches is still more than modest, and it is no longer possible to predict the structure of such proteins without using additional information and empirical approaches. And then Nature itself comes to the rescue – after all, proteins are not independent of each other, and there are "kinship" relations between them! Prediction of protein structure using these relationships is called comparative modeling, or homology-based modeling.

Comparative modeling

The "universe" of proteins is large (as already mentioned, today there are already more than five million proteins identified in the genomes of many organisms), but it is not unlimited. Many proteins have typical motifs of spatial organization – that is, they belong to different families, forming "related" groups. All proteins with a known structure are divided into ≈3,500 structural families forming ≈1000 types of spatial stacking (according to the SCOP – Structural Classification of Proteins).

The "kinship" between proteins (usually measured by the degree of identity of their amino acid sequences) is not accidental: one of the most common hypotheses of protein evolution explains the "kinship relationship" by gene duplication that occurred sometime during the evolution of the organism and led to the appearance of a protein with a new function [19]. And, although the "new" protein acquires a different function, and its sequence gradually evolves and changes, its spatial structure remains quite conservative until some point [20]!

It has been empirically established that if the sequences of two proteins are identical to each other by more than 30%, then the proteins are almost certainly "relatives" and the degree of evolutionary divergence is not yet so great that their structures lose their commonality. These observations are the basis of a technique for predicting spatial structure, called homology-based modeling

Modeling based on homology

At the moment, homology modeling allows us to establish the structure of more than half of the proteins whose structure is still unknown. If we choose targets for experimental determination of the structure in such a way that at least one structural homologue is obtained for each protein (with sequence identity >30%), it turns out that it is enough to obtain only 16,000 structures [21], and the "degree of coverage" will be >90%, including and membrane proteins. Homology modeling in this case will help to establish the structures of most of the remaining proteins.

The homology modeling process [22, 23] includes several steps (Fig. 2), the main of which are the search for a structural template and the construction of amino acid alignment. The decisive factor determining the quality of the resulting models is the degree of homology (or identity) of the sequences of the simulated protein and template. High identity means that the evolutionary divergence of both proteins from a common "ancestor" occurred not so long ago that these proteins lost their structural commonality.


Figure 2. Homology modeling scheme using the example of the human MT 1 melatonin receptor [24]:

  • Identification of a structural template – a protein with a known spatial structure, homologous to the modeled one (sequence identity >30%). The search is performed using FASTA or PSI-BLAST servers (or their analogues) in the PDB protein structure database (a single repository of structural data for biomacromolecules);
  • Construction of alignment of amino acid sequences template-model. Pair alignment serves as an "instruction" to programs that perform modeling. Multiple alignment can be useful for identifying conservative residues in the entire family (shown with an asterisk) or individual subfamilies of proteins (the top three sequences are melatonin receptors). Multiple alignment and sequence profiles make it possible to identify weaker homologies than "ordinary" paired alignment. Alignment is carried out using the CLUSTALW server (or its analogues);
  • The construction of the model consists mainly in "stretching" the sequence of the simulated protein (melatonin receptor MT 1) onto the "backbone" of the template (visual rhodopsin) according to alignment. The "loop" sections (which have no homology with the template) are completed independently, the position of the side chains is optimized using methods of empirical force fields. In the first transmembrane segment of the superimposed structures of the model and template, side chains of residues "highlighted" on alignment are shown.
    Modeling is carried out using the Modeller program (and similar ones) or the Swiss-Model server (and similar ones). The online databases ModBase and Swiss-Model Repository contain automatically constructed models for all proteins from the Swiss-Prot database for which it is possible to find a structural template;
  • Quality assessment, optimization and use of the model. The most difficult stage of homology modeling is to optimize the model taking into account all available biological information on the simulated protein. In general, modeling the structure by homology with a protein that performs an excellent function is not able to automatically give a model suitable for practically important tasks. Careful optimization is required, which turns the "blank" (which, in fact, is the "zero approximation" model) into a working tool – a task that depends more on the intuition and experience of the researcher than on specific computer techniques.

In the process of the worldwide "competition" for predicting the structure of proteins – CASP (Critical Assessment of Techniques for Protein Structure Prediction), which has been held for about 15 years with a two–year interval, it turned out that 30% identity can be considered an empirical "barrier". That is, if the protein sequences are identical by more than 30%, then most likely their structures will be similar and the quality of the final models will be satisfactory (Fig. 3). If the homology is low, then the accumulated structural differences are most likely already too large for accurate modeling, or – moreover – there is no real homology between the two proteins, and the observed level of sequence identity is only a random event.

Figure 3. Quality and scope of suitability of computer models of proteins based on various degrees of homology [16, 22]. The higher the identity of the sequences of the modeled protein and template, the more high–quality models are obtained, and the scope of their suitability expands to applications sensitive to the exact location of atoms – such as the explanation of the catalytic mechanism, the docking of ligands and the development of new drugs.
The vertical axis represents the fraction of the target pattern identity on the alignment. Techniques capable of identifying this level of homology are indicated to the left of the vertical arrows. The right-hand side lists the possible applications of models, and all the "roles" of models based on low homology relate to more "qualitative" structures. To the left of the scale, the typical accuracy of the models is indicated (the standard deviation from the "native" structure and the proportion of model residues satisfying this quality are given). The left part of the figure shows superimposed crystallographic structures of several nuclear receptors depending on the identity with respect to the progesterone receptor (given in red on top and on each combination): 54% – glucocorticoid receptor (green), 24% – estrogen receptor α (purple), 15% – triiodothyronine receptor (blue) [22]. From the comparison of the structures, it can be seen that, although the structural similarity is undoubtedly the higher, the higher the identity of the sequences, there is a conservative structural motif within this family of receptors, which persists even in proteins that are low homologous in sequence.

Low homology (<30% identity) can often no longer be correctly identified using paired sequence alignment due to too many accumulated substitutions "masking" the sequence of a protein that may still have retained a certain structural similarity with some known "template" protein. In this case, sequence profile search techniques are often used, in which not a single sequence is used to "query" the sequence database, but a profile constructed on the basis of multiple alignment – a kind of meta-sequence encoding the evolutionary variability of this protein [25]. Using this technique, it is sometimes possible to "calculate" a structural pattern suitable for modeling, despite the fact that the identity of sequences with it is only 10-15%. If it is not possible to find a structural homologue either with the help of "traditional" approaches to search for homologous sequences or with the help of profiles, the only way to get a prediction is de novo methods, which have already been mentioned above.

The field of application of the predicted protein structures is quite diverse (Fig. 3), and they are useful at various stages of the pharmaceutical development process [1] (Fig. 4).

Figure 4. Application of theoretical protein models in the development of new drugs [22]. The increasing amount of structural information intensifies not only the identification and optimization of the "prototype" compound, but also earlier stages, such as the selection of a target for pharmacological action and verification of its "involvement" in the studied processes (validation of the target).

Limitations of comparative modeling

In some cases, the fundamental concept of the homology modeling method is "close sequences are packed into close structures" – violated. Proteins whose sequences are almost identical and contain only a few substitutions can sometimes take on different conformations. Some proteins exchange domains during di- or oligomerization, as a result of which the structure of monomers in the oligomer and a single monomer are completely different. Behind these phenomena are very subtle effects accompanying the folding of proteins, leading to the fact that small substitutions in the sequence or molecular environment stabilize various conformations of the protein. Alas, the prediction of such events is so far completely beyond the control of either comparative modeling or other theoretical methods of predicting spatial structure.

In general, as the analysis of the set of predictions of the structure "blindly" shows, in the vast majority of cases, the structure of models created by homology turns out to be no closer to the native one than the template on which it was based [26] – if we compare the laying of protein "skeletons" in space. This is obviously due to the fact that the template structure cannot contain the distinctive features of the modeled protein, and the optimization methods used rather distance the model structure from the native one than bring it closer to it – again, due to the imperfection of modern empirical fields that are unable to reproduce subtle conformational phenomena occurring "near" native structure. Attempts are being made, however, to overcome this flaw by allowing the optimization of the interposition of sections of the protein backbone of the model to proceed only in "evolutionarily permitted directions" extracted from the family of structures of related proteins [27], but this approach has not yet been widely used.

Competitive spirit (Is there any progress in modeling the structure?)In 1993, for the first time, a "competition" was held among members of the scientific community engaged in modeling the spatial structure of proteins – "Examination of methods for predicting protein structure" (CASP, Critical Assessment of Techniques for Protein Structure Prediction).

The purpose of this competition, held every two years since then, is to record progress in this high-tech field. In order not to expose the participants of the competition to the temptation to fabricate the results, proteins with a really unknown structure are brought to the "start" – since the experimenters studying these proteins have either not yet completed work on their structures, or "on my word of honor" do not disclose its results until the end of the "race". According to the results of the competition – when all models from all participants are received and the "correct answers" are posted online – the winner is determined and a special issue of Proteins magazine [26] is published with a description of the achievements of the participants of the "competition". And, by the way, it was based on the results of several CASP events that a 30% empirical "milestone" was established, when lowering the identity of sequences below which the "twilight zone" begins, in which it is all the more difficult to find a template for modeling because there can be no clear certainty that it exists at all.

A similar test – "Examination of fully automatic methods of protein structure prediction" (CAFASP, Critical Assessment of Fully Automated Structure Prediction) – is also conducted for servers offering their services for 100% automatic modeling via the Internet. The "competition" among servers allows you to exclude the human factor from the competition and compare clean technologies.

And – what do you think? – the advantage is still firmly held by people whose success largely depends on intuition and informal experience, rather than on the technologies and programs they use. For servers, another pattern is characteristic: the so–called meta–predictors - robots that do not model the structure of proteins themselves, but, having collected the results from other servers on the Internet, combine their predictions into their own - produce results on average more correct than the "lone" servers. The mechanism of both electronic "intuition" and the expertise of scientists has yet to be generalized, so that, perhaps, one more step closer to understanding the mechanisms of protein folding and the ability to correctly predict their structure. 

Proteomic modeling

Although the accuracy of fully automatic modeling, as a rule, leaves much to be desired (both in absolute representation and in comparison with models obtained "manually"), progress in the development of "in-line" prediction methods is inevitable. Firstly, it allows you to summarize all the accumulated experience in one technological platform, which can be used by researchers who are not engaged in molecular modeling, including via the Internet. And secondly, the "robots" are indefatigable, which allows them to build models of a huge number of proteins - for example, all the proteins identified in the genome of a single organism – which would hardly be possible for people (if we do not consider the illegal exploitation of Asian students and graduate students).

Now there are already Internet resources containing computer models of a huge number of proteins obtained automatically as a result of launching such a large-scale "genomic-proteomic" modeling – and among them the already mentioned ModBase and Swiss-Model Repository. And if these databases contain models mainly based on homology with structures from the PDB database, then similar initiatives using de novo "predictors" - the Rosetta and TASSER programs mentioned above – also model poorly studied proteins that have neither structural homologues nor a well–defined function in the cell. De novo predictions, in addition to the actual modeling of the structure, can provide additional assistance to structural genomics projects, indicating proteins with a type of stacking that has not been found before and, therefore, are priority "candidates" for experimental study (as part of the strategy of structural genomic projects).

The meaning of such large–scale modeling is consonant with the goals of a global project on structural genomics aimed at obtaining a three-dimensional structure of all known proteins - as a result of direct experiments or computer calculations. At the same time, the strategy of choosing priority targets for experimental study is such as to "provide" structural templates for almost all known proteins – because even despite the enormous efforts of structural biologists, the structure of the overwhelming number of proteins will be modeled, and not obtained experimentally.

Not Healthy skepticism

In conclusion, a small fly in the ointment should be added to the rosy prospect of using computer models in practically important scientific tasks. Peter Moore, one of the leading experts on the structure of the ribosome, in his essay in the November 2007 issue of Structure, entitled in tune with the once popular song to the tune of Gershwin, "Let's Stop All This" (Let's Call the Whole Thing Off) [28], expresses skepticism about the Protein Structure Initiative funded by the US National Institutes of Health (NIH). Moore believes that the chosen strategy – determining the structure of the maximum number of proteins, focusing primarily on new structural motifs, even if the functions of the corresponding proteins are still unknown – is inherently flawed. According to Moore, it would be better if the rather large budget of this program was spent on supporting individual scientists studying the structure of proteins, whose practical significance is already obvious today, and not to expect that these structures, when they are needed, can be obtained on the basis of theoretical calculations.

He argues his position by the fact that if the scientific plans of the laboratory are related to the structure of a protein, then relying on a computer model - which, most likely, will have to work with, because even if the tasks of the PSI initiative are fulfilled, most of the proteins will still be modeled, and not studied experimentally – was it would be very rash. "If your laboratory starts a study based on knowledge of the structure of a protein, and all you have is just a computer model obtained from its sequence, wouldn't it be better for you to set out to get an experimental structure of this protein before starting this work? I think you will be just crazy if you don't do this," Moore writes. "It doesn't matter whether the starting model of your protein is obtained using comparative modeling or not, but you will still need to optimize it to come to the exact location of all the residues in the model, and you need to do this it will be with the help of empirical force fields. But these approaches are based on paired interactions of atoms, which simply does not correspond to the truth! In a solid, the polarization of atoms significantly affects the behavior of the system, but you will not be able to take this into account in any way.... Only the most accurate atomic models, in which the position of individual atoms is determined with an accuracy of 0.5 Å or higher, are worthy of being called "structures", and only they can be useful for high-precision scientific tasks based on knowledge of protein structure." Peter Moore believes that it makes no sense to determine the maximum structures of proteins just because they are organized in a way not yet described, because in a real study, in order to verify the accuracy of the molecular model, it will still be necessary to determine the structure of the protein of interest. "However, if you are going to do this [to determine the structure], why spend the effort to model its structure based on homology with the structures obtained by PSI? That's why I doubt that the "paradise" promised to us by the PSI organizers will ever come. Preved, Siziv!" exclaims Moore, hinting at the futility of the efforts of this international program. (My translation. – A. Ch. )

Anyway, there is still some benefit from computer predictions, and whether they will ever become a reliable substitute for experimental methods, we will have to see in the future.

Literature1. Biomolecule: "Drag design: how new medicines are created in the modern world";


2. Nobel laureates. Christian Anfinsen. Electronic library "Science and Technology";
3. Biomolecule: "Designer enzymes in the service of society";
4. Biomolecule: "Human genome: how it was and how it will be";
5. Biomolecule: "454-sequencing (high-performance DNA pyrosequencing)";
6. Finkelstein A.V., Ptitsyn O. B. Protein physics: A course of lectures with color and stereoscopic illustrations and problems with solutions. Moscow: University, 2005 (See the course of lectures on protein physics on the website of the Pushchinsky Protein Institute);
7. Dill K.A., Ozkan S.B., Weikl T.R., Chodera J.D., Voelz V.A. (2007). The protein folding problem: when will it be solved? Curr. Opin. Struct. Biol. 17, 342-346 (online);
8. Levinthal C. (1968). Are there pathways for protein folding. J. Chim. Phys. 65, 44-45 (pdf, 8 KB);
9. Xu Y., Purkayastha P., Gai F. (2006). Nanosecond folding dynamics of a three-stranded beta-sheet. J. Am. Chem. Soc. 128, 15836-15842 (online);
10. biomolecule: "Folding "in person"";
11. Biomolecule: "Molecular dynamics of biomolecules. Part I. The history of half a century ago";
12. Zagrovic B., Snow C.D., Shirts M.R., Pande V.S. (2002). Simulation of folding of a small alpha-helical protein in atomistic detail using worldwide-distributed computing. J. Mol. Biol. 323, 927-937 (online);
13. Biomolecule: "New advances in predicting the spatial structure of proteins";
14. Bradley P., Misura K.M.S., Baker D. (2005). Toward High-Resolution de Novo Structure Prediction for Small Proteins. Science 309, 1868-1871 (online);
15. Zhang T., Skolnick J. (2004). Automated structure prediction of weakly homologous proteins on a genomic scale. Proc. Natl. Acad. Sci. U.S.A. 101, 7594-7599 (online);
16. Baker D., Šali A. (2001). Protein Structure Prediction and Structural Genomics. Science 294, 93-96 (online);
17. Schueler-Furman O., Wang C., Bradley P., Misura K., Baker D. (2005). Progress in Modeling of Protein Structures and Interactions. Science 310, 638-642 (online);
18. Kuhlman B., Dantas G., Ireton G.C., Varani G., Stoddard B.L., Baker D. (2003). Design of a Novel Globular Protein Fold with Atomic-Level Accuracy. Science 302, 1364-1367 (online);
19. biomolecule: "Where did the vision come from";
Lesk A.M., Chothia C. (1986). The response of protein structures to amino-acid sequence changes. Philos. Trans. R. Soc. Lond. Boil. Sci. 317, 345–356;
20. Lesk A.M., Chothia C. (1986). The response of protein structures to amino-acid sequence changes. Philos. Trans. R. Soc. Lond. Boil. Sci. 317, 345–356;
21. Vitkup D., Melamud E., Moult J., Sander C. (2001). Completeness in structural genomics. Nat. Struct. Biol. 8, 559-566 (online);
22. Hillisch A., Pineda L.F., Hilgenfeld R. (2004). Utility of homology models in the drug discovery process. Drug Discov. Today 15, 659-669 (online);
23. Ginalski K. (2006). Comparative modeling for protein structure prediction. Curr. Opin. Struct. Biol. 16, 172-177 (online);
24. Chugunov A.O., Chavatte P., Farce A., Efremov R.G. (2006). Differences in binding sites of two melatonin receptors help to explain their selectivity to some melatonin analogs: a molecular modeling study. J. Biomol. Struct. & Dynamics 24, 91-108 (online);
25. Dunbrack R.L. Jr. (2006). Sequence comparison and protein structure prediction. Curr. Opin. Struct. Biol. 16, 374-384 (online);
26. Tress M., Ezkurdia I., Graña O., López G., Valencia A. (2005). Assessment of predictions submitted for the CASP6 comparative modeling category. Proteins 61 Suppl. 7, 27-45 (online);
27. Qian B., Ortiz A.R., Baker D. (2004). Improvement of comparative model accuracy by free-energy optimization along principal components of natural structural variation. Proc. Natl. Acad. Sci. U.S.A. 101, 15346-15351 (online);
28. Moore P. (2007). Let’s call the whole thing off: Some thoughts on the Protein structure initiative. Structure 15, 1350-1352 (online).

Portal "Eternal youth" www.vechnayamolodost.ru
27.03.2008

Found a typo? Select it and press ctrl + enter Print version