20 February 2008

454-sequencing (high-performance DNA pyrosequencing)

Natalyin Pavel, "Biomolecule".

Speed is one of the main advantages of the new sequencing method. Isn't that why the name of the method refers us to the legendary Chevrolet Chevelle SS 454 of 1970 with a 360 horsepower engine? Collage based on the dotsara @ Flickr image.

A new generation of DNA sequence decoding technologies, which allows the reading of genetic texts with unprecedented speed and productivity, has found wide application in biomedical research and has become a prerequisite for impressive scientific achievements.

Table of contentsThe classical approach to decoding DNA sequences

  • The principle of high-performance DNA pyrosequencing
  • In the service of all progressive humanity
  • The prospects
  • Literature


DictionaryDNA microarray is a small surface on which fragments of single-stranded synthetic DNA with a known sequence are deposited with a high density in a certain order.

These fragments act as probes with which complementary DNA chains from the test sample, usually labeled with a fluorescent dye, are hybridized (form double-stranded molecules). The more DNA molecules with a certain sequence in the sample, the more of them will contact the complementary probe, and the stronger the signal will be at the point of the microchip where the corresponding probe was "planted". After hybridization, the surface of the microchip is scanned, and as a result, each DNA sequence corresponds to a certain signal level proportional to the number of DNA molecules with a given sequence present in the mixture. DNA microchip technology finds a wide variety of applications in modern biology and medicine for the analysis of complex DNA mixtures — for example, the totality of all transcripts (matrix RNAs) in a cell.

PCR is a polymerase chain reaction, an enzymatic replication reaction of DNA molecules in vitro, catalyzed by a thermally stable DNA polymerase. The reaction, developed in 1983 by Kary Mullis, is now widely used in all areas of modern molecular biology. The reaction consists of repeated cycles, during which there is a step change in the temperature of the reaction mixture, which controls the reaction stages. First, the fragments of the double-stranded DNA matrix are separated (denaturation stage) at a high temperature (94 ° C). Then the temperature drops to 55-65 ° C and each single-stranded fragment hybridizes with a complementary oligonucleotide-seed (the "annealing" stage of the seed). Then the temperature rises again to 72 ° C (the temperature optimum of the thermostable DNA polymerase), and the enzyme completes the seed to the end, creating a full-fledged double-stranded copy of the DNA matrix molecule (synthesis stage). Since at the end of the cycle, two DNA molecules are formed from each (both chains go into action) and the number of molecules grows exponentially, the reaction is called a chain reaction. With the help of PCR, it is possible to obtain a sufficiently large amount of substance from one DNA matrix molecule.

Microfluidics is an interdisciplinary field of research that emerged in the early 80s at the intersection of physics, chemistry, biology and microtechnics. Studies the behavior of micro- and nanoliter volumes of liquids spatially limited to submillimeter sizes. Under such conditions, liquids have a number of interesting properties. Factors such as surface tension, energy dissipation, and fluid resistance begin to dominate the system. The turbulent current practically disappears (only the laminar one remains), and therefore mixing of the two liquids is difficult and occurs mainly due to diffusion. Applied microfluidics is engaged in the design of various devices — from inkjet printers to high-performance liquid chromatographs and "microchip laboratories".

A charge-coupled device (CCD sensor) is designed to convert the energy of electromagnetic radiation of the optical range into electrical energy. It has high sensitivity, resolution and speed. It is used as a photodetector in video cameras, digital cameras, scanners, etc. Structurally it has a matrix (multi-site) design, including from one to several lines of photosensitive micro-surfaces.

Frederick Sanger, b. April 18, 1918, OM, CH, CBE, FRS — English biochemist and luminary of molecular biology, twice winner of the Nobel Prize in Chemistry: for determining the amino acid sequence of insulin (1955) and for developing a DNA sequencing method (1980). They say, an unusually modest and charming man.


The amazing successes of modern biology have largely been determined by the rapid progress of biological instrumentation. Automation of routine procedures, miniaturization, integration of various modules into integrated multifunctional systems — all this has led to a rapid increase in the productivity of a single biological experiment and, in general, to raising research to a qualitatively new level. The active use of design solutions from other fields of technology has greatly facilitated and accelerated this process. Thus, the technology of manufacturing inkjet printers was used to create machines that "printed" the first DNA microchips. In general, nanotechnologies originally developed for the electronics industry, combined with the achievements of microfluidics, are used to produce instruments for biomedical research and already today allow the creation of "microchip laboratories" (from the English "lab on a chip").

Perhaps the most striking example of a breakthrough in biology, impossible without appropriate technological support, is the decoding of the genomes of a group of organisms that is constantly expanding, taking in more and more new members. Today, perhaps deaf pensioners have not heard about the Human Genome project [1]. (In the fall of 2005, I happened to visit the Sanger Institute in Cambridge, where a significant part of the human genome was decoded. There is a laboratory the size of a basketball gym, packed with automatic sequencers, continuously working in the format of microtiter dies with 384 holes... I wonder if they have already switched to 1536-hole dies?..) Automation of the sequencing process made it possible to read 3,253,037,807 base pairs of human DNA. And allowed scientists to go even further.

Go to http://www.ensembl.org — and you will see that new species with a partially or completely decoded genome appear there almost every month. It is impossible to imagine modern biology (not only molecular biology and biochemistry, but also systematics, the theory of evolution, anthropology, medicine, after all!) without megabytes of read DNA sequences, this flesh and blood of bioinformatics, the most dynamically developing field of biological science.

It's not just computers that are getting better and cheaper every day. Prices for genome decoding are also in "free fall". The first draft version of the human genome, completed in 2001, cost about $ 300 million (and the final version, together with the technologies that made it possible, cost about $ 3 billion). The decoding of the third genome of the primate Macaca mulatta, a draft version of the sequence of which was obtained in February last year, has already cost 22 million [2]. It is expected that in the near future at least one of the biotech companies will "finish reading" the sequence of the mammalian genome (i.e., a large, complex genome) for only 100 thousand dollars. We have a 3000-fold reduction in cost in just 6 years! And this is not the limit. Most likely, technologies will soon be available to reduce the price of a genome sequence to thousands of dollars.

Scientific laboratories and biotech companies are actively competing with each other in an effort to be the first to provide a "$1000 genome". The result of this fierce competition is the rapid development of technology and the reduction in the cost of decoding the DNA sequence. The group that is the first to read the human genome for $1,000 will receive instant recognition and benefit: The Craig Venter Science Foundation in September 2003 promised $500,000 for such an achievement. Later, in order to attract as many researchers as possible to solve the problem, the Venter Foundation joined forces with the X Prize Foundation and on October 4, 2006, they announced a $10 million prize. (This is the second prize from the X Prize Foundation; the first was awarded to Mojave Aerospace Ventures for developing a prototype of the first private spacecraft.) The prize will go to the group that managed to decode 100 human genomes in 10 days at a price of no more than $ 10,000 per genome. The competition itself began even earlier, after the US National Institutes of Health in 2004 launched a research support program (with 70 million grants) to reduce the cost of decoding large genomes to $ 100,000-$1,000.

The classical approach to decoding DNA sequencesFigure 1. The centers that carry out large-scale projects on the decoding of mammalian genomes by the Sanger method resemble factories in their size and number of service personnel.

Photo Nature Methods.The most common method of DNA sequencing today is the "chain termination method", or "dideoxy method", developed in the 70s of the last century by Frederick Sanger.

Cheapness, accuracy, as well as the comparative simplicity of automation makes this method a kind of "gold standard" among all existing methods for determining the sequence of DNA nucleotide residues. This is how the entire human genome was decoded, and it is the Sanger method that is still routine in everyday laboratory practice (Fig. 1).

Initially, the DNA fragments whose sequence is to be determined are repeatedly copied (amplified), then cut into short pieces, which then serve as a matrix for the synthesis of fully complementary DNA chains. Synthesis in general resembles the process of copying DNA in a living cell. The peculiarity of the method is the use of chemically modified varieties of four deoxyribonucleotides that make up DNA chains. Each variety is "labeled" with a fluorescent marker molecule, in the jargon "paint". (Previously, instead of fluorescent markers, a radioactive isotope of phosphorus 32P was used for labeling, which made the whole procedure not particularly useful for health.)

A short fragment of DNA, called a seed, or primer, initiates DNA synthesis at a certain point in the DNA matrix chain. A special enzyme, DNA polymerase, synthesizes a complementary chain. At the same time, modified varieties of nucleotides, which are present in the reaction mixture in much smaller quantities than conventional nucleotides, terminate synthesis when one of them ends up at the end of a growing DNA chain. (The thing is that the modified nucleotides do not have the same chemical group to which the next nucleotide must be attached to continue the chain.) The result is a mixture containing a complete set of newly synthesized DNA fragments, each of which begins in the same place, but ends in all possible positions along the DNA matrix chain.

Modern automated sequencers separate these fragments by passing the entire mixture through the thinnest capillaries filled with gel. The shorter the fragment, the faster it moves in the gel along the capillary under the action of an electric field. (DNA fragments are essentially ions moving in an electric field from "minus" to "plus".) The process called capillary electrophoresis is so effective that the fragment that has just emerged from the capillary turns out to be exactly one nucleotide longer than the one preceding it. As the fragment appears, it is illuminated by a laser, which causes the labeled nucleotide at its end to glow. The computer determines the variety of these nucleotides and registers the sequence of their appearance by folding the "letters" (nucleotides) into the "text" (DNA sequence). In the case of decoding the whole genome, billions of short "texts" are generated in this way, which are fed into a special program run on supercomputers. The program finds places where "texts" overlap and, placing them in the right order, builds a complete genome sequence.

Most of the new technological developments are aimed at miniaturization, multiplexing (in this case, parallel connection of low-performance units of the system to increase overall performance) and automation of the sequencing process. All of them can be divided into two classes. The first combines methods of "synthesis sequencing", in which bases are determined as they are embedded in a growing DNA chain. The second class includes technologies for decoding the sequence of bases of a single DNA molecule. Some of them are quite exotic — such as, for example, reading the nucleotide residues of DNA electronically or optically as the molecule "squeezes" through the nanopore. A long list of improvements to the capillary electrophoresis system, combined with increasing automation and software improvements, have reduced the cost of sequencing by 13 times since the first automatic sequencers appeared in the last decade.

But all this looks somewhat pale against the background of the possibilities of a new method of sequencing by synthesis — a sophisticated version of pyrosequencing, developed and implemented by 454 Life Sciences.

And in this case, as S. Dovlatov wrote, life overtakes the dream.

The principle of high-performance DNA pyrosequencingThe technology developed by 454 Life Sciences is called pyrophosphate sequencing, or pyrosequencing.

The very idea of pyrosequencing, I must say, is not new: it originated in the early 90s of the last century, but the method published then failed to displace the traditional didoxy Sanger method. However, the developers of 454 Life Sciences have supplemented it with the capabilities of modern nanotechnology, and, as fans of diamat would say, quantity has turned into quality. Therefore, it would be more accurate to call the method "DNA pyrosequencing in tightly fabricated picoliter reactors". The entire genome, all its DNA molecules, are randomly fragmented into pieces of 300-500 base pairs. Then the complementary chains of the fragment are separated, the same oligonucleotide is sewn to each chain of fragments-an "adapter" that allows individual chains to stick to plastic beads. (The sequence of this oligonucleotide makes it possible to recognize the DNA matrix later in the sequencing process.) At the same time, the mixture of fragments separated into complementary chains is diluted in such a way that each bead receives only one (!) individual chain. Each bead is enclosed in a droplet surrounded by oil and containing a mixture for polymerase chain reaction (PCR), which takes place separately in each droplet of the emulsion (the so-called emulsion PCR, ePCR). This leads to the "clonal amplification" of DNA chains, and speaking in Russian, to the fact that not one, but about 10 million copies ("clones") of a unique DNA matrix are held on the surface of the bead. Next, the emulsion is destroyed, the double-stranded DNA fragments (formed during PCR) are separated again, and the beads carrying single-stranded copies of the DNA matrix are placed in the wells of a "slide slide" of a special design. Each well of such a slide forms a separate picoliter "reactor" in which the sequencing reaction will take place.

The slide is a slice of a block obtained by several rounds of stretching and fusing optical fibers. As a result of each iteration, the diameter of the individual fibers decreases as the fibers form bundles of hexagonal packing of increasing transverse diameter. Each fiber has a core with a diameter of 44 microns, surrounded by a 2-3 microns layer of plating (shell). Then the cores are etched, and the result is holes ≈ 55 microns deep, with a distance of ≈ 50 microns between the centers of neighboring holes. The volume of such "reactors" is 75 picoliters; the density of placement on the slide surface is 480 wells per square millimeter. Each slide carries about 1.6 million holes, each of which contains one (!) bead with a DNA matrix. The slide is placed in the flow chamber in such a way that a channel 300 microns high is created above the holes, through which the necessary reagents enter the wells.

The reagents delivered to the flow chamber flow in a layer perpendicular to the axis of the wells. This configuration allows simultaneous reactions to be carried out on beads carrying DNA matrices inside individual wells. Addition and removal of reagents and reaction products occurs due to convection and diffusion transfer. The time frame of diffusion between the flow and the wells is about 10 seconds and depends on the height of the flow chamber and the depth of the wells. The depth of the holes is carefully calculated based on the following considerations:

The wells should be deep enough so that the beads carrying the DNA matrix do not jump out of them under the influence of convection;
They should be deep enough to exclude the diffusion of reaction products from wells where nucleotide inclusion took place to wells where inclusion did not occur (see below);
The wells should be as small as required for rapid diffusion of nucleotides into the well and rapid leaching of the remaining nucleotides and reaction products at the end of each cycle, which, in turn, is necessary to ensure high sequencing productivity and reduce reagent costs.

In addition to beads with a DNA matrix, smaller beads are "poured" into each well - each with (immobilized) enzymes necessary for pyrophosphate sequencing "sitting" on its surface. Nucleotides (one kind at a time) and other reagents necessary for the sequencing reaction are fed sequentially into the flow chamber where the slide is placed. Every time a certain nucleotide is embedded in a growing DNA chain in one of the wells, a pyrophosphate molecule is released in it, which, in turn, is a necessary precursor component of another enzymatic reaction. It is catalyzed by a special enzyme, Photinus pyralis firefly luciferase. But for its implementation, adenosine triphosphate (ATP) is needed. The newly formed pyrophosphate is converted into ATP in the well under the action of another enzyme — ATP-sulfurylase. And then luciferase oxidizes luciferin to oxyluciferin, and this reaction is accompanied by chemiluminescence — in a simple way, a small flash of light. The bottom of the slide is in optical contact with a fiber optic light guide connected to a charge-coupled device (CCD sensor, charge coupled device). This makes it possible to register the emitted photons from the bottom of each individual well in which a known nucleotide was embedded. The general scheme of pyrosequencing is given in Fig. 2.


Figure 2. Pyrosequencing scheme. A — DNA is fragmented, "adapter" oligonucleotides are sewn to the fragments; the resulting double-stranded DNA molecules are divided into two complementary chains. B — Single-stranded DNA molecules are attached to beads under conditions that stimulate the ingress of only one molecule per bead. Individual beads are enclosed in drops of the reaction mixture surrounded by oil. The number of molecules on the bead increases millions of times as a result of the emulsion polymerase chain reaction (ePCR). The B—emulsion breaks down, and the chains of DNA fragments formed as a result of ePCR are separated. Beads bearing millions of single-stranded copies of the original DNA fragment on their surface are placed in the wells of the fiber-optic slide, one in each well. G — Smaller beads are added to each well, carrying on their surface the enzymes necessary for pyrosequencing. D — Micrography of the emulsion, depicting "empty" droplets and droplets containing beads with a DNA matrix. The thick arrow points to a 100-micron drop, the thin one — to a 28-micron bead. E is a micrograph of a fragment of a fiber—optic slide obtained using a scanning electron microscope. The plating of optical fibers and empty wells are visible.By linking the flashes registered from each well with the type of nucleotide present in the flow chamber at a given time, the computer sequentially tracks the growth of DNA chains in hundreds of thousands of wells simultaneously.

The time required for the course of an enzymatic reaction producing a detectable "flash" is about 0.02–1.5 seconds. Thus, the reaction rate is determined by the mass transfer rate, which leaves room for improvements by accelerating the delivery of reagents. After entering the flow chamber of each nucleotide, it is washed with a solution containing the enzyme apyrase. Thus, before the next nucleotide is "launched" into the chamber, any nucleotides remaining there from the previous round are removed from all the wells.

The inclusion of a particular nucleotide is detected as a result of the release of inorganic pyrophosphate and subsequent light emission. To determine the wells containing beads with a DNA matrix, you can read the "sequence-key" of the adapter oligonucleotide sewn to the beginning of each DNA matrix. The background level is subtracted from the recorded signal, then the signal is normalized and corrected. The intensity of the normalized signal for each particular well during the entry of a certain nucleotide into the flow chamber is proportional to the number of embedded nucleotides, if any. The linearity of the dependence is preserved for homopolymers with a length of at least eight nucleotides. With such sequencing by synthesis, a very small number of DNA matrices on each bead lose synchronism, i.e. they break ahead or begin to lag behind other matrices. The reason for this is primarily the nucleotides remaining in the well or incomplete chain elongation. Correction of such shifts is necessary, since the loss of synchronism creates a cumulative effect that greatly reduces the quality of reading with an increase in its length. Based on a detailed model of the underlying physical processes of this effect, the employees of the company 454 have developed a special algorithm that allows you to evaluate and make corrections for the "flight" and incomplete completion of the circuit occurring in individual wells.

Before composing and "recording" the final sequence of the read DNA, it is necessary to select high-quality readings from the entire data array for further work and discard low-quality ones. The selection is based on the observation that in low-quality readings there is a large proportion of signals that do not allow distinguishing the cycles during which the inclusion of the nucleotide occurred from cycles without inclusion. Such ambiguous signals are the cause of errors in recording the sequence of individual readings. In order to increase the number of usable readings, 454 has developed a special measure that allows evaluating ab initio the probability of correctly determining the nucleotide in each specific position of individual readings.

The high accuracy of sequence decoding is achieved by the fact that the system performs numerous readings of the same fragment, which makes it possible to build a single generalized (so-called consensus) sequence. Individual readings of the same DNA section are aligned relative to each other based on the intensity of the signals at the time a particular nucleotide flows through the chamber, and not based on the sequence of these readings. Then the corresponding signals are averaged, and only then the resulting sequence is recorded. This approach significantly improves the quality of sequence decoding and provides an opportunity to evaluate its quality.

In 2005, scientists from 454 Life Sciences, using their technology, managed to decode the 600,000-nucleotide genome of the bacterium Mycoplasma genitalium with 99.4% accuracy, as well as the 2100,000-nucleotide genome of Streptococcus pneumoniae. Michael Egholm, the company's vice president responsible for the molecular biology part, reported at a conference in Florida in early 2006 that since then four more microbial genomes have been sequenced in the company, each with more than 99.99% accuracy. "In six months, we have significantly improved the quality of the data received," Egholm said.

In the article in which the new method was first presented and tested [3], it is reported that the entire Mycoplasma genitalium genome was read at one time! First, the entire genome was fragmented and turned into a library of DNA pieces, as described above (the work of one person for 4 hours). After polymerase chain reaction in an emulsion (ePCR) and placing the obtained beads with DNA matrices on a slide with an area of 60 sq.mm. (which took one employee 6 hours), the process ended with a 4-hour automatic operation of the tool, consisting of 42 cycles. As a result of assembling the read sequences (each about 108 base pairs), 25 separate continuous fragments, so-called contigues, with an average length of 22.4 thousand base pairs were obtained. These fragments covered about 96.54% of the entire mycoplasma genome. Of the remaining 4.6% of the genome not read, 3% accounted for unsolvable repeats. Thus, 99.5% of the unique genome sequence was sequenced at one time.

In the service of all progressive humanityAlthough the first version of the tool from 454 Life Sciences could easily replace more than 50 capillary sequencers of the Applied Biosystem 3730XL at a price six times less, the reaction of the scientific community was surprisingly cool.

Instead of adopting the new technology and starting to use its inexhaustible potential, many scientists accustomed to using the Sanger method started talking about problems such as the accuracy of decryption, the length of individual readings, and the cost of infrastructure... And someone simply rebelled against the need to work with large amounts of information produced using new technology.

Most critics, however, did not notice that many obstacles standing in the way of the next-generation sequencing method blocked the path of the Sanger method at first. At that time, the length of readings was only 25 base pairs, and reached 80 only after the appearance of Fred Sanger's terminating dideoxy nucleotides. The technology of "sequencing by synthesis", based on the release of pyrophosphate, initially allowed reading segments of no more than 100 nucleotides in length. After 16 months on the biotech market, this indicator has been improved to 250 base pairs. Recent developments make it possible to read more than 400 base pairs, bringing the new method closer to the Sanger method with its ≈750 nucleotides.

Another important factor, in addition to the length of individual readings, is their number produced as a result of one "run" of the sequencer, normalized by the cost of such a "run". This issue is well solved by competitors of 454 Life Sciences, whose systems produce ten times more readings, paying for it by shortening their length, which is only 35 (or less) nucleotides. There are three new-generation commercial DNA sequencing systems on the market today:

  • Roche (454) GS FLX Genome Analyzer (Fig. 3) distributed by Roche Applied Sciences. (454 LIfe Sciences was bought out by the giant Roche Diagnostics in March 2007 for $154.9 million, but continues to be an independent division);
  • the Illumina Solexa 1G sequencer and
  • the most recent SOLiD system from Applied Biosystems.

Figure 3. At the top — this is the system for high—performance DNA pyrosequencing - the Roche (454) Genome Sequencer FLX (2007) sequencer. Image from the 454 Life Sciences website.
Below is the sequencer diagram. The instrument consists of three main blocks: A — a system of micro—pumps for the supply of reagents; B - a flow chamber containing a fiber-optic slide with reactor wells; C — a fiber-optic system with CCD sensors for recording signals. The device also includes a built-in computer with the necessary software to control the entire process. Other systems for DNA decoding, which are expected to appear on the market within 1-2 years, belong to the "third generation" and are based on the analysis of single molecules. They are being developed by VisiGen and Helicos.

And although reading the bacterial genome at a time was an impressive achievement, at first it was not clear what biological tasks, inaccessible to the good old Sanger method, could be solved by adopting a new method of pyrosequencing.

And indeed, the first projects involving the Roche 454 GS20 instrument consisted only in "re-reading" already decoded bacterial genomes and reinforcing with additional data the already ongoing large "Sanger projects". At the same time, research in the field of metagenomics, in addition to working with huge data arrays, sometimes larger than the human genome, suffered from distortions introduced at the stages of constructing libraries and cloning fragments for sequencing. In this sense, the 454 technology, combining ePCR and pyrosequencing, has an undeniable advantage over the Sanger method. Emulsion PCR makes it possible to amplify single DNA molecules without any preferences, enclosing them in a droplet of emulsion and eliminating competition from other DNA matrices for a limited number of DNA polymerases. Pyrosequencing, in turn, performs parallel reading of these matrices with a light signal at the output, which can be read by a computer. The first such studies, published in 2006, showed the extraordinary flexibility of a new generation method used in the study of the microbial diversity of underground ecosystems of a deep mine [4], deep-sea marine ecosystems [5], marine viral "communities" ("viroms") in several oceans [6].

Figure 4. A new generation DNA sequencing device can produce as much data per day as several hundred Sanger capillary sequencers, but it is controlled by one person.An interesting study combining metagenomic analysis and "DNA paleontology" was conducted at the end of 2005. One launch of the Roche (454) GS20 tool was enough to analyze 13 million base pairs of the genome sequence of a 28,000-year-old mammoth [7].

This work paved the way for a technically more difficult project to decipher the Neanderthal genome [8, 9]. The difficulty of such a project is that the amount of ancient Neanderthal DNA extracted from samples is only 5% of the amount obtained from "fresh material". Consequently, it is necessary to sequence 20 times more than it is necessary for the genome of a modern person. In addition, the contribution of DNA destruction in samples preserved at moderate temperatures, combined with errors inherent in the new method of pyrosequencing, often exceeds the level of difference established for the genomes of Neanderthals and modern humans. Therefore, it is much easier to assert that the resulting sequence is really ancient, and not accidentally flown modern DNA, in the case of a mammoth — modern elephants, unlike humans, are not often found in laboratories. In order to obtain a true sequence of the ancient mammalian genome, it is necessary to conduct many rounds of reading each section of the genome, as well as to verify the origin of the read sections. All this will become possible only after a significant reduction in the cost of projects of this kind.

Together with a breakthrough in the sequencing of complex DNA mixtures, such projects will make it possible to study any ecosystem on the planet at the level of DNA sequences. This will open access to the flora and fauna of 100 thousand years ago - opportunities that exceed the wildest expectations of the very recent past.

At the cellular level, next-generation sequencing (hereafter we are talking not only about pyrosequencing, but also about other new methods of sequencing by synthesis) for the first time allows scientists to identify mutations in any organism for the entire genome. Thus, alleles responsible for antibiotic resistance in Mycobacterium tuberculosis were found [10], and all mutations in the genome of 9 million base pairs in a strain of bacteria that evolved over 1000 generations were identified [11]. These early attempts not only demonstrated the ability of the new technology to detect mutations and errors in published scientific articles [11], but also the difficulties associated with its use, such as errors in reading homopolymer sequences during pyrosequencing (454) or a rapid decrease in the quality of reading closer to the 3’-end of the sequence in systems with a short length of individual readings (Solexa or SOLiD from Applied Biosystem).

Previously, to overcome these difficulties, the data obtained by pyrosequencing were supplemented with information obtained by the classical Sanger method [12]. But since the cost and costs required by the Sanger component of the experiment remain repulsively high, many laboratories today rely only on new generation methods, usually combining relatively long pyrosequencing readings with short, but cheap (and hence numerous) readings carried out by Solexa and SOLiD systems. This combination of different platforms allows for an independent assessment of the quality of their work, as well as "checking for lice" reference sequences stored in public databases.

Obtaining a large number of DNA sequences from various closely related organisms moves forward and develops an approach called resequencing, in which work with sequences is carried out differently than when assembling a freshly sequenced genome. With repeated sequencing, the assembly is guided by the reference sequence already at hand, and therefore requires significantly less coverage (8-12-fold) than with de novo genome assembly (25-70-fold). This approach was applied in the work on decoding 10 mammalian mitochondrial genomes [13], which made possible research in the field of population genetics based not on short sequence segments, but on complete mitochondrial genomes. At the moment, numerous projects on decoding microbial genomes are underway not only to expand the list of available genomes, but also to conduct future comparative studies comparing the genotype and phenotype of an organism at the genomic level.

The work on studying organisms that are not in the plans for genomic sequencing can also go far — thanks to the possibilities of new sequencing methods to directly decode transcript sequences (more precisely, cDNA — DNA copies of matrix RNAs) in a cell. The study of transcripts by direct sequencing has a number of advantages over the method of hybridization on DNA microarrays. The main thing here is that sequencing does not require any knowledge about the genomic sequence of an organism a priori, since the transcript sequence can be immediately compared with the reference sequence of a closely related species from the database using standard bioinformatics algorithms. Knowledge of transcript sequences can fundamentally change the research of organisms whose genomes are not in line for decoding today, and in some cases will never be there. The first works in this field have shown that it is possible to compare sequences (cDNA and genomic, respectively) of two species as far apart as the legume Metticago truncatula and the reference plant Arabidopsis thaliana [14]. Many previously undescribed transcripts of corn Zea mays have also been found [15].

Direct analysis of transcripts will help to circumvent the problem posed to scientists by organisms with excessively large genomes. Despite the successful projects on decoding viral, bacterial and large mammalian genomes, the Sanger method left the task of decoding the genomes of polyploid plants to its successors. These giant genomes, often belonging to important economic plants (for example, the genome of wheat is 16 billion. pairs of bases), made all previous attempts to decipher fruitless. However, the prospect of cheap sequencing of the expressed parts of the genome (i.e. transcripts) allows us to hope for a successful study of plant genomes at least at the functional level [15].

And finally, new sequencing methods have practical applications in medicine. For example, in cancer genetics, specific cancer alleles can be tracked in tissues through high-throughput genomic DNA sequencing in cases where the Sanger method fails [16]. And here the great advantage of the new method is the repeated reading of the sequence.

The prospectsDespite the fact that new methods of DNA sequencing have already stimulated a large number of all kinds of research, the implementation of which was impossible in the recent past, scientists and engineers involved in the development of these technologies — as well as companies promoting these technologies on the market — have a lot to do to improve it.

First of all, reduce the cost. Reducing the price by one or two orders of magnitude is necessary for the realization of hopes for personal genomics, the purpose of which is the re-sequencing of individual genomes at a price not exceeding $ 1,000. In addition to this, a reduction in the error rate will also be warmly welcomed — not only for next-generation methods, but also for the Sanger method, which will continue to contribute in the foreseeable future. It is possible that artificially modified specialized DNA polymerases will appear, providing information about the DNA sequence in the form of an emitted light signal. As the cost of technology decreases, the amount of accumulated information will grow avalanche-like, which can create a "bottleneck" in research. Therefore, part of the efforts to develop new sequencing technologies should be directed to the bioinformatic front.

With more than a hundred publications in less than two years, the next generation of sequencing methods has convincingly demonstrated its immense potential to everyone working in biology and related sciences — at the very time when many believed in the advent of the post-genomic era [1]. Moreover, new technologies have returned genomic research to individual laboratories or small academic consortia, as evidenced by the fact that most of the articles worked with their use did not come from large genomic centers.

Looking back from the near future, one can only wonder why the use of new technologies was not warmly welcomed by the scientific community at first, and, more importantly, by the agencies distributing finances. We can only hope that the lesson will be learned and the third generation of devices for decoding DNA sequences will have a happier fate.

Literaturebiomolecule — "Human genome:

  • how it was and how it will be";
  • biomolecule — "The time of monkey research: the Rhesus macaque genome has been decoded";
  • Margulies M., Egholm M., Altman W.E. et al. (2005). Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380 ;
  • Edwards R.A., Rodriguez-Brito B., Wegley L., Haynes M., Breitbart M., Peterson D.M., Saar M.O., Alexander S., Alexander E.C. Jr, Rohwer F. (2006). Using pyrosequencing to shed light on deep mine microbial ecology. BMC Genomics 7, 57;
  • Sogin M.L., Morrison H.G., Huber J.A., Mark Welch D., Huse S.M., Neal P.R., Arrieta J.M., Herndl G.J. (2006). Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proc. Natl. Acad. Sci. U.S.A. 103, 12115–12120 ;
  • Angly F.E., Felts B., Breitbart M., Salamon P., Edwards R.A., Carlson C., Chan A.M., Haynes M., Kelley S., Liu H., Mahaffy J.M., Mueller J.E., Nulton J., Olson R., Parsons R., Rayhawk S., Suttle C.A., Rohwer F. (2006). The marine viromes of four oceanic regions. PLoS Biol. 4, e368 ;
  • Poinar H.N., Schwarz C., Qi J., Shapiro B., Macphee R.D., Buigues B., Tikhonov A., Huson D.H., Tomsho L.P., Auch A., Rampp M., Miller W., Schuster S.C. (2006). Metagenomics to paleogenomics: large-scale sequencing of mammoth DNA. Science 311, 392–394 ;
  • Green R.E., Krause J., Ptak S.E., Briggs A.W., Ronan M.T., Simons J.F., Du L., Egholm M., Rothberg J.M., Paunovic M., Pääbo S. (2006). Analysis of one million base pairs of Neanderthal DNA. Nature 444, 330–336;
  • Noonan J.P., Coop G., Kudaravalli S., Smith D., Krause J., Alessi J., Chen F., Platt D., Pääbo S., Pritchard J.K., Rubin E.M. (2006). Sequencing and analysis of Neanderthal genomic DNA. Science 314, 1113–1118 ;
  • Andries K., Verhasselt P., Guillemont J., Göhlmann H.W., Neefs J.M., Winkler H., Van Gestel J., Timmerman P., Zhu M., Lee E., Williams P., de Chaffoy D., Huitric E., Hoffner S., Cambau E., Truffot-Pernot C., Lounis N., Jarlier V. (2005). A diarylquinoline drug active on the ATP synthase of Mycobacterium tuberculosis. Science 307, 223–227;
  • Velicer G.J., Raddatz G., Keller H., Deiss S., Lanz C., Dinkelacker I., Schuster S.C. (2006). Comprehensive mutation identification in an evolved bacterial cooperator and its cheating ancestor. Proc. Natl. Acad. Sci. U.S.A. 103, 8107–8112;
  • Goldberg S.M., Johnson J., Busam D., Feldblyum T., Ferriera S., Friedman R., Halpern A., Khouri H., Kravitz S.A., Lauro F.M., Li K., Rogers Y.H., Strausberg R., Sutton G., Tallon L., Thomas T., Venter E., Frazier M., Venter J.C. (2006). A Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes. Proc. Natl. Acad. Sci. U.S.A. 103, 11240–11245 ;
  • Gilbert M.T., Tomsho L.P., Rendulic S., Packard M., Drautz D.I., Sher A., Tikhonov A., Dalén L., Kuznetsova T., Kosintsev P., Campos P.F., Higham T., Collins M.J., Wilson A.S., Shidlovskiy F., Buigues B., Ericson P.G., Germonpré M., Götherström A., Iacumin P., Nikolaev V., Nowak-Kemp M., Willerslev E., Knight J.R., Irzyk G.P., Perbost C.S., Fredrikson K.M., Harkins T.T., Sheridan S., Miller W., Schuster S.C. (2007). Whole-genome shotgun sequencing of mitochondria from ancient hair shafts. Science 317, 1927–1930 ;
  • Cheung F., Haas B.J., Goldberg S.M., May G.D., Xiao Y., Town C.D. (2006). Sequencing Medicago truncatula expressed sequenced tags using 454 Life Sciences technology. BMC Genomics 7, 272 ;
  • Ohtsu K., Smith M.B., Emrich S.J., Borsuk L.A., Zhou R., Chen T., Zhang X., Timmermans M.C., Beck J., Buckner B., Janick-Buckner D., Nettleton D., Scanlon M.J., Schnable P.S. (2007). Global gene expression analysis of the shoot apical meristem of maize (Zea mays L.). Plant J. 52, 391–404 ;
  • Thomas R.K., Baker A.C., Debiasi R.M., et al. (2007). High-throughput oncogene mutation profiling in human cancer. Nat Genet. 39, 347–351.

Portal "Eternal youth" www.vechnayamolodost.ru
19.02.2008

Found a typo? Select it and press ctrl + enter Print version