26 June 2020

The human genome: twenty years later

What actually happened when the human genome was "decoded" – and is happening now

Daria Spasskaya, N+1

For links, see the original article.

20 years ago, US President Bill Clinton and British Prime Minister Tony Blair announced that the Human Genome Project and Celera Genomics Corporation had completed "initial sequencing of the human genome." It is said that from that moment biology entered the "post-genomic era". To call this date "the day of decoding the human genome" can be called, however, only conditionally – at that conference, scientists only told about the first "draft" of the DNA sequence of all human chromosomes with many gaps, some of which have not been filled yet. Both publications describing the "draft" of the human genome were published in 2001, and the "final" version appeared three years later. After that, the Human Genome project was completed – but the decoding of the human genome was not. The comprehension and addition of the data obtained then continues to this day. N + 1 tells about the fate of perhaps the most important project for science of the XXI century and what it has done – and continues to do – with the world.

Genetic Telescope

The Human Genome project is sometimes called the most successful international scientific collaboration in history. However, at the start, the scientific community did not feel unanimous optimism: the preparation was accompanied by public discussions and devastating articles, the authors of which argued that it was impossible to read the sequence of human DNA, and taxpayers' money should be spent on something more useful.

Although the purely technical possibility of sequencing the genome was shown back in the 70s, when the first genome of the virus was deciphered, it was not immediately thought about a person. According to legend, this idea took shape thanks to biologist Robert Sinsheimer from the University of California at Santa Cruz. His fellow astronomers were working on the creation of the largest (at that time) ground-based telescope, and Sinsheimer was thinking about a project of a similar scale in biology.

In 1985, he gathered several leading geneticists to discuss a project on sequencing the human genome. The team came to the conclusion that the idea is tempting, but not feasible. By that time, even the E. coli genome "only" five million nucleotide pairs in size had not been deciphered, and the maximum duration of the nucleotide sequence that could be read at a time by the Sanger method was several hundred nucleotides.

20-yrs1.png 

20-yrs2.jpg

A cabinet with a fragment of the human genome, which stands in the Wellcome Collection Museum in London. The full transcription takes hundreds of volumes, each of which has about a thousand pages. Russ London / Wikimedia commons, Adam Nieman / flickr / CC BY-SA 2.0

Walter Gilbert participated in the discussion, who 10 years before had proposed his own method of DNA sequencing (known as the Maxham-Gilbert method or the method of chemical degradation of DNA), almost simultaneously with Frederick Sanger. He caught fire with the idea of creating a genomic institute and attracted to it the discoverer of the structure of DNA, James Watson and Charles Delisi, who headed the Department of Health and Environment at the US Department of Energy. The latter saw the genomic project as a logical continuation of research on the effects of radiation on humans. In 1986, they were already calculating the costs of decoding the sequence of the human genome.

Skeptical colleagues estimated the duration of the project at tens of years of routine work if small scientific teams "read" DNA – and in fact, only in this case, in their opinion, the work can be done well. The amount of work ahead seemed incredibly huge: one of the pillars of molecular biology, Sidney Brenner, joked that criminals would be forced to sequence DNA, and the size of the chromosome would be directly proportional to the severity of the crime. However, Watson and Delisi decided to rely on large automated centers and international cooperation. The final plan of the American part of the project was designed for 15 years and three billion dollars.

This figure seems large – but, for example, the Apollo space project, implemented twenty years earlier, cost Americans 10 times more (excluding inflation). At the same time, as a result of the Human Genome project, scientists promised something no less significant than a flight into space - at least to understand the nature of 4000 hereditary diseases and advance medical genetics and related technologies.

Despite the criticism and the price tag, they managed to push through both the Department of Energy and the US National Institutes of Health (NIH). In 1990, the project started. The panel of experts strongly recommended, in addition to the human genome, to also study the genomes of model organisms: E. coli, yeast, roundworms and mice – so that, if successful, human genes would have something to compare with.

"The credit for launching the project, of course, belongs to Watson. And it was originally conceived as an international one. In many countries, funding was allocated for this within the framework of national projects," says Yuri Lebedev, head of the Laboratory of Comparative and Functional Genomics of the IBH RAS and a member of the International Organization for the Study of the Human Genome (HUGO), who participated in the creation of a map of the 19th chromosome within the framework of the project. – People from institutes in the USA, England, France, Germany, Sweden, Russia – even those countries that were not included in the co–authors of the article in the end - went to each other and worked on the task together. Of course, America would not have done anything alone."

The authors of the 2001 article were members of the International Human Genome Sequencing Consortium from 20 scientific groups in the USA, Great Britain, Germany, France, Japan and China.

20-yrs3.jpg

A fragment of the physical map of the 19th chromosome, which was read at the Livermore National Laboratory with the participation of IBH RAS.

A few years after the start, the lone marathon of the international consortium became a race. Craig Venter, who initially headed one of the laboratories within the NIH, developed a new method for studying genomes called "expressed sequence tags", which significantly accelerated the process of searching for genes by their transcripts. Armed with this technology and the support of venture investors, he left the NIH and founded the Institute for Genomic Research.

In 1998, Venter teamed up with a manufacturer of automatic sequencers under the name Celera Genomics and announced that he would also decode the human genome. Having started eight years later than the Human Genome, Venter was going to complete the task in just three years – while the international consortium was not going to finish earlier than in seven years. His company planned to make a considerable profit from this by patenting genes associated with hereditary diseases (however, in 2000, Clinton said that the genome sequence was public domain, and it could not be patented, so the businessman's efforts in some sense were in vain).

The appearance of a competitor spurred the "Human Genome", and the goal was eventually achieved two years earlier. The federal project agreed with Celera, and the results of both projects were simultaneously announced at the same press conference on June 26, 2001. Jim Watson, the founder of the Human Genome, and John White, the director of PE Corporations, which sponsored Venter, were present in the hall – both faces clearly made it clear that the war had been ended with a bad world. The Venter group's article was published in Science, a day after the publication of the Human Genome article in Nature.

20-yrs4.jpg

Covers of Science and Nature magazines, in which HPG and Celera Genetics articles were published.

Background and consequences

In the 80s, geneticists already had tools that allowed them to study the size of chromosomes and the location of genes on them – mainly by means of enzymatic cleavage of DNA by restrictases, separation of fragments in a gel and hybridization with a radioactively labeled sequence. It was possible to look at DNA more closely thanks to the invention of a productive sequencing method by Englishman Frederick Sanger, who had already come up with a way to read the amino acid sequence of protein molecules.

The determination of the DNA sequence by Sanger, in turn, became possible due to the discovery of DNA polymerase, an enzyme that in a cell provides doubling of DNA molecules due to complementary completion of the chain on a single-stranded matrix.

This method, in contrast to the purely chemical Maxam-Gilbert method (degradation of DNA by modification sites of certain nucleotides), is based on the enzymatic completion of the second chain on the matrix of the chain that needs to be read, and therefore is more productive. The synthesis of the complementary chain takes place using standard nucleotides (A, T, G, C), but at a certain moment a radioactively labeled dideoxynucleotide is added to the test tube, after which the synthesis of the chain breaks off (now the same method is used for routine sequencing, but fluorescent tags are used instead of radiometrics). Analysis of the resulting fragments of different sizes ending in the same "letter" in the gel makes it possible to restore the entire nucleotide sequence.

20-yrs5.jpg

A fragment of the decoded sequence in the gel.

"In order to get an idea of the genome sequence, it was important not only sequencing," Yuri Lebedev clarifies. – It was necessary to make physical maps of chromosomes with a sequence of genes, structural and regulatory sites. This was done by cloning into genetic vectors, yeast, bacterial and phage, overlapping pieces of the genome of tens and hundreds of thousands of pairs in size, and placing on them known genetic markers by which these pieces could be compared. It should be understood that by that time a number of human genes had already been cloned with cDNA (DNA corresponding to the matrix RNA after cutting out non–coding sections - approx. N + 1) and sequenced, so that we could use certain sequences to arrange the "columns", and in parallel we were looking for new markers. It took a significant part of the time. Venter acted cunningly – he used ready-made physical cards, and only put sequents on them, and, of course, it took him much less time."

It should be clarified that all the data obtained along the way were laid out in open access, including maps of chromosomes with the location of genes on them. This greatly simplified the task for Craig Venter, who used them to map sequences obtained by a modified "shotgun method".

"For sequencing, each of the large fragments was split into a mixture of overlapping fragments of shorter length, the small fragments were recloned into a phage vector (M13) and the entire mixture of small fragments was sequenced by Sanger on automatic sequencers," Lebedev explains.

Actually, the method of splitting into short fragments is called the "shotgun method". The first long DNA sequence read in this way in 1981 was the genome of the cauliflower mosaic virus. Venter realized that short pieces of the genome do not have to be cloned into vectors, but can be read from both sides right like that (for this, you need to sew known sequences to them from the edges). Thanks to this improvement, his team quickly read a sequence of 70 million pieces, and put them together using ready-made physical maps in three years. It cost them only 200 thousand dollars – incomparably less than the "Human Genome".

20-yrs6.jpg

20-yrs7.jpg

The procedure for using the sequencing method, which was used in the "Human Genome" and in Celera Genomics (Jennifer Commins et al. / Biological Procedures Online, 2009).

By the time the project was launched in 1990, several short viral genomes and plasmids (auxiliary ring DNA molecules from bacteria) had been decoded, the size of which was limited to tens of thousands of nucleotide pairs. The "human genome" was going to read a genome several orders of magnitude larger: three billion pairs – that's how many "letters" a single set of human chromosomes contains (23 chromosomes). According to the majority, the number of genes contained in this "chronicle" should have been about 100 thousand.

It is not surprising that many leading geneticists found this task unsolvable. However, in the course of the project, the development of technologies made it easier for scientists to work. Among the technical achievements, we can note the appearance of an automatic capillary sequencer, where fragments were separated in thin tubes, and not in a gel. Such devices, in addition to allowing to increase the number of samples, after the appearance of fluorescently labeled nucleotides, switched to automatic signal detection. In addition, the development of computer technologies: from networks that allowed scientists to access data from anywhere, to programs for comparing and processing sequences.

The accumulation of sequences served as an impetus for the development of a whole science – bioinformatics, which is engaged in the assembly, processing and analysis of genomes using mathematical methods.

"High–performance sequencing (NGS) appeared precisely as a result of sequencing of the human genome - before that there was simply no need to read so many sequences. Moreover, I am sure that this project spurred the development of both computer technology and big data analysis – there were huge amounts of data that had to be analyzed somehow," Lebedev comments on the results of the project. Mikhail Gelfand, bioinformatician, deputy director of the Institute of Information Transmission Problems of the Russian Academy of Sciences, speaks in the same spirit: "Now the performance of sequencers is growing faster than the performance of processors and memory. Data grows faster than processing capabilities."

First results and further development

So by 2000, it was possible to get an idea of the sequence of human DNA in the composition of euchromatin – sites from which transcription is actively going on, that is, data reading by RNA polymerase.

According to scientists, euchromatin makes up about 95 percent of the entire genome. The rest of the DNA is hidden in tightly packed protein complexes and is "silent" most of the time. In addition to humans, as experts recommended in the 90th year, by 2001 the genomes of "599 viruses and viroids, 205 plasmids existing in nature, 185 organelles, 31 eubacteria, seven archaea, one fungus, two animals and one plant" were sequenced, and by the official finale of the project, the list was replenished with mouse and mouse genomes. rats are model animals, without which no major medical research is unthinkable.

The result of the project, of course, is not limited to the raw sequence of letters alone. After decoding, the number of genes in the human genome had to be reduced from 100,000 to 30,000 – this number is only twice as large as that of a fly or worm, the authors of the historical publication in Nature wrote.

20-yrs8.png

How estimates of the number of genes in the human genome changed from 1964 to 2009 (Mihaela Pertea and Steven L Salzberg / Genome Biology, 2010).

Scientists also learned that the human genome contains a lot of repeats and mobile elements, the vast majority of which no longer work. In addition, the human genome is very diverse – geneticists have estimated that the number of single-nucleotide polymorphisms in it (sites in which different people may have one or another nucleotide) reaches 1.5 million. This became clear, among other things, due to the fact that DNA from a large number of volunteers was used in the project, and not from one person.

"There are a lot of things about which we did not suspect that they actually happen. Here you lived somewhere on the shore and thought that you lived on a small island. Then somehow we climbed a mountain, the fog cleared, and you saw that it was actually a whole continent," Mikhail Gelfand, whose laboratory participated in the assembly and analysis of the human genome, describes the scientific results of the project.

However, genomic research has only just begun with the release of the first genomic article. Gelfand gives examples: "Following the Human Genome project, there was, for example, the ENCODE project, in which people were already purposefully studying functional things. Not just to write out a sequence of letters, but to understand why tissues are different, why genes work differently in different tissues. Again, how cancer degeneration works, how genes begin to work differently, how the work of a gene is arranged, how it changes during early development, when many different tissues arise from one type of cell. How DNA is packed into cells and what it affects. There are a lot of technologies that tell us exactly functional things, but they are very much tied to genome sequencing. You sequence something, then map the genome, and then draw some functional conclusions from it. In fact, this was the beginning of a science called systems biology, when you try to understand not one by one how genes work, but how the cell as a whole works, but at the same time with very great detail. And this is a thing that would be basically impossible without the genome. Again, our level of understanding of how the cell is arranged, it has fundamentally changed. We are not just blindly feeling the elephant from different sides, but now we are looking at the whole elephant, and through and through."

The "standard" or reference human genome is still being finalized. "The final point was very conditional. We agreed that this moment should be considered the point when [Clinton and Blair] made [their statement]. At that moment, the genome was not made to the end, people then cleaned up this case for many years," says Gelfand. – Wonderful works are coming out now, from which it follows that if you take many, many genomes of different people, then there will be whole pieces that are not in the classical genome, that is, we differ not only in point mutations and substitutions, but also whole large pieces of the genome that someone has, and who has- then there is no. Last year an article was published, they added a few percent to the universal human genome by simply sequencing a lot of Africans."

Genome for Medicine

In the twenty years since the completion of the assembly of the draft version of the genome, sequencing and sequence analysis technologies have developed so much that today it will cost you not three billion dollars to find out the sequence of coding sections of the genome (exome), but only a few hundred.

20-yrs9.png

The change in the cost of sequencing the human genome after September 2001.

Research databases continue to be replenished – this is done, for example, by the Thousand Genomes project, which is designed to assess the genetic diversity of the inhabitants of the planet. National DNA banks are being created. For example, the Icelandic company deCODE genetics owns the genetic information of two-thirds of the population of Iceland. This data is also used for the development of personalized medicine – individual therapy based on the patient's genetic data.

Genotyping, that is, the determination of single–nucleotide polymorphisms of a particular person, has already largely become routine - the UK Biobank database stores data of genome-wide typing of 500 thousand people. In addition to genetic data, participants' records contain information about health indicators, habits, family medical histories, etc. Such data sets allow researchers to conduct so–called Genome-wide association analyses (GWAS - Genome-Wide Association Study), which can reveal, for example, a genetic predisposition to a certain disease.

"Genomic studies can show that carriers of such a variant of the gene have the disease, for example, five times more often than carriers of another variant. This knowledge can help to adjust the lifestyle so as to minimize the probability. But the calculation of the risks of developing the disease is completely based on statistics, there is still room to develop the mathematical apparatus," Lebedev says. – As for predicting abilities for sports and music based on the genome of a child, this is, of course, a thing from the realm of fiction. However, genome or exome sequencing can help to give birth to healthy children if the parents are carriers of some harmful mutations."

In recent years, DNA sequencing and genotyping has been actively used in oncology. In addition to prescribing therapy depending on the presence of certain mutations in the tumor, oncogenomics helps to understand the nature of the occurrence of tumors and their metastasis. "A lot of things in medical genetics, oncology, immunology are also tied to genomes. Now people have already started to look at the genomes of individual cells, this allows, for example, to detect the development of cancer. A cancerous tumor is heterogeneous, there are many different clones in it, and this is very important for treatment. It has clones that are resistant and unstable to any treatment. Which can metastasize, and those who are not malignant enough to metastasize. And now the first works of this kind appear when people simply make a genealogy of individual cells in a cancerous tumor," explains Gelfand.

"With the help of modern postgenomic technologies, you can see how many cells in the human body are infected with SARS-CoV-2," Lebedev gives a topical example, whose laboratory is currently studying subpopulations of human immune cells. – And you can understand, these are cells killed, dying or struggling, how the immune response develops. Using both sequencing and other methods, for example, multiparametric fluorescent hybridization (FISH), it is possible to calculate in a patient's blood sample how many cytotoxic lymphocytes there are, how many helper cells, how many antibody-producing ones, how many of them have already passed into memory cells."

20 years later

When Francis Collins, who replaced Watson as head of the American Human Genome program in 1993, was asked in 2000 about the prospects for genomic research, he suggested that by 2020 genetic testing would be widespread, gene therapy for hereditary diseases would appear, and gene therapy at the germ cell level would prove its safety. As time has shown, he was almost right – both gene therapy and embryo editing already exist, but their widespread use is hindered by completely different things – issues of safety, efficiency, ethics, and questions of expediency (why edit an embryo if you can select a healthy one at the IVF stage?).

"In America, everyone is afraid that genomic information will lead to discrimination of individual employees by insurance companies. That is, if you have a tendency to develop the disease five times higher, then pay five times more. But this is a problem of insurance, not genetics," Lebedev points out. – I don't think that legislative regulation now somehow restricts the development of genetic technologies in relation to humans. That's where it comes to genetically modified plants, yes, laws interfere with the development of the industry."

"Yes, happiness has not come yet. But if happiness had not been promised, then no one would have given money, – Mikhail Gelfand shares his thoughts on the prospects of genomics. It pays off, just slower and not as noticeable. In biological things, the path from understanding to use is longer. In fact, medicine has advanced a lot, we just don't notice it."

Yuri Lebedev has a slightly different point of view: "Already today, a neighboring institute is selecting therapy for childhood leukemia based on the genetic information of patients. And the treatment regimen for patients with ankylosing spondylitis (autoimmune disease) is regulated based on UK Biobank data. This is personalized therapy–I didn't expect it to be available so soon."

Portal "Eternal youth" http://vechnayamolodost.ru


Found a typo? Select it and press ctrl + enter Print version