01 April 2022

Almost complete sequencing

The complete sequence of the human genome has been published

Polina Loseva, N+1

Molecular biologists have finished collecting the human DNA sequence — a special issue of the journal Science is dedicated to this. In the previous version of the genome, which appeared in 2001, about 8 percent of the sequence remained unencrypted. These are mainly non-coding regions, central and terminal regions of chromosomes. Six articles are devoted to the results of the project at once. The full version of the genome makes it possible to more accurately identify individual genetic characteristics of people and may become a new standard in genetics, despite the fact that it still lacks an entire chromosome.

In 2000, the Human Genome project and Craig Venter's company Celera genomics announced that they had finished sequencing the sequence of human DNA (we talked about this in more detail in the text "The human Genome: twenty years later"). By 2001, they published their draft versions of the assembly with a difference of a day (first the Human Genome, then the Venter project), and by 2003 they combined their efforts and developments to assemble a single clean-up. It became the first standard, or reference genome, with which everyone who deciphered new human genomes or searched for genetic causes of diseases was checked. However, the work on reading human DNA did not end there.

The authors of the first version of the human genome did not hide that it is far from complete. For example, there are 341 spaces left in it. In addition, in their work, the researchers relied on euchromatin, the fraction of DNA that is usually in a loosely packed state in a cell and information from which can be read. Thus, the first version of the genome did not include many sections of heterochromatin — the "twisted" fraction of DNA. It consists mainly of sequences that do not encode proteins, but perform various technical and structural (and often not fully understood) functions — therefore, they can also affect the life and work of the cell.

In the first version of the genome, it was also not completely clear which genes and non-coding regions were responsible for what. For example, the ENCODE project is engaged in finding out this. Finally, the reference genome did not fully take into account the genetic diversity of people — despite the fact that it was collected from random amounts of DNA from several dozen people. Other projects, for example, "Thousand Genomes", have taken on to fill these gaps.

Since then, the genome has been repeatedly clarified, several updated references have appeared. The latest, GRCh38.p13, was published in 2019. But there were also a lot of white spots in it — areas where the letters N appeared instead of nucleotides, or where some surrogate sequences were substituted. Another one and a half hundred sites were not exactly known where exactly and in what order they are located. In total, these inaccuracies affected about 8 percent of the human genome — which is comparable in size to an entire chromosome.

To deal with the missing parts in the genome, the Consortium "From telomere to telomere" (T2T-Consortium, telomere is the end section of the chromosome) undertook. It included scientists from 54 institutes and laboratories from different countries (including Russia), and the result of their work was the first full—fledged genome assembly - which they described in six articles in the journal Science.

The first article is a presentation of a new assembly, in which the authors talk about what methods they used and summarize their work. The new genome was named CHM13 — after the culture of cells that became DNA donors. This culture comes from a vesicular drift — an unusual human tumor that appears if a fertilized egg loses its maternal chromosomes for some reason (in fact, this is a kind of parthenogenesis, read more about it in the text "Half of yourself"). The bubble drift is convenient because often its genome consists of a doubled chromosome set, which the sperm brought with it. This means that both copies of each chromosome should be almost identical (with the exception of point mutations and accidental breakdowns), and sequencing does not need to figure out which of the copies a particular site is located on.

The CHM13 assembly differs from its predecessors in sequencing technology. The previous genome variants were assembled from a variety of short sequences — that is, DNA was first broken into small sections, read each separately, and then superimposed on each other. But this method is not suitable for heterochromatin, because there are many repeating sites, in the location and number of which it is easy to make a mistake (for example, some ribosomal RNA genes in humans can have 300-400 copies). Therefore, the participants of the T2T Consortium used the method of long-read sequencing, that is, they broke the DNA into long parts and read them in their entirety.

As a result, CHM13 included 3,054,815,472 pairs of nuclear DNA nucleotides and 16,569 pairs from mitochondrial DNA. Of these, 182 million pairs are brand new: they were not in the previous 2019 genome assembly. In this genome, the authors note, there are no gaps and nucleotides that could not find a place — it is completely complete.

The vast majority of the new sites are non—coding DNA, mainly centromeric (that is, from the middle of the chromosomes, at the place where they are attached to each other in a characteristic cross during meiosis). Nevertheless, the researchers managed to find new genes — only 1956 pieces. About a hundred of them, according to their estimates, encode proteins (the rest may encode certain types of RNA or not work at all).

The remaining five articles in the issue are devoted to individual in-depth research within the framework of the project. For example, one of the works tells about centromeres, their diversity, structure and evolution. In another, about repeats in the genome: the authors were looking for retrotransposons among them (mobile genetic elements that can move around the genome or insert new copies of themselves into it), including active ones. The third is devoted to segmental duplications — long sections with a small number of copies, which probably played a role in the evolution of primates. The fourth is a map of the methylation of newly sequenced sites.

Finally, another article is devoted to practical applications of the new genome. Its authors tested how convenient it is to use the CHM13 assembly to compare the genomes of individual people with it and look for special variants of sequences. To do this, they used the database of the Thousand Genomes project and, comparing sequences from the database with CHM13, found more than a million gene variants (those that were not shown by comparison with the GRCh38 assembly). Therefore, the consortium members proposed to designate CHM13 as a new standard for genetic and genomic research.

But the decoding of the human genome will not end there either. CHM13 has its own shortcomings — for example, there is no Y chromosome in this assembly. This is due to the fact that bubble drift cells carry two identical copies of each chromosome, and the YY genotype is not viable. Therefore, this chromosome will have to be collected separately.

In addition, CHM13 is not a synthetic genome from cells of different people, as was the case with previous assemblies, but the genome of a single cell line. Therefore, the Consortium will have to collect other variants of genomes so that their standard takes into account not only the complete DNA sequence, but also its different variants.

Portal "Eternal youth" http://vechnayamolodost.ru


Found a typo? Select it and press ctrl + enter Print version