Population genetics for dummies
Everything you wanted to ask about population genetics
Nadezhda Markina, PCR.news
We are learning to understand scientific news about where the ancestors of the French or Finns came from and from which African peoples African Americans come from. What is the difference between genotyping and genome-wide research? Genome—wide - does it mean that the whole genome is sequenced or not the whole one? If it is possible to determine from the genome where a person comes from, then why do scientists repeat in every interview that there are no "genetic markers of nationality"?
What are population geneticists studying?Population geneticists study the gene pools of populations — both modern gene pools and the history of their formation.
They emphasize all the time that they are not studying ethnic groups, because ethnicity is not a biological category, but a social one. Belonging to one or another ethnic group is determined by the person himself. Literally — who he thinks he is, feels he is. Russian Russian and population geneticists follow this rule: if a person considers himself a Tatar, he is a Tatar, if he considers himself a Chuvash, he is a Chuvash, if he considers himself a Korak, he is a Korak, if he considers himself a Russian, he is a Russian.
But then how can scientists get objective information about the gene pool?When population geneticists collect biological samples in an indigenous population, they strictly adhere to certain criteria.
A person can be included in the sample if:
— Not only he himself, but also his ancestors in three generations (both parents and all four grandparents) refer themselves to this people.
— His ancestors lived in this place for three generations. By the way, when studying indigenous peoples, geneticists do not work in cities, only in rural areas. The population of cities, which is a conglomerate of immigrants, is not suitable for this.
— He is not related to other people already included in the survey.
And what is a population from the point of view of population genetics?This is a group of people living in a certain territory and meeting two conditions.
First, the group has existed for not one, but many generations. Secondly, members of this group more than half of the time marry representatives of the same group, and not others. It happens that interethnic marriages with representatives of other peoples begin to prevail in a group of indigenous people, and then this group ceases to be a population from a genetic point of view.
Populations can be composite, enter into each other like nesting dolls. The population can be the population of an isolated village (although it is rare now), and a particular region, and the people as a whole, and the continent, and the whole world.
What is a haplotype and a haplogroup?A haplotype, according to the definition from the textbook, is a set of alleles on one (any) chromosome that are inherited together.
A Y-chromosome haplotype is a certain combination of variants of those parts of the Y chromosome that differ in different people. The haplotype is inherited on the paternal side and is passed from father to son unchanged (strictly speaking, only a part of the Y chromosome is passed unchanged, which does not exchange sections with the X chromosome during meiosis). The mitochondrial haplotype is a combination of individually variable sites on mtDNA, which is inherited from the maternal line, because during fertilization, the zygote receives all cytoplasm with mitochondria from the egg.
A haplogroup is a group of haplotypes that have a common ancestor. This ancestor once had a mutation inherited by descendants. If a mutation affects a single nucleotide, then a single nucleotide polymorphism (SNP) occurs. Each such mutation transmitted to descendants leads to the appearance of a new branch on the haplogroup tree. Such replacements occur rarely, one in 22 generations, that is, on average once in 550 years. (In fact, mutations occur more often, but we are talking about those replacements who are lucky to gain a foothold in the population.
Since haplogroups are inherited, it is convenient to use them to study migrations occurring in the past — they seem to mark population groups that moved from one place to another, mastered new territories. Knowing the Y-chromosomal haplogroup of a person, it is possible to trace the path his ancestors took on the paternal line for thousands and tens of thousands of years, since the time when people of the modern species came out of Africa. mtDNA haplogroup makes it possible to trace the path of human ancestors on the maternal line.
Haplotypes within each haplogroup are determined by a different type of polymorphism — by short tandem repeats, or STR markers. These are areas consisting of repeating sequences, and different haplotypes differ in the number of such repeats. Mutations that change the number of repeats of the same fragment occur much more often than nucleotide substitutions. That is why they are usually used "on a small scale" — to characterize an individual haplotype and to establish kinship ties between individuals.
So, if we read in the article about peoples and populations, then most likely we will talk about haplogroups and sub-haplogroups. And if families and blood relatives are mentioned, haplotypes will be mentioned.
So what about peoples and countries? We identify haplogroups in different territories, look at them and what do we see?In different populations, certain Y-chromosome and mtDNA haplogroups occur at different frequencies.
Some indigenous people have one or another haplogroup (more often Y-chromosomal, because most peoples historically lived on the principle of patrilocality — men stayed in place, and wives could be taken from other places) it can dominate, reaching a high frequency. But this does not mean that other peoples and other regions do not have it.
For example, haplogroup R1a is most frequently found in Eastern Europe — in Russians, Ukrainians, Belarusians, Poles, Slovenes, Slovaks. But it is completely wrong to label it "Slavic" — it is also common among the Baltic peoples. And another area of its high frequencies is Central and South Asia; R1a dominates among the Kyrgyz, Tajiks, and many peoples of India and Pakistan.
For haplogroup R1b, the maximum frequency zone is in Western Europe, however, it is also found throughout Eurasia. The range of haplogroup N covers the entire northern half of Eurasia — from the Far East, Northern China and Japan through Siberia and the Urals to Eastern Europe. At the same time, different branches of it are common among different peoples, for example, the N3 branch marks the peoples of the Uralic language family and helps to trace the spread of these languages. Haplogroup C is widespread in Eastern and Central Asia, as well as in North America and Australia. Haplogroup Q is common for some peoples of Siberia, and it is also considered the "calling card" of American Indians.
On the other hand, each population has a spectrum of haplogroups, which, in the words of population geneticists, constitutes its "Y-chromosome portrait" or "mitochondrial portrait".
It is important that the alleles by which haplogroups are determined are selectively neutral, which means that natural selection does not affect them. The frequencies of haplogroups can change over time under the influence of gene drift — random fluctuations that usually occur in small populations. The main thing that forms the "portraits" is the history of the group: migrations, meetings with other peoples, the separation of the formerly unified group.
The frequency spectrum of haplogroups in the population is most often represented in the form of such "pies" as in the figure below, where "Y-chromosome portraits" of different ethnogeographic groups of Tatars are shown.
Balanovskaya et al . Tatars of Eurasia: the uniqueness of the gene pools of the Crimean, Volga and Siberian Tatars // Bulletin of the Moscow University. Series XXIII ANTHROPOLOGY No. 2/2016/.
How are haplogroups designated?The letters of the Latin alphabet — in the order of occurrence — denote clusters of haplogroups, and within the cluster each haplogroup has a digit-letter number: for example, R1a, R1b, R2 ... Such a system makes it possible to place newly discovered twigs on the haplogroup tree.
And what else are the letters and numbers in brackets?Designations of mutations that serve as markers of the haplogroup.
For example, the marker of the most probably known Y-chromosomal haplogroup R1a is the M420 mutation, therefore it is designated R1a (M420). It was this mutation that separated R1a from the parent haplogroup R1, and this happened 22-25 thousand years ago, presumably in Asia. But R1a also carries all the mutations that occurred earlier in its history. In the future, R1a (M420) split into branches R1a1(M459) and R1a2(YP4141), about 17 thousand years ago; these branches formed sub-branches, for example, 5800 years ago R1a1a (M198) arose, and so on.
Today, not only genotyping is used, but also complete sequencing of the Y chromosome. This makes it possible to detect smaller and smaller branches on the Y-chromosome phylogenetic tree and to study the phylogeography even more precisely — to find out in which populations which branches occur.
Is it possible to trace the ancestral history of one person by his haplogroup?When we define a haplogroup, we thereby determine the ancestral history.
For example, if a genetic test reveals a Y-chromosomal haplogroup R1b1a2 in a person, we can say that 4000-8000 thousand years ago his male ancestors lived in Europe or Near Asia (where haplogroup R1b was formed), where they got about 18,000 years ago from Southwest Asia (where haplogroup R1 originated), and their ancestors lived about 27,000 thousand years ago in Central Asia (the ancestral haplogroup K arose), where they came from the Middle East about 40,000 years ago (haplogroup F arose), well, the carriers of the original haplogroup C-T 65,000-70,000 years ago came from Africa.
Of course, if we look at the mitochondrial haplogroup obtained from female ancestors, we will see a different story. And to find out the contribution of all the other ancestors, you will have to study non-sex chromosomes (autosomes). The study of the autosomal gene pool gives a much more complete picture of the genetic history of the population.
What is genome-wide analysis? Is this genome sequencing?Maybe, but not necessarily.
Sequencing of the human genome is still an expensive pleasure, and microchips are much more often used for population studies, they are also biochips or panels that analyze hundreds of thousands or several million single-nucleotide substitutions — SNP markers. Russian population geneticists often use the term "genome-wide analysis", this is a tracing paper from the English "genome-wide". It is not very euphonious, but it emphasizes the difference between genotyping by individual SNPs (although located throughout the genome) and a complete reading of the nucleotide sequence, or sequencing. Although it has already been suggested that sequencing with a small coating on next-generation devices (NGS) is approaching microchips in cost and can replace them.
How do population geneticists analyze and interpret genome-wide data?There are different methods for this.
Let's talk about the main ones.
Principal component analysis (PCA) method
The most traditional method of analysis is found in almost any article with population-genetic research. It is used in a variety of fields of science. Its mathematical basis is reduced to the decomposition of the data matrix into different vectors.
If you explain "on your fingers", without going into mathematics, — of the many factors that affect the magnitude of a trait, choose the components that make the greatest contribution to its variability. The first and second components are usually the most important. They are displayed along the coordinate axes, and the studied objects — in this case, individual genomes — are located in the coordinate space. On this graph, scientists place not only the samples they study, but also genomic data from previously studied populations. The graph of the analysis of the main components clearly shows how populations are grouped in the genetic space, which of them turn out to be genetically close, which are distant. If ancient genomes are studied in the work, then data on both other previously published ancient genomes and modern populations are presented together with them on the graph.
For example, the authors of a recent article in Current Biology investigated the Basque gene pool against the background of surrounding peoples. In Figure 2A, where the first (PC1) and second (PC2) main components are indicated along the axes, the basques are indicated by green circles. They are at the very edge of the genomic diversity of populations in Europe, the Middle East and North Africa. The closest to them are the so-called Peribasks, who live next door to the Basques in Spain and France, but do not speak the Basque language (Euskara), but Spanish or French. The Spanish and French Peribasques occupy an intermediate position between the Basques, Spaniards and French; the inhabitants of Sardinia are quite close to them genetically. But geographically close to the Basques, North Africa turned out to be genetically distant.
But on the same graph, the ancient people of the western part of Russia: asterisks indicate three hunter-gatherers (from 10,800 to 4,250 BC) and 26 representatives of the Fatyanovo culture of the Bronze Age (2,900-2,050 BC) from the west of Russia, as well as one representative of the corded ceramics culture from Estonia (2 850-2 500 BC). Three hunter-gatherers are grouped with previously studied European hunter-gatherers (blue), Fatjanovites and a person from Estonia — with representatives of the culture of corded ceramics from different European countries (red). Farmers, both European and Anatolian (green), are far from them.
Saag et al., Genetic ancestry changes in Stone to Bronze Age transition in the East European plain. Science Advances, 2021.
ADMIXTURE (literally "admixture") is a computer program that simulates the mixed genomic composition of individuals based on their genotypes and allows you to make assumptions about the origin of the population.
The researcher sets the value k — the number of hypothetical ancestral populations. When these populations are embedded in the model, they do not have names, they are conditional. Let's say k=3. The program models the contribution of each of these three putative ancestral populations to the genomes of the studied population, as well as to the previously studied genomes of other populations. The number k can be arbitrarily large, but it is necessary to choose its optimal value. According to certain bioinformatic criteria, the researcher takes the number k, which gives maximum information and at the same time corresponds more to real data. The latter is determined by the magnitude of the error — it should be minimal.
The program presents the results of calculations in the form of a stockade of multicolored columns. Each column is an individual genome, a group of columns denotes the genomes of one population. Each color is a specific genetic component, or the contribution of one or another ancestral population to the genome.
The question arises, how can we give hypothetical ancestral populations a biological meaning, call them something, if their number is arbitrary and they are initially conditional? This becomes clear when comparing different populations: if a component of a certain color clearly dominates in one of them, then the component can be given the name of this population.
For example, the ADMIXTURE analysis graph below shows populations of Slavic and Baltic peoples in the context of the surrounding peoples of Europe and the whole of Eurasia.
Kushniarevich et al. Genetic Heritage of the Balto-Slavic Speaking Populations: A Synthesis of Autosomal, Mitochondrial and Y-Chromosomal Data // PLOS ONE, 2015.
On the presented graph, ADMIXTURE k=5. It can be seen that in the genomes of the Baltic-Slavic populations — in the lower part of the figure — almost the entire spectrum of ancestral components is represented by two colors: blue (indicated by k3) and blue (k2). If you look at Europe as a whole, you can see that k3 (blue) makes a big contribution to all European populations and decreases from the northeast to the south. This ancestral component is maximal in Baltic populations, prevails in Eastern Slavs (80-95%) and decreases in southern Slavs (55-70%). On the contrary, k2 (blue) is more typical for populations of the Mediterranean and Caucasian regions and decreases towards the north of Europe. Thus, conditionally k3 can be called the North-Central European, and k2 — the southern European component.
In addition, the Slavs have another lemon-yellow component (k5), although it is somewhat significantly represented only among the Eastern Slavs, and among them most of all among the northern Russians. Comparison with other populations of Eurasia shows that this component can be called Siberian. The dark green component (k4) is present in a small proportion in the Southern Slavs, according to the maximum representation in the populations it can be called South Asian. Finally, the dark yellow component (k6), which is practically absent from either the Slavs or the Baltic peoples, is the East Asian genetic trace.
How can this be interpreted in relation to the origin of populations? First of all, the genetic similarity of most Western and Eastern Slavs is obvious, and the southern Slavs are more different from them. In addition, it is clear that the Eastern Slavs have a genetic trace from the east, but by origin it is associated more with migrations from Siberia than from Central Asia. And the dark green trace of South Asian populations is also common in the Middle East and the Mediterranean. Therefore, it is not surprising that it occurs, albeit with a small frequency, among the southern Slavs and other peoples of the Balkan peninsula.
Another method often used by population geneticists is the search for fragments of the genome of common origin in pairs of individuals from two different populations. It is called IBD analysis (identical by descent). Different people, representatives of different populations, inherited these fragments from the same common ancestor. Fragments of common origin are similar to mtDNA and Y-chromosome haplotypes, but differ from them in that over time they are broken up by recombination — the exchange of sites between the paternal and maternal chromosomes.
If the common fragments are short, strongly broken by recombinations, it means that the common ancestor of these people lived a long time ago. Conversely, the longer they are, the fewer generations ago the common ancestor lived. It is by the number of long IBD fragments in the genomes of representatives of two different populations that it can be concluded that these populations diverged in their history relatively recently.
In addition, methods of f3, f4 and D-statistics are used for genetic comparison of populations. All of them are based on the analysis of allele frequencies in populations and use genome-wide (genome-wide) data.
What is genogeography?Population geneticists often use mapping methods to visualize their data.
With the use of these methods, in fact, population genetics passes into genogeography, the founder of which is considered to be the Russian geneticist Alexander Sergeyevich Serebrovsky.
Mapping methods allow you to transfer different genetic data to the map. A frequency distribution map of Y-chromosome or mtDNA haplogroups is often built, for example, this is how the map of haplogroup R1b (L10) looks like in Europe. High frequencies of this haplogroup correspond to purple and brown-red shades on the map, low frequencies correspond to green shades. It can be seen that the area of maximum frequencies R1b is Western Europe; there is also a spot of high frequencies in the Urals and for some reason in North Africa (we will not find out the reasons here).
Map from O.P. Balanovsky's monograph "The Gene Pool of Europe" (2015).
It is possible to map not haplogroups, but some alleles of interest to the researcher responsible for certain phenotypic traits. For example, this is the distribution map of the HERC2 rs1129038 T allele that controls eye and hair pigmentation in populations of Northern Eurasia. The HERC2 protein, among its other functions, provides the production of dark pigment, and nucleotide substitutions affecting its activity lead to a lack of pigment. Accordingly, the more HERC2 rs1129038 T allele there is in a particular region, the more often the population has light eyes and hair. (Of course, we must remember that pigmentation depends not only on this gene.)
Balanovskaya et al., Genogeographic atlas of DNA markers that control the color of human eyes and hair. Genetics, 2021.
And you can build maps based on the calculation of genetic distances between populations. As a rule, the Ney standard genetic distance method is used, based on a comparison of allele frequencies. The map shows the genetic distances from any one studied population to all the others. This gives a clear idea of its similarity and difference with the surrounding populations. For example, here is a map of genetic distances from the Finns. On it, green colors show the areas of minimum genetic distances, that is, the most genetically close to the Finns, and red-brown colors show the areas of maximum genetic distances, that is, genetically far from the Finns.
Map from O.P.Balanovsky's monograph "The Gene Pool of Europe" (2015).
Portal "Eternal youth" http://vechnayamolodost.ru