# Population Structure

This is a PLOS Topic Page draft

Public peer review comments will be posted here

Authors

Lana S. Martin
AFFILIATION: Computer Science, University of California, Los Angeles , 580 Portola Plaza, Los Angeles, CA 90095
0000-0003-2311-7191

Eleazar Eskin
AFFILIATION: Human Genetics, University of California, Los Angeles , 695 Charles E Young Dr S, Los Angeles, CA 90095
0000-0003-1149-4758

## Abstract

Population structure (or population stratification) is the presence of a systematic difference in allele frequencies between subpopulations in a population, possibly due to different ancestry, especially in the context of association studies. Population structure arises when there is relatedness among individuals in a study cohort. Unless accounted for in the methodology, population structure can create false positive associations when conducted a genome-wide association study.

## Causes

Figure 1. A phylogenetic tree demonstrating the relationships between 38 inbred mouse strains using 140,000 Mouse HapMap SNPs. As shown in the tree, the strains cluster in two groups: classical inbred strains and wild-derived strains. The body weight phenotypes, obtained from the Mouse Phenome Database, of the strains is shown. Here, classical inbred strains have much higher body weight than do wild-derived strains. Many SNPs separate the two groups because of the long branch length. One such SNP is shown in the figure. Clearly the SNP is highly correlated with body weight. All of the SNPs that separate these two groups will have the same correlation. When we consider both the tree and the SNP, we can infer that the population structure may be driving this correlation and not an effect of the SNP on body weight.

The basic cause of population structure is nonrandom mating between groups, often due to their physical separation (e.g., for populations of African and European descent) followed by genetic drift of allele frequencies in each group. In some contemporary populations there has been recent admixture between individuals from different populations, leading to populations in which ancestry is variable (as in African Americans). Over tens of generations, random mating can eliminate this type of structure. In some parts of the globe (e.g., in Europe), population structure is best modeled by isolation-by-distance, in which allele frequencies tend to vary smoothly with location.

## Population structure and association studies

Population structure can be a problem for association studies, which are often a type of case-control study. Here, an association signal could be found due to the underlying structure of the population and not due to a disease-associated locus.

Geneticists link genetic traits with disease risk and development using a genome-wide association study. Association studies discover these genetic factors by correlating an individual’s genetic variation with a disease status or disease-related trait. At the genome-wide scale, association studies typically focus on statistical relationships between single-nucleotide polymorphismssingle-nucleotide polymorphisms (SNPs) and disease traits. SNPs are the most common genetic variants underlying susceptibility to disease, and associated SNPs are considered to mark the region of a human genome that influences disease risk. A GWAS identifies a SNP as a significant, and therefore associated, variant when the specific genome sequence at the SNP is correlated with a disease trait or disease status.

For example, a GWAS study may find that individuals with a specific sequence (or allele) at a SNP have higher blood pressure on average than individuals with a different sequence at the SNP. If a SNP has a significant correlation with a trait or disease status, the association study suggests that presence of the particular variant may increase an individual’s risk for disease.

One challenge to producing accurate genome-wide association study results is that of relatedness, termed “population structure,” within a study cohort. Population structure can produce many false positive associations in genome-wide association study results. In other words, population structure may cause association study methods to incorrectly identify genetic variants as being associated with a disease.

By analogy, one might imagine a scenario in which certain small beads are made out of a certain type of unique foam, and that children tend to choke on these beads; one might wrongly conclude that the foam material causes choking when in fact it is the small size of the beads. Also the real disease causing locus might not be found in the study if the locus is less prevalent in the population where the case subjects are chosen. For this reason, it was common in the 1990s to use family-based data where the effect of population stratification can easily be controlled for using methods such as the TDT. But if the structure is known or a putative structure is found, there are a number of possible ways to leverage this structure in the association studies and compensate for population bias. Most contemporary genome-wide association studies consider the problem of population stratification to be manageable,[1] and the logistic advantages of using unrelated cases and controls make these studies preferable to family-based association studies.

Since the 2010s, new approaches to genome-wide association studies have used mixed models to mitigate the biasing effects of population structure and relatedness. However, developing methods that are capable of effectively testing for association while correcting for population structure is a computational and statistical challenge.

The two most widely used approaches to this problem include genomic control, which is a relatively nonparametric method for controlling the inflation of test statistics,[2] and structured association methods,[3]which use genetic information to estimate and control for population structure. Currently, the most widely used structured association method is Eigenstrat, developed by Alkes Price and colleagues[4].

## Hypothesis testing of genetic variants

Figure 2. Significance testing in association studies. The null distribution is the standard normal distribution and the expected distribution of the association statistics under the assumption that the effect size is 0. Each variant’s association statistic is computed, and its significance is evaluated using the null distribution. If the statistic falls in the significance region of the distribution, the variant is declared associated. In this example, S1 is not significant, while S2 and S3 are significant.

In order to evaluate if the association between a SNP and a phenotype is statistically significant, the collected genomic data can be used to test two hypotheses.

The null hypothesis assumes a model where the SNP does not affect the phenotype (see Figure 1b). In this hypothesis, the phenotypes (${\displaystyle y}$) are only affected by the population mean (${\displaystyle /mu}$) and the environment (${\displaystyle e}$). Unless data indicate otherwise, we assume that the null hypothesis is true and the SNP does not influence the phenotype (i.e., does not affect the individual’s disease risk).

An alternative hypothesis provides a model of the SNP being significantly associated with the phenotype (see Figure 1c). In this case, the phenotypes (${\displaystyle y}$) are affected not only by the population mean (${\displaystyle /mu}$) and environment (${\displaystyle e}$), but they are also affected by the genotype (${\displaystyle x}$). In other words, presence of the SNP suggests an individual is likely to have the trait or disease risk. Here, the quantitative measurement of strength that the genotype has on the phenotype is referred to as the effect size (${\displaystyle /beta}$). If the effect size (${\displaystyle /beta}$) is equal to 0, we consider the two models equivalent. The SNP is determined to be significantly associated with the phenotype when the data fits the alternative hypothesis beyond a specific threshold.

The null and the alternative hypotheses are mathematically expressed in order to perform a single-SNP test. The kth genotype of the jth individual ${\displaystyle g_{jk}}$ is denoted where the genotype is in the set {0,1,2}, which is the number of copies of the kth variant that the jth individual has on their two chromosomes. Here, a “0” denotes the genotype that does not contain the variant in either chromosome, while a “1” or “2” denotes the genotype presence in one or two of the chromosomes, respectively. In order to simplify the equations for association studies, the genotypes are standardized by subtracting the population mean and dividing by the variance. The frequency of a variant in the population is denoted as ${\displaystyle p_{k}}$, which is the average genotype frequency in the population. The standardized genotypes can be expressed as

${\displaystyle X_{jk}\in {\bigg \{}{\frac {-p_{k}}{\sqrt {p_{k}(1-p_{k})}}},{\frac {1-p_{k}}{\sqrt {p_{k}(1-p_{k})}}},{\frac {2-p_{k}}{\sqrt {p_{k}(1-p_{k})}}}{\bigg \}}}$

Once the standardized genotypes are calculated, a typical single-SNP test can be used to identify variants associated with traits. A standard regression technique estimates the relationship among variables, including a dependent variable (${\displaystyle y}$), any independent variables (${\displaystyle x}$), and unknown variables (${\displaystyle /beta}$). Using regression, these simple linear models can correlate the genetic variation with the trait, allowing a test of whether the data best fits the null or alternative hypothesis.

## Modeling the relationship between genotypes and phenotypes

Fig 3. Standard genetic association study applied to human blood pressure data. (a) The left SNP appears to be more strongly associated with blood pressure than the right SNP. (b) We test two hypotheses against each other to evaluate whether the association between a SNP and a phenotype is statistically significant. By default, a null hypothesis assumes that the SNP does not affect the phenotype. (c) If the data fits the alternative hypothesis beyond a certain threshold, the SNP is described as significantly associated with the phenotype.

The equation

${\displaystyle y_{j}=\mu +\beta _{k}X_{jk}+e_{j}}$

models the phenotype for a single individual ${\displaystyle j}$ in the study. Here, the effect of the variant on the phenotype is ${\displaystyle /Beta_{k}}$, the model mean is ${\displaystyle /mu}$, and the contribution of the environment on the phenotype is ${\displaystyle e_{j}}$. The environment’s effect on a phenotype for an individual ${\displaystyle j}$ (${\displaystyle e_{j}}$) is assumed to be normally distributed with variance ${\displaystyle \sigma _{e}^{2}}$, denoted as ${\displaystyle e_{j}\sim N(0,\sigma _{e}^{2})}$.

The equation above describes the relationship between the genotype and phenotype of just one individual. Vector notation can be used to represent all of the individuals in the dataset and produce the model

${\displaystyle y=\mu 1+\beta _{k}X_{k}+e}$

, with the phenotypes of all of the individuals in the dataset denoted as column vector ${\displaystyle y}$, a column containing the genotypes for the ith variant in the population denoted as ${\displaystyle x}$, and a vector containing the environments denoted as ${\displaystyle e}$. 1 is a column vector of 1s. The random vector ${\displaystyle e}$ is drawn from the distribution ${\displaystyle e\sim N(0,\sigma _{e}^{2}I)}$. Each element of ${\displaystyle e}$ is independent of the others; hence, the variance-covariance matrix is a diagonal matrix (${\displaystyle \sigma _{e}^{2}I}$).

### True genetic model

Figure 4. Pairwise similarity between strains gives some insight on the similarity of the unmodeled factor. In this toy example, we consider 10 SNPs where the even numbered SNPs are the causal SNPs with an effect on the trait. (a) Since B6 and C3H share alleles at 9 out of 10 SNPs, these strains have a similar value for the unmodeled factor. (b) When we consider other strains, the unmodeled factors may be larger. For example, B6 and CAST, which share few SNPs, will have different values for their unmodeled factor.

A mathematical model can used to help explain the way population structure can bias results of a genome-wide association study.

Assuming the collection of SNPs is independent and identically distributed (iid), the single-SNP test will indicate if a SNP is responsible for the differences observed in an individual’s trait or phenotype expression values. However, this simple linear model is an unrealistic model for identifying variants associated with traits in today’s large genomic datasets that contain a high degree of relatedness. In real populations, the true effect of a single SNP is influenced by multiple variants that are affecting the trait. A ‘hypothetical’ true genetic model takes into account the effect of all SNPs on the trait.

Here, the vector notation

${\displaystyle y=\mu 1+\sum _{i=1}^{M}\beta _{i}X_{i}+e}$

models the phenotypes of all the individuals in the dataset denoted as column vector ${\displaystyle y}$. Again, the effect of the ${\displaystyle i}$th variant on the phenotype is ${\displaystyle \beta _{i}}$, the mean is ${\displaystyle \mu }$, and the contribution of the environment on the phenotype is denoted by ${\displaystyle e}$. Here, the number of variants is ${\displaystyle M}$.

The true genetic model takes into account the true effect of all SNPs, including the effect of the SNP being tested for association with a trait. When testing SNP ${\displaystyle k}$, the actual data is generated from

${\displaystyle y=\mu +X_{k}\beta _{k}+\sum _{i\neq k}\beta _{i}X_{i}+e}$

In applying the simple linear model to data, we observe a mismatch between the model used for testing and the assumed underlying generative model. Here, any term that is missing in the testing model when compared to the generative model is called an unmodeled factor. The unmodeled factor is exactly ${\displaystyle \sum _{i\neq k}\beta _{i}X_{i}}$.

In this case, the unmodeled factor is the effect of variants in a genome other than the variant being tested. This factor can significantly affect the results of an association study. If the individuals in the study are related to each other, the unmodeled factor may produce a high rate of false positive associations.

In an association study, relatedness among individuals is referred to as population structure. Over the past few years, there have been many methods which have been developed to mitigate the effect of population structure in association studies. One of the most commonly utilized approaches today, mixed models, originally became popularized in mouse studies and is now the standard approach for analyzing human GWAS studies.

### An example of population structure confounding from mouse genetics

Figure 5. Image NullAlternativeHypothesesSNPPhenotype

The importance of controlling for population structure is evident in genetic mapping of inbred mouse strains. Mice strains pose particular problems that mixed models are developed to solve, and the basic ideas behind mixed models can be clearly demonstrated with mice genetics.

Today’s classical inbred laboratory mouse strains descend from a relatively small number of genetic founders (mostly fancy mice originally kept as pets) and are characterized by several population bottlenecks [5] [6].

A second group of laboratory strains are referred to as “wild-derived” strains. These strains are mouse strains captured from wild and inbred mice that were never kept as pets. Wild-derived strains do not share the population history of classical laboratory strains. A simple way to visualize the relationship between multiple ancestral groups and traits in the mouse genome is with a phylogenetic tree that can be computed from the genetic information (Figure 3). This tree visualizes the genetic relationships between 32 classical inbred strains and 6 wild derived strains, using genetic variant information at 140,000 SNPs for each strain.

In this tree, there are the two groups of mice that are close to each other in the phylogeny. These groups are separated by a long branch length (denoted with a dotted line). This branch represents the many genetic differences between the groups of mice. When comparing measurements for body weight and liver weight taken from each of the two strains, the body weights of the classical strains are much larger than the body weights of the wild derived strains (Figure 4). The differences in population genetics are produced by different selective pressures on the two groups, including environmental fitness (wild-derived) and human selection (laboratory).

A linear model can be applied to the 140,000 SNPs from this dataset to identify which genetic variants are associated with body weight. In general, association study results indicate very few significant associations between particular SNPs and a trait. One common way to visualize the results of an association study is with a Manhattan plot. In a Manhattan plot, the mouse genome is plotted against the x-axis, and the measure of significance of correlation between the genome and trait is plotted against the y-axis. Each red spike represents a SNP at a particular genomic position, and the height of the spike represents the strength of the association. The green horizontal line represents the significance threshold. Any SNP that crosses this line is considered a significant association.

Typically, a Manhattan plot will indicate that a number of SNPs affect the phenotypes. The plot often shows signals that cross the threshold at a few locations in the genome, but most of the SNPs will not be associated with the phenotype.

Another way to visualize the results of an association study is with a cumulative p-value distribution plot (b) and a quantile-quantile (Q-Q) plot (c). These plots are graphical techniques for determining whether multiple datasets come from populations with common distribution. Here, the cumulative p-value distribution plot shows the quantiles of the p-values, which assess the probable significance of association between a genotype and trait; the Q-Q plot shows the distribution of the same data log-transformed.

Since the majority of SNPs will not be associated, most of the statistics will be coming from the null distribution. Thus, most of the p-values will be uniformly distributed between 0 and 1. Typically, only a small fraction of the SNPs have signals stronger than expected at the tail of the distribution. This results in a cumulative p-value distribution that is close to the diagonal line (Figure 5b) and a Q-Q plot that follows the line for the beginning of the curve (as shown in Figure 5c). As shown in Figure 5, we would expect that the median p-value would be close to 0.5.

However, when standard linear models are applied to the inbred mouse dataset, many locations in the genome have strong association signals (Figure 6a). The cumulative p-value distribution and the Q-Q plots are shown in Figure 6b and 6c. In fact, nearly 50% of the SNPs are significantly associated with the phenotype. There are far more significant associations (red line) than expected associations (yellow line).

### Why false positives are observed in mouse genetic studies

The excess amount of strong associations observed can be explained by examining the data for one of the red peaks from the Manhattan plot (Figure 6a) in Figure 7a. Here, the big circles are body weight values, and the small circles are genome-wide SNPs; the black small circles are reference alleles and the white small circles are alternate alleles. Based on the distribution of body weight values and SNPs, it appears that green SNPs correspond to mice with small body weight, while pink SNPs correspond to mice with heavy body weight. Clearly there is a very strong correlation between the SNP and the trait of body weight; it is no surprise that this analysis produces very significant p-value.

However, if laying the phylogenetic tree over the pattern of SNPs and body weight values (Figure 7b) shows that the separation of the population into classical and wild derived strains is strongly correlated with the body weight. Here, the SNP differentiates these two groups. The length of each branch in the tree corresponds to the amount of genetic differences between the two groups separated by the branch. The long branch length between the classical and wild strains indicates that many SNPs are dominant in one group and each has a strong signal. This correlation between strains and SNPs causes the large amount of observed associations.

Clearly there are genetic differences between these two groups that affect body weight, but not every genetic difference between the two groups affects body weight. However, the simple linear model will associate every SNP that separates these two groups with body weight. Thus, most of the associations observed are for SNPs that are not actually affecting body weight. These associations are referred to as spurious associations.

Another way to understand the effect of population structure on association is through graphical models. Consider SNPs and traits in Figure 8a. Typically, an association test is performed on a SNP. Observation of an association gives evidence that the SNP affects the trait. On the other hand, if an association is not observed, this suggests that either the SNP does not affect the trait, or that the effect is too small for the study to detect. However, if genetic differences between groups are present (Figure 8b), shared histories will produce many SNPs directly correlated with population structure (straight dark line). In addition, phenotypes, such as body weight, are also highly correlated with the population structure (straight dark line). This will induce correlation between many SNPs and the phenotype (dotted line) including, but not limited to, the SNPs that are actually responsible for variants.

This phenomenon of association due to relatedness is exactly related to Equation (3). Here, the genetic history shared between mouse strains is the unmodeled factor ${\displaystyle \sum _{i\neq k}\beta _{i}x_{i}}$. Since the shared genetic history is missing from the testing model, we consider population structure the unmodeled factor.

## Correcting using mixed model methods

Figure 6. The mixed model includes a term u which attempts to model the unmodeled factors in the true model. The term uses information from the kinship matrix that accounts for the dependency among SNPs correlated with phenotypes due to population structure.

Population structure can bias the results of GWAS by producing a significant number of false associations. The mouse model example shows that studies must correct for population structure in order to accurately identify specific genetic variants involved in disease risk.

Several challenges presently limit usefulness of genome association studies for implicating genetic variants. First, unmodeled factors are not known and cannot be accounted for in computational methods that match traits with phenotypes. Second, we do not know the exact ways that unmodeled factors interact with population structure to bias output. Finally, many studies ignore dependency among these unmodeled factors.

The principle underlying mixed models is the incorporation of this “model” of unmodeled factors into the association test. Mixed models incorporate the unknown factors into the model of association using what is called a random effect or a variance component. This model is called a mixed model, because it combines a random effect with the effect sizes of the SNPs that are being tested (referred to as fixed effects) to model population structure.

### Correcting for population structure in mouse association studies

Using the mouse example, consider two different strains, B6 and C3H. These two strains are both classical inbred mice derived from domesticated mice and have similar genomes. Figure 9a shows a toy example considering the genomes of the two strains. Here, the genomes are very similar; nine out of ten SNPs are shared between B6 and C3H. In this example, assume that the even numbered SNPs are causal variants that affect the phenotype. For those variants, their corresponding effect size (${\displaystyle \beta _{i}}$) will be non-zero. The actual effect sizes and the resulting value for the unmodeled factor are unknown. However, because they share the same allele as these SNPs, the two strains will have a similar value for the unmodeled factor.

Next, consider two very different strains pairwise (Figure 9b): the classic inbred mouse strain B6 and the wild mouse strain CAST. In this case, the strains have different alleles present at many SNPs. If any of these SNPs affect the trait, the value of the unmodeled factor will differ by the effect size. Thus, the two strains are expected to have different values for the unmodeled factor.

The amount of pairwise sharing of alleles between strains can be used to capture the similarity between the values of the unmodeled factor among strains. In order to do this, a matrix is created that contains all SNPs shared between the paired genomes (Figure 10). This matrix “models” the values of the unmodeled factors among the individuals in a study, and it shows which pairs have similar sharing of alleles and which pairs have dissimilar values.

When using a mixed model to identify causal variation, one key step is to establish these fixed parameters and random effect components. A linear mixed model (LMM) uses the information from the matrix to account for the unmodeled factor. The LMM extends the simple, hypothetical true model

${\displaystyle y=\mu 1+\beta _{k}X_{k}+e}$

to include a term that captures the unmodeled factors. The term ${\displaystyle u}$ in

${\displaystyle y=\mu 1+\beta _{k}X_{k}+u+e}$

is a random vector that depends on the amount of shared genome in terms of pairwise differences. Here, we assume that ${\displaystyle u\sim N(0,\sigma ^{2}K)}$, where ${\displaystyle K}$ is the kinship matrix. Each entry of ${\displaystyle K}$ estimates the pairwise similarity between the genomes of the individuals in the study, which follows the intuition of Figures 9 and 10.

In practice, ${\displaystyle K}$ can be computed from the genotypes where each entry in the kinship matrix is just the product of the standardized genotypes for the two individuals divided by the number of variants. Thus, the kinship entry computing the relatedness between individuals ${\displaystyle i}$ and ${\displaystyle j}$ is

${\displaystyle K_{ij}={\frac {\sum _{k=1}^{M}X_{ik}X_{jk}}{M}}}$

We can elegantly compute the kinship matrix using the equation ${\displaystyle K=XX^{T}/N}$.

The mixed model is making an assumption that the phenotype follows the model in equation (NEW NUMBER). How well does this assumption hold in practice is an active area of research leading to many variations of mixed models including techniques for computing kinship matrices.

The standard estimation equations above cannot be used to estimate the values of the parameters in equation (NEW NUMBER). Due to the random effect ${\displaystyle u}$, the phenotypes of the individuals are no longer independent of each other—an assumption of the previous methods.

However, if we know the values of ${\displaystyle \sigma _{g}^{2}}$ and ${\displaystyle \sigma _{e}^{2}}$, we can then apply the following “mixed model trick.” We note that the phenotypes will follow the distribution

${\displaystyle y\sim N(\mu +\sum \beta _{i}X_{i},V)}$

, where ${\displaystyle V=\sigma _{g}^{2}K+\sigma _{e}^{2}I}$ and ${\displaystyle I}$ is the identity matrix. If we transform then multiply the phenotypes and genotypes by ${\displaystyle V^{-{\frac {1}{2}}}}$, we get

${\displaystyle V^{-{\frac {1}{2}}}y\sim N(V^{-{\frac {1}{2}}}1\mu +\sum \beta V^{-{\frac {1}{2}}}X_{i},I)}$

In the transformed data, the individuals are now independent of each other, and we can apply the estimation equations presented above to estimate the values for ${\displaystyle \beta }$ and the association statistics.

In this case, we assume that the ${\displaystyle \beta _{i}}$ values are drawn from a normal distribution with a mean zero as effect size and ${\displaystyle \sigma _{e}^{2}}$ as the variance.

Estimating the values of ${\displaystyle \sigma _{g}^{2}}$ and ${\displaystyle \sigma _{e}^{2}}$ is a difficult computational problem referred to as estimating the variance components. These parameters are estimated by utilizing a maximum likelihood

${\displaystyle l(y,X_{k},\beta _{k},\sigma _{g},\sigma _{e},n)=-{\frac {1}{2}}[n\log(2\pi )+log|V|+(y-X_{k}\beta _{k})V^{-1}(y-X_{k}\beta _{k})]}$

, where ${\displaystyle V=\sigma _{g}^{2}K+\sigma _{e}^{2}I}$.

This equation is computationally difficult, because likelihood requires computing the inverse of the matrix (${\displaystyle V^{-1}}$), which in turn depends on the values of ${\displaystyle \sigma _{g}^{2}}$ and ${\displaystyle \sigma _{e}^{2}}$. Optimization methods that maximize this likelihood apply algorithms updating current estimates of of ${\displaystyle \sigma _{g}^{2}}$ and ${\displaystyle \sigma _{e}^{2}}$ until they converge to high values of the log likelihood function. Each step of an optimization algorithm is referred to as an iteration. In each iteration, the optimization algorithm must evaluate the log likelihood for the current values of ${\displaystyle \sigma _{g}^{2}}$ and ${\displaystyle \sigma _{e}^{2}}$ and must compute this matrix inverse. A straightforward way to compute a matrix inverse involves a complexity of approximately ${\displaystyle O(n^{3})}$. Unfortunately, this results in a very inefficient algorithm and prevents mixed models from being widely utilized in association studies, despite their long history in genetics.

Efficient Mixed Model Association (EMMA) [7] and similar efficient algorithms [8] [9] [10] address this problem by estimating these parameters. Since we first presented EMMA, many other groups have developed similar efficient algorithms. The key idea behind EMMA is that we apply spectral decomposition to the kinship matrix, leading to a much faster optimization algorithm. The spectral decomposition only needs to be computed once and requires a complexity of ${\displaystyle O(n^{3})}$. Specifically, if we write ${\displaystyle K=UDU^{T}}$ where ${\displaystyle U}$ is a matrix of eigenvectors and ${\displaystyle D}$ is a diagonal matrix of eigenvalues, then we can represent ${\displaystyle V}$ using matrix algebra properties as follows:

${\displaystyle V=\sigma _{g}^{2}K+\sigma _{e}^{2}I=\sigma _{g}^{2}UDU^{T}+\sigma _{e}^{2}UIU^{T}=U(\sigma _{g}^{2}D+\sigma _{e}^{2}I)U^{T}}$

We can then compute the quantity ${\displaystyle z=U^{T}(y-X_{k}\beta _{k})}$ for each SNP ${\displaystyle k}$ which has complexity ${\displaystyle O(n^{2})}$. The log likelihood of the data can then be computed using

${\displaystyle l(y,X_{k},\beta _{k},\sigma _{g},\sigma _{e},n)=-{\frac {1}{2}}[n\log(2\pi )+\sigma _{g}^{2}Tr(D)+n\sigma _{e}^{2}+z^{T}(\sigma _{g}^{2}D+\sigma _{e}^{2}I)^{-1}z]}$

, which can be computed in complexity ${\displaystyle O(n)}$ since the matrix inside the likelihood is now diagonal. The inverse can be computed by simply taking the reciprocal of the elements along the diagonal. This procedure results in a very efficient algorithm that is useful for today’s large-scale human genomic datasets.

We applied EMMA to the same mouse association data analyzed using a standard LMM approach (see Figure 6). With these computational improvements, we almost completely reduced the inflation of false positives while obtaining nearly uniform p-value distribution for most SNPs (Figure 11). Here, the strongest peak, which is not significant, falls into a region of the genome on chromosome 8, which is known to be associated with body weight. Regions of the genome that correlate with variation in a phenotype are referred to as Quantitative Trait Loci (QTL).

Next, we applied EMMA to other phenotypes from the same mouse strain datasets, including a liver weight phenotype. Here, we see that the inflation of false positives is reduced and a strong signal at chr2 is more pronounced after the correction (Figure 12). EMMA correctly identifies a locus for liver weight that falls into the QTL Lvrq1 (liver weight), which was previously identified using a traditional mous mapping approach [11].

### Correcting for population structure in human association studies

Figure 7. Different degrees of relatedness in the sample. (a) All of the individuals in a genetic study are somehow related through a large pedigree or family tree. Different parts of the tree induce different types of relatedness. (b) Cryptic relatedness refers to relatively recent genetic relationships. (c) Relatedness due to ancestry refers to relatedness caused by ancestors originating from the same region. The boxes in (b) and (c) represent the level of the pedigree that causes that type of relatedness in each case, respectively.

During the time that mixed models were starting to be used in mouse studies, the problem of relatedness in human studies was becoming apparent by causing difficulties in analyzing human GWAS studies. At that time, there was no single approach to handle relatedness. Instead, different types of relatedness were explicitly modeled, and association study methods were adapted to those scenarios. There is an entire class of methods designed to handle relatedness when there are closely related individuals in the genetic study and the genetic relationships are known. These include methods for multigenerational families, twins, and siblings [12] [13]

A complication in human association studies is when the relationships are unknown. One of the most common types of relatedness among individuals in human studies is due to ancestry. Ancestry refers to the population that an individual descended from. Many individuals are admixed, which means they are descended from ancestors in different populations. If an association study contains individuals from different populations or differing degrees of admixture, the individual will have different degrees of relatedness among them. In other words, individuals with the same ancestry are slightly more related to each other than individuals with different ancestries.

It is well documented that these ancestry differences can induce false positive associations [14]. Association studies that analyzed individuals with differences in ancestry typically utilized an approach to predict the ancestry for each individual and then incorporated this information as a covariate in the model [15]. An alternate approach was to estimate principal components over the genotype data, which could be interpreted as a proxy for association studies and included in the model as covariates [4]. In the human genetics literature, ancestry differences are sometimes referred to as population structure. In this review, we use the term ancestry differences separately from the term population structure; we use the latter to describe the general phenomenon of relatedness in a sample.

A second type of relatedness is cryptic relatedness [16]. Since GWAS are applied to extremely large samples, there are often individuals included in the study who happen to be related—but this relatedness is unknown the both the individuals and the investigators. Typically, cryptic relatedness is handled by screening the association study for related individuals and computing the genetic similarity between each pair of individuals.

A general purpose approach to correct for population structure, or any type of confounding in association studies, is genomic control [17] [18]. Genomic control allows us to measure the extent to which population structure (or other confounders) is affecting the association statistics. By examining the cumulative p-value distribution plot, we consider the deviation of the actual plot from what is expected at the median. Since we expect the vast majority of variants not to be associated with the trait, we expect the median observed p-value to be close to 0.5. Typically, population structure induces a more significant observed median p-value.

Genomic control computes a correction factor referred to as ${\displaystyle \lambda }$, which is a scaling factor used to scale all of the observed p-values so that the corrected median p-value is then 0.5. The ${\displaystyle \lambda }$ is on the ${\displaystyle \chi ^{2}}$ scale (meaning that the median p-value is converted to a ${\displaystyle \chi ^{2}}$ value and the ratio is computed relative to the ${\displaystyle \chi ^{2}}$ value) corresponding to a p-value of 0.5, which is 0.545. The observed association p-values are converted from p-values to ${\displaystyle \chi ^{2}}$ statistics, scaled by ${\displaystyle \lambda }$ and then converted back to p-values.

We can also use the value of the ${\displaystyle \lambda }$ as a measure of the extent of the effect of confounding on the association statistics. Genomic control ${\displaystyle \lambda }$’s are widely utilized to compare different correction approaches. A ${\displaystyle \lambda }$ of 1.0 shows that there is no inflation. A value greater than 1.0 is evidence that the association statistics are inflated. Typically, the 95% confidence interval of the ${\displaystyle \lambda }$ in GWAS studies is 0.02. Thus, any ${\displaystyle \lambda }$ of 1.03 or higher suggests that there is some inflation. We note that more recent exploration of polygenicity, or the amount of causal variants for a trait, suggests that there are many more causal variants than originally expected. In this case, the ${\displaystyle \lambda }$ values should actually be higher than 1.0 [19].

In the literature, ancestry differences and cryptic relatedness are referred to as distinct phenomenon. In fact, they can be thought of as different degrees of relatedness in the sample. Consider in Figure 13a, which shows a potential pedigree relating all of the individuals in an association study sample. Ancestry differences can be thought of relatedness near the top of the tree (Figure 13b), and cryptic relatedness can be thought of relatedness in a more recent portion of the tree (Figure 13c).

Mixed models can handle nearly arbitrary genetic relationships between individuals and are a natural approach for human association studies. Mixed models are ideal because they can be applied without explicit identification of the ancestry and relatedness within the sample. They also enable the analysis of datasets with particularly complex genetic relationships, such as isolate populations where the population is descended from a small number of founder individuals [20]. For isolate populations, the previous methods were not able to fully account for population structure.

Mixed models were first used in human studies with the Northern Finnish Birth Cohort [21], where mixed models were applied to 331,475 SNPs in 5,326 individuals who were phenotypes for 10 traits [8]. These traits include C-reactive protein (CRP), triglyceride (TG), insulin plasma levels, (INS), diastolic blood pressure (DBP), body mass index (BMI), glucose (GLU), high-density lipoprotein (HDL), systolic blood pressure (SBP), and low density lipoprotein A (LDL). Individuals within this cohort have some ancestry differences due to their origin from different parts of Finland, and they share some genetic relationships.

Table 1 shows the results of applying mixed models to these traits. Each entry in the table shows the ${\displaystyle \lambda }$ value for the analysis of that phenotype. The first column shows the results of the uncorrected analysis. We can see that there are very large ${\displaystyle \lambda }$ factors, particularly for height. In fact, the associations with height were not reported in the original Sabatti et al. (2009) manuscript because the high ${\displaystyle \lambda }$ value suggested that some of the observed associations may be false positives. The second column shows the ${\displaystyle \lambda }$ factors after eliminating cryptically related individuals. Here, we compute the pairwise relationships between individuals and filter out one of any pair that was closely related. This approach filtered out 611 individuals.

The third column shows the ${\displaystyle \lambda }$ factors after using 100 principal components as covariates. This was done to show the limit of the principal component approach in correcting for population structure. Each component decreases the ${\displaystyle \lambda }$; using 100 components is an absurdly large number of components and is well beyond what is typically utilized in any type of association study. The last column shows the ${\displaystyle \lambda }$ for mixed models. Each of these ${\displaystyle \lambda }$ values are within the 95% confidence interval (around 1.0), suggesting that mixed models can correct for all of the population structure in the sample—including cryptic relatedness and ancestry differences. As shown in Table 1, only mixed models adequately correct for population structure in this sample.

Mixed models have become important in human GWAS analysis, because the estimates of ${\displaystyle \sigma _{g}^{2}}$ and ${\displaystyle \sigma _{e}^{2}}$ can be used to estimate the heritability of the trait. Recent results suggest that common variants explain a larger proportion of the variance of complex traits than previously thought [22] [23] [24].

## Discussion and recent developments

Over the past decade, association studies have identified thousands of variants implicated in dozens of common human diseases. The traditional approach to association studies assumes that individuals are unrelated to each other. However, in practice, individuals in genetic studies are related to each other in complex ways. In this review, we demonstrate how these relationships cause false positives in association studies and how mixed models can correct for these confounding genetic relationships.

This review covers only the basic principles of mixed models and population structure. Since the original EMMA paper in 2008 [7], mixed models have become an active research area. Many groups have published papers exploring various aspects of mixed models and their application to complex genomic problems.

Many approaches have been developed to improve the efficiency of mixed models, including the methods Fast-LMM [25] and GEMMA [26]. More recently, a method called BOLT-LMM [27] was developed for scaling analyses to handle cohorts in the hundreds of thousands of individuals.

Another direction of method development has been extending mixed models to handle case control studies. These approaches typically assume a liability threshold model where there is an underlying continuous phenotype; if the phenotype is above a threshold, the individual has a disease. If it is below a threshold, the individual does not have the disease [28]. These types of studies are also complicated by a phenomenon of selection bias, because the cases are oversampled from the population. At present, such mixed model extensions to case/control studies results in challenging computational problems [29] [30].

Some mixed models are developed based on observation of a particular bias inherent to standard approaches. For example, a bias is induced by the SNP that is tested and used in the computation of the kinship matrices [31]. This bias motivated the idea that, when applying mixed models, the kinship matrix should not contain the SNP being tested. As a result, the Leave One Chromosome Out (LOCO) approach constructs a different kinship matrix for testing each chromosome and leaves out the SNPs on the chromosome being tested [32].

### Selecting SNPs for kinship

This would stress the fact that it is not the causal SNPs that are used, and that those SNPs should capture genetic relatedness and associated environmental factors, and tag the causal SNPs. Related limitations of mixed models could then be discussed [33].

Our example, in Figure 9, brings to the surface a key issue related to the application of mixed models in genetic studies. In our example, all of the variants are used to build the kinship matrix, yet only a subset of them are the actual causal variants affecting the trait. The model also makes assumptions about the magnitude of the contribution of each SNP to the trait.

This approach is also motivated by the observation that many complex traits are highly polygenic, suggesting that there are hundreds (if not thousands) of loci that influence some traits [34]. Some traits, such as height, are known to be highly polygenic. In this case, it is not clear what the actual value of [itex]/lambda</ref> should be for a polygenic trait as it is expected to have a contribution from both confounding effects as well as polygenicity. More recently, a method called LD score regression has been developed that attempts to differentiate between these two components [35].

Mixed models are also utilized in genetic studies beyond just the correction for population structure as described in this review. - accounting for genetic relatedness (as described in this review using the example of classical and wild-derived strains) - accounting for environmental/non-genetic factors correlated with genetic background (not currently discussed in this review). See for example skill with chopstick (phenotype) affected by chopstick exposure (environmental exposure) described in [36]. - increasing power by accounting for true genetic effects other than that tested (polygenicity)

From their origins in non-human organisms to powering large scale human genome wide association studies today, mixed models play an important role in the analysis of genetic data, particularly in correcting for population structure. Research in improving and extending mixed model approaches is now an active research area in the field.

## Software

The Efficient Mixed Model Association (EMMA) is an efficient algorithm for estimating these parameters [37]. EMMA is a statistical test for model organisms association mapping correcting for the confounding from population structure and genetic relatedness. EMMA takes advantage of the specific nature of the optimization problem in applying mixed models for association mapping, which allows us to substantially increase the computational speed and the reliability of the results [37].

Many approaches have been developed to improve the efficiency of mixed models, including the methods Fast-LMM [25] and GEMMA [26]. More recently, a method called BOLT-LMM [27] was developed for scaling analyses to handle cohorts in the hundreds of thousands of individuals.

Another direction of method development has been extending mixed models to handle case control studies. These approaches typically assume a liability threshold model where there is an underlying continuous phenotype; if the phenotype is above a threshold, the individual has a disease. If it is below a threshold, the individual does not have the disease [38]. These types of studies are also complicated by a phenomenon of selection bias, because the cases are oversampled from the population. At present, such mixed model extensions to case/control studies results in challenging computational problems [39] [40].

Some mixed models are developed based on observation of a particular bias inherent to standard approaches. For example, a bias is induced by the SNP that is tested and used in the computation of the kinship matrices [41]. This bias motivated the idea that, when applying mixed models, the kinship matrix should not contain the SNP being tested. As a result, the Leave One Chromosome Out (LOCO) approach constructs a different kinship matrix for testing each chromosome and leaves out the SNPs on the chromosome being tested [32].

# References

1. ^ Teo, Y. Y. (2008). Common statistical issues in genome-wide association studies: a review on power, data quality control, genotype calling and population structure. Current opinion in lipidology, 19(2), 133-143.
2. ^ Bacanu, S. A., Devlin, B., & Roeder, K. (2000). The power of genomic control. The American Journal of Human Genetics, 66(6), 1933-1944.
3. ^ Price, A. L., Zaitlen, N. A., Reich, D., & Patterson, N. (2010). New approaches to population stratification in genome-wide association studies. Nature reviews. Genetics, 11(7), 459.
4. ^ a b Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., & Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature genetics, 38(8), 904.
5. ^ Frazer, K.A., Eskin, E., Kang, H.M., Bogue, M.A., Hinds, D.A., Beilharz, E.J., Gupta, R.V., Montgomery, J., Morenzoni, M.M., Nilsen, G.B. and Pethiyagoda, C.L., 2007. A sequence-based variation map of 8.27 million SNPs in inbred mouse strains. Nature, 448(7157), pp.1050-1053.
6. ^ Yang, H., Bell, T.A., Churchill, G.A. and de Villena, F.P.M., 2007. On the subspecific origin of the laboratory mouse. Nature genetics, 39(9), pp.1100-1107.
7. ^ a b Kang, H.M., Zaitlen, N.A., Wade, C.M., Kirby, A., Heckerman, D., Daly, M.J. and Eskin, E., 2008. Efficient control of population structure in model organism association mapping. Genetics, 178(3), pp.1709-1723.
8. ^ a b Kang, H.M., Sul, J.H., Service, S.K., Zaitlen, N.A., Kong, S.Y., Freimer, N.B., Sabatti, C. and Eskin, E., 2010. Variance component model to account for sample structure in genome-wide association studies. Nature genetics, 42(4), pp.348-354.
9. ^ Lippert, C., Listgarten, J., Liu, Y., Kadie, C.M., Davidson, R.I. and Heckerman, D., 2011. FaST linear mixed models for genome-wide association studies. Nature methods, 8(10), pp.833-835.
10. ^ Zhou, X. and Stephens, M., 2012. Genome-wide efficient mixed-model analysis for association studies. Nature genetics, 44(7), pp.821-824.
11. ^ Rocha, J.L., Eisen, E.J., Van Vleck, L.D. and Pomp, D., 2004. A large-sample QTL study in mice: I. Growth. Mammalian Genome, 15(2), pp.83-99.
12. ^ Freimer, N. and Sabatti, C., 2004. The use of pedigree, sib-pair and association studies of common diseases for genetic mapping and epidemiology. Nature genetics, 36(10), pp.1045-1051.
13. ^ Van Dongen, J., Slagboom, P.E., Draisma, H.H., Martin, N.G. and Boomsma, D.I., 2012. The continuing value of twin studies in the omics era. Nature Reviews Genetics, 13(9), pp.640-653.
14. ^ Helgason, A., Yngvadóttir, B., Hrafnkelsson, B., Gulcher, J. and Stefánsson, K., 2005. An Icelandic example of the impact of population structure on association studies. Nature genetics, 37(1), pp.90-95.
15. ^ Pritchard, J.K., Stephens, M., Rosenberg, N.A. and Donnelly, P., 2000. Association mapping in structured populations. The American Journal of Human Genetics, 67(1), pp.170-181.
16. ^ Voight, B.F. and Pritchard, J.K., 2005. Confounding from cryptic relatedness in case-control association studies. PLoS Genet, 1(3), p.e32.
17. ^ Devlin, B. and Roeder, K., 1999. Genomic control for association studies. Biometrics, 55(4), pp.997-1004.
18. ^ Bacanu, S.A., Devlin, B. and Roeder, K., 2002. Association studies for quantitative traits in structured populations. Genetic epidemiology, 22(1), pp.78-93.
19. ^ Yang, J., Weedon, M.N., Purcell, S., Lettre, G., Estrada, K., Willer, C.J., Smith, A.V., Ingelsson, E., O'Connell, J.R., Mangino, M. and Mägi, R., 2011. Genomic inflation factors under polygenic inheritance. European Journal of Human Genetics, 19(7), pp.807-812.
20. ^ Kenny, E.E., Kim, M., Gusev, A., Lowe, J.K., Salit, J., Smith, J.G., Kovvali, S., Kang, H.M., Newton-Cheh, C., Daly, M.J. and Stoffel, M., 2010. Increased power of mixed models facilitates association mapping of 10 loci for metabolic traits in an isolated population. Human molecular genetics, p.ddq510.
21. ^ Sabatti, C., Service, S.K., Hartikainen, A.L., Pouta, A., Ripatti, S., Brodsky, J., Jones, C.G., Zaitlen, N.A., Varilo, T., Kaakinen, M. and Sovio, U., 2009. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nature genetics, 41(1), pp.35-46.
22. ^ Purcell, S.M., Wray, N.R., Stone, J.L., Visscher, P.M., O'Donovan, M.C., Sullivan, P.F., Sklar, P., Ruderfer, D.M., McQuillin, A., Morris, D.W. and O’Dushlaine, C.T., 2009. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature, 460(7256), pp.748-752.
23. ^ Yang, J., Benyamin, B., McEvoy, B.P., Gordon, S., Henders, A.K., Nyholt, D.R., Madden, P.A., Heath, A.C., Martin, N.G., Montgomery, G.W. and Goddard, M.E., 2010. Common SNPs explain a large proportion of the heritability for human height. Nature genetics, 42(7), pp.565-569.
24. ^ Eskin, E., 2015. Discovering genes involved in disease and the mystery of missing heritability. Communications of the ACM, 58(10), pp.80-87.
25. ^ a b Lippert, C., Listgarten, J., Liu, Y., Kadie, C.M., Davidson, R.I. and Heckerman, D., 2011. FaST linear mixed models for genome-wide association studies. Nature methods, 8(10), pp.833-835.
26. ^ a b Zhou, X. and Stephens, M., 2012. Genome-wide efficient mixed-model analysis for association studies. Nature genetics, 44(7), pp.821-824.
27. ^ a b Loh, P.R., Tucker, G., Bulik-Sullivan, B.K., Vilhjalmsson, B.J., Finucane, H.K., Salem, R.M., Chasman, D.I., Ridker, P.M., Neale, B.M., Berger, B. and Patterson, N., 2015. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nature genetics, 47(3), pp.284-290.
28. ^ Zaitlen, N., Lindström, S., Pasaniuc, B., Cornelis, M., Genovese, G., Pollack, S., Barton, A., Bickeböller, H., Bowden, D.W., Eyre, S. and Freedman, B.I., 2012. Informed conditioning on clinical covariates increases power in case-control association studies. PLoS Genet, 8(11), p.e1003032.
29. ^ Hayeck, T.J., Zaitlen, N.A., Loh, P.R., Vilhjalmsson, B., Pollack, S., Gusev, A., Yang, J., Chen, G.B., Goddard, M.E., Visscher, P.M. and Patterson, N., 2015. Mixed model with correction for case-control ascertainment increases association power. The American Journal of Human Genetics, 96(5), pp.720-730.
30. ^ Weissbrod, O., Lippert, C., Geiger, D. & Heckerman, D., 2015, Accurate liability estimation improves power in ascertained case-control studies, Nature methods, 12(4), pp. 332-4.
31. ^ Listgarten, J., Lippert, C., Kadie, C.M., Davidson, R.I., Eskin, E. and Heckerman, D., 2012. Improved linear mixed models for genome-wide association studies. Nature methods, 9(6), pp.525-526.
32. ^ a b Yang, J., Zaitlen, N.A., Goddard, M.E., Visscher, P.M. and Price, A.L., 2014. Advantages and pitfalls in the application of mixed-model association methods. Nature genetics, 46(2), pp.100-106.
33. ^ Mathieson, I., & McVean, G. (2012). Differential confounding of rare and common variants in spatially structured populations. Nature genetics, 44(3), 243-246.
34. ^ Yang, J., Weedon, M.N., Purcell, S., Lettre, G., Estrada, K., Willer, C.J., Smith, A.V., Ingelsson, E., O'Connell, J.R., Mangino, M. and Mägi, R., 2011. Genomic inflation factors under polygenic inheritance. European Journal of Human Genetics, 19(7), pp.807-812.
35. ^ Bulik-Sullivan, B.K., Loh, P.R., Finucane, H.K., Ripke, S., Yang, J., Patterson, N., Daly, M.J., Price, A.L., Neale, B.M. and Schizophrenia Working Group of the Psychiatric Genomics Consortium, 2015. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nature genetics, 47(3), pp.291-295.
36. ^ Vilhjálmsson, B. J., & Nordborg, M. (2013). The nature of confounding in genome-wide association studies. Nature Reviews. Genetics, 14(1), 1.
37. ^ a b Kang, H.M., Zaitlen, N.A., Wade, C.M., Kirby, A., Heckerman, D., Daly, M.J. and Eskin, E., 2008. Efficient control of population structure in model organism association mapping. Genetics, 178(3), pp.1709-1723.
38. ^ Zaitlen, N., Lindström, S., Pasaniuc, B., Cornelis, M., Genovese, G., Pollack, S., Barton, A., Bickeböller, H., Bowden, D.W., Eyre, S. and Freedman, B.I., 2012. Informed conditioning on clinical covariates increases power in case-control association studies. PLoS Genet, 8(11), p.e1003032.
39. ^ Hayeck, T.J., Zaitlen, N.A., Loh, P.R., Vilhjalmsson, B., Pollack, S., Gusev, A., Yang, J., Chen, G.B., Goddard, M.E., Visscher, P.M. and Patterson, N., 2015. Mixed model with correction for case-control ascertainment increases association power. The American Journal of Human Genetics, 96(5), pp.720-730.
40. ^ Weissbrod, O., Lippert, C., Geiger, D. & Heckerman, D., 2015, Accurate liability estimation improves power in ascertained case-control studies, Nature methods, 12(4), pp. 332-4.
41. ^ Listgarten, J., Lippert, C., Kadie, C.M., Davidson, R.I., Eskin, E. and Heckerman, D., 2012. Improved linear mixed models for genome-wide association studies. Nature methods, 9(6), pp.525-526.