1. Introduction
Genetic association studies aim to detect the association between one or more genetic polymorphisms and a trait. Once an allele of a gene is over-presented in a case population relative to the control, it may be established that such an allele of the gene is associated with the studied disease. In a population-based case-control design, the association can be demonstrated, if it exists, by comparing allele frequencies at the marker locus in random samples of unrelated patients and controls. However, the association between a disease and a genetic marker can arise from confounding by underlying stratification within the population [1] [2]. Population stratification can occur in case-control or other population-based designs. It’s important to make comparisons between cases and controls, as far as possible, within homogeneous subpopulations. Family-based designs have been proposed to counteract confounding due to population stratification.
The best-known of the family-based design is the case-parent triad design. The case-parent triad design and transmission disequilibrium test (TDT) proposed by Spielman (1993) suggested to collect case-parent trios [3]. Alleles or haplotypes transmitted to affected offspring are compared with untransmitted alleles, providing a control sample that is inherently matched to the case sample with regard to population structure. Various extensions of the TDT and computer programs have been developed to deal with a variety of different pedigree structures with nuclear families [4]-[8]. The methods have been widely applied in medical research, even in Genome-wide association studies (GWAS) [9]-[11]. However, current simulations for the statistical power of TDT-type methods are limited to some special cases. For example, the special cases in which there is no recombination between the marker and disease susceptibility locus or there are only case-parents trios. For population-based case-control design, Su et al. (2011) used a resampling approach to simulate a single or multiple nearby disease SNPs on the same chromosome [12]. In this paper, we introduced a simple simulation algorithm to generate general nuclear families rather than trios to simulate multiple tightly linked markers.
2. Method
We consider a triad to contain both parents and an affected child first. Suppose a disease locus with mutant disease allele D and normal allele d, with allele frequencies,
,
. For multiple tightly linked markers, we consider the haplotypes, rather than alleles for each marker. For example, for two SNP (single nucleotide polymorphism) markers with alleles 1 and 2, there will be four haplotypes 11, 12, 21, 22. The haplotype frequencies for the markers are denoted by
for m haplotypes. Linkage disequilibrium (LD) coefficients between disease locus and markers defined in Sham (1995) [4]
,
where
is the frequency of disease-marker haplotype si. All
implies linkage equilibrium. The LD coefficients satisfy
and
The penetrance
is denoted by
for genotype DD, Dd, dd. And then the prevalence
.
If there is no recombination between the disease locus and markers, i.e. recombination rate
, there are 4 disease-marker haplotypes in one case-parents trio, denoted by
,
, where (
), (
) and (
) are haplotype pairs for father, mother and child, denoted by gf, gm, gc respectively. According to Morris et al. (1997) [5], under Hardy-Weinberg equilibrium, the probability of the corresponding haplotype pairs for the trio
Note that
,
, and
.
The parents have disease allele s with probability qs. And the parent has disease-marker haplotype si with probability
, after having disease allele s. The term
is the probability that the parents transmitted disease alleles
(s from father, u from mother) to the affected child when they have alleles
. Therefore, when the parents have genotype a/b and c/d
, the probabilities that they transmitted alleles (a, c), (a, d), (b, c), or (b, d) to the affected child are proportional to
,
,
,
.
If there is no recombination, consider the unaffected-parents trio. Similarly, we can get
The term
is the probability that the parents transmit disease alleles
(s from father, u from mother) to the unaffected child when they have alleles
. Therefore, when the parents have genotype a/b and c/d
, the probabilities that they transmitted alleles (a, c), (a, d), (b, c), or (b, d) to an unaffected child are proportional to
,
,
,
.
However, if recombination is considered, there are 4 situations for the affected or unaffected child. Let (
) and (
) be haplotype pairs for father and mother, where the disease alleles s and u are transmitted to the affected child. Let
be the recombination rate, it’s easy to see that
. Then, the disease-marker haplotype pair for the child will be (
), (
), (
), or (
) with probability
,
,
, or
, respectively.
3. Simulating Nuclear Family Data
After specifying the parameters, including the number of nuclear families N, disease allele frequencies
, penetrance
, marker allele frequencies
(
,
), recombination rate θ between disease locus and markers, LD between disease locus and markers,
,
,
.
We simulate the nuclear families using the following procedure:
Step 1. Generate the paternal and maternal genotypes a/b and c/d
) with probability
and
.
Step 2. Transmit disease alleles (a, c), (a, d), (b, c), or (b, d) to the affected child with probabilities proportional to
,
,
,
, and to unaffected child with probabilities proportional to
,
,
,
. The alleles transmitted to affected or unaffected children are denoted by (s, u). The alleles untransmitted to affected or unaffected children are denoted by (t, v).
Step 3. Combing with the generated disease alleles s, t, u and v from step 2, generate marker haplotypes i, j, k and l (
with probability
,
,
and
respectively. Assign marker haplotype pairs (i, j) for the father, (k, l) for the mother.
Step 4. Generate haplotype pair (si, uk), (sj, uk), (si, ul), or (sj, ul) for each child with probability
,
,
,
, respectively. Assign the corresponding marker haplotype pair (i, k), (j, k), (i, l), or (j, l) to the affected or unaffected child.
Step 5. Output: the genotypes of the markers for the family members, based on the haplotype pairs from step 3 and step 4.
In step 5, we need the haplotype information to construct the genotypes of the markers. For example, if there are three biallelic SNP markers with allele 1 and 2, then there might be 8 haplotype 111, 112, 121, 122, 211, 212, 221, 222, numbered with 1, 2, …, 8. Therefore, when the haplotype pairs 2/5 generated from step 3 which represents 112/211, that means the genotypes are 1/2, 1/1, 2/1 for the three markers respectively.
In addition, we can use the simulator to generate a population with markers satisfying Hardy-Weinberg equilibrium. If we specify all which LD coefficients
(
;
), and recombination rate
, then
,
and hence
, i.e., the genotypes of parents at the markers follow HW equilibrium. In this case, the parents from the simulators can be regarded as a sample from a population with HW equilibrium at the markers.
4. Results
To evaluate our algorithm, two extended TDT methods, based on conditional likelihood and full likelihood, respectively, are selected to demonstrate how our simulator will preserve the test statistic distribution under the null hypothesis. Based on conditional likelihood, Clayton (1999) [6] developed a score test with the program TRANSMIT. Based on full likelihood, we presented a likelihood ratio test to compare haplotype frequencies in transmitted and non-transmitted groups and derive relative risks [8]. It can be regarded as a generalized haplotype-relative-risk method, denoted by GHRR here. Both of them can deal with multiple tightly linked markers and uncertain haplotype phases for multi-locus genotypes.
In our simulations, one disease locus with alleles D and d, and two tightly linked SNP markers genotype data for 100 case-parent trios are generated. The four marker haplotypes are assumed to be equally frequent
. For a general heredity mode, we specify
,
,
,
. Under the simulated condition, 500 replicated samples were generated to assess the distribution of the statistics under the null hypothesis in which
(
;
). For each replicated sample, TRANSMIT and GHRR were applied respectively to obtain the values of the test statistic.
When the null hypothesis is true, both of the test statistics are chi-square distributed with degree of freedom 3. The quantile-quantile (QQ) plots for test statistics from 500 replicated samples under the null hypothesis are shown in Figure 1. The plots show that the scatters of the observed quantiles and the expected quantiles of distribution
are tightly close to the line y = x. That means that the empirical distributions of the test statistics coincide with the expected distribution under the null hypothesis.
Figure 1. QQ plots for test statistics from TRANSMIT (left) and GHRR (right).
5. Conclusion
The TDT test for case-parents design measures the over-transmission of an allele in the affected from the heterozygous parents, thus avoiding problems of population stratification. TDT has been extended to various nuclear family designs. We introduced a simple simulation algorithm based on a transition model to generate general nuclear families with multiple tightly linked markers. To deal with multiple markers, we generate haplotypes first and output the genotypes for each marker at the last step. The simulations show that the empirical distributions of the test statistics coincide with the expected distribution under the null hypothesis.
Acknowledgements
This work was supported by the National Natural Science Foundation of China (No. 82373433), Guangdong Basic and Applied Basic Research Foundation (2020B1515310007), Guangdong Province Key Laboratory of Computational Science, Sun Yat-sen University (2020B1212060032).