Kullback-Leibler distance methods for detecting disease association with rare variants from sequencing data.

doi: 10.1111/ahg.12103

Kullback–Leibler Distance Methods for Detecting Disease Association with Rare Variants from Sequencing Data Asuman S. Turkmen1,2 , Zhifei Yan1 , Yue-Qing Hu3∗ and Shili Lin1∗ 1 Statistics Department, The Ohio State University, Columbus, OH, USA 2 The Ohio State University, Newark, OH, USA 3 State Key Laboratory of Genetic Engineering, Institute of Biostatistics, School of Life Sciences, Fudan University, Shanghai, China

Summary Because next generation sequencing technology that can rapidly genotype most genetic variations genome, there is considerable interest in investigating the effects of rare variants on complex diseases. In this paper, we propose four Kullback–Leibler distance-based Tests (KLTs) for detecting genotypic differences between cases and controls. There are several features that set the proposed tests apart from existing ones. First, by explicitly considering and comparing the distributions of genotypes, existence of variants with opposite directional effects does not compromise the power of KLTs. Second, it is not necessary to set a threshold for rare variants as the KL definition makes it reasonable to consider rare and common variants together without worrying about the contribution from one type overshadowing the other. Third, KLTs are robust to null variants thanks to a built-in noise fighting mechanism. Finally, correlation among variants is taken into account implicitly so the KLTs work well regardless of the underlying LD structure. Through extensive simulations, we demonstrated good performance of KLTs compared to the sum of squared score test (SSU) and optimal sequence kernel association test (SKAT-O). Moreover, application to the Dallas Heart Study data illustrates the feasibility and performance of KLTs in a realistic setting. Keywords: Effects of rare variants, next generation sequencing data, robustness to noise, robustness to directional effects, Dallas Heart Study

Introduction Genome-wide association studies (GWASs) have successfully identified hundreds of common variants (CVs) associated with complex traits. However, these identified CVs explain only a small proportion of heritable variability (Maher, 2008). It is believed that rare variants (RVs) play an important role in complex traits (Bodmer & Bonilla, 2008; Gorlov et al., 2008; Schork et al., 2009). Recent advances in next generation

∗

Corresponding author: Shili Lin, Department of Statistics, The Ohio State University, 1958 Neil Avenue, Columbus, OH 43210-1247. Tel: (614) 292-7404; Fax: (614) 292-2096; E-mail: [email protected]. Yue-Qing Hu, State Key Laboratory of Genetic Engineering, Institute of Biostatistics School of Life Sciences, Fudan University, 220 Handan Road, Shanghai 200433, China. Tel: 86 21 5566 5561; Fax: 86 21 5566 4388; E-mail: [email protected]

C

2015 John Wiley & Sons Ltd/University College London

sequencing technologies have made it feasible to sequence exomes or whole genomes. The availability of sequencing data offers an exciting opportunity to conduct studies to discover complex trait-RV associations. Owing to the extremely low frequency of a RV, single variant-based analysis and many existing tests developed for common variants may not be suitable. Recently published methods show that power for detecting rare risk variants can be greatly enhanced by combining information across variants in a target region when multiple variants influence phenotype (Morgenthaler & Thilly, 2007; Li & Leal, 2008; Morris & Zeggini, 2010). Since these tests seek to assess the overall genetic burden due to rare variants, they are called burden tests. These tests have high power when a large proportion of the rare variants in a region are truly causal and influence the phenotype in the same direction with equal effect sizes. However, such a pooling strategy has its own limitation. If the RVs to be pooled are associated with the trait in different directions,

Annals of Human Genetics (2015) 79,199–208

199

A. S. Turkmen et al.

i.e., some are associated positively while others negatively, the strategy of pooling may weaken or diminish the signal in associated RVs. Furthermore, if many of the RVs to be pooled are non-causal, i.e., they are not associated with the traits, pooling will inevitably introduce noises into the collapsed genotype and thus reduce statistical power. To address these issues, variance component tests such as the C-alpha test (Neale et al., 2011) and the sequence kernel association test (SKAT) (Wu et al., 2011) were developed to evaluate the association of a genomic region with a trait. Variance component-based approaches aggregate genetic information across the region using a kernel function and use a computationally efficient variance component test to test for association. Pan (2009) proposed the sum of squared score test (SSU) which also can be regarded as a variance component test. Since SKAT using flat weights is equivalent to the kernel machine regression test which is related to the SSU, it can be concluded that SKAT under flat weights and the SSU are equivalent. All these tests can be considered as non-burden tests and they are more robust to the inclusion of causal variants with disparate or even opposite effects on phenotype. However, the non-burden tests can be less powerful than the burden tests if a large proportion of rare variants are associated with the phenotype in the same direction. Because the underlying genetic function of a region is usually unknown, choosing an ideal statistical test (burden or non-burden tests) in advance is impossible. To develop a powerful test that is also robust to the directions of effects of rare variants, Lee et al. (2012) have proposed an optimal test to combine SKAT and the burden tests. This optimal test, referred to as SKAT-O, has been shown to outperform the burden tests and SKAT in a wide range of scenarios (Lee et al., 2012). Nevertheless, SKAT-O may be under-powered when there is a large proportion of null variants, the norm in most studies. In this paper, we propose several novel methods based on Kullback–Leibler (KL) distance (Kullback & Leibler, 1951) for testing the effects of rare variants on a disease. The main idea of these methods is the direct comparison of some distributions capturing the essence of the data for cases and controls. The distribution can either be frequencies for each site or frequencies for multi-site genotypes. These methods are not sensitive to different effect sizes or directions, can accommodate both rare and common causal variants, and some possess noise fighting features. An extensive simulation study was carried out to evaluate the proposed methods and to compare them with SSU and SKAT-O, two methods that have been shown to perform well in some simulation studies. An application of the methods to the Dallas Heart Study was also performed. The results from both the simulation and the real application are encouraging since they compare favorably to existing methods.

200


Materials and Methods Kullback–Leibler Test 1—KLT1 and aKLT1 Let us consider a data set with N individuals, among which NA are affected cases and NU are unaffected controls. We assume that there are M sites of single nucleotide variants within the candidate region where both rare and common variants may be present. The multi-site genotype for each individual is denoted as a vector G = (g 1 , g 2 , . . . , g M ), with g m being the number of minor alleles observed at the m t h site, i.e., g m has value 2 if the site is homozygous for the minor allele, 1 if the site is heterozygous, and 0 if the site is homozygous for the major allele, m = 1, . . . , M. We first propose a Kullback–Leibler test, or KLT1, which compares the distributional differences of variant frequencies site by site between cases and controls. Specifically, let Xi = (Xi 1 , . . . , Xi M ), i = 1, . . . , NA, and Y j = (Y j 1 , . . . , Y j M ), j = 1, . . . , NU , denote the multi-site genotype vector for the i t h individual in cases and the j t h individual in controls, respectively, where Xi m and Y j m are coded as g m as described above. We use a m and b m to denote the number of rare variants at site m among cases and controls, respectively, that is, for m = 1, . . . , M, am =

NA

Xi m

and b m =

i =1

NU

Yj m .

j =1

Then we define the adjusted, or normalized, variant frequency at site m among all sites in the cases and controls, respectively, as follows, for m = 1, . . . , M : am + 1 f m = M v=1 (a v + 1)

and

bm + 1 g m = M , v=1 (b v + 1)

where the constant 1 is added to the counts to prevent the denominator being 0. Further, { f m } ≡ { f m , m = 1, . . . , M} is treated as a distribution of “frequencies” for cases over the M sites being considered. Similarly, {g m } ≡ {g m , m = 1, . . . , M} is the distribution of “frequencies” for controls. Then { f m } and {g m } are two discrete distributions defined on the same domain. The KL distance is then defined as in Kullback & Leibler (1951), which is taken to be the KLT1 test statistic: K LT1 = H({ f m }, {g m }) =

M M 1 fm gm . (1) f m log + g m log 2 m =1 gm fm m =1

One can see that the KL-distance is a non-negative measure of the difference between two distributions. For two distributions that are identical, the KL distance is 0. A larger observed KL value signifies a larger distance between the two distributions. As such, an observed KLT1 value is anticipated to be much smaller under the null hypothesis compared to that

C


Kullback–Leibler Distance-Based Tests

under the alternative hypothesis. That is, a large observed KLT1 is indicative of association, with threshold of significance determined by a permutation test where the casecontrol status of individuals is permuted. Since the KLT1 statistic considers each site with the same weight if there are a large number of neutral (null) variants in the target region, the neutral variant sites will add noise to the signal and thus reduce the power. In order to improve the power of KLT1 in the presence of a large number of neutral variants, we propose the following two-step procedure by first performing a data-adaptive screening step to distill the variants. That is, in the first screening step, we define a set of potential disease-causing sites (SC ) following Li et al. (2010); the remaining sites are collected in the set SN . Specifically, let f A,m =

am + 1 2(NA + 1)

For each site m , if

and

f U,m =

| f A,m − f U,m | > μ

bm + 1 , m = 1, . . . , M. 2(NU + 1)

f U,m (1 − f U,m ) , 2NU

(2)

then m ∈ SC ; otherwise m ∈ SN , where μ is a constant taken to be 1.28 following Li et al. (2010). By using such a scheme, we note that the larger the variant frequency at a specific site among controls is, the larger frequency difference between cases and controls is required to make this site classified into SC as the right-hand side of equation (2) is an increasing function of f U,m (≤ 1/2) maximized at 1/2. After determining the set SC , in the second step, we first define a “combined variant” by collapsing the genotypes the variants in SN . Specifically, NU wedefine a N = NA across I ( X > 0) and b = i m N i =1 m ∈SN j =1 I ( m ∈SN Y j m > 0), where I is the usual indicator function taking the value of 1 or 0. Then we define our adaptive normalized frequency at each variant in SC and the “combined variant” as follows. For cases, f m =

am + 1 , m ∈ SC ; (a v∈SC v + 1) + a N + 1

f N =

aN + 1 . v∈SC (a v + 1) + a N + 1

For controls, g m =

bm + 1 , m ∈ SC ; v∈SC (b v + 1) + b N + 1 bN + 1 . (b v∈SC v + 1) + b N + 1

g N =

C


Then our adaptive KL test, or aKLT1, is defined as a K LT1 = H({ f N , f m , m ∈ SC }, {g N , g m , m ∈ SC }). Several variations of aKLT1 may also be considered, including deleting the combined variant altogether or splitting the combined variant into two variants according to the sign of the difference in the case-control frequency. We will elaborate on this in the Discussion section.

Kullback–Leibler Test 2—KLT2 and aKLT2 As defined, both KLT1 and aKLT1 compare the frequencies of single variants between cases and controls site by site. Even with the effort of reducing the noise of null variants, as in aKLT1, there may still be reduced power for detecting association when the proportion of null variants is extremely large. As an effort to further reduce the adverse effect of noise in such a situation, we propose another version of the KL test that considers the joint pattern of genotype distribution across sites. Recall that the multi-site genotype is denoted by a vector G = (g 1 , g 2 , . . . , g M ). Suppose we have, among all individuals in the sample, K multi-site genotype patterns, denoted as G1 , . . . , G K . Note that, theoretically, K can be a very large number, but due to small minor allele frequencies for a majority of the sites, it is small enough to manage. We assume that there are NA,k cases and NU,k controls with genotype pattern Gk , k = 1, . . . , K. We then define the adjusted variant frequency of multi-site genotype Gk among all individuals in cases and controls as p k and q k respectively. That is, for k = 1, . . . , K, pk =

NA,k + 1 , NA + K

and q k =

NU,k + 1 . NU + K

The KLT2 statistic is defined as K LT2 = H({ p k , k = 1, . . . , K}, {q k , k = 1, . . . , K}).

The observed KLT2 is anticipated to be much smaller under the null hypothesis of no association compared to that under the alternative hypothesis. That is, a large observed KLT2 value is indicative of association. To further improve the power of KLT2 in the presence of a large number of null variants, we follow the two-step protocol prescribed for obtaining aKLT1. Using the same notation as before, in the first step, we discern the set of variants SC and collapse the genotypes in the remaining set of variants SN to obtain a “combined variant.” The joint genotype patterns across the ||SC || + 1 variants and the corresponding frequencies for cases and controls are obtained as before, where ||SC || denotes the cardinality of SC . The second step is the computation of the KLT2 statistic based on these two adaptive


201


distributions of multi-site frequencies. The resulting statistic is termed aKLT2.

Significance Test The p-values of the proposed tests are obtained by a permutation test where the case-control status of individuals is permuted. Without loss of generality, suppose the test statistic is T (which can be KLT1, aKLT1, KLT2, aKLT2). We first randomly shuffle the disease status of all individuals in the sample, then we apply the test statistic to the permuted data to obtain the corresponding test statistic T (b ) . We repeat the process for B times, that is, b = 1, . . . , B. The p-value for the statistic T is estimated as B I (T (b ) ≥ T) . (3) p = b =1 B For our real data analysis on the Dallas Heart Study, the p-values reported in the next section were calculated in this fashion. On the other hand, for our simulation study, we simulate the data under a model and then repeat the above procedure to obtain a p-value p (r ) . This process is then repeated R times, that is, r = 1, 2, · · · , R. Then for a fixed significance level α, we compute R

I ( p (r ) ≤ α)/R.

r =1

If the underlying model is true, that is, none of the variants in the genomic region being tested is associated with the disease, then this quantity is taken as the empirical type I error rate; otherwise, it is reported as the power.

Results To evaluate the proposed methods and to compare their performances with those of representative existing methods, we first carry out a simulation study, where the “answers” are known, under a variety of scenarios of causal mechanisms and minor allele frequencies (MAFs) to gauge the methods thoroughly. After the methods are carefully evaluated, we apply them to detect association of three genes with the condition of high triglycerides using the data from the Dallas Heart Study. We begin with the simulation study.

Simulation Study—Data Generation and Results Simulation 1 To base our simulations on realistic exome sequence data characteristics, we generated data under various causal mechanisms

202


and MAFs for the causal and non-causal variants, as summarized in Table 1 for our first set of simulations. In all scenarios, the proportion of non-causal variants (rare and common combined; also referred to as contamination rate) is either 67% (scenarios 3 and 4) or 75% (the other scenarios). Our settings also consider situations in which the causal variants are all rare (scenarios 6, 8–10) or all common (scenarios 5 and 7). For scenarios 9 and 10, all variants investigated are rare, with the only distinction being that the MAFs of the non-causal variants are in the same range as (scenario 9) or larger than (scenario 10) the causal ones. More specifically, we simulated M variants (24 for scenarios 3, 4, and 32 for the rest of the scenarios) with the sample size of 500 cases and 500 controls. Each variant has a MAF uniformly distributed in the intervals shown in Table 1. Following Basu and Pan (2011), we first generated a latent vector Z = (Z1 , . . . , ZM ) from a multivariate normal distribution with mean 0 M = (0, . . . , 0) , variance 1 M = (1, . . . , 1) , and a covariance structure as described in the following: if variants i and j are both causal or both non-causal, then the correlation is set to be Cor r (Zi , Z j ) = ρ |i − j | ; otherwise the correlation is zero. We used ρ = 0, 0.5 and 0.9 in the study corresponding to the no LD, mild LD, and strong LD, respectively. We note that there is very limited understanding, and even controversial view, at the moment, of LD structures among rare variants and between rare and common variants. Thus, this simulation scheme, although frequently used, may not realistically reflect the true underlying LD structure. Each Zi is then mapped to a value between 0 and 1 through inverse transformation and then dichotomized to 0 (major allele) or 1 (minor allele) depending on the corresponding MAF of the variant. For each individual, simulation of two Z’s leads to a vector of genotype data denoted as G = (g 1 , . . . , g M ). The disease status D of each subject was generated using the logistic regression model:

logit P (D = 1) = β0 +

M

g jβj.

(4)

j =1

In the above model, we used β0 = − log(4), corresponding to a 20% phenocopy rate, or background disease prevalence. To evaluate the type I error, we set β1 = . . . = β M = 0. On the other hand, to evaluate power, we employed various coefficients for both common and rare causal variants to incorporate different effect sizes and directions as described in the following. For scenarios 5–10, the coefficients for rare casual (RC) variants, if they exist, are set to be β = (log(3), log(1/3), log(2), log(2), log(2), log(1/2), log(1/2), log(1/2)),

C



Table 1 Summary of the four types of variantsa in the 10 scenarios utilized in the simulation study. For each variant, the MAF value was generated uniformly from the range given. RC

RNC

CC

CNC

Scenario

MAF range

#of SNVsb

MAF range

# of SNVs

MAFrange

# of SNVs

MAF range

# of SNVs

1 2 3 4 5 6 7 8 9 10

0.005 − 0.01 0.005 − 0.01 0.005 − 0.01 0.005 − 0.01 NA 0.005 − 0.01 NA 0.005 − 0.01 0.005 − 0.01 0.005 − 0.01

6 6 6 6 0 8 0 8 8 8

0.005 − 0.01 0.01 − 0.05 0.01 − 0.05 0.005 − 0.01 0.005 − 0.01 0.01 − 0.05 0.005 − 0.01 0.005 − 0.01 0.005 − 0.01 0.01 − 0.05

8 16 8 8 8 16 8 16 24 24

0.1 − 0.3 0.1 − 0.3 0.1 − 0.3 0.1 − 0.3 0.1 − 0.3 NA 0.1 − 0.3 NA NA NA

2 2 2 2 8 0 8 0 0 0

0.2 − 0.5 0.1 − 0.3 0.2 − 0.5 0.1 − 0.3 0.2 − 0.5 0.1 − 0.3 0.1 − 0.3 0.1 − 0.3 NA NA

16 8 8 8 16 8 16 8 0 0

a b

RC = rare causal; RNC = rare non-causal; CC = common causal; CNC = common non-causal. SNV denotes single nucleotide variant.

while the coefficients for common causal (CC) variants, if they exist, are β = (log(3/2), log(2/3), log(23/20), log(23/20), log(23/20), log(20/23), log(20/23), log(20/23)), portraying smaller effect sizes for common variants. For scenarios 1–4, the effect sizes are β = (log(3), log(1/3), log(2), log(2), log(1/2), log(1/2)) for RC and β = (log(3/2), log(2/3)) for CC. We simulated data under each of the 10 scenarios for 1000 times, that is, R = 1000, or 1000 independent replicates. We applied KLT1, aKLT1, SSU, and SKAT-O to each of the simulated data sets. For SKAT-O, we used the default weights set in the R package. For SSU, we used the codes provided by the authors (Basu & Pan, 2011). KLT2 and aKLT2 are specifically proposed, and are more appropriate, for situations where there is a high contamination rate, the focus of our second simulation. Although these tests are still valid, they are not as powerful as the KLT1 and aKLT1 under the settings of this simulation, as we elaborated in the Discussion section. To compute the P -values and powers, we set B = 200 and α = 0.05. The power results under these 10 scenarios for the three LD structures are illustrated in Figure 1. As we can see from the figure, KLT1 is the best, or among the best, for all the scenarios and LD structures. Its adaptive version, aKLT1, is not too far behind. On the other hand, SSU is among

C


the best for scenarios 5, 7, and 9, but is the worst, by a large margin, for the other scenarios. The results are consistent with the observations in Basu & Pan (2011) and Turkmen & Lin (2012), where they reported that SSU may suffer from power loss when the non-causal variants have higher frequencies than the causal ones (e.g., scenarios 1–3). The results for SKAT-O are more or less the opposite of SSU—SKAT-O has very low power under scenarios 5 and 7 where SSU performs well. This is not surprising given that the only causal variants under these two scenarios are all common and SSU was initially proposed for detecting common variants. On the other hand, SKAT-O has good power under scenarios 1, 4, 8, and 9, where SSU under performed. These observations are irrespective of the amount of LD. In addition, we simulated under the null model for the same three LD structures, and we report the type I errors for the mild LD (ρ = 0.5) case in Table 2. We also provided type I errors under other correlation structures (i.e., ρ = 0 and 0.9) in the Supplementary Materials. Type I errors for various scenarios and LD levels are depicted in the Tables S1 and S2 and Figure S1. As we can see from the tables, the type I error rates are all around the nominal significance level of 0.05, affirming that all procedures have properly controlled, and similar, type I error rates, thus making the discussion on their relative power meaningful.

Simulation 2 Whereas simulation 1 focuses on evaluating the performances of methods for a wide range of scenarios, this simulation is designed to investigate, in depth, the effect of noise on the relative power of the tests, especially KLT2 and aKLT2. We


203


0.0

0.0

Power 0.4 0.8

B

Power 0.4 0.8

A

1

2

3

4

5

6

7

8

9

10

1

2

Scenario number

3

4

5

6

7

8

9

10

Scenario number

Power 0.4 0.8

C

0.0

KLT1 aKLT1 SSU SKAT−0

1

2

3

4

5

6

7

8

9

10

Scenario number

Figure 1 Power comparisons among four tests under 10 scenarios. (A) ρ = 0; (B) ρ = 0.5; (C) ρ = 0.9. Table 2 Type I error rates for four tests and 10 scenarios with ρ = 0.5. Scenario number Method

1

2

3

4

5

6

7

8

9

10

KLT1 aKLT1 SSU SKAT-O

0.036 0.056 0.062 0.050

0.056 0.048 0.056 0.052

0.046 0.056 0.052 0.048

0.048 0.036 0.034 0.034

0.034 0.038 0.040 0.028

0.064 0.068 0.060 0.062

0.054 0.068 0.054 0.042

0.050 0.054 0.056 0.032

0.048 0.056 0.042 0.052

0.036 0.038 0.034 0.050

focus on considering a popular setting in the literature (Basu & Pan, 2011) in which all causal and non-causal variants are rare, i.e., scenarios 9 and 10, and we set ρ = 0.9. We also consider variations of these two scenarios by increasing the number of non-causal variants, leading to several higher contamination rates. As mentioned earlier, the main difference between these two scenarios is that the MAF range is the same for causal and non-causal variants in scenario 9 while the MAF of non-causal variants is larger in scenario 10. The number of causal variants remains at eight, whereas the number of non-causal variants are set to 24 (scenarios 9 and 10), 32, 45, and 72, corresponding to 75%, 80%, 85%, and 90% contamination rates, respectively, among the variants studied. For scenario 9, since KLT2 and aKLT2 start to outperform the

204


rest of the tests at the 90% contamination rate, we also evaluated the power for 91% to 94% contamination rates in 1% increments for a zoom-in evaluation. Note that the number of causal variants remains at eight, while we increase the number of non-causal variants to achieve the varying contamination rates. The power of the tests are demonstrated in Figure 2. When the contamination rates are between 75% and 90% (Fig. 2 A for scenario 9 and its variations, and Fig. 2 C for scenario 10 and its variations), aKLT1, being adapted for noise fighting, outperforms KLT1, which was the winner in the first set of simulations. For KLT2 and aKLT2, their power is relatively flat over the range of the contamination rate, which speaks volumes as they were specifically designed to be insensitive

C



● ●

● ●

● ●

●

0.0

0.0

● ●

0.4

0.6 ● ●

Power

● ●

0.4

●

0.2

●

B

0.2

Power

0.6

0.8

A

75

80 85 90 Contamination Percentage

91


● ●

●

●

●

●

0.4

● ●

0.2

Power

0.6

0.8

C

0.0

● ●

75

KLT1 aKLT1 KLT2 aKLT2 SSU SKAT−0


Figure 2 Power comparisons among six tests. (A) Scenario 9 with contamination rate being 75, 80, 85, and 90%; (B) scenario 9 with higher contamination rates: 91–94% with increment of 1%; (C) scenario 10 with contamination rate being 75, 80, 85, and 90%. In all three comparisons, the data were generated with ρ = 0.9.

to noise by considering multi-site patterns. As the power of the other methods drops with increasing contamination rate, KLT2 and aKLT2 start to be competitive. For the zoomin evaluation presented in Figure 2(B), KLT2 and aKLT2 are consistent winners, as expected. Also consistent with the earlier results is that aKLT1 continues to outperform KLT1 with the higher contamination rate. Neither SSU nor SKATO is competitive, especially when the contamination rate gets higher. Table 3 reports the type I error rates for scenario 9 under all contamination rates considered. Type I error rates under scenario 10 and figure illustrations of type I errors are also provided in the Supplementary Materials, Table S3 and Figure S2. As with the investigation in the first set of simulations, the empirical type I error rates are all around the nominal level of 0.05, indicating the appropriateness of the testing procedures.

Application to Data from Dallas Heart Study We applied all four methods proposed in this paper, KLT1, aKLT1, KLT2, aKLT2, together with SSU and SKAT-O, to

C


the sequence data from the Dallas Heart Study (DHS) (Victor et al., 2004; Romeo et al., 2009). DHS was designed as a single-site, multiethnic, population-based study. We focus on testing for association between serum triglyceride (TG) levels and three genes (ANGPTL3, ANGPTL4, and ANGPTL5) that were sequenced in 3, 476 subjects, including 1832 African Americans, 1043 European Americans, and 601 Hispanics. The amount of TG (or blood fats) in blood is one important barometer of metabolic health; high levels are associated with coronary heart disease, diabetes, and fatty liver disease. It has been shown in the literature that genes ANGPTL3, ANGPTL4, and ANGPTL5 are associated with lower plasma levels of TG (Zeng et al., 2003; Romeo et al., 2009; Price et al., 2010; Chen et al., 2012; Mattijssen & Kersten, 2012). Other than ANGPTL3, it has been noted in previous studies that multiple variants with opposite effects (deleterious and protective) exist in ANGPTL4 (Lee et al., 2012) and ANGPTL5 (Romeo et al., 2009; Chen et al., 2012). The data set contains sequence information on 95 variants in these three genes, including missense substitutions, nonsense substitutions, frameshift insertions, and deletions. Among these variants, only one of them has an MAF greater than 0.05, implying that most of the variants are rare. To


205


Table 3 Type I error rates for scenario 9 under different contamination rate with ρ = 0.9. Contamination rate (%) Method

75

80

85

90

91

92

93

94

KLT1 aKLT1 KLT2 aKLT2 SSU SKAT-O

0.046 0.046 0.038 0.050 0.020 0.028

0.066 0.058 0.064 0.058 0.058 0.068

0.056 0.054 0.036 0.064 0.036 0.044

0.054 0.052 0.050 0.042 0.044 0.052

0.054 0.056 0.036 0.052 0.060 0.056

0.050 0.050 0.050 0.036 0.036 0.046

0.048 0.048 0.050 0.052 0.040 0.040

0.032 0.032 0.042 0.032 0.020 0.040

process the data for our analysis, we first removed data from 315 subjects who were being treated with statins or whose statins treatment status was unknown, as in Epstein et al. (2012). To make it possible to replicate the results in Epstein et al. (2012), we consider a dichotomized trait by calling individuals with the top 20% of TG values as cases and the bottom 20% as controls, leading to the TG1 trait with 628 cases and 621 controls. Note that the numbers for the top and bottom 20% are not identical due to ties. By deleting variants that have no sequence variation (all homozygous for the common allele) in all case and control samples, we are left with 21 variants in ANGPTL3, 14 in ANGPTL4, and 15 in ANGPTL5. As an alternative, we characterize case-control samples by dichotomizing TG levels using medically accepted ranges (http://www.nlm.nih.gov/medlineplus/ency/article/ 003493.htm). Specifically, an individual is classified as a case— high level—if his/her TG level is above 200 mg/dL, whereas a control—normal level—has TG level below 150 mg/dL. This discretization of the TG values, denoted as trait TG2, results in 373 cases and 2431 controls. Similarly, we deleted variants that have no sequence variation in all case and control samples combined, which resulted in 31 variants in ANGPTL3, 27 in ANGPTL4, and 25 in iANGPTL5. The results for the TG1 trait (top/bottom 20% as cases/controls) are presented in the top segment of Table 4. As can be seen from the results, KL-based tests perform well. For ANGPTL3 and ANGPTL4, all four KL-based tests have small p-values comparable to those of SSU and SKAT-O, and all lead to a conclusion of significant association. For ANGPTL5, a gene believed to be associated with serum triglyceride in previous research (Romeo et al., 2009; Chen et al., 2012), KLT2 and aKLT2 are the only two tests returning a result that is significant at the 1% level, perhaps suggesting a very large proportion of null variants. The results for the clinically oriented dichotomous trait, TG2, are presented in the bottom segment of Table 4. We observe that KLT1 is the only test that leads to significant conclusion at the 1% level for all of the three genes that are reported to be associated with TG-level in the literature (Romeo et al., 2009). 206


Discussion In this paper, we propose four tests to evaluate the associations of rare and common variants with complex diseases. The proposed tests are based on the idea of measuring the differences between distributions that can capture significant features of cases and controls via Kullback–Leibler distance. Although we only consider testing one genomic region (or one gene) at a time in the examples presented in this paper, the proposed KL-based approaches, like burden tests, can aggregate variants across multiple genomic regions (or multiple genes), thereby are also applicable to pathway-level analysis. There are several desirable features that set the proposed tests apart from existing ones. First, by explicitly considering and comparing the distributions of genotypes, existence of variants with opposite directional effects does not compromise the power of KLTs. Second, it is not necessary to set a threshold for rare variants as the KL definition makes it reasonable to consider rare and common variants together without worrying about the contribution from one type overshadowing the other. Third, KLTs are robust to null variants due to the built-in noise fighting mechanism. Finally, correlation among variants is taken into account implicitly so the KLTs work well regardless of the underlying LD structure. Finally, KL-based approaches can aggregate variants across multiple genomic regions, thereby are also applicable to pathway-level analysis. Through an extensive simulation study where we consider a total of 10 scenarios with various combinations of types of variants (rare or common), contamination rate (from 67% to 94%), and minor allele frequencies, we evaluated the performances of the tests thoroughly. Compared to the results from SSU and SKAT-O, two methods that were shown to work well in various studies, the performance of the KL-based tests are encouraging. In particular, KLT1 appears to perform well overall, being either the best or among the best in almost all settings and scenarios. On the other hand, KLT2 and aKLT2 are specifically designed for noise fighting, and our simulation study indeed confirms their superior performances under

C



Table 4 Results from the analysis of the Dallas Heart Study sequence data∗ . P-values Trait

Gene

KLT1

aKLT1

KLT2

aKLT2

SSU

SKAT-O

TG1

ANGP T L3 ANGP T L4 ANGP T L5 ANGP T L3 ANGP T L4 ANGP T L5

2.00E-07 1.00E-07 0.0676 1.00E-06 0.0048 0.0078

2.00E-06 1.00E-07 0.0577 2.00E-06 0.0079 0.0166

2.00E-08 1.50E-06 0.0080 4.41E-04 0.0312 0.2532

1.00E-08 3.10E-06 0.0050 7.54E-04 0.0321 0.1824

8.36E-08 0.0005 0.0701 2.03E-05 0.0050 0.0184

1.08E-07 3.76E-07 0.0237 1.09E-05 0.0006 0.0195

TG2

∗

TG1 trait calls an individual as a case (control) if the person is in the top (bottom) 20% of the serum triglyceride levels, whereas TG2 trait calls an individual as a case (control) if the person’s serum triglyceride level is above 200 mg/dL (below 150 mg/dL).

such scenarios. As such, KLT2 and aKLT2 would be most relevant for situations in which it is suspected, a priori, that most of the variants considered do not confer any effect on the trait. To illustrate how KLT2 (aKLT2) would perform if the proportion of null variants is relatively small, we applied them to the data from Simulation 1. The results are presented in the Supplementary Materials, Figure S3, which indeed show the low power of these tests under scenarios with a moderate contamination rate. Permutation tests to determine statistical significance are among the most commonly used statistical tools in modern genomic research. However, it is well known that many permutation-based procedures suffer from high computation costs. To explore the computational aspect of KL-based methods, we conducted a small simulation study. The results, discussed in Supplementary Materials Section S1 and Table S4, show that computational times for all methods are quite reasonable (less than one second for most of the scenarios considered). In general, KLT2 takes slightly longer than KLT1, whereas both are a little faster than their respective alternative versions (i.e., aKLT1 and aKLT2). We also explored whether the KL-based methods may be influenced by the gene length assuming the number of variants increases as the gene length increases. Specifically, we considered how the power of these tests may change with varying number of variants where the proportion of causal and non-causal variants is kept constant. Analysis described in Supplementary Materials Section S2 and Figure S4 shows that the KL-based tests are quite robust. That is, the power does not change substantially as the number of variants increases, and in the meantime, the type I errors are controlled well regardless of the number of variants. Although good properties are achieved with the proposed KLTs, the KL-based methodology could be further improved in several ways. For instance, instead of collapsing all variants in SN , we may consider not including them at all. Alternatively, one may create two, instead of just one, “combined” variants by collapsing only those with the same sign C


of f A,m − f U,m . One may also consider not including the combined variant or only using those with small MAF frequency (e.g., MAF≤ 0.05) to form a combined variant. For identifying candidate causal and non-causal variants, a different method may also be considered. For example, there are many well-studied regularization methods for feature selection and extraction, such as LASSO and Elastic Net as adopted in Turkmen & Lin (2012). Since these variable selection approaches shrink coefficients of some variants to exactly 0, one may simply retain the variants with non-zero coefficients in the analysis to reduce noise. Such methods can achieve a greater objectivity as there is no need to set a threshold for filtering. We note that the KL-based tests proposed in this paper cannot take population stratification into account directly. However, if information on subpopulations is available, one may consider applying the tests to each subpopulation separately. Alternatively, a weighted sum of the test statistics across the subpopulations may also be considered to derive an overall test statistic. However, further investigation is needed to explore these ideas. Finally, it would be ideal if (a)KLT1 and (a)KLT2 can be combined to arrive at an even more powerful and robust test. Although (a)KLT1 performs well overall, it is outperformed by (a)KLT2 when most of the variants being considered are null. A weighted average in the spirit of SKAT-O would seem a reasonable solution, but a computationally efficient procedure for assessing the significance would be necessary for practical utility. Given the potential benefit, research on how to combine the two statistics is warranted.

Acknowledgments We thank the editor and anonymous reviewers for their constructive and insightful comments that greatly improved the presentation of the manuscript. This work was supported in part by Shanghai Pujiang Program (12PJ1400400), National Natural Science Foundation of China (11171075), National Annals of Human Genetics (2015) 79,199–208

207


Basic Research Program of China (2012CB316505), the Scientific Research Foundation of Fudan University, and the United States National Science Foundation (DMS-1208968).

References Basu, S. & Pan, W. (2011) Comparison of statistical tests for disease association with rare variants. Genet Epidemiol 35, 606–619. Bodmer, W. & Bonilla, C. (2008) Common and rare variants in multi-factorial susceptibility to common diseases. Nat Genet 40, 695–701. Chen, G., Yuan, A., Zhou, Y., Bentley, A. R., Zhou, J., Chen, W., Shriner, D., Adeyemo, A., & Rotimi, C. N. (2012) Simultaneous analysis of common and rare variants in complex traits: application to SNPs (SCARVAsnp). Bioinform Biol Insights 6, 177–185. Epstein, M. P., Duncan, R., Jiang, Y., Conneely, K. N., Allen, A. S., & Satten, G. A. (2012) A permutation procedure to correct for confounders in case-control studies. Am J Hum Genet 91, 215–223. Gorlov, I. P., Gorlova, O. Y., Sunyaev, S. R., Spitz, M. R., & Amos, C. I. (2008) Shifting paradigm of association studies: value of rare single-nucleotide polymorphisms. Am J Hum Genet 82, 100–112. Kullback, S. & Leibler, R. A. (1951) On information and sufficiency. Ann Math Statist 22, 79–86. Lee, S., Wu, M. C., & Lin, X. (2012) Optimal tests for rare variant effects in sequencing association studies. Biostatistics 13, 762–775. Li, B. & Leal, S. M. (2008) Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet 83, 311–321. Li, Y., Byrnes, A. E., & Li, M. (2010) To identify associations with rare variants, just WHaIT: weighted haplotype and imputationbased tests. Am J Hum Genet 87, 728–735. Maher, B. (2008) Personal genomes: the case of the missing heritability. Nature 456, 18–21. Mattijssen, F. & Kersten, S. (2012) Regulation of triglyceride metabolism by angiopoietin-like proteins. BBA-Mol Cell Boil Lipids 1821, 782-789. Morgenthaler, S. & Thilly, W. G. (2007) A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST). Mutat Res 615, 28–56. Morris, A.P. & Zeggini, E. (2010) An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet Epidemiol 34, 188–193. Neale, B. M., Rivas, M. A., Voight, B. F., Altshuler, D., Devlin, B., Ogho-Melander, M., Katherisan, S., Purcell, S. M., Roeder, K., & Daly, M.J. (2011) Testing for an unusual distribution of rare variants. PLoS Genet 7, e1001322. Pan, W. (2009) Asymptotic tests of association with multiple SNPs in linkage disequilibrium. Genet Epidemiol 33, 497– 507. Price, A. L., Kryukov, G. V., deBakker, P. I. W., Purcell, S. M., Staples, J., Wei, L., & Sunyaev, S. R. (2010) Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet 86, 832-838. Romeo, S., Yin, W., Kozlitina, J., Pennacchio, L. A., Boerwinkle, E., Hobbs, H. H., & Cohen, J. C. (2009) Rare loss-of-function

208


mutations in ANGPTL family members contribute to plasma triglyceride levels in humans. J Clin Invest 119, 70–79. Schork, N. J., Murray, S. S., Frazer, K. A., & Topol, E. J. (2009) Common versus rare allele hypotheses for complex disease. Curr Opin Genet Dev 19, 212–219. Turkmen, A. S. & Lin, S. (2012) An optimum projection and noise reduction approach for detecting rare and common variants associated with complex diseases. Hum Hered 74, 51–60. Victor, R. G., Haley, R. W., Willett, D. L., Peshock, R. M., Vaeth, P. C., Leonard, D., Basit, M., Cooper, R. S., Iannacchione, V. G., Visscher, W. A., Staab, J.M., Hobbs, H.H., & Dallas Heart Study Investigators (2004) The Dallas Heart Study: a populationbased probability sample for the multidisciplinary study of ethnic differences in cardiovascular health. Am J Cardiol 93, 1473– 1480. Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M., & Lin, X. 2011 Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet 89, 82– 93. Zeng, L., Dai, J., Ying, K., Zhao, E., Jin, W., Ye, Y., Dai, J., Xu, J., Xie, Y., & Mao, Y.(2003) Identification of a novel human angiopoietin-like gene expressed mainly in heart. J Hum Genet 48, 159–162.

Supporting Information The supplementary materials include additional information on type I error (Tables S1– S3, Figs. S1 and S2), power comparison (Fig. S3), computational time (Table S4), and the effect of gene length (Fig. S4). Table S1: Type I error rates for four tests and 10 scenarios with ρ = 0. Table S2: Type I error rates for four tests and 10 scenarios with ρ = 0.9. Table S3: Type I error rates for scenario 10 under different contamination rates with ρ = 0.9. Table S4: Computation time (in seconds) for KL-based methods for two different sample sizes (N) and three different numbers of variants. Figure S1: Power comparisons among four tests under 10 scenarios. (a) ρ = 0; (b) ρ = 0.5; (c) ρ = 0.9. Figure S2: Power comparisons among six tests. (a) scenario 9 with contamination rate being 75, 80, 85 and 90%; (b) scenario 9 with higher contamination rates: 91-94% with increment of 1%; (c) scenario 10 with contamination rate being 75, 80, 85 and 90%. In all three comparisons, the data were generated with ρ = 0.9. Received: 3 June 2014 Accepted: 7 December 2014

C


EXCAVATOR: detecting copy number variants from whole-exome sequencing data.

Detecting association of rare and common variants by testing an optimally weighted combination of variants with longitudinal data.

A random forest classifier for detecting rare variants in NGS data from viral populations.

Detecting disease association signals with multiple genetic variants and covariates.

CLAMMS: a scalable algorithm for calling common and rare copy number variants from exome sequencing data.

Family-based association test using both common and rare variants and accounting for directions of effects for sequencing data.

Whole-exome sequencing for the discovery of rare genetic variants that protect from coronary artery disease.

False discovery rates for rare variants from sequenced data.

Whole genome sequencing data from pedigrees suggests linkage disequilibrium among rare variants created by population admixture.

Privacy preserving protocol for detecting genetic relatives using rare variants.

Beyond Rare-Variant Association Testing: Pinpointing Rare Causal Variants in Case-Control Sequencing Study.

Detecting non-allelic homologous recombination from high-throughput sequencing data.

Rare variants analysis using penalization methods for whole genome sequence data.

Comparison of haplotype-based statistical tests for disease association with rare and common variants.

Cohen's h for detection of disease association with rare genetic variants.

SNPlice: variants that modulate Intron retention from RNA-sequencing data.

Progress in methods for rare variant association.

Next-generation sequencing identifies rare variants associated with Noonan syndrome.

Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants.

Evaluation of gene-based association tests for analyzing rare variants using Genetic Analysis Workshop 18 data.

Global inference of disease-causing single nucleotide variants from exome sequencing data.

VCF.Filter: interactive prioritization of disease-linked genetic variants from sequencing data.

A New Method for Detecting Associations with Rare Copy-Number Variants.

Utilizing mutual information for detecting rare and common variants associated with a categorical trait.