Genetic Epidemiology

RESEARCH ARTICLE

Pathway-Based Association Study of Multiple Candidate Genes and Multiple Traits Using Structural Equation Models Hela Romdhani,1 Heungsun Hwang,2 Gilles Paradis,1,3,4 Marie-Helene Roy-Gagnon,5 and Aurelie Labbe1,6,7 ∗ 1

Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, Quebec, Canada; 2 Department of Psychology, McGill University, Montreal, Quebec, Canada; 3 Research Institute of the McGill University Health Centre, Montreal, Quebec, Canada; 4 Institut ´ National de Sante´ Publique du Quebec, Montreal, Quebec, Canada; 5 Department of Epidemiology and Community Medicine, University of Ottawa, 6 Ottawa, Ontario, Canada; Douglas Mental Health University Institute, McGill University, Montreal, Quebec, Canada; 7 Department of Psychiatry, McGill University, Montreal, Quebec, Canada

Received 6 June 2014; Revised 5 November 2014; accepted revised manuscript 5 November 2014. Published online 30 December 2014 in Wiley Online Library (wileyonlinelibrary.com). DOI 10.1002/gepi.21872

ABSTRACT: There is increasing interest in the joint analysis of multiple genetic variants from multiple genes and multiple correlated quantitative traits in association studies. The classical approach involves testing univariate associations between genotypes and phenotypes and correcting for multiple testing that results in loss of power to detect associations. In this paper, we propose modeling complex relationships between genetic variants in candidate genes and measured correlated traits using structural equation models (SEM), taking advantage of prior knowledge on clinical and genetic pathways. We adopt generalized structured component analysis (GSCA) as an approach to SEM and develop a single association test between multiple genetic variants in a gene and a set of correlated traits, taking into account all available data from other genes and other traits. The performance of this test is investigated by simulations. We apply the proposed method to the Quebec Child and Adolescent Health and Social Survey (1999) data to investigate genetic associations with cardiovascular disease-related traits. Genet Epidemiol 39:101–113, 2015. © 2014 Wiley Periodicals, Inc.

KEY WORDS: generalized structured component analysis; prior biological knowledge; clinical pathway; genetic pathway; permutation test

Introduction There is strong evidence that genetic mechanisms account for a large part of the etiology of many complex disorders, such as cardiovascular diseases (CVD), cancers, schizophrenia, autism, and others. Although genome scans have identified a number of candidate regions of interest for several complex diseases, most of these successful studies have explained only a small proportion of the disease heritability through the identified loci. Statistical approaches that take into account the complexity of disease etiology are therefore of interest to understand the genetics of complex diseases. Classical genome-wide association studies (GWAS) test for the association between one single-nucleotide polymorphism (SNP) and one trait at a time and correct for multiple testing. This is known as the univariate approach. Recently, some methods have been developed to simultaneously analyze multiple correlated traits in population-based genomewide association analysis. MultiPhen [O’Reilly et al., 2012] performs an ordinal regression of a SNP on multiple phenotypes; PCHAT [Klei et al., 2012] adapts the Principal Component of Heritability approach [Ott and Rabinowitz, ∗

Correspondence to: Aurelie Labbe, 1020 Pine Avenue West, Montreal, QC H3A

1A2, Canada. E-mail: [email protected]

1966] to the population design; [Schifano et al., 2013] propose a scaled marginal model for testing and estimating the common effect of a SNP on multiple secondary phenotypes in case–control studies; TATES [van der Sluis et al., 2013] combines multiple P-values to obtain one overall trait-based P-value. Many gene-based tests of association have also been proposed. In [Huang et al., 2011], the authors propose a gene-wide significance (GWiS) test that uses greedy Bayesian model selection to identify the independent effects within a gene. They also review and compare their method to many other gene-based methods all of them implemented in a software FAST [Chanda et al., 2013]. All these methods are either performed one SNP at a time (methods designed for multiple phenotypes) or performed on a single trait (gene-based tests). Furthermore, most of them rely on the normality assumption of the measured traits, which is not appropriate for analyzing quantitative data with different scales or from different sources. Thus, such approaches are of limited use if the ultimate goal is the understanding of the paths leading from genetic variants to phenotypic variation. For many diseases, a substantial amount of information has already been gathered on biological pathways involved in disease etiology. We propose modeling complex relationships between genetic variants in multiple candidate genes  C 2014 WILEY PERIODICALS, INC.

and measured correlated traits by using the prior knowledge on clinical/molecular pathways (for traits) and genetic pathways (for genotypes). Indeed, analyzing genomic data through functional pathways offers the potential of greater power for discovery of connections to biological mechanisms. In fact, small but coordinated changes in sets of functionally related measures (i.e., pathways) can have significant overall effects. Therefore, extending the focus of the analysis from genes to pathways, thus incorporating more complexity into the analysis, can address at least in part the challenges presented by pleiotropy and heterogeneity. Note that the term pathway has been used in very broad contexts in the literature. In this paper, pathways are defined, for genomic data, as any latent functional structure within which data are correlated (e.g., genes or regions). For traits, we use the word pathway to represent a latent biological process relating traits to each other. Over several years, dozens of different methods have been proposed to summarize the significance of a biological pathway from a collection of SNPs and to adjust for multiple testing at the pathway level. Some of the related issues have been discussed and reviewed in [Wang et al., 2010] and [Khatri et al., 2012]. Although several methods are available in the literature for pathway analysis using prior information, none of these methods can fully address pleiotropy and heterogeneity of genetic effects while taking into account the correlation between the phenotypes and between the genetic variants. Furthermore, current methods rely heavily on strong distributional assumptions. We believe that structural equation models (SEM) provide the ideal solution to this complex problem. Over the past several decades, SEM , also called path models with latent variables, have been employed for the specification and testing of complex, path-analytic relationships between observed variables and underlying theoretical constructs, often called latent variables. SEM have become a remarkably popular method in social, natural, and health sciences for many reasons including the analytic flexibility and generality imparted by the procedure. In our context, genotyped polymorphisms and measured traits (or phenotypes) are observed variables and genetic pathways (genes or genomic regions) and clinical (or molecular) pathways are latent variables. A structural equation model consists of two submodels: measurement and structural models. The former specifies the relationships between observed and latent variables, whereas the latter expresses relationships between latent variables. Traditionally, two statistical approaches have been proposed for SEM: covariance structure analysis (CSA) [Bock and Bargmann, 1966; J¨oreskog, 1970] and partial least squares path modeling (PLSPM) [Esposito Vinzi et al., 2010; Lohm¨oller, 1989]. While CSA is, typically, based on the normality assumption and often results in improper solutions (e.g., negative variance estimates or correlations greater than one in absolute value), PLSPM does not rely on a stringent distributional assumption and avoids improper solutions because latent variables are treated as weighted composites of observed variables. Nonetheless, PLSPM estimation does not involve a global optimization criterion that is consistently optimized to estimate parameters. As a result, PLSPM

102

Genetic Epidemiology, Vol. 39, No. 2, 101–113, 2015

cannot provide measures of overall model fit that is not ideal for model validation. Generalized structured component analysis (GSCA) was recently introduced as another approach to SEM in the psychometric literature [Hwang and Takane, 2004]. As in PLSPM, GSCA treats latent variables as weighted composites of observed variables. Moreover GSCA employs a global leastsquares optimization criterion for parameter estimation and thus permits the calculation of measures of overall model fit. Consequently, GSCA becomes appealing to researchers and practitioners for the following reasons: (i) GSCA does not require the multivariate normality assumption of observed variables; (ii) GSCA is free from improper solutions that tend to occur with high frequency in practice; (iii) it enables the provision of overall model-fit measures for model comparison and (iv) GSCA has been extended to accommodate more advanced analyses that may be of great interest to a wide cadre of researchers, such as regularized estimation [Hwang, 2009], analyses of higher-order latent variables [Hwang and Takane, 2004] and multiple-group comparison [Hwang and Takane, 2004]. Despite its significant technical and empirical advantages, GSCA has mainly been used in psychology and other social sciences and remains novel to researchers in genetic studies. In this paper, GSCA is adopted for the first time as a statistical framework for the joint analysis of multiple correlated traits and genomic measures. Note that the proposed approach is set within a hypothesis-driven context and is not a pathway exploratory analysis. However, tuning such models to meet the specific needs of clinicians and researchers requires the development of a specific statistical test for the association between genetic and clinical pathways. The development of such a test constitutes the main methodological contribution of this paper, as well as the development of an R package (ASGSCA, available in Bioconductor), specifically designed for the analysis of genomic data. The performance of the proposed method is assessed by simulations under various scenarios. It is then applied on the Quebec Child and Adolescent Health and Social Survey (QCAHS, 1999) data to model and test associations between candidate genes and CVD-related traits, taking into account the prior biological knowledge.

Methods GSCA Framework Consider the case where J candidate SNPs (X 1 , . . . , X J ) and K phenotypes (X J +1 ,. . . , X J +K ) are observed. Let I = J + K denote the total number of observed variables. Suppose that the J SNPs are mapped to G different genes or regions (γ1 , . . . , γG ) and the K phenotypes are involved in T different clinical pathways (γG +1 , . . . , γG +T ). The G genes and T clinical pathways are unobserved latent variables. Figure 1A illustrates an example with J = 5 observed SNP genotypes X 1 , . . . , X 5 and K = 3 observed phenotypes X 6 , X 7 , X 8 . The two latent variables γ1 and γ2 represent a gene and a clinical pathway, respectively.

Figure 1. Examples of path models. Square boxes are used to represent observed variables (here SNPs and traits) and circles are used to represent latent variables (here genes and clinical pathways).

GSCA defines latent variables as weighted composites of observed variables  w i X i ,  = 1, . . . , L (1) γ = i∈S 

where S  denotes the set of indices of the observed variables mapped to the th latent structure, w i denotes the weight associated with the observed variable X i in the definition of the latent variable  and L = G + T is the total number of latent variables. Equation (1) is called the measurement model. The effect of a gene is thus modeled as a weighted additive effect of SNPs, and similarly a clinical pathway is defined as a weighted sum of the underlying phenotypes. In the example of Figure 1A, the set S 1 = {1, . . . , 5} includes the indices of the SNPs mapped to the gene γ1 ; and S 2 = {6, 7, 8} corresponds to the indices of the phenotypes involved in the clinical pathway γ2 . The relationships between latent variables are established given prior biological knowledge such that γ =

L 

b  γ +  ,  = 1, . . . , L

(2)

 =1, =

where  represents the error term and b  represents the path coefficient linking γ to γ . Equation (2) is called the structural model. In Figure 1A, b12 is the path coefficient linking the gene variable to the clinical pathway while the path coefficient b21 linking the clinical pathway to the gene variable is set to 0 that corresponds to no directed arrow from the clinical pathway to the gene in the graph. To be concise, no error terms for latent variables are displayed in this figure as is the case for all the other figures representing SEM in

this paper. Figure 1B provides an example of a more complex network, where SNPs are mapped to two genes, and traits are mapped to two clinical pathways. Note that a trait can map to several pathways and genes can be associated with more than one clinical pathway. Also, genes can be replaced by more general functional regions of the genome. In the framework above, GSCA provides a tool for the joint analysis of all genotypes and phenotypes and allows for estimation of the model parameters using a global least squares optimization criterion. To minimize this criterion, an alternating least-squares (ALS) algorithm was developed [Hwang and Takane, 2004]. The ALS algorithm alternates two main steps until convergence. In the first step, the weight coefficients w i , i = 1, . . . , I,  = 1, . . . , L are fixed, and the path coefficients b  , ,  = 1, . . . , L ,  =  are updated in the least-squares sense. In the second step, the weights w i are updated in the least-squares sense for fixed path coefficients b   . Path models help distinguish between direct and indirect relationships between variables. Direct effects go directly from one variable to another. Indirect effects are those characterizing a relationship between two variables mediated by one or more latent variables. The effect of a direct path from a gene to a clinical pathway is represented by the corresponding path coefficient. This coefficient quantifies the joint effect of all SNPs mapped to the gene on all multiple phenotypes mapped to the clinical pathway. The indirect effect of a SNP on a clinical pathway is the product of the coefficients along the path relating them. In the example illustrated in Figure 1A, b12 quantifies the joint effect of X 1 , . . . , X 5 on the phenotypes X 6 , . . . , X 8 together and the indirect effect of SNP X 1 on clinical pathway γ2 is given by w 11 × b12 . Genetic Epidemiology, Vol. 39, No. 2, 101–113, 2015

103

In the literature, few attempts have been made to model gene, environment and traits relationship using SEM [Nock et al., 2007; Nock and Zhang, 2010], but the proposed approaches can be applied only to a single trait. In addition, the two papers above, which are technically based on CSA, assume that SNPs are “consequences” of genes (in SEM terminology, SNPs are reflective indicators); whereas our approach considers SNPs to be determinants of genes (i.e., SNPs are formative indicators). Although no clear theoretical evidence seem to support either of these specifications, we believe that it is more sensible to assume that SNPs are formative indicators. To make GSCA useful for clinicians and researchers, we need to develop a specific statistical test for the association between genetic and clinical pathways. Within the GSCA framework, a bootstrap method has been used to estimate the standard errors of the model parameters’ estimators. However, inference on these parameters based on the bootstrap standard errors estimates requires assumptions on the distributions of the used estimators. Whereas basing inference on bootstrap confidence intervals does not require such a distributional assumption, it may be misleading when a parameter estimate changes its sign from one bootstrap sample to another, although the meaning of the estimate remains the same. This will likely lead the bootstrap confidence intervals of the path coefficient to include zero, indicating that this coefficient is not significant, no matter how large it is. The sign change may occur for various reasons, for example, when sample size is small and/or the number of observed variables is large, or when multicollinearity is present in observed variables and/or latent variables. To safeguard against such a potential issue of the bootstrap method, two-sided permutation tests can be considered for testing the statistical significance of the parameters. These tests are not affected by the potential sign changes of parameter estimates because they use the absolute values of the model parameters as test statistics. Moreover, they do not rely on any distributional assumption. A permutation test procedure for the gene effect on a clinical pathway, or more generally the joint effect of multiple SNPs on multiple phenotypes, is described in the following section.

Hypotheses Test Procedure SEMs allow one to represent and measure different directed effects between observed and latent variables as well as between latent variables. Consider the model described in equations (1) and (2). Based on the GSCA estimation method, suppose one wishes to test for the null hypothesis  H0 , : b  = 0 of no effect of gene γ on clinical pathway γ . We propose performing a permutation test as follows: SNP genotypes mapping to the gene tested are permuted by randomly reassigning the subject’s ID to genotypes such that any association between the gene and the phenotypes is destroyed. Note that it is important to assign the same ID permutation to all the SNPs tested in order to not disrupt the linkage disequilibrium (LD) pattern between them.

104

Genetic Epidemiology, Vol. 39, No. 2, 101–113, 2015

The permutation distribution is then obtained by calculating, for each permuted dataset, the estimate of the path coefficient bˆ  . This is an approximation of the distribution of the test statistic bˆ   under the null hypothesis. The permutation P-value is estimated by the proportion of permuted datasets for which the path coefficient bˆ   is larger, in absolute value, than its absolute value computed on the original dataset. It is worth noting that no multiple testing correction is needed for these hypotheses tests since a single model is estimated from all the available data. The performance of this GSCA-based test is investigated by simulations in the Results section.

Results Simulation Study We conducted a simulation study to assess the performance of the proposed test procedure. The nine scenarios considered (Fig. 2) involve one or two genes and one or two clinical pathways. In each scenario, one or two SNPs are mapped to each gene and two or three correlated traits are related to each clinical pathway. Scenarios are ordered with respect to the complexity of the underlying association structure, i.e., in each new scenario a new latent variable or a new association is added compared to the previous scenario. First, genotypes were simulated using HAPGEN2 [Su et al., 2011]. Based on a reference panel of known haplotypes and an estimate of the fine-scale recombination rate across the region, HAPGEN2 simulates datasets over large regions. The simulated data has the same linkage disequilibrium patterns as the reference data. The August 2009 release of phased data from the 1000 Genomes Project containing reference haplotype data, legend files and recombination rates were downloaded from the IMPUTE website https://mathgen.stats.ox.ac.uk/impute/ impute_v1.html. The genotypes were then coded according to the additive model, i.e., a SNP’s genotype value is equal to the number of minor alleles. In this study, we chose to simulate specific SNPs of interest from the QCAHS study discussed in the next section. The list of simulated SNPs in each scenario is given in Table 1. The traits were simulated in a way to obtain the desired association structure described in each scenario. Note that traits were not simulated according to the GSCA models (1) and (2). Rather, we generated them following a multivariate linear regression model (and thus according to a multivariate normal distribution) allowing us to control for the associations between SNPs and traits by fixing the values of the heritability coefficients (see Appendix A). The correlation r between traits within a clinical pathway was set to 0.3 or 0.6 in scenarios (a) to (c) and 0.3 in scenarios (d) to (i). Finally, for each scenario, we simulated 1,000 replications and 1,000 subjects. Simulated heritability coefficients for each scenario are presented in Table 2. Scenario (a) involves one SNP X 1 from gene γ1 and two phenotypes X 2 and X 3 involved in a clinical pathway γ2 (Fig. 2A). In this scenario, we have a single SNP in the gene so w 11 = 1. To test the effect of the gene on the clinical pathway, i.e., on the two traits jointly, we used the GSCA-based

Figure 2. Scenarios considered in the simulation study.

Table 1.

Genes and SNPs simulated in each scenario

Gene

SNP

rs code

Chromosome

MAF

LD

Scenarios

eNOS

T786C Glu298Asp

rs2070744 rs1799983

7

0.375 0.401

0.345

(a)–(i) (b)–(i)

b2-AR

Gly16Arg Gln27Glu

rs1042713 rs1042714

5

0.235 0.380

0.527

(c)–(i) (c)–(i)

permutation test as well as the MultiPhen method [O’Reilly et al., 2012] and compared their performances. Figure B1 in Appendix B shows the empirical power functions for the two methods GSCA and MultiPhen in scenario (a). We can see that the two tests have roughly equal powers

with a slight uniform advantage for the proposed GSCAbased procedure. Both tests are powerful enough to detect small effects corresponding to small heritabilities. Unsurprisingly, the results also show that a stronger correlation between the traits reduces the power of the test. In scenario (b), we investigate the case where two traits are associated to the same gene with different strength. The first part of Table A1 reports the probability of rejecting the null hypothesis for different values of h 3,1 , h 4,1, and correlation r. If the null hypothesis of no association is true, i.e., the corresponding heritability is zero, this probability must be close to the level of the test. Under the alternative hypothesis, it is an estimate of the power. As one can see, the test respects the Type I level of 5%. Furthermore, we note that including a Genetic Epidemiology, Vol. 39, No. 2, 101–113, 2015

105

Table 2.

Heritability coefficients controlled in each scenario and the simulated values

Scenario

Heritability coefficients

(a)

The heritability coefficients of the SNP on the two traits are equal.

(b)

h 3,1 = heritability of trait X 3 with respect to the SNP selected from gene γ1 . h 4,1 = heritability of trait X 4 with respect to the SNP selected from gene γ1 .

(c)

h .,1 = heritability of traits X 5 and X 6 with respect to the SNP selected from gene γ1 . h .,2 = heritability of traits X 5 and X 6 with respect to the SNP selected from gene γ2 .

(d)

h .,1 = heritability of traits X 5 and X 6 with respect to the SNP selected from gene γ1 . h .,2 = heritability of traits X 7 and X 8 with respect to the SNP selected from gene γ2 .

(e)

h .,1 = heritability of traits X 5 , X 6 , and X 7 with respect to the SNP selected from gene γ1 . h .,2 = heritability of traits X 7 and X 8 with respect to the SNP selected from gene γ2 .

(f)

h .,1 = heritability of traits X 5 and X 6 with respect to the SNPs selected from gene γ1 . h .,2 = heritability of traits X 5 , X 6 , X 7 , and X 8 with respect to the SNP selected from gene γ2 .

(g)

h .,1 = heritability of traits X 5 , X 6 , and X 7 with respect to the SNPs selected from gene γ1 . h .,2 = heritability of traits X 5 , X 6 , X 7 , and X 8 with respect to the SNP selected from gene γ2 .

(h)

h .,1 = heritability of traits X 5 , X 6 , X 7 , and X 8 with respect to the SNPs selected from gene γ1 . h .,2 = heritability of traits X 5 , X 6 , X 7 , and X 8 with respect to the SNPs selected from gene γ2 .

(i)

h .,1 = heritability of traits X 5 , X 6 , X 7 , and X 8 with respect to the SNPs selected from gene γ1 . h .,2 = heritability of traits X 5 , X 6 , X 7 , and X 8 with respect to the SNPs selected from gene γ2 .

nonassociated trait in the analysis slightly reduces the power of the test. For instance, for h 3,1 = h 4,1 = 0.01 the power is equal to 0.896 while for h 3,1 = 0 and h 4,1 = 0.01 the obtained power is 0.756. Nonetheless, the power of the test remains satisfactory. The power also decreases for stronger correlation between the phenotypes except when one of the traits is not associated with the gene. Scenario (c) illustrates the case where two genes are associated with different degrees to the same clinical pathway. Results presented in the second part of Table A1 show that the proposed test is able to distinguish genes with null and moderate effects in a scenario involving more than one gene. Again, the level of the test is respected and it is powerful enough to detect small heritabilities. As one can notice, adding a nonassociated gene to the model slightly decreases the power of detecting the other (associated) gene. Furthermore, the test on a gene is more powerful when the other gene has a larger heritability coefficient. For instance, for h .,1 = 0.7%, gene γ1 is detected as associated to the clinical pathway with probabilities of almost 80%, 83%, and 84% when h .,2 takes the values 0.7%, 1%, and 2%, respectively. Finally, the power also decreases for stronger correlation between the phenotypes. In scenario (d), we compare the performance of our approach in two situations: a correctly specified model and a misspecified model. In the latter, we specify additional associations between Gene 1 (γ1 ) and clinical pathway 4 (γ4 ) and between Gene 2 (γ2 ) and clinical pathway 1 (γ3 ), as indicated by the dashed lines on Figure 2D. When the model is wrongly specified, in addition to the significance of b13 and b14 , we test for the association between Gene 1 and clinical pathway 2 and between Gene 2 and clinical pathway 1 that are, in reality, null. In the third part of Table A1, we report the results for the two situations. For this scenario, when the model is correctly specified, a nonassociated gene does not affect the power of detecting the other associated gene. This is due to the fact that the true model contains two independent submodels, each one involving one gene and one clinical pathway. In this case, the

106

Genetic Epidemiology, Vol. 39, No. 2, 101–113, 2015

Simulated values [0, 0.04]

{0, 0.007, 0.01, 0.3}

GSCA estimates of the path and weight coefficients are the same when the two submodels are estimated together (as in our scenario) or separately. In the misspecified model case, our test recognizes the wrongly added associations and rejects them while keeping the same power for detecting the true associations as if the model was correctly specified. Finally, in scenarios (e)–(i), we wanted to assess the effect of the complexity of the model on the performance of the test. Starting from scenario (d), we progressively added new associations between genes and pathways or a connection between phenotype X 7 and clinical pathway γ3 . The results are reported in Table A2. Comparing the results of scenarios (e), (g), and (i) with those of (d), (f), and (h), respectively, we see that the additional connection between X 7 and γ3 increases the power for detecting the gene(s) associated with γ3 . Then, the more phenotypes involved in the clinical pathway are considered in the model, the more powerful our test is. Furthermore, the comparison of the results obtained for scenarios (d), (f), and (h) shows that an additional connection between a gene, say gene γ2 , and a clinical pathway, say γ3 , increases the power to detect the other genes associated with pathway γ3 , but decreases the power of detecting the association between gene γ2 and pathway γ4 . For example, when we move from scenario (d) to scenario (f), for h .,1 = h .,2 = 1%, the probability of detecting the association between γ1 and γ3 increases from 0.885 to 0.940, while the probability of detecting the association between γ2 and γ4 decreases from 0.907 to 0.880. Another set of simulations was conducted to compare the GSCA-based procedure to another approach consisting of adjusting a multivariate linear model in which the phenotypes are the responses and the genetic variants are the independent variables, then using Pillai’s test to test a particular subset of regression parameters. The model, the test and the simulation results for scenarios (g) and (i) are detailed in Appendix C. Note that the multivariate linear model is based on the normality assumption for the traits that is not the case

for GSCA. The results in Tables D1 and D2 in Appendix C show that the GSCA-based test is more powerful than Pillai’s test except in a few cases in scenario (g) (corresponding to h 1 = 0.007 and h 2 = 0.007, 0.01) where the maximum relative loss of power is 2%. However the relative gain in power (GSCA-based test relatively to Pillai’s test) can be as high as 26.77% in scenario (g) (for h1 = 0.007, h2 = 0.03, and r = 0.6). In scenario (i), GSCA-based test always performs better and the relative gain in power can be as high as 20.18% (for h1 = 0.007, h2 = 0.03, and r = 0.6). Comparing the results obtained for scenarios (g) and (i) (and also other scenarios for which the results are not reported here), we can conclude that the more complex the model studied is, the more relative power GSCA-based test gains with respect to Pillai’s test. On the other hand, although both tests lose power when the correlation between traits of the same clinical pathway increases, the gain in power using GSCA-based test relatively to Pillai’s test also increases. Finally, it is worth noting that the traits where simulated according to a multivariate normal distribution. For nonnormally distributed traits the method based on the multivariate linear model would lose much power that is not the case for the GSCA-based test. In fact, GSCA and the permutation test proposed in this paper do not rely on any distributional assumption.

variables (25 genes and three clinical pathways). Figure A1 shows a diagram illustrating the path model we fitted to the data, along with the weight and path coefficients estimated using GSCA. In Table A4 we report the significant associations detected using the GSCA-based test. We recall that no multiple testing correction is needed here as one single model is fitted to the data. To compare our results with the classical approach, we also tested for univariate associations between the available genotypes and phenotypes and used a Bonferroni correction for the 41 × 8 performed tests. The associations detected using this approach can be found in Table A5. The GSCA-based approach allowed us to discover many genetic associations to CVD-related phenotypes that were not discovered by the univariate approach. This latter detected only few variants from genes CETP, APOE, HL, and PCSK9 as significantly associated with one or more of three traits involved in Energy and/or Lipid metabolisms. By applying the GSCA-based test, we detected 5, 2, and 1 gene(s) significantly associated with Lipid metabolism, Energy metabolism, and Blood pressure control, respectively, which is twice the number of associations detected with the univariate approach. However, the association between gene HL and the Lipid metabolism was not detected as significant with the GSCAbased approach, which gave a P-value of 0.08.

Analysis of the Quebec Child and Adolescent Health and Social Survey (QCAHS) Data The QCAHS was designed to characterize cardiovascular disease (CVD) risk factors in a representative sample of Quebec youth and targeted all youth aged 9, 13, and 16, attending public or private schools in Quebec, in 1999. Detailed descriptions of the QCAHS design and methods can be found in [Paradis et al., 2003]. DNA is available on 1, 707 French Canadian participants of the QCAHS study (860 boys and 847 girls). Among the traits measured in QCAHS, 8 are of interest in our study. A list of these traits can be found in Table A3. We use the z-score transformations of these traits to standardize for age and sex. These traits are grouped into three pathways: lipid metabolism, energy metabolism, and blood pressure control. Available genotypic data on 35 variants within 25 genes are listed in Table D3 in Appendix D along with biological pathways within which they fall. Among the considered genetic variants, 33 are SNPs that were coded according to the additive model. The other two are polymorphisms with more than two alleles. The first, with three alleles, belongs to the gene APOE. It is coded using five indicator variables APOE1-APOE5. The second is a variant from gene PCSK9, also with three alleles but only four different genotypes are observed in our dataset. It was coded using three indicator variables PCSK9Leu1-PCSK9Leu3. Our new approach provides a tool to investigate complex relationships between genetic variants in these candidate genes and the CVD-related traits, taking advantage of prior knowledge on CVD-related pathways available in the literature. Our model involves in total 49 observed variables (41 genotype variables and eight phenotypes) and 28 latent

Discussion In this paper, we developed a statistical framework for the joint analysis of multiple correlated traits and multiple genomic measures from one or more candidate functional regions in genetic studies. Our approach is based on an adaptation of GSCA to genetic association studies. This allowed us, within a single model, to use prior biological information about clinical and genomic pathways, to model pleiotropic effects and genetic heterogeneity and to account for the correlation between measures within pathways. We also constructed a hypothesis test procedure for the association between a genetic region and multiple correlated traits represented by a clinical pathway. GSCA is an approach to structural equation modeling based on the minimization of a global least squares criterion. The solution to this optimization problem is calculated subject to var (γ ) = 1,  = 1, . . . , L . Because of this scale standardization, however, the weight coefficients suffer from lack of interpretability. The scales of the coefficient estimates depend on the number of observed variables involved in the same latent variable and are, consequently, only comparable within the same pathway. In the QCAHS data example, the weight associated with the diastolic blood pressure (DBP) trait in the definition of the blood pressure (BP) control latent variable and the weight associated with the low-density-lipoprotein (LDL) trait in the definition of the Energy metabolism are equal to 0.815 and 0.604, respectively. This does not necessarily mean that DBP contributes to BP control more than LDL does to Energy metabolism. Genetic Epidemiology, Vol. 39, No. 2, 101–113, 2015

107

The association test proposed in this paper relies on a permutation procedure. It was stated in section Methods that, in order to keep the LD patterns between the variants of the tested gene unchanged, the same ID permutation sets are assigned to all of them. But this procedure disrupts LD patterns between these SNPs and those from the other genes and, consequently, may not generate the correct null distribution. This would introduce some bias to the P-value estimates. Nevertheless, correlation between genotypes belonging to different genetic regions is generally very weak (otherwise, they should be grouped in one region), so the bias is negligible. GSCA has been extended to accommodate more advanced analyses that may be of great interest to a wide range of researchers. For instance, a ridge-type of regularization has been incorporated to the GSCA estimation [Hwang, 2009] to address multicollinearity problems. In our framework, this may occur due to high correlations among measured traits or variants mapped to the same gene. Additional simulations, not presented in this paper, showed that using the regularized version instead of the original GSCA does not substantially improve the power of the gene-clinical pathway association test. Furthermore, the analysis of the QCAHS data using the regularized GSCA led to exactly the same results. A possible explanation is that the correlation between the SNPs considered in the study and the correlation between the traits studied are not large enough to affect the performance of the standard GSCA. GSCA also allows for multiple-group comparison by fitting the same model to more than one group of individuals simultaneously [Hwang and Takane, 2004]. If one has data on individuals belonging to different populations or under two or more environmental exposures, it is possible to estimate the model parameters for different groups simultaneously. This makes it feasible to examine hypotheses on parameters associated with different groups. However, additional research is needed to develop a test procedure to compare such association structures in different groups, especially as the generation of the null distribution under this setting is not straightforward. In the analysis of the QCAHS data, we used z-scores for the considered traits to adjust for age and sex. A possible alternative is to use raw data and include covariates in the model, by adding one latent variable for each of them. If for example correlated covariates of environmental risk factors are available, they can be grouped in one latent structure that could be called Exposure. The covariates-associated latent variables are then related to the model’s clinical pathways as another source of variation, in addition to the genetic factor. It is also possible to specify some sort of gene–gene interaction by adding some path coefficients between the interacting genes. However, it is important to note that this strategy does not model a usual statistical interaction, defined as the product of two variables. The results of the QCAHS data analysis showed that, by considering complex relationships involving many traits and genotypes, we detect a greater number of associations. But it also showed that we may lose some associations as was the case for gene HL. Furthermore, gene Adiponectin (ADIPO),

108

Genetic Epidemiology, Vol. 39, No. 2, 101–113, 2015

known to be associated with lipid traits, was only detected to be associated with Energy metabolism, which shares three traits with Lipid metabolism. Then, a model selection strategy needs to be developed in order to draw conclusions about the relationship between all the variables. This model selection procedure can be constructed based on the overall fit measures provided by GSCA [Hwang and Takane, 2004; Hwang et al., 2007]. These fit measures allow the evaluation of the adequacy of the model and make it possible to compare different models. Such a model selection procedure could also be needed when clinicians do not have a clear idea about the disease pathway, and thus, cannot provide a path model to be fitted to the data. In this case, it is possible to select candidate genes by performing univariate tests in a genome-wide level, as a first stage. Another option maybe to use a recent extension of GSCA [Hwang and Takane, 2014, Chapter 9] that combines the LASSO selection method with GSCA . This new GSCA framework estimates weights or paths coefficients of noncontributing SNPs or genes to be exactly zero. Implementation of this method in the context of genetic studies is subject to future research. Even though this paper focused on the use of GSCA to model relationships between genetic variants and correlated clinical traits, the proposed method can be applied to data from different sources. Indeed, an application of our approach can also include other genomic data, such as gene expression and DNA methylation, and also other traits, such as molecular markers or brain imaging data. Integrating data from different sources in the same model may imply that latent variables are nested within each other. In this case, genotype variables are correlated within the “Genotype” latent variable, methylation data are correlated within the “Methylation” latent variable and these two latent components are nested within a latent gene or region. These types of models, called higher order components models, can be handled by GSCA exactly in the same way as ordinary models [Hwang and Takane, 2004]. The extension of the permutation test procedure to this framework is straightforward. Finally, note that our method was not initially designed to handle rare variants genotype data. However, it would probably work if genes include some (but not only) rare variants. This option is beyond the scope of this paper, and no simulation study was conducted to assess the performance of our test in the presence of rare variants. Note however that the proposed approach does not require any assumption on the observed variables (traits and SNPs), since GSCA does not rely on these distributions. This remains true even if the model includes rare variants genotypes. GSCA estimation procedure and the GSCA-based association test have been implemented in an R-package, ASGSCA (Association Study using GSCA), available on Bioconductor.

Appendix A In this appendix, we present the model we used to simulate a set of correlated traits associated to an already simulated set of SNPs, in order to obtain the desired association structure corresponding to each scenario in Figure 2. Note that our

Figure A1. Path model for the QCAHS data. Weights and path coefficients are included.

Genetic Epidemiology, Vol. 39, No. 2, 101–113, 2015

109

Table A1.

Results for scenarios (b), (c), and (d)

Scenarios (b)

Scenarios (c) H01,2

h 3,1 0 0 0 0 0.007 0.007 0.007 0.01 0.01 0.03

Scenarios (d)

r = 0.3

r = 0.6

Correctly specified

Misspecified

h 4,1

r = 0.3

r = 0.6

h .,1

h .,2

H01,3

H02,3

H01,3

H02,3

h .,1

h .,2

H01,3

H02,4

H01,3

H02,4

H01,4

H02,3

0 0.007 0.01 0.03 0.007 0.01 0.03 0.01 0.03 0.03

0.050 0.606 0.756 0.999 0.758 0.851 0.997 0.896 1 1

0.061 0.768 0.910 1 0.668 0.763 0.998 0.809 0.998 1

0 0 0 0 0.007 0.007 0.007 0.01 0.01 0.03

0 0.007 0.01 0.03 0.007 0.01 0.03 0.01 0.03 0.03

0.048 0.045 0.048 0.040 0.796 0.829 0.837 0.943 0.946 1

0.051 0.748 0.911 1 0.794 0.937 1 0.940 1 1

0.049 0.059 0.045 0.045 0.697 0.704 0.790 0.876 0.911 1

0.054 0.657 0.849 1 0.732 0.876 1 0.873 1 1

0 0 0 0 0.007 0.007 0.007 0.01 0.01 0.03

0 0.007 0.01 0.03 0.007 0.01 0.03 0.01 0.03 0.03

0.054 0.054 0.047 0.051 0.771 0.776 0.763 0.885 0.916 1

0.048 0.749 0.923 1 0.763 0.905 1 0.907 1 1

0.053 0.051 0.046 0.045 0.765 0.775 0.766 0.884 0.909 1

0.048 0.752 0.923 1 0.756 0.903 1 0.909 1 1

0.061 0.053 0.054 0.063 0.039 0.021 0.024 0.036 0.022 0.012

0.051 0.031 0.021 0.013 0.036 0.028 0.012 0.026 0.023 0.026



Probability of rejecting the null hypothesis H0, of no effect of gene γ on clinical pathway γ . In scenario (b), h 3,1 = and h 4,1 = denote, respectively, the heritabilities of traits X 3 and X 4 with respect to the SNP selected from gene 1. In scenarios (c) and (d), h .,1 denotes the heritability of traits X 5 and X 6 with respect to the SNP selected from gene 1, while h .,2 denotes the heritability of traits X 5 and X 6 in scenario (c) and the traits X 7 and X 8 in scenario(d), with respect to the SNP selected from gene 2. The correlation between each two traits from the same clinical pathway is denoted r.

Table A2.

Results for scenarios (e)–(i) Scenario (e)

h .,1 0 0 0 0 0.007 0.007 0.007 0.01 0.01 0.03

Scenario (f)

Scenario (g)

Scenario (h)

Scenario (i)

h .,2

H01,3

H02,4

H01,3

H02,3

H02,4

H01,3

H02,3

H02,4

H01,3

H01,4

H02,3

H02,4

H01,3

H01,4

H02,3

H02,4

0 0.007 0.01 0.03 0.007 0.01 0.03 0.01 0.03 0.03

0.046 0.055 0.055 0.045 0.796 0.797 0.790 0.931 0.924 1

0.047 0.744 0.890 1 0.760 0.902 1 0.909 1 1

0.060 0.059 0.059 0.052 0.793 0.820 0.859 0.940 0.949 1

0.046 0.754 0.882 1 0.786 0.924 1 0.923 1 1

0.050 0.715 0.883 1 0.735 0.888 1 0.880 1 1

0.053 0.048 0.038 0.059 0.848 0.868 0.925 0.960 0.978 1

0.045 0.781 0.929 1 0.845 0.950 1 0.964 1 1

0.056 0.720 0.915 1 0.771 0.876 1 0.895 1 1

0.061 0.048 0.045 0.040 0.779 0.790 0.828 0.938 0.931 1

0.038 0.053 0.051 0.043 0.764 0.771 0.826 0.910 0.932 1

0.053 0.736 0.878 1 0.766 0.902 1 0.914 1 1

0.049 0.716 0.897 1 0.785 0.908 1 0.929 1 1

0.040 0.047 0.055 0.048 0.868 0.868 0.906 0.961 0.990 1

0.049 0.056 0.050 0.053 0.796 0.802 0.813 0.923 0.957 1

0.053 0.820 0.932 1 0.848 0.947 1 0.974 1 1

0.042 0.762 0.890 1 0.796 0.906 1 0.924 1 1



Probability of rejecting the null hypothesis H0, of no effect of gene γ on clinical pathway γ . The coefficient h .,1 denotes the heritability of the traits mapped to the clinical pathway(s) related to gene γ1 (clinical pathway γ3 in scenarios (e)–(g) and both pathways γ3 and γ4 in scenarios (h) and (i)) with respect to the SNPs selected from gene γ1 . The coefficient h .,2 denotes the heritability of the traits mapped to the clinical pathways related to gene γ2 (clinical pathway γ4 in scenario (e) and both pathways γ3 and γ4 in scenarios (f)–(i)). The correlation between each two traits from the same clinical pathway is equal to 0.3.

Table A3.

Traits of interest available in QCAHS

Lipid pathway only

Both Lipid and Energy pathways

Energy pathway only

High-density lipoprotein (HDL)

Low-densitylipoprotein (LDL) Apolipoprotein B (APOB) Triglycerides (TG)

Fasting glucose

Fasting insulin

Blood pressure (BP) control pathway Systolic blood pressure (SBP) Diastolic blood pressure (DBP)

model is a generalization of the one used in [O’Reilly et al., 2012] to generate a simple scenario of one single SNP and two traits, which corresponds to our scenario (a). A set of SNPs from G different genes are simulated using HAPGEN2 for N individuals. We then select one SNP from each gene. Let X 1 , . . . , X G denote the selected SNPs and p 1 , . . . , p G their respective minor allele frequencies. For each individual, K continuous traits are simulated following a multivariate normal distribution by controlling the heritability of each trait on each of the G SNPs as well as controlling for the correlation between traits. The traits are simulated for each subject i = 1, . . . , N following the model below:

110

Genetic Epidemiology, Vol. 39, No. 2, 101–113, 2015

Table A4. approach

Results for QCAHS data analysis with the GSCA-based

Clinical pathways Gene

Lipid

CETP ApoE LPL PON2 PCSK9 Adiponectin TNFa AGT

Energy

Blood pressure

X X X X X X X X

A significant association (at level 5%) between a gene and a clinical pathway is indicated by “X.”

Y1i = α11 X 1i + α12 X 2i + . . . + α1G X G i + β11 E 1i , Y2i = α21 X 1i + α22 X 2i + . . . + α2G X G i + β21 E 1i + β22 E 2i , .. . YK i = αK 1 X 1i + αK 2 X 2i + . . . + αK G X G i + βK 1 E 1i + βK 2 E 2i +

. . . + βK ,K –1 E K –1,i + i ,

Table A5. Associations detected by the univariate test at level of 5% with a Bonferroni correction Lipid only

Lipid and energy

Gene

Variant

HDL

LDL

CETP APOE

TaqIB APOE1 APOE2 APOE5 C514T R46L

X

X X

HL PCSK9

the equations (3) and (4) conditional on Var(Yki ) = 1, k = 1, . . . , K . The covariances between pairs of SNPs can be estimated on the simulated sample of SNPs. The solution is obtained by following these steps :

APOB

 X X

X X

A significant genotype-phenotype association is indicated by “X.”

where E ki ∼ N (0, 1) can be interpreted as a common factor affecting the subset of traits (Yk , . . . , YK ), for k = 1, . . . , K – 1. These variables make it possible to control for the dependance structure among the traits, and thus to construct the clinical pathways by controlling the different correlation coefficients between pairs of traits via the coefficients βkg (for k = 1, . . . , K and g = 1, . . . , G ). The variable  ∼ N (0, σ2 ) is a random effect. The heritability h k,g of phenotype k on the SNP selected from gene g is defined as the proportion of the variance of Yk explained by X g . From the model above, the variances of the traits are given by

1. β11 = 1 –

G 

h 1,g – 2

g =1

G  g

Pathway-based association study of multiple candidate genes and multiple traits using structural equation models.

There is increasing interest in the joint analysis of multiple genetic variants from multiple genes and multiple correlated quantitative traits in ass...
3MB Sizes 0 Downloads 7 Views