| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
STATISTICAL CORNER |
From the Weight Control and Diabetes Research Center (J.M.M.), Brown Medical School and The Miriam Hospital, Providence, RI; Georgia Prevention Institute (H.S., Y.D.) and Department of Pediatrics, Medical College of Georgia, Augusta, GA; Twin Research and Genetic Epidemiology Unit (H.S.), St. Thomas Hospital, London, UK; and the Department of Biological Psychology (E.D.G.), Vrije Universiteit, Amsterdam, The Netherlands.
Address correspondence and reprint requests to Jeanne M. McCaffery, Weight Control and Diabetes Research Center, 196 Richmond Street, Providence, RI. E-mail: Jeanne_McCaffery{at}brown.edu
| ABSTRACT |
|---|
|
|
|---|
Key Words: statistics genetics twin studies
Abbreviations: MZ = monozygotic; DZ = dizygotic; SES = socioeconomic status; SEM = structural equation modeling; SBP = systolic blood pressure; HWE = Hardy-Weinberg equilibrium; SNP = single nucleotide polymorphism; VNTR = varying number of tandem repeats; LD = linkage disequilibrium; TDT = transmission disequilibrium test.
| INTRODUCTION |
|---|
|
|
|---|
We aim to review two common research designs in genetics. The starting point of genetic research on any risk factor is the establishment of significant heritability. The twin study has been the work horse of such heritability estimation and we will start by reviewing its principles. Because most researchers in this field are expected to use candidate gene association approaches, the largest part of this paper will consider the statistical methods for this type of association. Throughout, we based this paper on the valuable experiences gained during two workshops for "starters in the field" at the American Psychosomatic Society (30) and the Society for Psychophysiological Research (31). Although we expect that some statistical approaches may be familiar to the readers of Psychosomatic Medicine, some genetic terminology may not be. A glossary of genetic terms is available at http://www.genome.gov/glossary.cfm.
Twin Studies
Perhaps one of the most robust clinical observations in psychiatry and cardiology is that disease tends to "run in the family". However, familial resemblance for a trait cannot automatically be attributed to genes. In family studies, the genetic relatedness is confounded with the shared environment of the family members. This includes potentially important sources of interindividual variance like culture, socioeconomic status (SES), neighborhood, school, sports club, peers, family diet, and parental rearing style and attitudes. A unique experiment of nature has provided the solution to separating genetic and shared environmental influences: the existence of monozygotic (MZ) and dizygotic (DZ) twins.
Because MZ twins reared together share part of their environment and 100% of their genes (32) except for some rare exceptions, any resemblance between them is attributed to these two sources of covariance. The extent to which MZ twins do not resemble each other is ascribed to so-called unique or nonshared environmental factors like differential jobs or lifestyle, accidents or other life events, and in childhood, differential treatment by the parents, and nonshared peers. Unique environment also includes measurement error. Resemblance between DZ twins reared together is ascribed to the sharing of both environment and genes. DZ twins share on average 50% of their segregating genes; any resemblance between them attributable to genetic influences will be less than for MZ pairs. The extent to which DZ twins do not resemble each other is due to unique environmental factors and nonshared genetic influences.
Based on molecular genetic theory, we can further divide the genetic variance in two separate parts: a) additive and b) dominant genetic variance. Genetic effects at a single locus are called additive when the effect of one parental allele is added to the effect of the other parental allele. Genetic effects are called dominant when they deviate from purely additive effects, e.g., when the two alleles of the locus interact. The total additive and dominance variance estimated in twin studies reflects the additive and dominant effects summed over all contributing loci. The total variance in any trait can arise from the four components identified above: a) unique environmental factors ("E"), b) shared or common environmental factors ("C"), c) additive ("A") genetic factors, and d) dominant ("D") genetic factors. For simplicity, we will first consider the case where there is no interaction or correlation among these four components. The value of a trait is then defined as P = A + D + C + E, where P is a quantitative trait; A and D are the effects of additive and dominant genetic factors; and C and E are the effects of common and unique environmental factors (with E also including the residual variance due to measurement error). The variance (V) in trait P then becomes VP = VA + VD + VC + VE, and the MZ and DZ twin covariances become Cov(MZ) = VA + VD + VC, and Cov(DZ) = 0.50VA + 0.25VD + VC, respectively (33,34).
From the pattern of MZ and DZ twin correlations, we can obtain a first crude estimate of these variance components. However, we cannot estimate common environmental influences and dominant genetic influences at the same time. Therefore, we first test for evidence of dominance, which would yield MZ correlations that are much larger than twice the DZ correlation (e.g., rMZ = 0.42, rDZ = 0.10). If there is no evidence for dominance, the contribution of additive genetic influences to the total variance in a trait can be estimated as twice the difference between the MZ and DZ correlations (VA/VP = 2(rMZ rDZ)). For instance, typical MZ and DZ correlations for resting systolic blood pressure (SBP) are 0.52 and 0.26 (17); therefore, the percentage of SBP variance explained by the additive genetic influences is estimated at 52%. An estimate of the proportional contribution of the shared environmental influences to the total phenotypic variance is given by subtracting the MZ correlation from twice the DZ correlation (VC/VP = 2rDZ rMZ). The proportional contribution of the unique environmental influences can be obtained by subtracting the MZ correlation from unit correlation (VE/VP = 1 rMZ). If, for instance, the MZ correlation for exercise behavior of adolescents is 0.8 and the DZ correlation is 0.6, estimates of the relative contribution VA, VC, and VE to total variance are 40%, 40%, and 20%, respectively (35). If there is evidence for genetic dominance (i.e., the MZ correlation is larger than twice the DZ correlation), the estimate for the proportional contribution of additive genetic influences changes to VA/VP = (4rDZ rMZ). An estimate of the proportional contribution of the dominant genetic influences is then obtained by subtracting four times the DZ correlation from twice the MZ correlation (VD/VP = 2rMZ 4rDZ).
These are rules of thumb only. They are based on a model that has no interaction terms (e.g., A x E = 0) and assumes that mating is random, and that the genetic and environmental factors are uncorrelated in the population (e.g., Cov(A, C) = 0). If these assumptions do not hold, these intuitively simple rules may yield incorrect estimates. Interaction across multiple loci (gene-gene interaction or epistasis), for instance, will reduce the DZ correlation and inflate the estimate of genetic dominance. Interaction of genetic and unique environmental influences will inflate the contribution of the unique environment and underestimate genetic influences, whereas interaction of genetic and shared environmental factors will inflate the contribution of genetic influences (36,37). Incorrect estimates may also arise when genetic and environmental factors are correlated, for instance, because people actively seek environments that fit their temperament and skills, or because parents pass on their genes as well as a specific environment to their offspring (vertical cultural transmission). Finally, phenotypic assortment, which is nonrandom mate selection based on shared traits (e.g., education, religion, lifestyle choices), increases both MZ and DZ twin correlations that lead to an inflated estimate of the contribution of shared environment.
The other major assumption of the classical twin study is the "Equal Environments Assumption" that MZ twin pairs experience the same degree of environmental similarity as DZ twin pairs. If this is not the case and MZ twin pairs are exposed to more similar environments than DZ pairs, then any excess similarity between MZ pairs compared with DZ pairs may result from environmental rather than genetic factors. Several empirical findings argue in favor of the validity of the equal environment assumption (38,39). For instance, heritability estimates obtained from twin-adoption studies (where the MZ twins are raised in entirely different families) closely resemble those from ordinary twin studies. Also, studies of parents with misclassified twins (the parents always thought the twin was MZ but they turned out to be DZ and vice versa) have not shown any consistent effect of perceived zygosity on twin similarity for a range of personality traits.
Structural equation modeling (SEM) of twin variance-covariance data has several advantages over merely comparing the MZ and DZ correlations (34,40,41). SEM allows the comparison of the fit of alternative models (e.g., ACE versus AE) with the observed data and provides confidence intervals around the estimates for VA, VC/VD, and VE. In SEM, the relationship between several latent unobserved and observed variables is summarized by a series of structural equations. In a genetic analysis, these equations relate the observed trait to latent genetic and environmental variables (i.e., the additive and dominant effects of genes and common and unique environmental influences). From these equations, it is possible to derive the variance-covariance matrix implied by the model through covariance algebra (42). The variances and covariances for the basic twin model can be represented by linear structural equations of the total phenotypic variance (VP) of both MZ and DZ twins (VP = VA + VD + VC + VE), the MZ covariance (Cov[MZ] = VA + VD + VC), and the DZ covariance (Cov[DZ] = 0.50VA + 0.25VD + VC). As stated earlier, since we have four unknowns and only three observations, at most only one of VC and VD can be estimated. This is not to say that VC and VD cannot both contribute to the phenotypic variance of a trait but rather they cannot be estimated simultaneously with data from twins alone. Consequently, when the correlation between MZ twins is less than twice the DZ correlation, we estimate VC and assume that genetic dominance is absent; conversely, when the MZ correlation is more than twice the DZ correlation, we estimate VD and assume that VC is zero.
Structural equation models may be represented diagrammatically using path diagrams, which can be helpful in understanding complex multivariate designs. A first simple univariate example relevant to psychosomatic medicine is depicted in Figure 1. SBP is measured at rest in MZ and DZ twin pairs. Our model specifies one latent genetic factor, one latent shared environmental factor, and one latent unique environmental factor, all with a variance of 1. In the example, dominance is assumed not to influence SBP and all the genetic variance is assumed to be additive; this seems to be the case in reality as well (17). Path coefficients "a," "c," and "e" represent the factor loadings of SBP on the latent factors. As seen from biometrical theory, a2 = VA, c2 = VC, and e2 = VE (43). In structural equation modeling, parameter estimates for these path coefficients are obtained by using a fitting function, which quantifies the difference between the observed variance-covariance matrix and the variance-covariance matrix implied by the model. These functions provide a measure of how likely the data are under the specified model for the causes of familial resemblance. They also provide the significance of each of the model parameters (e.g., a2, c2, and e2). The relative contribution of the genetic factor to the total variance in resting SBP, also known as the heritability (h2), now obtains as the ratio of a2/(a2 + e2 + c2).
|
One huge advantage of structural equation modeling is that it can easily be expanded to the multivariate case, enabling us to examine if two traits are correlated through common genetic or through common environmental effects. A typical example in our field would be to detect the nature of the well-known tracking of SBP level across time, which in adulthood is about 0.55 over 5- to 10-year periods. This tracking may reflect the effects of an underlying genetic factor affecting SBP across time points, but it may also reflect the effects of chronic stress or other persistent unique environmental factor. Figure 2 depicts a bivariate twin model that can test this and various other hypotheses. In the example, we assume that only two sources of variance explain individual differences in SBP at the two time points: additive genetic and unique environmental factors; again this seems to be true in reality (23,44). If coefficient a22 can be set to zero without a significant loss of fit, only a single genetic factor influences SBP at both time points; i.e., there is no additional contribution of genetic factors at time 2 that is not already seen at time 1. If coefficient e21 can be set to zero, then the unique environmental factors causing variance in SBP at time points 1 and 2 are uncorrelated. If coefficient a21 is significant and e21 is not, this means that the tracking of SBP over time is caused entirely by underlying genetic factors. Such a structure was found across multiple time points in Dutch twin samples (23) whereas in Australian and American twins both genetic and environmental factors contributed to temporal stability of SBP (44,45).
|
Multivariate structural equation models of twin data can also be used to analyze the interaction between siblings, the genetic and environmental correlation between different traits, and the direction of causation between variables (34,43,46). It is also easy to extend the classical twin design by including other informative relationships in the analysis including siblings (47,48), parents of twins (25,49,50), the offspring of MZ and DZ twins (51,52), and the spouses of twins (5355). These designs can quantify the effects of phenotypic assortment and vertical cultural transmission, which the classical twin study cannot do. Finally, if important aspects of the environment are measured, the presence and extent of gene x environment interaction can be tested (36,37).
An example of a twin model incorporating gene-environment interaction is given in Figure 3, where we additionally control for possible gene-environment correlation. Regular exercise is known to be associated with lowered SBP (56). However, the extent of SBP reduction after an identical exercise program shows large differences between individuals; family studies (98,99) have suggested these differences to be partly heritable (57). This suggests that subjects with different genetic make-up can differ in their sensitivity to the beneficial effects of exercise. To account for this gene-exercise interaction, the path loadings on SBP in Figure 3 are weighted for the exercise status (which is "yes" = 1/"no" = 0) of the twins. If a model with nonzero ß weights for the genetic factors fits the observed data better than a model with zero ß weights for the genetic factors, we have formal evidence of gene-exercise interaction. Some complexity is introduced to the model by allowing part of the association between exercise behavior and blood pressure to derive from genes that independently influence both traits (i.e., the latent genetic factor Ac). This phenomenon is known as "pleiotropy" and may play a role in many traits that can be considered "environmental" modifiers of risk factors, on the one hand (e.g., lifestyle, SES, chronic stress), but may themselves be heritable. When there is evidence of potential gene-environment correlation, i.e., when the "environmental factor" itself shows heritability, as is the case for exercise behavior (35), allowing for gene-environment correlation as in Figure 3 is prudent.
|
In short, twin studies provide a first necessary step in genetic research by establishing that genes contribute to the observed population variation in psychosomatic risk factors and by estimating the size of this genetic contribution relative to other factors that create resemblance within families. Twin studies do not identify the actual genes. This effort requires molecular genetic research on the actual genetic variation.
Molecular Genetics
That DZ twins share, on average, 50% of their genetic material refers exclusively to the part of the genes in which people can differ. Any one persons deoxyribonucleic acid (DNA) is 99.9% the same as any other persons DNA. The 0.01% difference in the sequence of DNA among individuals is the source of all genetic variation. Variation in a single gene is responsible for some disorders, such as cystic fibrosis and sickle cell disease. Variation in multiple genes, environmental factors, gene by gene interactions, and gene by environment interactions are thought to account for complex traits, including most traits of interest in psychosomatic medicine.
A gene consists of two units of information, the alleles. One allele is inherited from the father and one from the mother. Together they constitute the genotype, which may be homozygous (same allele from both parents) or heterozygous (different allele from each of the parents). Under a simple Mendelian inheritance model and random mating assumption, lack of selection according to genotype, and absence of mutation or migration, the frequencies of the genotypes in the population are perfectly predicted by the frequencies of the two alleles, which is referred to as Hardy-Weinberg equilibrium (HWE) (58). As an example, consider a gene with two alleles, denoted "short" (s) with frequency p and "long" (l) with frequency q. Let the least frequent, or minor, allele s take up 40% of all alleles in the population (p = .4). The three potential genotypes, ss, ls and ll, have expected frequencies, namely, p2 (.16), 2pq (0.48), and q2 (0.36). A
2 test for HWE compares these expected genotype frequencies with the observed genotype frequencies; a significant
2 test indicates that HWE does not hold. Many of the association analyses discussed below require HWE to hold.
Large-scale genetic variation includes loss or gain of chromosomes or breakage and rejoining of chromatids. This variation is abnormal and often leads to profound developmental problems. Smaller-scale genetic variation is at the level of a single allele and contributes to most of the normal variation in the population. Smaller-scale genetic variation can be classified into three groups: a) single nucleotide polymorphisms (SNPs), b) insertion/deletion polymorphisms, and c) varying number of tandem repeats (VNTR). Deletion occurs when one or more nucleotides are eliminated from a sequence, whereas insertion occurs when one or more nucleotides are inserted into the sequence. VNTRs (which include very short repeats or microsatellites) are short identical segments of DNA aligned head to tail in a repeating fashion. The number of repeated segments at a locus varies between individuals. An SNP is defined as a single base substitution. SNPs are the most abundant form of DNA variation in the human genome with approximately 7 million common SNPs with a minor allele frequency of at least 5% across the entire human population (5962).
Candidate Gene Association Studies
Candidate gene association studies test if a particular allele in a candidate gene and a trait co-occur above chance level, given the frequency of the allele and the distribution of the trait in the population (63). In these studies, selection of candidate genes a priori is required. The selection of genes may be based on the biological role of the gene in a causative pathway (physiological candidate) or a location close to a peak from a linkage, or genetic mapping, study (positional candidate). Ideally, the gene fits both criteria. In a direct association study, one or more putatively functional variants are genotyped and serve as the independent variable predicting a dependent variable, the trait of interest. It is presumed that the selected variant is causative in the trait of interest although, in practice, association may be attributable to linkage disequilibrium (LD) with another functional site nearby. Genetic variants should be prioritized by apparent functional significance or location within coding, promoter, or splice regions. These typically include SNPs, VNTRs, and insertion/deletion polymorphisms.
An example of a direct association study is examining the role of variants within
- and ß-adrenergic receptor genes as predictors of blood pressure level. The adrenergic receptor genes are good biological candidates due to their location on the heart (ß1), in the vasculature (
1, ß2), or within the central nervous system (
2a), and their involvement in cardiovascular regulation. In addition, individual variants within the genes have been shown to be functional. For example, receptors with the C
G SNP at base pair (bp) 1165 within ADRB1 (ß1-adrenoreceptor gene), resulting in an amino acid substitution from arginine to glycine at position 389, show increased adenylyl cyclase activity in the presence of an agonist. In a study of young adult twins, the genotypes at this SNP were examined in relation to blood pressure at rest and in response to a combined mental arithmetic and Stroop task (3). After statistically controlling for age, sex, and body mass index, participants carrying any G allele at base pair (bp) 1165 in ADRB1 exhibited increased resting SBP (GG/GC = 115.52 ± 8.47 versus CC = 112.94 ± 10.14 mm Hg), DBP (GG/GC = 61.88 ± 6.32 versus CC = 59.64 ± 7.16 mm Hg), and a larger DBP response (GG/GC = 6.97 ± 6.94 versus CC = 4.85 ± 6.87
mm Hg) to mental challenge as compared with CC genotypes (CG and GG groups were combined due to the small sample size for GG homozygotes).
There are numerous online resources with information about candidate genes and variation in or near the genes of interest. The National Center for Biotechnology Information home page, available at http://www.ncbi.nlm.nih.gov/, includes resources such as Online Medelian Inheritance in Man (OMIM), dbSNP, the Genome Database and Pubmed. Other excellent resources include Ensembl available at http://www.ensembl.org/, the Genome Browser from the University of California, Santa Cruz available at http://genome.ucsc.edu/, the International Hapmap project available at http://www.hapmap.org/, the SNP Consortium (TSC) available at http://snp.cshl.org/, the SeattleSNPs variation discovery resource available at http://pga.gs.washington.edu/, and SNPper, a Web-based application to automate the tasks of extracting SNPs from public databases available at http://SNPper.chip.org/.
The statistical approach to association studies depends on the research design (63). Common research designs for association studies include cohort designs and case-control designs. Within these designs, special topics with statistical implications include treatment of race and ethnicity, gene x gene and gene x environment interaction, and use of haplotypes and power.
Cohort Studies: Quantitative Traits
Single diallelic polymorphisms, such as SNPs, may be analyzed using general linear modeling. For individual SNPs, genotype (e.g., GG, CG, CC for a G to C substitution) typically serves as the independent variable. In the absence of knowledge about whether alleles at a given site function in an additive, dominant, or recessive manner (as is the case for many of the polymorphisms of interest in psychosomatic medicine), the three possible genotypes should be treated as independent groups. This would translate to a between-subjects group factor with the number of levels (k) equal to the number of genotypes and k-1 degrees of freedom (df) (i.e., 2 df). Evidence for apparent dominance of one allele over another may be detected through posthoc group contrasts (e.g., GG = CG > CC). Covariates and additional predictors of the dependent variable may also be incorporated.
Within a regression framework, the most general model for genetic effects at a single locus includes a term for linear effects of a given allele and an additional parameter for the deviation from this linear effect, i.e., a dominance term (63). For the linear term, genotypes (e.g., GG, CG, and CC) are assumed to function in an additive manner and are coded as 0, 1, and 2, reflecting dose of the C allele. The associated ß weight is the additive effect of the C allele. This linear model alone predicts that the mean of the heterozygotes (CG) will be located at the midpoint between the two types of homozygotes (GG, CC); however, in practice, this may or may not be the case. Deviation of the mean of the heterozygotes from the midpoint between the means of the homozygotes suggests that one allele is dominant over the other. To quantify this effect, an additional, dominance term, is necessary. Specifically, genotypes GG, CG and CC may be coded 0, 1, and 0 with the associated ß weight reflecting deviation of the heterozygotes from the midpoint of the two homozygous groups, as would be predicted by the linear term alone. The general regression framework for a diallelic locus is given by P =
+ ßaA + ßdD + e, where P is a quantitative trait;
is the baseline mean of P; A and D are dummy variables reflecting coding for linear and nonlinear effects of the underlying genotype at a single locus; and e is a residual error term assumed to be normally distributed.
For polymorphisms with more than two alleles (e.g., microsatellites), genotypes may be treated individually, although there will be little power to examine the effects of the more rare alleles. Alternatively, alleles may be ranked according to function based on in vitro assays (64). If there are no functional data available and several rare genotypes, it may be necessarily to limit analyses to the most common genotypes to preserve statistical power.
Case-Control Studies: Disease Traits
Case-control genetic association studies are typically comprised of a group of cases with a trait of interest and well-matched controls. Ideally, the cases and controls should represent "identical" subsamples from a single population differing only on the trait of interest (65). Statistical analyses compare allele frequencies or genotypes across cases and controls. In well-matched samples, differences in genotypes across cases and controls may be tested using
2 tests. Alternatively, the risk of having the disorder may be modeled using logistic regression with a 2 df test. Within this approach, the log odds of expressing the disease trait is modeled as a function of the additive effects of the dose of one of the alleles (e.g., 0, 1, or 2 copies of the C allele for genotypes GG, CG, and CC, respectively) and a dominance term representing deviance from this additive pattern (e.g., genotypes GG, CG, and CC coded as 0, 1, and 0). For the additive term, the log odds of disease expression for heterozygotes is midway between the log odds of the two homozygous groups. The dominance term quantifies the extent to which the log odds for heterozygotes differs from the additive prediction. The general logistic regression framework for a diallelic locus is given by ln(P/1 P) =
+ ßaA + ßdD + e, where P is the binary expression of a phenotype;
is the baseline log odds of P; A and D are dummy variables reflecting coding for linear and nonlinear effects of the underlying genotype at a single locus; and e is a residual error term assumed to be normally distributed. The natural log raised to the power of the additive ß weight (eßa or Exp (ßa)) reflects the change in odds of expression of the phenotype based on a unit increase in allele dose. The GG genotype becomes the reference group (0 allele) and the effect of genotype is quantified by determining if there is a significant change in the probability of the expression of the phenotype for each additional C allele (CG = 1 additional allele and CC genotypes = 2 additional alleles). The natural log raised to the power of the dominance ß weight reflects the deviation of heterozygotes from the midpoint of the log odds for the two homozygous groups. For a binary genotype (i.e., GG versus CG or CC), natural log raised to the power of the additive ß weight would be an odds ratio.
Treatment of Race and Ethnicity
In cohort-based and case-control analysis of unrelated individuals, spurious genetic association may result due to differences in allele frequencies and the trait of interest in subgroups within the larger population, often reflecting racial or ethnic groups (population stratification). The classic example of population stratification is a hypothetical association between chopstick use and any genetic marker that differs markedly between Asian and Caucasian populations in a larger population with substantial representation of both ethnicities, such as San Francisco, California (66). It has been argued that there have been relatively few documented instances of bias due to population stratification reported in the literature and that population-based studies are largely robust to this type of bias (67). However, recent empirical tests do find evidence of stratification effects, particularly among populations that have recently been mixed from two or more distinct parental populations (genetic admixture), including African Americans and Hispanic Americans (68).
Population stratification is essentially a problem of sample matching, occurring primarily when the genetic background of the cases differs from that of controls (67). Accordingly, it is possible that matching cases and controls on self-reported race in homogeneous populations (such as European Americans) will mitigate concerns about population stratification. Two methods are available to control for stratification using markers throughout the genome. In structure assessment (6974), genetic markers, either anonymous markers or markers that differ substantially among ethnic groups, are used to predict membership in homogeneous subgroups within a stratified population. Once identified, genetic associations may be conducted within these subgroups to ensure a similar genetic background of cases and controls. A second method, genomic control (7579), uses anonymous genetic markers to estimate the degree of inflation of the
2 statistic due to population stratification and yields a correction factor to account for these background genetic effects in genetic association studies. With the rapid reduction in genotyping costs and further development of these methods (69), it is likely that the threat of population stratification will be routinely controlled in cohort and case-control genetic association studies using these types of techniques.
Another good method to ensure genetic matching is to conduct genetic studies within families. Tests using within-family controls to control for population stratification are collectively known as transmission disequilibrium tests (TDTs). The classic TDT requires information on trios, i.e., parents and an affected offspring. The principal idea is that the allele associated with disease will be transmitted more often to an affected offspring (80). The TDT compares the actual and expected probabilities of transmission of the allele (an offspring has an expected chance of 0.5 of receiving a specific allele from either the mother or the father). Overtransmission can only occur if the marker and disease locus are linked. However, power of the TDT is less than for an association test based on cases and controls because only heterozygote parents provide information about preferential allele transmission. After the introduction of the classic TDT by Spielman and colleagues (80), the TDT has undergone many developments and has, for example, been adapted for quantitative traits and nuclear families of any size (8183) as well as for haplotypes (84).
Gene x Gene and Gene x Environment Interaction
From a genetics perspective, nearly all psychosomatic traits are considered "complex," meaning that the causal pathways are likely to involve multiple genes of small effect, environmental factors, and gene x gene and gene x environment interaction (85). Genetic interaction within a given locus is termed genetic dominance. Interaction between two loci is termed epistasis. However, a distinction between epistasis referring to a statistical interaction and that referring to a physical interaction of gene products is warranted, as the presence of statistical interaction does not necessarily imply an underlying biological interaction (86,87). Similarly, statistical gene- environment interactions should be interpreted with caution as the mathematical model may again have no obvious biological interpretation (88).
Modeling statistical gene x gene or gene x environment interaction may be accomplished by incorporating two genetic predictors or one genetic and one environmental predictor into linear or logistic regression in standard statistical packages and testing for their interaction (87,89). The choice of scale becomes important because factors that are additive with respect to an outcome in one scale may exhibit interaction if a transformed scale is used. For linear regression with two genetic predictors, the general regression model is given by:
|
|
where P is a quantitative trait;
is the baseline mean of P; A1, A2, D1, and D2 are dummy variables coding for the additive and dominance effects of the underlying genotype for sites 1 and 2; and e is a residual error term assumed to be normally distributed. Statistical epistasis implies that at least one of the interaction coefficients differs significantly from zero.
For gene x environment interaction, at least one genetic and one environmental predictor are included in the regression equation plus the interaction of the additive and dominance term with the environmental predictor. Statistical interaction implies that either of the interaction terms differs significantly from zero. The general regression framework for a gene x environment interaction for a continuous trait is given by:
|
|
where P is a quantitative trait;
is the baseline mean of P; A and D are dummy variables coding for linear and nonlinear effects of the underlying genotype; E is a measured environmental factor; and e is a residual error term assumed to be normally distributed. Assuming no genetic dominance or associated interactions, this equation reduces to:
|
|
Finally, if the genotype is correlated to the environmental risk factor (e.g., genetic susceptibility to aggression and parental maltreatment), the interpretation of the statistical interaction is not straightforward (90). In addition, observational studies can be associated with substantially less power than well-designed experiments to detect interaction effects (91), suggesting that controlled interventions may be a useful alternative to observational studies in detecting gene x environment interaction effects. An example would be to test whether certain candidate genes in the sympathetic nervous system (e.g., ADRB2, or the ß2-adrenoreceptor gene) may explain part of the large individual variability in the beneficial effects of exercise on blood pressure.
Haplotype Analysis
The primary disadvantage of characterizing a single variant per gene is that there may be additional variants within the gene that are relevant to the trait of interest but are not captured by variation at a single marker. Hence, there has been increasing interest in using haplotypes, rather than single markers, as the unit of analysis in association studies (92). A haplotype refers to multiple SNPs along a short region of a chromosome (e.g., within a gene) that occur in a block pattern (Figure 4). There are three good reasons to perform haplotype analysis as part of candidate gene association studies: a) a haplotype might be in higher LD with the causal locus than any of the individual markers, b) interactions among the individual markers might form a functional haplotype, and c) haplotype analysis reduces the number of multiple tests of individual SNP analysis. A common problem of all statistical methods that use haplotype information is linkage phase ambiguity; i.e., it is unknown which alleles are located on the maternal chromosome and which are located on the paternal chromosome. As our genotyping analyses yield only the full genotypes, not the parental alleles separately, we do not know from which haplotype (maternal or paternal) the alleles originated. When multiple members in a family are genotyped, preferably including the parents, the two haplotypes constituting each genotype can often be determined from the Mendelian principles of gene segregation within a pedigree. Alternatively, statistical algorithms can be used to reconstruct haplotypes in unrelated individuals using the frequency and correlation of the SNPs in the population. The reliability of such algorithms seems to be good for multiple diallelic markers, such as SNPs (93,94), although there is some power loss for the association tests as a result of the haplotype phase uncertainty.
|
Because the SNPs in a haplotype are strongly associated (in LD) with each other, it is possible to test for the association of a haplotype with a trait or disease by genotyping only a few SNPs ("haplotype tagging" SNPs or simply "tagging" SNPs) within the haplotype. Tagging SNPs are first selected in a subset of the sample or in samples of the same ethnicity from freely available web resources, such as the HapMap. This reduces the cost of genotyping in the full sample, yet it ensures reasonably good coverage of common variation throughout the gene. The tagging SNPs are then examined for association with the trait of interest in the total sample and the effects of unassayed SNPs would then be detected through LD with tagging SNPs (95). The International Hapmap project (available at http://www.hapmap.org) has characterized >4 million SNP markers on a genome-wide scale in three ethnic groups (Caucasians, Africans, and Asians), greatly facilitating the use of tagging SNPs in association studies.
Power and Sample Size Considerations
Although many power calculations required for genetic association studies may be derived from texts well known to behavioral researchers (96,97), excellent on-line resources specific to power and sample size calculations for genetic association studies also exist, e.g., Quanto (98) available at http://hydra.usc.edu/gxe and the Genetic Power Calculator (99) available at http://statgen.iop.kcl.ac.uk/gpc/. In molecular genetic studies of quantitative traits, assuming a simple additive model, the effect size of a locus is a function of mean trait differences between homozygotes (e.g., the CC versus GG genotype) and allele frequency (100). Differing modes of inheritance (additive, dominant, and recessive) will also influence the effect size and have resulting effects on power and sample size.
As psychosomatic traits are likely to be influenced by multiple genes and interactions of small effect, the effect size for each is generally expected to be small. Sample sizes required to detect gene main effects and gene x environment interaction with sufficient statistical power in this context are relatively large. Although previous studies (98,99) have suggested that association can be detected even in modestly sized samples, standard power calculations show that up to 1000 participants are required to detect gene main effects and approximately 1500 to 2000 participants are required to detect gene x environment interaction with small to medium effect sizes. The required sample size will be even larger if one of the alleles is rare (e.g., less than 5% to 10%) or a large number of markers is typed and the statistical criterion, typically set at
= 0.05 for two-tailed tests, must be adjusted for multiple comparisons. One method to adjust for multiple comparisons is to use techniques that control the false discovery rate (FDR), i.e., the proportion of significant findings (or discoveries) that are false-positives (101,102).
Integration
Although statistical analysis of genetic association may, in many cases, be conducted using well-known methods, the strength of the interpretation of results is grounded in the study design. Nonsignificant results may be attributable to Type II error (at 80% power, there remains a 20% chance that you will falsely accept the null hypothesis) or experimental biases like genotyping error or overmatching of controls (65). Many prior studies have lacked sufficient statistical power (at least 80%) to detect the small effects expected, particularly if gene-gene, gene-environment, or genetic heterogeneity (i.e., more than one genetic variant can produce the same outcome) effects are involved. In addition, negative results could be due to inadequate coverage of a gene, for example, in studies of single variants. Nonsignificant results may also be attributable to a true lack of etiological relationship (65). Given the importance of nonreplications in the literature, calls have been made for convenient formats to publish negative results (65,89,103).
A significant result may indicate that a causal relationship between genotype and trait has been identified. However, because there are several other potential explanations of significant results, this type of interpretation should be used with caution. A common cause of false-positive results is increased Type I error due to multiple statistical tests. This problem will only increase with the availability of high throughput methods, which can easily generate millions of genotypes. The optimal correction method for multiple comparisons depends on the number of markers and phenotypes studied. For example, correction is mandatory for whole genome studies that use large numbers of random markers but may not always be necessary for candidate gene studies in which the prior probability of a true discovery is likely to be higher (104) and in which gene-wide significance levels can be used (105). The issue is further complicated by the correlation between SNPs (LD) and between phenotypes, making it difficult to assess the number of independent tests. Recently Manly and colleagues (104) showed that correction techniques for multiple comparisons based on the original Bonferroni are generally too conservative. New procedures based on FDR effectively control the proportion of false discoveries without sacrificing the power to discover.
Population stratification is another source of false-positive results. Although it is less likely that spurious association due to population stratification will occur among seemingly homogeneous populations, such as European Americans, this type of bias remains a concern particularly among recently admixed populations, such as African Americans and Hispanic Americans, and populations of mixed racial and ethnic composition. It is also possible that the genetic marker showing significant association is not the causal variant per se but is co-inherited (in LD) with a causal variant. Many of these threats to the interpretation of the results may be mitigated with careful study design, such as appropriate correction for multiple comparisons, incorporation of genetic markers to characterize population substructure, and haplotyping to characterize variation throughout a candidate gene. Most importantly, to minimize the probability that an observed association is a false-positive, significant findings must be replicated in independent samples. Many of these issues may be novel for persons considering genetic research for the first time and consultation on study designs with geneticists, statistical geneticists, or genetic epidemiologists is always recommended.
Finally, although this review focused on methods for candidate gene association studies, it should be noted that for most complex traits, our knowledge of underlying causative pathways is likely incomplete. Limiting the search for contributing genetic variation to known candidate genes only will likely prevent the identification of potentially novel pathways that contribute to psychosomatic traits. Thus, the candidate gene approach should ideally capitalize on knowledge generated with genome-wide searches, using techniques such as linkage analysis (106) and genome-wide association (107).
| NOTES |
|---|
|
|
|---|
DOI:10.1097/PSY.0b013e31802f5dd4
| REFERENCES |
|---|
|
|
|---|
- and ß-adrenoreceptor genes as predictors of cardiovascular function at rest and in response to mental challenge. J Hypertens 2002;20:110514.[CrossRef][Medline]This article has been cited by other articles:
![]() |
D. Ge, S. Su, H. Zhu, Y. Dong, X. Wang, G. A. Harshfield, F. A. Treiber, and H. Snieder Stress-Induced Sodium Excretion: A New Intermediate Phenotype to Study the Early Genetic Etiology of Hypertension? Hypertension, February 1, 2009; 53(2): 262 - 269. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. E. Freedland, E. J.C. de Geus, R. N. Golden, W. J. Kop, G. E. Miller, V. Vaccarino, B. Brumback, M. M. Llabre, V. J. White, and D. S. Sheps What's in a Name? Psychosomatic Medicine and Biobehavioral Medicine Psychosom Med, January 1, 2009; 71(1): 1 - 4. [Full Text] [PDF] |
||||
![]() |
E. J. C. De Geus, N. Kupper, D. I. Boomsma, and H. Snieder Bivariate Genetic Modeling of Cardiovascular Stress Reactivity: Does Stress Uncover Genetic Variance? Psychosom Med, May 1, 2007; 69(4): 356 - 364. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |