| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
STATISTICAL CORNER |
From Arizona State University, Tempe, Arizona.
Address correspondence and reprint requests to Craig K. Enders, Arizona State University, Department of Psychology, P.O. Box 871104, Tempe, AZ 85287-1104. E-mail: cenders{at}asu.edu
| ABSTRACT |
|---|
|
|
|---|
Key Words: missing data full information maximum likelihood direct maximum likelihood maximum likelihood multiple imputation attrition
Abbreviations: DML = direct maximum likelihood; MI = multiple imputation; ML = maximum likelihood; LW = listwise deletion; AMI = arithmetic mean imputation; SRI = stochastic regression imputation; DA = data augmentation; QOL = quality of life; MAR = missing at random; MCAR = missing completely at random; MNAR = missing not at random; LOCF = last observation carried forward.
| INTRODUCTION |
|---|
|
|
|---|
DML and MI have received considerable attention in the methodologic literature during the last 20 years but have not yet enjoyed widespread use in psychology and medicine. These techniques are theoretically appealing because they require a weaker, and probably more realistic, assumption about the missing data. In practice, this means that DML and MI should produce unbiased and efficient parameter estimates in situations where traditional methods will not. Indeed, a growing body of empirical studies has demonstrated that DML and MI are generally superior to the traditional missing data techniques that are still prevalent in the applied literature (110). Although DML and MI are not without their own shortcomings, these methods are currently considered to be the methodologic "state of the art" (11) and are the focus of this manuscript.
In a recent issue of APAs Monitor on Psychology, Stephen West, editor of Psychological Methods, stated that, "Routine implementation of these new methods of addressing missing data [DML and MI] will be one of the major changes in research over the next decade" (12, p. 70). However, recent reviews of published studies suggest that authors still rely primarily on techniques that are frowned on in the methodologic literature (13,14). If the routine use of DML and MI represents a fundamental shift in research methodology in the coming decade, it is important that researchers gain exposure to these analytic methods because a substantial gap appears to exist between the analytic methods suggested in the literature and those used in published behavioral science journals. As such, the goal of this manuscript is to summarize recent methodologic advances related to missing data and to provide readers with an overview of analytic options for handling missing data.
The manuscript begins with a brief overview of Rubins (15) missing data mechanisms because this provides a basis for understanding the theoretical advantages of DML and MI. Next, a detailed overview of DML and MI is provided, as are brief descriptions of some popular traditional methods. In discussing DML and MI, special attention is given to "inclusive" analysis strategies (16) that incorporate auxiliary variables into the analysis; as defined below, auxiliary variables are extraneous to ones substantive hypotheses but may be potential causes or correlates of missingness. Finally, an example analysis is conducted so that researchers may gain some appreciation for the ease with which DML and MI can be used to handle missing data in their own research.
For the remainder of the manuscript, I rely on a hypothetical study of quality-of-life (QOL) measurements from a clinical trial of cancer patients. Specifically, suppose that it were of interest to collect four bimonthly assessments of QOL in a clinical trial for a new chemotherapy drug, such that patients were randomly assigned to either the treatment or control condition. In addition to the four QOL assessments, a number of auxiliary variables were collected at the onset of the study (e.g., health status, physical functioning, age, education). Furthermore, suppose that a stipulation for participation in the study required all patients to be present for the intake assessment, so complete data were available for the first QOL assessment, as well as for the set of auxiliary variables. The concepts presented in this manuscript will be explored in the context of this hypothetical research scenario.
Missing Data Theory
The use of DML and MI is supported by a growing number of empirical studies but also has a strong foundation in statistical theory. Rubins seminal article (15) outlines different "mechanisms" that lead to missing data. These missing data mechanisms may be viewed as probabilistic explanations for how missing values are related to other measured variables in a data set. From a slightly different perspective, Rubins mechanisms may be viewed as assumptions that dictate the performance of a given missing data technique. Although it may not be readily apparent, virtually every missing-data handling approach requires the analyst to adopt certain assumptions about the "cause" of the missing values (i.e., the relationship between missingness and the data). As discussed below, traditional methods such as listwise (LW) and pairwise deletion tend to require the most strict (and probably unrealistic) assumption, but even DML and MI are not without their own tenuous assumptions.
According to Rubin (15), data are missing at random (MAR) if the probability of a missing value on variable x is unrelated to the values of x itself. In the context of the previous example, this means that the probability of missing a QOL score is unrelated to QOL at that particular assessment. This terminology is somewhat unfortunate because MAR does not mean that the process governing missing values is random in the same sense as a coin toss. Rather, MAR means that missingness is systematically related to measured values in the data set. For example, suppose that it was known that elderly patients and patients with lower education have a higher propensity for missing QOL scores (17). This scenario would satisfy MAR as long as no residual relationship existed between QOL scores and missingness once age and education were controlled (i.e., the observed data are a random sample of cases belonging to the same age and education strata).
The MCAR mechanism is a special case of MAR, with the additional restriction that missingness is unrelated to the observed data. That is, the observed data are a random sample of observations from the hypothetically complete data set. Returning to the QOL example, suppose that the administration of questionnaires is inadvertently overlooked for a small group of participants. Similarly, suppose that a questionnaire battery could not be administered due of a scheduling conflict. These examples would qualify as MCAR as long as the assessment difficulties were unrelated to measured variables (e.g., administration problems were not systematically related to a patients treatment site, health status, etc.). Some authors have argued that MCAR is a strict assumption that is rarely met in practice (8,18). This is an important point because, with the exception of a few esoteric cases (19), traditional missing data approaches will only yield unbiased estimates when MCAR holds; DML and MI are unbiased and efficient under MCAR and MAR data.
The last of Rubins (15) missing data mechanisms is termed missing not at random (MNAR). Data are MNAR if the probability of missing data is systematically related to the values that are missing. Returning to the previous example, suppose that QOL scores are missing for patients whose health status had deteriorated to the point that they could no longer participate in the study (i.e., the missing QOL score is related to the patients poor QOL at that particular assessment). In this case, DML and MI would also produce biased parameter estimates. MNAR missing data methods have been proposed in the literature (20,21) but are prone to substantial bias when the missing data model is not properly specified. As such, I chose to focus on methods that assume MAR in this manuscript (DML and MI) because these techniques currently represent the "practical state of the art" (11, p. 173).
The MAR condition is sometimes referred to as ignorable missingness because unbiased parameter estimates can be obtained via DML or MI without the need to incorporate an explicit model that explains why the data are missing. A subtle but important point to remember is that missing data mechanisms are not inherent characteristics of a data set. Within the same data set, one analysis may satisfy the MAR condition, while another analysis could be described as MNAR. To illustrate, I previously raised the possibility that missing QOL scores were systematically related to age and education. For simplicity, assume that these were the only variables related to missingness. Suppose that the primary analytic interest was to regress final QOL scores on a treatment indicator (0 = control, 1 = treatment) and initial QOL status (i.e., an ANCOVA model). Even if the analyst used DML to conduct the regression analysis, this situation would be described as MNAR because the parameter estimates were not conditioned on the variables related to missingness (i.e., age and education were not included in the analysis model). As such, it is likely that the estimated treatment effect would contain at least some bias due to the missing data.
Fortunately, there are established techniques for incorporating auxiliary variables such as age and education into DML and MI analyses. An auxiliary variable can be defined as a variable that is extraneous to ones substantive hypotheses but may be 1) a potential cause or correlate of missingness, or 2) a correlate of the variable that contains missing values. The inclusion of auxiliary variables has been referred to as an "inclusive" analysis strategy (16) and has been recommended in order to make the MAR condition more plausible and to increase the efficiency of the resulting estimates (22,23). Although it may be difficult to identify useful auxiliary variables with any degree of certainty, it appears that researchers can cast a wide net when selecting a set of auxiliary variables. Previous simulation results suggested that the inclusion of a relatively large number of "junk" variables does not appear to negatively affect the resulting parameter estimates (16).
Traditional Missing-Data Techniques
Before outlining the two "modern" missing-data methods, a brief overview of some traditional missing-data techniques is warranted. The focus on DML and MI precludes an exhaustive review of these methods, but more comprehensive sources are available in the literature (24).
Traditional missing-data techniques can be roughly grouped into two categories: deletion methods and imputation methods. LW, or complete case analysis, removes all incomplete records from the data set before analysis. In contrast, pairwise deletion (PW) attempts to use as much of the observed data as possible and removes cases on an analysis-by-analysis basis (e.g., each element in a covariance matrix is estimated using cases with complete data on the variable pair). Deletion methods appear to be the predominant techniques used in published studies (13,14) but are particularly problematic because they require MCAR data and will introduce bias when this assumption does not hold.
A number of single imputation methods exist in the literature, and these techniques vary in the magnitude of bias they introduce. Arithmetic mean imputation (AMI) fills in the data set by replacing missing values with the arithmetic mean of the complete cases and is perhaps the worst technique that one might choose because estimates of variation and covariation can be severely negatively biased. Regression imputation offers some improvement and replaces missing values with predicted scores from a linear regression equation. However, imputed values fall directly on a regression surface, so the imputed data will still lack residual variation that is present in the hypothetically complete data. Stochastic regression imputation (SRI) ameliorates this problem by adding a randomly sampled residual term from a normal distribution to each imputed value and is perhaps the best of the traditional missing-data techniques; unlike other techniques discussed here, SRI has desirable statistical properties under the less restrictive MAR assumption. Finally, last observation carried forward (LOCF) is a technique specific to longitudinal designs and is used regularly in clinical trials (14). LOCF replaces missing values with the last complete observation for case i. Despite its popularity in clinical trials, recent empirical studies have cautioned against the use of this technique (25,26) and have demonstrated its bias, even under MCAR (26).
DML Estimation
In the complete data context, ML estimation has been the estimator of choice for a number of widely used statistical procedures (e.g., SEM, multilevel models). ML estimation for missing data is also available in a number of software packages. Both SPSS and SAS produce ML estimates of a covariance matrix and mean vector using the EM (expectation maximization) algorithm (using the MVA procedure, and PROC MI, respectively), and all commercial SEM programs (e.g., Mplus; 27) allow the user to estimate a range of linear model analyses using DML estimation for missing data.
Briefly, the basic goal of ML estimation is to identify the population parameter values that are most likely to have produced a particular sample of data. The discrepancy between each cases data and the estimated parameters is quantified by the likelihood. The likelihood is conceptually similar to a probability because it provides an indication of how likely a particular score is to occur from a normal distribution with a particular set of parameter values. For computational simplicity, the natural logarithm of the likelihood is used during estimation, and the log likelihood is subsequently used to gauge the fit of the sample data to a set of estimated parameter values. ML estimation typically requires the use of iterative algorithms that "audition" different values for the unknown parameters and ultimately converge on the single set of parameter values that maximize the log likelihood (i.e., parameters are found that produce the highest probability or "match" to the observed data). Accessible overviews of ML estimation are available in the literature for those who are interested in more details (28,29).
As noted above, the fit of a particular set of parameter values to the raw data is quantified by the log likelihood. In this missing data context, the log likelihood is the sum of N equations, each of which captures the discrepancy between the raw data and the parameter estimates for which case i has complete data. Assuming the data are multivariate normally distributed, the log-likelihood for case i is
|
|
where xi is a vector of raw data, µi is the vector of mean estimates,
i is the estimated covariance matrix, and Ki is a constant that depends on the number of observed values for case i (i.e., [ni/2] log[2
]). The important thing to note about Equation 1 is that the size and content of the arrays change as a function of the missing data pattern for case i.
To illustrate, consider a simple situation where QOL is measured on three different occasions and it is of interest to estimate the mean vector and covariance matrix for these three variables. The arrays for cases that are missing the second QOL assessment would be as follows:
|
|
and the contribution to the sample log-likelihood for these cases (i.e., Equation 1 would be
|
|
Notice that the rows and columns corresponding to the missing variable (i.e., qol2) have been removed, and the discrepancy between the data and the parameters is based only on the variables that are complete for case i. In a similar vein, the arrays for cases that are missing the last two QOL assessments would contain only a single element (i.e., qol1i, µ1, and
211). As usual, parameter estimates (e.g., µ and
) are sought that maximize the log likelihood, which in this case is the summation of Equation 1 over the N cases.
Several important points should be made regarding DML estimation. Unlike MI, the missing values are not imputed. Rather, model parameters and standard errors are estimated directly using all observed data. Although Equation 1 assumes that the users goal is to estimate the mean vector and covariance matrix for a set of variables, it may generally be of interest to estimate a model where µ and
are expressed as a function of the model parameters (e.g., a regression model, SEM). DML is well suited for this case, and SEM software packages can be used to estimate many common models using DML. The use of SEM software is also ideal because it allows the user to incorporate auxiliary variables (30).
As noted previously, DML is not without its own assumptions. A clear advantage of DML is that it requires the less strict MAR assumption. Unfortunately, it is not possible to verify that a particular analysis satisfies MAR, because doing so would require knowledge of the missing values; only MCAR can be empirically tested from the data (31). This means that MAR-based analyses must generally proceed by adopting an important, albeit untestable, assumption. However, the plausibility of MAR may be bolstered by auxiliary information collected in follow-up interviews (32). On a related point, software packages that implement DML estimation may allow the user to choose whether standard errors are based on the observed or expected information matrix (a matrix based on the second derivatives of the log likelihood function). Standard errors should be based on the observed information matrix whenever possible because these estimates are also valid under MAR (33).
Finally, the derivation of the DML log likelihood function in Equation 1 is based on the multivariate normal density. As is true in the complete data context, DML parameter estimates are relatively unaffected by normality violations, but standard errors are negatively biased (2). Readers with previous exposure to SEM may be familiar with so-called robust fit statistics and standard errors. Robust statistics have been developed for the missing data context (34) and are available in some commercial SEM software packages (EQS 6.0, Mplus 3.0). These standard errors also require the MCAR assumption but appear to perform reasonably well under MAR (2). The bootstrap resampling procedure can also be used to obtain corrected standard errors with nonnormal data (35,36).
Incorporating Auxiliary Variables
Graham (30) proposed two methods for including auxiliary variables into a DML analysis, and the "saturated correlates" approach is outlined here. To illustrate, reconsider the earlier scenario where it was of interest to regress final QOL scores on a treatment indicator and initial QOL status. Furthermore, assume that age, education, and QOL scores from the second and third assessments were selected as auxiliary variables. Again, it may not be possible to identify useful auxiliary variables with any degree of certainty, but the inclusion of extra "junk" variables does not appear to negatively affect estimation (16).
Graham (30) outlined three formal rules for incorporating auxiliary variables into an analysis. When specifying the analysis model, the user must include correlations between an auxiliary variable and 1) other auxiliary variables, 2) observed exogenous (i.e., independent) variables, and 3) the residuals of any observed endogenous (i.e., dependent) variable (e.g., a criterion variable from a regression analysis, a manifest indicator of a latent variable in an SEM analysis). It is important to note that the model specification should always omit correlations between the auxiliary variables and the latent variables or the disturbance terms associated with latent variables.
The top panel of Figure 1 depicts the complete-data multiple regression model using path diagram conventions common to the SEM framework (e.g., rectangular boxes denote observed variables, straight arrows represent regression coefficients, curved arrows are correlations). The bottom panel of Figure 1 shows the saturated correlates model (30) and includes four auxiliary variables (age, education, and QOL scores from the second and third assessments). Graham (30) showed that the saturated correlates model incorporates information from the auxiliary variables but does so in a way that does not alter model fit (there is a net gain of zero degrees of freedom) or compromise the interpretation of the substantive parameters. As seen in the figure, the auxiliary variables do not directly influence the outcome measure itself but are correlated with the criterion variables residual term. As such, the auxiliary variables are not partialed out when estimating the influence of the substantive predictors and will only influence the regression coefficients by way of removing nonresponse bias that is explained by the auxiliary variables.
|
MI
The goal of MI imputation phase is to create m (e.g., 5 or 10) copies of the sample data, each of which is imputed with different estimates of the missing values. In the subsequent analysis phase, the desired statistical model is fit to each of the m complete data sets, and the resulting parameter estimates and standard errors are combined into a single inference using straightforward arithmetic rules (37). Although the process of analyzing multiple data sets and combining parameter estimates sounds tedious, several popular software packages have implemented routines that automate this process (e.g., SAS, Mplus, HLM). A number of MI algorithms have been proposed in the literature (38), but Schafers (39) data augmentation (DA) procedure is arguably the most popular and is the approach described below.
Imputation Phase
Schafers (39) DA algorithm involves an iterative chain that repeatedly cycles between two steps. The DA process begins with an initial estimate of the covariance matrix, typically an ML estimate obtained via the EM algorithm. The imputation (I) step replaces missing values with the predicted scores from a series of multiple regression equations. Because the imputed values fall directly on a regression surface, a random perturbation is added to each value by randomly sampling a residual from a normal distribution. A new covariance matrix and mean vector are subsequently computed from the complete data using standard formulae.
In order to create imputed data sets with different estimates of the missing values, random perturbations are also added to the regression equations used to generate the imputed values. This occurs in the posterior (P) step, where new covariance matrix and mean vector elements are randomly sampled from a posterior distribution that is conditional on the filled-in data from the previous I step. The newly constructed estimates of
and µ are carried forward to the next I step, where a new set of imputed values is generated from regression equations that differ slightly from those used in the previous iteration. This two-step procedure is repeated a large number of times, and imputed data sets are saved at specified intervals in the sequence.
In creating a set of imputed data files, it is necessary that the number of DA iterations (the number of I and P steps) be much larger than the desired number of imputed data sets, m. Because the imputed values from any two adjacent I steps will be correlated with one another, a number of iterations must separate the imputed data sets that are used in the subsequent analysis phase (these are sometimes referred to as between-imputation iterations). For example, in the analysis presented below, 200 between-imputation iterations were specified, such that an imputed data set was saved after every 200th I step. As a result, the m imputed data sets are filled with different plausible estimates of the missing values, and the imputed values are independent from one data set to the next. For somewhat different reasons, it is also necessary to allow a number of DA cycles to lapse before saving the first imputed data set (these are referred to as burn-in iterations). Determining the number of burn-in iterations can be complex and involves assessing the convergence (i.e., stability) of the sampled parameters from the P step. This process is discussed at length by Schafer (39) and is aided by graphical displays (time series and autocorrelation function plots) produced by MI software packages.
Analysis Phase
The imputation process creates m complete data sets, so standard statistical software can be used for all subsequent analyses. The statistical model of interest is fit to each of the m data sets, and the resulting parameter values and standard errors are subsequently combined into a single inference using straightforward arithmetic (37). A single point estimate for a given parameter is simply the arithmetic average of that parameter over the m analyses, as follows
|
|
where i is the parameter estimate from the ith data set, and m is the number of imputations.
Combining standard errors is slightly more complicated and involves two components. The within-imputation variance for a given parameter is defined as the arithmetic average of the m squared standard errors (i.e., the average of the m parameter variance estimates) and is given by
|
|
where Ûi is the squared standard error from the ith data set. The between-imputation variance is simply the variance of the parameter estimate across the m imputations, or
|
|
Finally, the combined standard error is obtained as a function of the within- and between-imputation variance, as shown below.
|
|
Individual parameter estimates can be tested for significance using the familiar t ratio (i.e.,
/S.E.), and simultaneous tests of multiple parameters can also be constructed (40). The standard degrees of freedom for the t test are complex and can actually exceed the sample size. Barnard and Rubin (41) proposed an adjusted degrees of freedom calculation that is generally considered to be superior to the standard values given in classic MI references. MI software packages may differ in the degrees of freedom that are reported, so the user should pay attention to this nuance.
Some final points should be made regarding MI. Like DML, MI assumes a multivariate normal model. However, studies have suggested that MI performs reasonably well even when data are nonnormal (7), and Schafer (39) argued that the normal model can be used with nominal and ordinal variables. Normalizing transformations can also be used before imputation, and variables can subsequently be restored to their original metrics using reverse transformations. Finally, like DML, MI assumes a MAR mechanism and will produce unbiased estimates when data are either MCAR or MAR.
Incorporating Auxiliary Variables
An advantage of MI is that auxiliary variables can readily be incorporated into the imputation phase simply by specifying additional predictors in the imputation model. Because the imputed values are conditioned on any auxiliary variables in the imputation model, these extraneous variables need not be included in the subsequent analysis phase. Schafer (39, p. 143) discussed the selection of variables for the imputation model and recommended that the imputation process include variables that are 1) related to the variable being imputed and 2) potentially related to missingness on that variable. In addition, the imputation model should be at least as general as the analysis model. For example, if the researcher intends to examine an interaction effect in a subsequent analysis, this interaction should also be included in the imputation model, or the interaction effect may be biased toward zero; this point underscores the need for careful planning when choosing the variables that are to be included in the imputation model.
Example Analyses
A small set of artificial data (N = 300) was generated in order to illustrate the application of DML and MI. The data were generated to mimic the hypothetical QOL study described earlier in the manuscript. Briefly, four artificial QOL scores were generated such that QOL declined across the four assessments. In order to simulate a randomized trial, a binary treatment variable was created that was uncorrelated with QOL scores at the initial assessment but was systematically related to the decline in QOL over time (e.g., the "treatment" group experienced less rapid decline in QOL scores across the four assessments). The differential growth rates produced significant group differences at the final QOL assessment, such that the treatment group exhibited significantly higher QOL scores than the control group. Finally, two auxiliary variables, age and education, were generated to have modest correlations with QOL scores; age was negatively correlated with QOL, and education had a positive correlation (r
0.35 and 0.20, respectively). In order to mimic a randomized trial, the auxiliary variables were unrelated to treatment group membership, within sampling error.
Missing values were imposed on the final three QOL scores in an MAR fashion and were systematically related to age, education, initial QOL status, and treatment group membership. Specifically, the probability of missing data was highest for cases with high age values, low education scores, and low initial QOL values. Also, missingness was systematically related to treatment group membership such that the missing data rate for the control group was twice as high as that for the treatment group (40% of the QOL scores for the control were missing, whereas 20% of the scores for the treatment group were deleted). The deletion process was executed separately for the three QOL scores, resulting in a general pattern of missing data (e.g., some cases sporadically missed assessments, whereas others permanently "dropped out" at a particular wave).
To illustrate the application of DML and MI, a linear growth curve analysis was used to model the development of QOL scores across the four assessments. The growth curve model describes change for each case using two parameters, the intercept (i.e., estimated QOL scores at the final assessment) and slope (i.e., change in QOL scores between each assessment). It is often the case that the growth model intercept is defined as initial status, or the expected score at baseline. However, these data were constructed to mimic a randomized trial where the treatment groups do not differ at the baseline assessment. In this case, it may be of substantive interest to assess group differences at the end of the study. The choice of location for the intercept does not affect the fit of the model, nor does it alter the interpretation of the slope.
The intercept and slope values from the growth model were regressed on treatment group membership in order to assess 1) mean differences at the final assessment and 2) the differential rates of change for the two groups, respectively. The DML analyses were performed using Mplus 3.13, and MI was implemented using the SAS MI procedure. The syntax for these analyses is given in the Appendices, and the artificial data set is available on request. For comparison purposes, the analyses were also performed using LW, LOCF, and AMI.
DML Analysis
The DML analysis was performed using two auxiliary variables, age and education. Although missingness was, in part, related to two variables in the analysis model (treatment group membership and initial QOL scores), the analyst will typically be blind to the true cause of the missing values. As such, it is important to illustrate an inclusive analysis strategy that applies the saturated correlates approach (30). Again, it may not be possible to identify useful auxiliary variables with any degree of certainty, but a liberal approach may be justified when considering variables to include in the auxiliary portion of the model (16). The latent growth curve model that includes age and education as auxiliary variables is depicted in Figure 2, and the syntax for this model given in Appendix A. Note that all three of Grahams rules are applied in this case.
|
Selected results from the growth curve analysis are given in Table 1, and the complete-data results are also presented for comparison purposes. There are four key parameters of interest in this analysis: the intercept mean (i.e., final QOL mean for the control group), the slope mean (i.e., the rate of change for the control group), the regression of the intercept on the treatment indicator (i.e., the mean difference at the final QOL assessment), and the regression of the slopes on the treatment indicator (i.e., the difference in the rates of change for the two groups). The DML estimates were quite close to those from the complete-data analysis and differed by less than one half the value of each parameters standard error. The column labeled "Intercept on TX" gives the estimated mean difference at the final QOL assessment. As seen in the table, the estimated mean difference (4.02) was quite similar to that from the complete-data analysis (3.71).
|
In contrast, the estimates from the traditional methods were substantially biased. For example, the LW estimate of the control group mean at wave 4 (i.e., the intercept) was 45.21 versus 42.72 for the complete data. This bias is more than four times larger than the parameters standard error (0.58). Although other LW parameters appear to be more accurate, the cumulative bias resulted in a distorted picture of the growth processes for the two groups. To illustrate, the top panel of Figure 3 shows the mean growth trajectories for DML and LW (the complete data trajectories were virtually identical to those of DML and were omitted from the figure). As seen in the figure, LW suggested the presence of group differences at baseline, even though none existed. Although the LW growth curve slopes were relatively accurate, the mean difference at the final assessment was too small. In a similar vein, the bottom panel of Figure 3 shows the estimated growth trajectories for DML and LOCF. These results are even more problematic because LOCF clearly suggests that no treatment effect existed at all; the treatment group was substantially higher than the control at baseline, and the group differences persisted across the duration of the study.
|
MI Analysis
The SAS MI procedure was used to generate m = 10 imputed data sets, the SAS syntax for which is given in Appendix B. The imputation model included seven variables (the four QOL scores, age, education, and the treatment group indicator). Because the analysis models did not require interaction terms, no higher-order effects were used in the imputation model. Consistent with Schafers (39, p. 148) recommendations, imputed values were rounded to the nearest integer (using the ROUND keyword) because this was consistent with the original metric of the variables. A range for the imputed values was also specified using the MINIMUM and MAXIMUM keywords; if an imputed value falls outside this range, a new imputed value is generated.
Although the syntax in Appendix B does not reflect this, preliminary analyses were conducted in order to assess the convergence behavior of the DA algorithm and to determine the number of burn-in and between-imputation iterations. Graphic displays (time series and autocorrelation function plots) were used for this purpose and are discussed in detail by Schafer (39); examples of these plots are also given in an accessible paper by Schafer and Olsen (42).
A total of m = 10 imputed data sets were created for further analysis (specified using the NIMPUTE keyword). I opted for a very conservative approach and specified a total of 500 burn-in iterations (the first imputed data set was saved after 500 DA iterations had lapsed) and 200 between-imputation iterations (data sets were saved after every 200th iteration thereafter). These two options are specified using the NBITER and NITER keywords, respectively. The process of creating the 10 imputed data sets took just a few seconds of computer time. The 10 imputed data sets are stacked in a single file (the name of which is specified by the OUT keyword), and an index variable named _Imputation_ is written to this file than ranges between 1 and m.
To further illustrate, Table 2 gives the imputed values of qol4 (the final QOL assessment) for the first 10 cases; for brevity, the estimated values are shown for the first five imputed data sets. As seen in the table, cases with complete data (e.g., case 1) have identical values from one data set to the next. In contrast, cases with missing values have data values that vary across the m imputed data sets. As described previously, the imputed values are essentially predicted scores from a regression equation that have been augmented with a random residual term, and subsequently rounded to the nearest integer. The estimated values differ from one data set to the next because of the addition of the random residual, but also because the regression coefficients used to generate the predicted scores vary slightly across data sets.
|
The use of SAS is convenient because MIs can be generated and analyzed within the same program using the MIANALYZE procedure. For example, I could have performed the growth curve analysis using PROC MIXED and subsequently combined the parameter estimates and standard errors using PROC MIANALYZE. However, I chose to analyze the multiply imputed data sets using Mplus. Like other popular software packages (e.g., Lisrel, HLM), Mplus automates the process of analyzing and combining parameter estimates from multiply imputed data sets. The syntax for the analysis is given in Appendix C. When using Mplus, the TYPE = IMPUTATION subcommand is used to specify the input of multiply imputed data sets. Instead of specifying the location of a single data file, the FILE subcommand is used to specify a file containing the names of the imputed data sets (i.e., a text file with 10 rows, each row containing the name of one imputed data set, e.g., imp1.dat, imp2.dat, ..., imp10.dat). Mplus subsequently fits the specified model to each of the imputed data sets and combines the parameter estimates and standard errors.
Results from the growth curve analysis are given in Table 1. Consistent with DML, the MI growth model estimates were quite similar to those of the complete data; the two sets of estimates generally differed by less than 1 SE unit. The estimated mean difference at the final QOL assessment was 4.28 compared with 3.71 from the complete-data analysis.
Traditional Analyses
Table 1 also contains the results from LW, AMI, and LOCF. As seen in the tables, the traditional methods produced estimates that differed considerably from those of the complete-data analysis. For example, the traditional methods underestimated the mean difference from the growth model analysis and also underestimated the variance estimates of the intercepts and slopes. Although caution is warranted when generalizing the results from a single artificial data set, these analyses are certainly consistent with previous computer simulation studies that have demonstrated the superiority of DML and MI (110).
| DISCUSSION |
|---|
|
|
|---|
Some researchers appear dubious of DML and MI because they hold the misperception that these procedures are "making up data." As discussed previously, DML does not impute missing values at all but estimates parameters directly from the observed data. Although MI does impute missing values, it does so in a principled fashion and should be viewed as a computational tool that improves the accuracy of the resulting parameter estimates. It is important to remember that the goal of any statistical analysis is to estimate a set of population parameters, so "filling in" the missing values is not problematic, provided that doing so produces accurate estimates. One of the key points from Rubins (15) theoretical work (24) is that likelihood-based approaches (DML and MI) should yield the same answer as a complete data analysis, provided that the data are MAR. This point should alleviate concerns that the analyst is somehow "cheating" when using DML or MI.
After illustrating the use of DML and MI, the reader might be left wondering, "Which method should I use?" To better answer this question, a brief review of the pros and cons of each method is warranted. As noted previously, DML is widely available in all commercial SEM software packages, a benefit that may weigh heavily for users who have previous exposure to SEM. The inclusion of auxiliary variables is somewhat awkward but can be accomplished without much additional effort (e.g., see Appendix A). Perhaps the biggest drawback associated with DML is the availability of an estimation routine. Although the number and types of models that can be estimated using DML have grown substantially in recent years, there are many analyses that cannot be estimated using current SEM software and would require custom computer code.
The flexibility of MI is probably its most attractive feature. Software for creating multiply imputed data sets is readily available, and no special analysis software is required. For example, two SAS procedures, PROC MI and PROC MIANALYZE, were implemented in version 8, and Schafers NORM package (43) can be freely downloaded from www.stat.psu.edu/~jls/misoftwa.html. The ease with which auxiliary variables can be incorporated into the imputation model is also a benefit because auxiliary variables can be ignored in all subsequent analyses. The primary detriment associated with MI is its complexity. Compared with DML, MI can be more labor intensive. The availability of software routines for combining parameter estimates certainly reduces the burden on the analyst, but it could reasonably be argued that MI requires more knowledge on the part of the user.
In summary, DML and MI require identical assumptions (multivariate normality and MAR missingness) and will frequently produce very similar parameter estimates, particularly if the same auxiliary variables are used in both models. As such, the primary considerations appear to be 1) the availability of a DML routine for the analysis of interest, 2) the ease of incorporating auxiliary variables, and 3) the overall ease of use. In the end, choosing between these two methods is probably a matter of personal preference and convenience. Given the advantages associated with these techniques, behavioral medicine researchers should strongly consider the use of DML and MI in future studies.
| Appendix 1 |
|---|
|
|
|---|
data:
file is 'c:/qol.dat';
variable:
names are qol1 qol2 qol3 qol4 age ed tx;
usevariables are qol1 tx;
missing are all (99);
analysis:
type = missing h1;
model:
! basic growth model syntax;
i s | qol1@3 qol2@2 qol3@1 qol4@0;
[qol1 qol4@0];
[i s];
qol1 qol4;
i s;
i s on tx;
! auxiliary variables;
age ed;
age with ed;
age ed with tx;
age ed with qol1 qol4;
output:
sampstat standardized;
Appendix B. SAS multiple imputation syntax.
libname c 'c:/ ';
data c.qol;
infile 'c:/qol.dat';
input qol1 qol2 qol3 qol4 age ed tx;
/* CHANGE MISSING VALUE CODE FROM 9 TO. */
array x[7] qol1m qol2m qol3m qol4m age ed tx;
do i = 1 to 7;
if x [i] = 99 then x[i] = .;
end;
drop i;
run;
/* CREATE M = 10 IMPUTED DATA SETS */
/* NIMPUTE SPECIFIES THE NUMBER OF IMPUTED DATA SETS */
/* THE MINIMUM AND MAXIMUM OPTION PROVIDES RANGES FOR IMPUTED VALUES */
/* ROUND = 1 ROUNDS IMPUTED VALUES TO NEAREST INTEGER */
/* NBITER SPECIFIES THE NUMBER OF BURN IN ITERATIONS */
/* NITER SPECIFIES THE NUMBER OF BETWEEN- IMPUTATION CYCLES */
proc mi data = c.qol out = c.qolmi seed = 56789 nimpute = 10 minimum = 50 8 0 0 0 0 0 maximum = 99 30 1 100 100 100 round = 1 1 1 1 1 1 1;
var qol1 qol2 qol3 qol4 age ed tx;
mcmc nbiter = 500 niter = 200;
run;
Appendix C. Mplus syntax for a growth curve using multiply imputed data.
data:
file is 'c:/imputednames.dat';
type = imputation;
variable:
names are qol1 qol2 qol3 qol4 age ed tx;
usevariables are qol1 qol4 tx;
model:
! basic growth model syntax;
i s | qol1@3 qol2@2 qol3@1 qol4@0;
[qol1 qol4@0];
[i s];
qol1 qol4;
i s;
i s on tx;
| NOTES |
|---|
|
|
|---|
Received for publication November 7, 2005; revision received March 2, 2006.
DOI:10.1097/01.psy.0000221275.75056.d8
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
J. Huang and C. H. C. Hsu The Impact of Customer-to-Customer Interaction on Cruise Experience and Vacation Satisfaction Journal of Travel Research, February 1, 2010; 49(1): 79 - 92. [Abstract] [PDF] |
||||
![]() |
C. V. Igna, J. Julkunen, H. Vanhanen, P. Keskivaara, and M. Verkasalo Depressive Symptoms and Serum Lipid Fractions in Middle-Aged Men: Physiologic and Health Behavior Links Psychosom Med, November 1, 2008; 70(9): 960 - 966. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. M. Phillips, M. H. Antoni, S. C. Lechner, B. B. Blomberg, M. M. Llabre, E. Avisar, S. Gluck, R. DerHagopian, and C. S. Carver Stress Management Intervention Reduces Serum Cortisol and Increases Relaxation During Treatment for Nonmetastatic Breast Cancer Psychosom Med, November 1, 2008; 70(9): 1044 - 1049. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. J. Contrada, D. A. Boulifard, E. L. Idler, T. J. Krause, and E. W. Labouvie Course of Depressive Symptoms in Patients Undergoing Heart Surgery: Confirmatory Analysis of the Factor Pattern and Latent Mean Structure of the Center for Epidemiologic Studies Depression Scale Psychosom Med, November 1, 2006; 68(6): 922 - 930. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |