Psychosomatic Medicine
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS

This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Right arrow An erratum has been published
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrowRequest Permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Enders, C. K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Enders, C. K.
Related Collections
Right arrow Statistical Corner
Psychosomatic Medicine 68:427-436 (2006)
© 2006 American Psychosomatic Society


STATISTICAL CORNER

A Primer on the Use of Modern Missing-Data Methods in Psychosomatic Medicine Research

Craig K. Enders, PhD

From Arizona State University, Tempe, Arizona.

Address correspondence and reprint requests to Craig K. Enders, Arizona State University, Department of Psychology, P.O. Box 871104, Tempe, AZ 85287-1104. E-mail: cenders{at}asu.edu


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 DISCUSSION
 Appendix 1
 NOTES
 REFERENCES
 
This paper summarizes recent methodologic advances related to missing data and provides an overview of two "modern" analytic options, direct maximum likelihood (DML) estimation and multiple imputation (MI). The paper begins with an overview of missing data theory, as explicated by Rubin. Brief descriptions of traditional missing data techniques are given, and DML and MI are outlined in greater detail; special attention is given to an "inclusive" analytic strategy that incorporates auxiliary variables into the analytic model. The paper concludes with an illustrative analysis using an artificial quality of life data set. Computer code for all DML and MI analyses is provided, and the inclusion of auxiliary variables is illustrated.

Key Words: missing data • full information maximum likelihood • direct maximum likelihood • maximum likelihood • multiple imputation • attrition

Abbreviations: DML = direct maximum likelihood; MI = multiple imputation; ML = maximum likelihood; LW = listwise deletion; AMI = arithmetic mean imputation; SRI = stochastic regression imputation; DA = data augmentation; QOL = quality of life; MAR = missing at random; MCAR = missing completely at random; MNAR = missing not at random; LOCF = last observation carried forward.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 DISCUSSION
 Appendix 1
 NOTES
 REFERENCES
 
Missing data are ubiquitous in psychosomatic medicine research and represent a considerable nuisance. Researchers have traditionally relied on methods that attempt to fix the data before analysis either by discarding incomplete records or by filling in (i.e., imputing) the missing values with seemingly suitable replacements. Unfortunately, most of these "traditional" analytic techniques tend to work well in a limited set of circumstances and are prone to estimation bias when a rather strict assumption about the missing data (referred to as missing completely at random, MCAR) does not hold. Even when traditional techniques are unbiased, they tend to be less powerful (i.e., efficient) than "modern" missing data methods, direct maximum likelihood (DML; also referred to as full information maximum likelihood [ML] or raw ML in the literature) and multiple imputation (MI).

DML and MI have received considerable attention in the methodologic literature during the last 20 years but have not yet enjoyed widespread use in psychology and medicine. These techniques are theoretically appealing because they require a weaker, and probably more realistic, assumption about the missing data. In practice, this means that DML and MI should produce unbiased and efficient parameter estimates in situations where traditional methods will not. Indeed, a growing body of empirical studies has demonstrated that DML and MI are generally superior to the traditional missing data techniques that are still prevalent in the applied literature (1–10). Although DML and MI are not without their own shortcomings, these methods are currently considered to be the methodologic "state of the art" (11) and are the focus of this manuscript.

In a recent issue of APA’s Monitor on Psychology, Stephen West, editor of Psychological Methods, stated that, "Routine implementation of these new methods of addressing missing data [DML and MI] will be one of the major changes in research over the next decade" (12, p. 70). However, recent reviews of published studies suggest that authors still rely primarily on techniques that are frowned on in the methodologic literature (13,14). If the routine use of DML and MI represents a fundamental shift in research methodology in the coming decade, it is important that researchers gain exposure to these analytic methods because a substantial gap appears to exist between the analytic methods suggested in the literature and those used in published behavioral science journals. As such, the goal of this manuscript is to summarize recent methodologic advances related to missing data and to provide readers with an overview of analytic options for handling missing data.

The manuscript begins with a brief overview of Rubin’s (15) missing data mechanisms because this provides a basis for understanding the theoretical advantages of DML and MI. Next, a detailed overview of DML and MI is provided, as are brief descriptions of some popular traditional methods. In discussing DML and MI, special attention is given to "inclusive" analysis strategies (16) that incorporate auxiliary variables into the analysis; as defined below, auxiliary variables are extraneous to one’s substantive hypotheses but may be potential causes or correlates of missingness. Finally, an example analysis is conducted so that researchers may gain some appreciation for the ease with which DML and MI can be used to handle missing data in their own research.

For the remainder of the manuscript, I rely on a hypothetical study of quality-of-life (QOL) measurements from a clinical trial of cancer patients. Specifically, suppose that it were of interest to collect four bimonthly assessments of QOL in a clinical trial for a new chemotherapy drug, such that patients were randomly assigned to either the treatment or control condition. In addition to the four QOL assessments, a number of auxiliary variables were collected at the onset of the study (e.g., health status, physical functioning, age, education). Furthermore, suppose that a stipulation for participation in the study required all patients to be present for the intake assessment, so complete data were available for the first QOL assessment, as well as for the set of auxiliary variables. The concepts presented in this manuscript will be explored in the context of this hypothetical research scenario.

Missing Data Theory
The use of DML and MI is supported by a growing number of empirical studies but also has a strong foundation in statistical theory. Rubin’s seminal article (15) outlines different "mechanisms" that lead to missing data. These missing data mechanisms may be viewed as probabilistic explanations for how missing values are related to other measured variables in a data set. From a slightly different perspective, Rubin’s mechanisms may be viewed as assumptions that dictate the performance of a given missing data technique. Although it may not be readily apparent, virtually every missing-data handling approach requires the analyst to adopt certain assumptions about the "cause" of the missing values (i.e., the relationship between missingness and the data). As discussed below, traditional methods such as listwise (LW) and pairwise deletion tend to require the most strict (and probably unrealistic) assumption, but even DML and MI are not without their own tenuous assumptions.

According to Rubin (15), data are missing at random (MAR) if the probability of a missing value on variable x is unrelated to the values of x itself. In the context of the previous example, this means that the probability of missing a QOL score is unrelated to QOL at that particular assessment. This terminology is somewhat unfortunate because MAR does not mean that the process governing missing values is random in the same sense as a coin toss. Rather, MAR means that missingness is systematically related to measured values in the data set. For example, suppose that it was known that elderly patients and patients with lower education have a higher propensity for missing QOL scores (17). This scenario would satisfy MAR as long as no residual relationship existed between QOL scores and missingness once age and education were controlled (i.e., the observed data are a random sample of cases belonging to the same age and education strata).

The MCAR mechanism is a special case of MAR, with the additional restriction that missingness is unrelated to the observed data. That is, the observed data are a random sample of observations from the hypothetically complete data set. Returning to the QOL example, suppose that the administration of questionnaires is inadvertently overlooked for a small group of participants. Similarly, suppose that a questionnaire battery could not be administered due of a scheduling conflict. These examples would qualify as MCAR as long as the assessment difficulties were unrelated to measured variables (e.g., administration problems were not systematically related to a patient’s treatment site, health status, etc.). Some authors have argued that MCAR is a strict assumption that is rarely met in practice (8,18). This is an important point because, with the exception of a few esoteric cases (19), traditional missing data approaches will only yield unbiased estimates when MCAR holds; DML and MI are unbiased and efficient under MCAR and MAR data.

The last of Rubin’s (15) missing data mechanisms is termed missing not at random (MNAR). Data are MNAR if the probability of missing data is systematically related to the values that are missing. Returning to the previous example, suppose that QOL scores are missing for patients whose health status had deteriorated to the point that they could no longer participate in the study (i.e., the missing QOL score is related to the patient’s poor QOL at that particular assessment). In this case, DML and MI would also produce biased parameter estimates. MNAR missing data methods have been proposed in the literature (20,21) but are prone to substantial bias when the missing data model is not properly specified. As such, I chose to focus on methods that assume MAR in this manuscript (DML and MI) because these techniques currently represent the "practical state of the art" (11, p. 173).

The MAR condition is sometimes referred to as ignorable missingness because unbiased parameter estimates can be obtained via DML or MI without the need to incorporate an explicit model that explains why the data are missing. A subtle but important point to remember is that missing data mechanisms are not inherent characteristics of a data set. Within the same data set, one analysis may satisfy the MAR condition, while another analysis could be described as MNAR. To illustrate, I previously raised the possibility that missing QOL scores were systematically related to age and education. For simplicity, assume that these were the only variables related to missingness. Suppose that the primary analytic interest was to regress final QOL scores on a treatment indicator (0 = control, 1 = treatment) and initial QOL status (i.e., an ANCOVA model). Even if the analyst used DML to conduct the regression analysis, this situation would be described as MNAR because the parameter estimates were not conditioned on the variables related to missingness (i.e., age and education were not included in the analysis model). As such, it is likely that the estimated treatment effect would contain at least some bias due to the missing data.

Fortunately, there are established techniques for incorporating auxiliary variables such as age and education into DML and MI analyses. An auxiliary variable can be defined as a variable that is extraneous to one’s substantive hypotheses but may be 1) a potential cause or correlate of missingness, or 2) a correlate of the variable that contains missing values. The inclusion of auxiliary variables has been referred to as an "inclusive" analysis strategy (16) and has been recommended in order to make the MAR condition more plausible and to increase the efficiency of the resulting estimates (22,23). Although it may be difficult to identify useful auxiliary variables with any degree of certainty, it appears that researchers can cast a wide net when selecting a set of auxiliary variables. Previous simulation results suggested that the inclusion of a relatively large number of "junk" variables does not appear to negatively affect the resulting parameter estimates (16).

Traditional Missing-Data Techniques
Before outlining the two "modern" missing-data methods, a brief overview of some traditional missing-data techniques is warranted. The focus on DML and MI precludes an exhaustive review of these methods, but more comprehensive sources are available in the literature (24).

Traditional missing-data techniques can be roughly grouped into two categories: deletion methods and imputation methods. LW, or complete case analysis, removes all incomplete records from the data set before analysis. In contrast, pairwise deletion (PW) attempts to use as much of the observed data as possible and removes cases on an analysis-by-analysis basis (e.g., each element in a covariance matrix is estimated using cases with complete data on the variable pair). Deletion methods appear to be the predominant techniques used in published studies (13,14) but are particularly problematic because they require MCAR data and will introduce bias when this assumption does not hold.

A number of single imputation methods exist in the literature, and these techniques vary in the magnitude of bias they introduce. Arithmetic mean imputation (AMI) fills in the data set by replacing missing values with the arithmetic mean of the complete cases and is perhaps the worst technique that one might choose because estimates of variation and covariation can be severely negatively biased. Regression imputation offers some improvement and replaces missing values with predicted scores from a linear regression equation. However, imputed values fall directly on a regression surface, so the imputed data will still lack residual variation that is present in the hypothetically complete data. Stochastic regression imputation (SRI) ameliorates this problem by adding a randomly sampled residual term from a normal distribution to each imputed value and is perhaps the best of the traditional missing-data techniques; unlike other techniques discussed here, SRI has desirable statistical properties under the less restrictive MAR assumption. Finally, last observation carried forward (LOCF) is a technique specific to longitudinal designs and is used regularly in clinical trials (14). LOCF replaces missing values with the last complete observation for case i. Despite its popularity in clinical trials, recent empirical studies have cautioned against the use of this technique (25,26) and have demonstrated its bias, even under MCAR (26).

DML Estimation
In the complete data context, ML estimation has been the estimator of choice for a number of widely used statistical procedures (e.g., SEM, multilevel models). ML estimation for missing data is also available in a number of software packages. Both SPSS and SAS produce ML estimates of a covariance matrix and mean vector using the EM (expectation maximization) algorithm (using the MVA procedure, and PROC MI, respectively), and all commercial SEM programs (e.g., Mplus; 27) allow the user to estimate a range of linear model analyses using DML estimation for missing data.

Briefly, the basic goal of ML estimation is to identify the population parameter values that are most likely to have produced a particular sample of data. The discrepancy between each case’s data and the estimated parameters is quantified by the likelihood. The likelihood is conceptually similar to a probability because it provides an indication of how likely a particular score is to occur from a normal distribution with a particular set of parameter values. For computational simplicity, the natural logarithm of the likelihood is used during estimation, and the log likelihood is subsequently used to gauge the fit of the sample data to a set of estimated parameter values. ML estimation typically requires the use of iterative algorithms that "audition" different values for the unknown parameters and ultimately converge on the single set of parameter values that maximize the log likelihood (i.e., parameters are found that produce the highest probability or "match" to the observed data). Accessible overviews of ML estimation are available in the literature for those who are interested in more details (28,29).

As noted above, the fit of a particular set of parameter values to the raw data is quantified by the log likelihood. In this missing data context, the log likelihood is the sum of N equations, each of which captures the discrepancy between the raw data and the parameter estimates for which case i has complete data. Assuming the data are multivariate normally distributed, the log-likelihood for case i is



Formula 1

where xi is a vector of raw data, µi is the vector of mean estimates, {Sigma}i is the estimated covariance matrix, and Ki is a constant that depends on the number of observed values for case i (i.e., –[ni/2] log[2{pi}]). The important thing to note about Equation 1 is that the size and content of the arrays change as a function of the missing data pattern for case i.

To illustrate, consider a simple situation where QOL is measured on three different occasions and it is of interest to estimate the mean vector and covariance matrix for these three variables. The arrays for cases that are missing the second QOL assessment would be as follows:



Formula 1

and the contribution to the sample log-likelihood for these cases (i.e., Equation 1 would be



Formula 2

Notice that the rows and columns corresponding to the missing variable (i.e., qol2) have been removed, and the discrepancy between the data and the parameters is based only on the variables that are complete for case i. In a similar vein, the arrays for cases that are missing the last two QOL assessments would contain only a single element (i.e., qol1i, µ1, and {sigma}211). As usual, parameter estimates (e.g., µ and {Sigma}) are sought that maximize the log likelihood, which in this case is the summation of Equation 1 over the N cases.

Several important points should be made regarding DML estimation. Unlike MI, the missing values are not imputed. Rather, model parameters and standard errors are estimated directly using all observed data. Although Equation 1 assumes that the user’s goal is to estimate the mean vector and covariance matrix for a set of variables, it may generally be of interest to estimate a model where µ and {Sigma} are expressed as a function of the model parameters (e.g., a regression model, SEM). DML is well suited for this case, and SEM software packages can be used to estimate many common models using DML. The use of SEM software is also ideal because it allows the user to incorporate auxiliary variables (30).

As noted previously, DML is not without its own assumptions. A clear advantage of DML is that it requires the less strict MAR assumption. Unfortunately, it is not possible to verify that a particular analysis satisfies MAR, because doing so would require knowledge of the missing values; only MCAR can be empirically tested from the data (31). This means that MAR-based analyses must generally proceed by adopting an important, albeit untestable, assumption. However, the plausibility of MAR may be bolstered by auxiliary information collected in follow-up interviews (32). On a related point, software packages that implement DML estimation may allow the user to choose whether standard errors are based on the observed or expected information matrix (a matrix based on the second derivatives of the log likelihood function). Standard errors should be based on the observed information matrix whenever possible because these estimates are also valid under MAR (33).

Finally, the derivation of the DML log likelihood function in Equation 1 is based on the multivariate normal density. As is true in the complete data context, DML parameter estimates are relatively unaffected by normality violations, but standard errors are negatively biased (2). Readers with previous exposure to SEM may be familiar with so-called robust fit statistics and standard errors. Robust statistics have been developed for the missing data context (34) and are available in some commercial SEM software packages (EQS 6.0, Mplus 3.0). These standard errors also require the MCAR assumption but appear to perform reasonably well under MAR (2). The bootstrap resampling procedure can also be used to obtain corrected standard errors with nonnormal data (35,36).

Incorporating Auxiliary Variables
Graham (30) proposed two methods for including auxiliary variables into a DML analysis, and the "saturated correlates" approach is outlined here. To illustrate, reconsider the earlier scenario where it was of interest to regress final QOL scores on a treatment indicator and initial QOL status. Furthermore, assume that age, education, and QOL scores from the second and third assessments were selected as auxiliary variables. Again, it may not be possible to identify useful auxiliary variables with any degree of certainty, but the inclusion of extra "junk" variables does not appear to negatively affect estimation (16).

Graham (30) outlined three formal rules for incorporating auxiliary variables into an analysis. When specifying the analysis model, the user must include correlations between an auxiliary variable and 1) other auxiliary variables, 2) observed exogenous (i.e., independent) variables, and 3) the residuals of any observed endogenous (i.e., dependent) variable (e.g., a criterion variable from a regression analysis, a manifest indicator of a latent variable in an SEM analysis). It is important to note that the model specification should always omit correlations between the auxiliary variables and the latent variables or the disturbance terms associated with latent variables.

The top panel of Figure 1 depicts the complete-data multiple regression model using path diagram conventions common to the SEM framework (e.g., rectangular boxes denote observed variables, straight arrows represent regression coefficients, curved arrows are correlations). The bottom panel of Figure 1 shows the saturated correlates model (30) and includes four auxiliary variables (age, education, and QOL scores from the second and third assessments). Graham (30) showed that the saturated correlates model incorporates information from the auxiliary variables but does so in a way that does not alter model fit (there is a net gain of zero degrees of freedom) or compromise the interpretation of the substantive parameters. As seen in the figure, the auxiliary variables do not directly influence the outcome measure itself but are correlated with the criterion variable’s residual term. As such, the auxiliary variables are not partialed out when estimating the influence of the substantive predictors and will only influence the regression coefficients by way of removing nonresponse bias that is explained by the auxiliary variables.


Figure 112
View larger version (23K):
[in this window]
[in a new window]
 
Figure 1. Path diagram representation of the hypothetical regression model. The top panel of the Figure represents the standard multiple regression model with two predictors. The bottom panel of the figure represents the same regression model but includes four auxiliary variables (qol2, qol3, age, ed). Note that the auxiliary variables are correlated with (a) each other, (b) the substantive predictors (qol1 and tx), and (c) the residual term.

 

MI
The goal of MI imputation phase is to create m (e.g., 5 or 10) copies of the sample data, each of which is imputed with different estimates of the missing values. In the subsequent analysis phase, the desired statistical model is fit to each of the m complete data sets, and the resulting parameter estimates and standard errors are combined into a single inference using straightforward arithmetic rules (37). Although the process of analyzing multiple data sets and combining parameter estimates sounds tedious, several popular software packages have implemented routines that automate this process (e.g., SAS, Mplus, HLM). A number of MI algorithms have been proposed in the literature (38), but Schafer’s (39) data augmentation (DA) procedure is arguably the most popular and is the approach described below.

Imputation Phase
Schafer’s (39) DA algorithm involves an iterative chain that repeatedly cycles between two steps. The DA process begins with an initial estimate of the covariance matrix, typically an ML estimate obtained via the EM algorithm. The imputation (I) step replaces missing values with the predicted scores from a series of multiple regression equations. Because the imputed values fall directly on a regression surface, a random perturbation is added to each value by randomly sampling a residual from a normal distribution. A new covariance matrix and mean vector are subsequently computed from the complete data using standard formulae.

In order to create imputed data sets with different estimates of the missing values, random perturbations are also added to the regression equations used to generate the imputed values. This occurs in the posterior (P) step, where new covariance matrix and mean vector elements are randomly sampled from a posterior distribution that is conditional on the filled-in data from the previous I step. The newly constructed estimates of {Sigma} and µ are carried forward to the next I step, where a new set of imputed values is generated from regression equations that differ slightly from those used in the previous iteration. This two-step procedure is repeated a large number of times, and imputed data sets are saved at specified intervals in the sequence.

In creating a set of imputed data files, it is necessary that the number of DA iterations (the number of I and P steps) be much larger than the desired number of imputed data sets, m. Because the imputed values from any two adjacent I steps will be correlated with one another, a number of iterations must separate the imputed data sets that are used in the subsequent analysis phase (these are sometimes referred to as between-imputation iterations). For example, in the analysis presented below, 200 between-imputation iterations were specified, such that an imputed data set was saved after every 200th I step. As a result, the m imputed data sets are filled with different plausible estimates of the missing values, and the imputed values are independent from one data set to the next. For somewhat different reasons, it is also necessary to allow a number of DA cycles to lapse before saving the first imputed data set (these are referred to as burn-in iterations). Determining the number of burn-in iterations can be complex and involves assessing the convergence (i.e., stability) of the sampled parameters from the P step. This process is discussed at length by Schafer (39) and is aided by graphical displays (time series and autocorrelation function plots) produced by MI software packages.

Analysis Phase
The imputation process creates m complete data sets, so standard statistical software can be used for all subsequent analyses. The statistical model of interest is fit to each of the m data sets, and the resulting parameter values and standard errors are subsequently combined into a single inference using straightforward arithmetic (37). A single point estimate for a given parameter is simply the arithmetic average of that parameter over the m analyses, as follows



Formula 3

where i is the parameter estimate from the ith data set, and m is the number of imputations.

Combining standard errors is slightly more complicated and involves two components. The within-imputation variance for a given parameter is defined as the arithmetic average of the m squared standard errors (i.e., the average of the m parameter variance estimates) and is given by



Formula 4

where Ûi is the squared standard error from the ith data set. The between-imputation variance is simply the variance of the parameter estimate across the m imputations, or



Formula 5

Finally, the combined standard error is obtained as a function of the within- and between-imputation variance, as shown below.



Formula 6

Individual parameter estimates can be tested for significance using the familiar t ratio (i.e., Formula /S.E.), and simultaneous tests of multiple parameters can also be constructed (40). The standard degrees of freedom for the t test are complex and can actually exceed the sample size. Barnard and Rubin (41) proposed an adjusted degrees of freedom calculation that is generally considered to be superior to the standard values given in classic MI references. MI software packages may differ in the degrees of freedom that are reported, so the user should pay attention to this nuance.

Some final points should be made regarding MI. Like DML, MI assumes a multivariate normal model. However, studies have suggested that MI performs reasonably well even when data are nonnormal (7), and Schafer (39) argued that the normal model can be used with nominal and ordinal variables. Normalizing transformations can also be used before imputation, and variables can subsequently be restored to their original metrics using reverse transformations. Finally, like DML, MI assumes a MAR mechanism and will produce unbiased estimates when data are either MCAR or MAR.

Incorporating Auxiliary Variables
An advantage of MI is that auxiliary variables can readily be incorporated into the imputation phase simply by specifying additional predictors in the imputation model. Because the imputed values are conditioned on any auxiliary variables in the imputation model, these extraneous variables need not be included in the subsequent analysis phase. Schafer (39, p. 143) discussed the selection of variables for the imputation model and recommended that the imputation process include variables that are 1) related to the variable being imputed and 2) potentially related to missingness on that variable. In addition, the imputation model should be at least as general as the analysis model. For example, if the researcher intends to examine an interaction effect in a subsequent analysis, this interaction should also be included in the imputation model, or the interaction effect may be biased toward zero; this point underscores the need for careful planning when choosing the variables that are to be included in the imputation model.

Example Analyses
A small set of artificial data (N = 300) was generated in order to illustrate the application of DML and MI. The data were generated to mimic the hypothetical QOL study described earlier in the manuscript. Briefly, four artificial QOL scores were generated such that QOL declined across the four assessments. In order to simulate a randomized trial, a binary treatment variable was created that was uncorrelated with QOL scores at the initial assessment but was systematically related to the decline in QOL over time (e.g., the "treatment" group experienced less rapid decline in QOL scores across the four assessments). The differential growth rates produced significant group differences at the final QOL assessment, such that the treatment group exhibited significantly higher QOL scores than the control group. Finally, two auxiliary variables, age and education, were generated to have modest correlations with QOL scores; age was negatively correlated with QOL, and education had a positive correlation (r {approx} –0.35 and 0.20, respectively). In order to mimic a randomized trial, the auxiliary variables were unrelated to treatment group membership, within sampling error.

Missing values were imposed on the final three QOL scores in an MAR fashion and were systematically related to age, education, initial QOL status, and treatment group membership. Specifically, the probability of missing data was highest for cases with high age values, low education scores, and low initial QOL values. Also, missingness was systematically related to treatment group membership such that the missing data rate for the control group was twice as high as that for the treatment group (40% of the QOL scores for the control were missing, whereas 20% of the scores for the treatment group were deleted). The deletion process was executed separately for the three QOL scores, resulting in a general pattern of missing data (e.g., some cases sporadically missed assessments, whereas others permanently "dropped out" at a particular wave).

To illustrate the application of DML and MI, a linear growth curve analysis was used to model the development of QOL scores across the four assessments. The growth curve model describes change for each case using two parameters, the intercept (i.e., estimated QOL scores at the final assessment) and slope (i.e., change in QOL scores between each assessment). It is often the case that the growth model intercept is defined as initial status, or the expected score at baseline. However, these data were constructed to mimic a randomized trial where the treatment groups do not differ at the baseline assessment. In this case, it may be of substantive interest to assess group differences at the end of the study. The choice of location for the intercept does not affect the fit of the model, nor does it alter the interpretation of the slope.

The intercept and slope values from the growth model were regressed on treatment group membership in order to assess 1) mean differences at the final assessment and 2) the differential rates of change for the two groups, respectively. The DML analyses were performed using Mplus 3.13, and MI was implemented using the SAS MI procedure. The syntax for these analyses is given in the Appendices, and the artificial data set is available on request. For comparison purposes, the analyses were also performed using LW, LOCF, and AMI.

DML Analysis
The DML analysis was performed using two auxiliary variables, age and education. Although missingness was, in part, related to two variables in the analysis model (treatment group membership and initial QOL scores), the analyst will typically be blind to the true cause of the missing values. As such, it is important to illustrate an inclusive analysis strategy that applies the saturated correlates approach (30). Again, it may not be possible to identify useful auxiliary variables with any degree of certainty, but a liberal approach may be justified when considering variables to include in the auxiliary portion of the model (16). The latent growth curve model that includes age and education as auxiliary variables is depicted in Figure 2, and the syntax for this model given in Appendix A. Note that all three of Graham’s rules are applied in this case.


Figure 212
View larger version (35K):
[in this window]
[in a new window]
 
Figure 2. Path diagram representation of the latent growth curve model. Note that the auxiliary variables are correlated with (a) each other, (b) the substantive predictor (tx), and (c) the residual terms of each repeated measure. Note also that the residual covariance between the growth factors was omitted from the figure but was estimated in the analysis.

 

Selected results from the growth curve analysis are given in Table 1, and the complete-data results are also presented for comparison purposes. There are four key parameters of interest in this analysis: the intercept mean (i.e., final QOL mean for the control group), the slope mean (i.e., the rate of change for the control group), the regression of the intercept on the treatment indicator (i.e., the mean difference at the final QOL assessment), and the regression of the slopes on the treatment indicator (i.e., the difference in the rates of change for the two groups). The DML estimates were quite close to those from the complete-data analysis and differed by less than one half the value of each parameter’s standard error. The column labeled "Intercept on TX" gives the estimated mean difference at the final QOL assessment. As seen in the table, the estimated mean difference (4.02) was quite similar to that from the complete-data analysis (3.71).


View this table:
[in this window]
[in a new window]
 
TABLE 1. Selected Estimates From the Growth Curve Analysis

 

In contrast, the estimates from the traditional methods were substantially biased. For example, the LW estimate of the control group mean at wave 4 (i.e., the intercept) was 45.21 versus 42.72 for the complete data. This bias is more than four times larger than the parameter’s standard error (0.58). Although other LW parameters appear to be more accurate, the cumulative bias resulted in a distorted picture of the growth processes for the two groups. To illustrate, the top panel of Figure 3 shows the mean growth trajectories for DML and LW (the complete data trajectories were virtually identical to those of DML and were omitted from the figure). As seen in the figure, LW suggested the presence of group differences at baseline, even though none existed. Although the LW growth curve slopes were relatively accurate, the mean difference at the final assessment was too small. In a similar vein, the bottom panel of Figure 3 shows the estimated growth trajectories for DML and LOCF. These results are even more problematic because LOCF clearly suggests that no treatment effect existed at all; the treatment group was substantially higher than the control at baseline, and the group differences persisted across the duration of the study.


Figure 312
View larger version (16K):
[in this window]
[in a new window]
 
Figure 3. Estimated growth trajectories from the example analyses. The top panel compares the growth trajectories for DML and LW, whereas the bottom panel gives the estimated growth curves for DML and LOCF. The DML growth curves were virtually identical to those of the complete data, whereas both LW and LOCF produced distorted estimates of the trajectories.

 

MI Analysis
The SAS MI procedure was used to generate m = 10 imputed data sets, the SAS syntax for which is given in Appendix B. The imputation model included seven variables (the four QOL scores, age, education, and the treatment group indicator). Because the analysis models did not require interaction terms, no higher-order effects were used in the imputation model. Consistent with Schafer’s (39, p. 148) recommendations, imputed values were rounded to the nearest integer (using the ROUND keyword) because this was consistent with the original metric of the variables. A range for the imputed values was also specified using the MINIMUM and MAXIMUM keywords; if an imputed value falls outside this range, a new imputed value is generated.

Although the syntax in Appendix B does not reflect this, preliminary analyses were conducted in order to assess the convergence behavior of the DA algorithm and to determine the number of burn-in and between-imputation iterations. Graphic displays (time series and autocorrelation function plots) were used for this purpose and are discussed in detail by Schafer (39); examples of these plots are also given in an accessible paper by Schafer and Olsen (42).

A total of m = 10 imputed data sets were created for further analysis (specified using the NIMPUTE keyword). I opted for a very conservative approach and specified a total of 500 burn-in iterations (the first imputed data set was saved after 500 DA iterations had lapsed) and 200 between-imputation iterations (data sets were saved after every 200th iteration thereafter). These two options are specified using the NBITER and NITER keywords, respectively. The process of creating the 10 imputed data sets took just a few seconds of computer time. The 10 imputed data sets are stacked in a single file (the name of which is specified by the OUT keyword), and an index variable named _Imputation_ is written to this file than ranges between 1 and m.

To further illustrate, Table 2 gives the imputed values of qol4 (the final QOL assessment) for the first 10 cases; for brevity, the estimated values are shown for the first five imputed data sets. As seen in the table, cases with complete data (e.g., case 1) have identical values from one data set to the next. In contrast, cases with missing values have data values that vary across the m imputed data sets. As described previously, the imputed values are essentially predicted scores from a regression equation that have been augmented with a random residual term, and subsequently rounded to the nearest integer. The estimated values differ from one data set to the next because of the addition of the random residual, but also because the regression coefficients used to generate the predicted scores vary slightly across data sets.


View this table:
[in this window]
[in a new window]
 
TABLE 2. Multiple Imputation Estimates of Missing Values for 10 Cases

 

The use of SAS is convenient because MIs can be generated and analyzed within the same program using the MIANALYZE procedure. For example, I could have performed the growth curve analysis using PROC MIXED and subsequently combined the parameter estimates and standard errors using PROC MIANALYZE. However, I chose to analyze the multiply imputed data sets using Mplus. Like other popular software packages (e.g., Lisrel, HLM), Mplus automates the process of analyzing and combining parameter estimates from multiply imputed data sets. The syntax for the analysis is given in Appendix C. When using Mplus, the TYPE = IMPUTATION subcommand is used to specify the input of multiply imputed data sets. Instead of specifying the location of a single data file, the FILE subcommand is used to specify a file containing the names of the imputed data sets (i.e., a text file with 10 rows, each row containing the name of one imputed data set, e.g., imp1.dat, imp2.dat, ..., imp10.dat). Mplus subsequently fits the specified model to each of the imputed data sets and combines the parameter estimates and standard errors.

Results from the growth curve analysis are given in Table 1. Consistent with DML, the MI growth model estimates were quite similar to those of the complete data; the two sets of estimates generally differed by less than 1 SE unit. The estimated mean difference at the final QOL assessment was 4.28 compared with 3.71 from the complete-data analysis.

Traditional Analyses
Table 1 also contains the results from LW, AMI, and LOCF. As seen in the tables, the traditional methods produced estimates that differed considerably from those of the complete-data analysis. For example, the traditional methods underestimated the mean difference from the growth model analysis and also underestimated the variance estimates of the intercepts and slopes. Although caution is warranted when generalizing the results from a single artificial data set, these analyses are certainly consistent with previous computer simulation studies that have demonstrated the superiority of DML and MI (1–10).


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 DISCUSSION
 Appendix 1
 NOTES
 REFERENCES
 
The primary goal of this manuscript was to summarize recent methodologic advances related to missing-data handling and to familiarize psychosomatic medicine researchers with DML and MI. The use of DML and MI is supported by statistical theory (15) and by a number of recent empirical studies (1–10). DML and MI are advantageous because they require less strict assumptions about the missing-data mechanism relative to traditional missing data techniques such as LW and LOCF. Although DML and MI are not without their own potentially tenuous assumptions (i.e., MAR and multivariate normality), these techniques are currently considered to be the best procedures available.

Some researchers appear dubious of DML and MI because they hold the misperception that these procedures are "making up data." As discussed previously, DML does not impute missing values at all but estimates parameters directly from the observed data. Although MI does impute missing values, it does so in a principled fashion and should be viewed as a computational tool that improves the accuracy of the resulting parameter estimates. It is important to remember that the goal of any statistical analysis is to estimate a set of population parameters, so "filling in" the missing values is not problematic, provided that doing so produces accurate estimates. One of the key points from Rubin’s (15) theoretical work (24) is that likelihood-based approaches (DML and MI) should yield the same answer as a complete data analysis, provided that the data are MAR. This point should alleviate concerns that the analyst is somehow "cheating" when using DML or MI.

After illustrating the use of DML and MI, the reader might be left wondering, "Which method should I use?" To better answer this question, a brief review of the pros and cons of each method is warranted. As noted previously, DML is widely available in all commercial SEM software packages, a benefit that may weigh heavily for users who have previous exposure to SEM. The inclusion of auxiliary variables is somewhat awkward but can be accomplished without much additional effort (e.g., see Appendix A). Perhaps the biggest drawback associated with DML is the availability of an estimation routine. Although the number and types of models that can be estimated using DML have grown substantially in recent years, there are many analyses that cannot be estimated using current SEM software and would require custom computer code.

The flexibility of MI is probably its most attractive feature. Software for creating multiply imputed data sets is readily available, and no special analysis software is required. For example, two SAS procedures, PROC MI and PROC MIANALYZE, were implemented in version 8, and Schafer’s NORM package (43) can be freely downloaded from www.stat.psu.edu/~jls/misoftwa.html. The ease with which auxiliary variables can be incorporated into the imputation model is also a benefit because auxiliary variables can be ignored in all subsequent analyses. The primary detriment associated with MI is its complexity. Compared with DML, MI can be more labor intensive. The availability of software routines for combining parameter estimates certainly reduces the burden on the analyst, but it could reasonably be argued that MI requires more knowledge on the part of the user.

In summary, DML and MI require identical assumptions (multivariate normality and MAR missingness) and will frequently produce very similar parameter estimates, particularly if the same auxiliary variables are used in both models. As such, the primary considerations appear to be 1) the availability of a DML routine for the analysis of interest, 2) the ease of incorporating auxiliary variables, and 3) the overall ease of use. In the end, choosing between these two methods is probably a matter of personal preference and convenience. Given the advantages associated with these techniques, behavioral medicine researchers should strongly consider the use of DML and MI in future studies.


    Appendix 1
 TOP
 ABSTRACT
 INTRODUCTION
 DISCUSSION
 Appendix 1
 NOTES
 REFERENCES
 
Appendix A. Mplus syntax for a growth curve analysis with auxiliary variables.

data:

file is 'c:/qol.dat';

variable:

names are qol1 qol2 qol3 qol4 age ed tx;

usevariables are qol1 – tx;

missing are all (–99);

analysis:

type = missing h1;

model:

! basic growth model syntax;

i s | qol1@–3 qol2@–2 qol3@–1 qol4@0;

[qol1 – qol4@0];

[i s];

qol1 – qol4;

i s;

i s on tx;

! auxiliary variables;

age ed;

age with ed;

age ed with tx;

age ed with qol1 – qol4;

output:

sampstat standardized;

Appendix B. SAS multiple imputation syntax.

libname c 'c:/ ';

data c.qol;

infile 'c:/qol.dat';

input qol1 qol2 qol3 qol4 age ed tx;

/* CHANGE MISSING VALUE CODE FROM –9 TO. */

array x[7] qol1m qol2m qol3m qol4m age ed tx;

do i = 1 to 7;

if x [i] = –99 then x[i] = .;

end;

drop i;

run;

/* CREATE M = 10 IMPUTED DATA SETS */

/* NIMPUTE SPECIFIES THE NUMBER OF IMPUTED DATA SETS */

/* THE MINIMUM AND MAXIMUM OPTION PROVIDES RANGES FOR IMPUTED VALUES */

/* ROUND = 1 ROUNDS IMPUTED VALUES TO NEAREST INTEGER */

/* NBITER SPECIFIES THE NUMBER OF BURN IN ITERATIONS */

/* NITER SPECIFIES THE NUMBER OF BETWEEN- IMPUTATION CYCLES */

proc mi data = c.qol out = c.qolmi seed = 56789 nimpute = 10 minimum = 50 8 0 0 0 0 0 maximum = 99 30 1 100 100 100 round = 1 1 1 1 1 1 1;

var qol1 qol2 qol3 qol4 age ed tx;

mcmc nbiter = 500 niter = 200;

run;

Appendix C. Mplus syntax for a growth curve using multiply imputed data.

data:

file is 'c:/imputednames.dat';

type = imputation;

variable:

names are qol1 qol2 qol3 qol4 age ed tx;

usevariables are qol1 – qol4 tx;

model:

! basic growth model syntax;

i s | qol1@–3 qol2@–2 qol3@–1 qol4@0;

[qol1 – qol4@0];

[i s];

qol1 – qol4;

i s;

i s on tx;


    NOTES
 TOP
 ABSTRACT
 INTRODUCTION
 DISCUSSION
 Appendix 1
 NOTES
 REFERENCES
 

Received for publication November 7, 2005; revision received March 2, 2006.

DOI:10.1097/01.psy.0000221275.75056.d8


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 DISCUSSION
 Appendix 1
 NOTES
 REFERENCES
 

  1. Arbuckle JL. Full information estimation in the presence of incomplete data. In: Marcoulides GA, Schumacker RE, eds. Advanced Structural Equation Modeling. Mahwah, NJ: Lawrence Erlbaum Publishers; 1996:243–77.
  2. Enders CK. The impact of nonnormality on full information maximum likelihood estimation for structural equation models with missing data. Psychol Methods 2001;6:352–70.[Medline]
  3. Enders CK. Using the EM algorithm to estimate coefficient alpha for scales with item level missing data. Psychol Methods 2003;8:322–37.[CrossRef][Medline]
  4. Enders CK, Bandalos DL. The relative performance of full information maximum likelihood estimation for missing data in structural equation models. Structural Equation Modeling 2001;8:430–57.[CrossRef]
  5. Gold MS, Bentler PM. Treatments of missing data: a Monte Carlo comparison of RBHDI, iterative stochastic regression imputation, and expectation-maximization. Structural Equation Modeling 2000;7:319–55.[CrossRef]
  6. Graham JW, Hofer SM, MacKinnon DP. Maximizing the usefulness of data obtained with planned missing value patterns: an application of maximum likelihood procedures. Multivariate Behav Res 1996;31:197–218.[CrossRef]
  7. Graham JW, Schafer JL. On the performance of multiple imputation for multivariate data with small sample size. In: Hoyle RH, ed. Statistical Strategies for Small Sample Research. Thousand Oaks, CA: Sage; 1999:1–29.
  8. Muthén B, Kaplan D, Hollis M. On structural equation modeling with data that are not missing completely at random. Psychometrika 1987;52:431–62.[CrossRef]
  9. Kaplan D. The impact of BIB spiraling-induced missing data patterns on goodness-of-fit tests in factor analysis. J Educ Behav Stat 1995;20:69–82.
  10. Wothke W. Longitudinal and multi-group modeling with missing data. In: Little TD, Schnabel KU, Baumert J, eds. Modeling Longitudinal and Multiple Group Data: Practical Issues, Applied Approaches and Specific Examples. Mahwah, NJ: Erlbaum; 2000:219–40.
  11. Schafer JL, Graham JW. Missing data: our view of the state of the art. Psychol Methods 2002;7:147–77.[CrossRef][Medline]
  12. Azar B. Finding a solution for missing data. Monit Psychol 2002;33:70.
  13. Peugh JL, Enders CK. Missing data in educational research: a review of reporting practices and suggestions for improvement. Rev Educ Res 2004;74:525–56.
  14. Wood AM, White IR, Thompson SG. Are missing outcome data adequately handled? a review of published randomized controlled trials in major medical journals. Clin Trials 2004;1:368–76.[Abstract/Free Full Text]
  15. Rubin DB. Inference and missing data. Biometrika 1976;63:581–92.[Abstract/Free Full Text]
  16. Collins LM, Schafer JL, Kam C-H. A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol Methods 2001;6:330–51.[Medline]
  17. Bernhard J, Cella DF, Coates AS, Fallowfield L, Ganz PA, Moinpour CM, Mosconi P, Osoba D, Simes J, Hurny C. Missing quality of life data in cancer clinical trials: serious problems and challenges. Stat Med 1998;17:517–32.[CrossRef][Medline]
  18. Raghunathan TE. What do we do with missing data? Some options for analysis of incomplete data. Annu Rev Public Health 2004;25:99–117.[CrossRef][Medline]
  19. Allison PD. Missing Data. Newbury Park, CA: Sage; 2001.
  20. Diggle PJ, Kenward MG. Informative dropout in longitudinal data analysis (with discussion). Appl Stat 1994;43:49–73.[CrossRef]
  21. Little RJA. Pattern-mixture models for multivariate incomplete data. J Am Stat Assoc 1993;88:125–34.[CrossRef]
  22. Little RJA, Raghunathan T. Should imputation of missing data condition on all observed variables? Proceedings of the Section on Survey Research Methods. American Statistical Association; 1997:617–22.
  23. Raghunathan TE, Siscovick DS. A multiple imputation analysis of a case-control study of the risk of primary cardiac arrest among pharmacologically treated hypertensive. Appl Stat 1996;45:335–52.[CrossRef]
  24. Little RJA, Rubin DB. Statistical Analysis with Missing Data. 2nd ed. Hoboken, NJ: Wiley; 2002.
  25. Cook RJ, Zeng L, Yi GY. Marginal analysis of incomplete longitudinal binary data: a cautionary note on LOCF imputation. Biometrics 2004;60:820–8.[CrossRef][Medline]
  26. Molenberghs G, Thijs H, Jansen I, Beunckens C, Kenward MG, Mallinckrodt C, Carroll RJ. Analyzing incomplete longitudinal clinical trial data. Biostatistics 2004;5:445–64.[Abstract]
  27. Muthén LK, Muthén BO. Mplus User’s Guide [computer software and manual]. Los Angeles, CA: Muthén & Muthén; 2004.
  28. Enders CK. Estimation by maximum likelihood. In: Everitt B, Howell DC, eds. Encyclopedia of Behavioral Statistics. West Sussex, UK: Wiley; 2005:1164–70.
  29. Eliason SR. Maximum Likelihood Estimation: Logic and Practice. Newbury Park, CA: Sage; 1993.
  30. Graham JW. Adding missing-data relevant variables to FIML-based structural equation models. Structural Equation Modeling 2003;10:80–100.[CrossRef]
  31. Little RJA. A test of missing completely at random for multivariate data with missing values. J Am Stat Assoc 1988;83:1198–1202.[CrossRef]
  32. Graham JW, Hofer SM, Donaldson SI, MacKinnon DP, Schafer JL. Analysis with missing data in prevention research. In: Bryant KJ, Windel M, eds. The Science of Prevention: Methodological Advances from Alcohol and Substance Abuse Research. Washington, DC: American Psychological Association; 1997:325–66.
  33. Kenward MG, Molenberghs G. Likelihood based frequentist inference when data are missing at random. Stat Sci 1998;13:236–47.[CrossRef]
  34. Yuan K-H, Bentler PM. Three likelihood-based methods for mean and covariance structure analysis with nonnormal missing data. In: Becker M, Sobel M, eds. Sociological Methodology 2000. Malden, MA: Blackwell; 2000:165–200.
  35. Enders CK. Applying the Bollen-Stine bootstrap for goodness-of-fit measures to structural equation models with missing data. Multivariate Behav Res 2002;37:359–77.[CrossRef]
  36. Enders CK. A SAS macro for implementing the modified Bollen-Stine bootstrap for missing data: implementing the bootstrap using existing structural equation modeling software. Structural Equation Modeling 2005;12:620–41.
  37. Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York, NY: Wiley; 1987.
  38. Allison PD. Multiple imputation for missing data: a cautionary tale. Sociol Methods Res 2000;28:301–9.[Abstract/Free Full Text]
  39. Schafer JL. Analysis of Incomplete Multivariate Data. New York, NY: Chapman and Hall; 1997.
  40. Li KH, Raghunathan TE, Rubin DB. Large-sample significance levels from multiple-imputed data using moment-based statistics and an F reference distribution. J Am Stat Assoc 1991;86:1065–73.[CrossRef]
  41. Barnard J, Rubin DB. Small-sample degrees of freedom with multiple imputation. Biometrika 1999;86:948–55.[Abstract/Free Full Text]
  42. Schafer JL, Olsen MK. Multiple imputation for multivariate missing data problems: a data analyst’s perspective. Multivariate Behav Res 1998;33:545–71.[CrossRef]
  43. Schafer JL. NORM: Multiple Imputation of Incomplete Multivariate Data under a Normal Model [computer software]. University Park, PA: Department of Statistics, The Pennsylvania State University; 1999.



This article has been cited by other articles:


Home page
Journal of Travel ResearchHome page
J. Huang and C. H. C. Hsu
The Impact of Customer-to-Customer Interaction on Cruise Experience and Vacation Satisfaction
Journal of Travel Research, February 1, 2010; 49(1): 79 - 92.
[Abstract] [PDF]


Home page
Psychosom. Med.Home page
C. V. Igna, J. Julkunen, H. Vanhanen, P. Keskivaara, and M. Verkasalo
Depressive Symptoms and Serum Lipid Fractions in Middle-Aged Men: Physiologic and Health Behavior Links
Psychosom Med, November 1, 2008; 70(9): 960 - 966.
[Abstract] [Full Text] [PDF]


Home page
Psychosom. Med.Home page
K. M. Phillips, M. H. Antoni, S. C. Lechner, B. B. Blomberg, M. M. Llabre, E. Avisar, S. Gluck, R. DerHagopian, and C. S. Carver
Stress Management Intervention Reduces Serum Cortisol and Increases Relaxation During Treatment for Nonmetastatic Breast Cancer
Psychosom Med, November 1, 2008; 70(9): 1044 - 1049.
[Abstract] [Full Text] [PDF]


Home page
Psychosom. Med.Home page
R. J. Contrada, D. A. Boulifard, E. L. Idler, T. J. Krause, and E. W. Labouvie
Course of Depressive Symptoms in Patients Undergoing Heart Surgery: Confirmatory Analysis of the Factor Pattern and Latent Mean Structure of the Center for Epidemiologic Studies Depression Scale
Psychosom Med, November 1, 2006; 68(6): 922 - 930.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Right arrow An erratum has been published
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrowRequest Permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Enders, C. K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Enders, C. K.
Related Collections
Right arrow Statistical Corner


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS