| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
STATISTICAL CORNER |
From the Integrative Medicine Service, Biostatistics Service, MSKCC, New York, NY.
Address correspondence and reprint requests to Andrew Vickers, PhD, Department of Medicine, Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, 1275 York Ave., New York, NY 10021. E-mail: vickersa{at}mskcc.org
| ABSTRACT |
|---|
|
|
|---|
Key Words: randomized controlled trials analysis of variance statistical interpretation of data
Abbreviations: ANOVA = analysis of variance; ANCOVA = analysis of covariance.
| INTRODUCTION |
|---|
|
|
|---|
| STATISTICAL ANALYSIS OF RANDOMIZED TRIALS |
|---|
|
|
|---|
The CONSORT group, which issues recommendations on the reporting of randomized trials, has therefore stated that the results of a trial should stated as "a summary of results for each group, and the estimated effect size and its precision (e.g., a 95% confidence interval)." They go on to state that "although p-values may be provided ... results should not be reported solely as p-values" (2).
ANOVA produces p values by a method that does not require calculation of the difference between groups. Accordingly, the default setting for ANOVA results in most statistical software packages such as SPSS, SAS, and STATA is to give only F, p, and the degrees of freedom. Perhaps as a result, these are the values that are most typically reported in randomized trials of psychosomatic interventions. A typical example, selected at random from a paper in Health Psychology, is: "when the two interventions were compared, the [cognitivebehavioral] participants had significantly better scores on the ... POMS Vigor subscale F(1,46) = 6.60, p = .014." This gives us no idea by how much cognitive behavior therapy improves vigor and therefore whether it is worth receiving treatment.
Reporting of F and p values in isolation, without an estimate of effect size, is particularly problematic when differences between groups are not statistically significant. To illustrate, I give three possible "negative" results of the hypothetical psychotherapy trial in Table 1. Following the CONSORT recommendations, I give the results in each group, the effect size (in terms of the difference between means), and a 95% confidence interval, calculated from the standard errors (see Altman for an introduction to calculating a confidence interval for the difference between means (3)). The p value was obtained by ANOVA. For the sake of simplicity, I assume that anxiety is measured on a 0 to 10 scale, with higher scores indicating worse symptoms.
|
In scenario 1, treatment was considerably better than control, reducing anxiety scores by approximately 20%. In this case, the lack of statistical significance is an indication of a trial with insufficient power. In scenario 2, the treatment was similar to control, but the 95% confidence interval includes differences of clinical relevance; it could be that anxiety scores in the psychotherapy group are as much as 1.8 points (30%) lower. This would lead us to conclude that although there was no evidence of a treatment effect, it remains possible that treatment is of benefit. Conversely, in scenario 3, the 95% confidence interval includes only clinically trivial differences between groups; at best, psychotherapy could reduce anxiety scores by 0.6 points, or 10%. In this case, we might conclude that psychotherapy is unlikely to help. Reporting estimates and confidence intervals along with the p value allows us to draw appropriately varying conclusions from the three clinical trial results. Reporting only a p value from an ANOVA leads to the same conclusion from different findings: a failure to reject the null hypothesis.
| TRIALS WITH MORE THAN TWO GROUPS |
|---|
|
|
|---|
The researchers decided to run a second trial, this time randomizing patients to receive psychotherapy, a group relaxation class, or standard care alone. In our first scenario, the mean anxiety scores in the three groups were, 5.5, 6.2, and 6.1, respectively. This would seem to indicate that psychotherapy, but not relaxation alone, is of benefit, and therefore that the effects of psychotherapy are not merely the result of attention from a practitioner or a relaxation component. An ANOVA of some data with these means gave F(2,117) = 2.4, p = .096, leading us to conclude that there is no overall difference between groups. This appears to contradict the previous trial result, which showed an effect of psychotherapy. Moreover, the conclusion has little connection to the study design, which concerns the degree to which relaxation and attention contribute to the effects of psychotherapy. An alternative approach is to conduct a multivariable linear regression using predictor variables that reflect the questions asked by the researchers. One variable could be called "contact" and is coded 1 for both the relaxation and psychotherapy groups (who both get additional care) and 0 for controls. The second variable could be called "therapy" and is coded 1 for patients in the psychotherapy group and 0 otherwise. When this regression is run on the dataset for the first scenario, the coefficients for these variables are 0.1 (95% confidence interval [CI], 0.80.5; p = .6) and 0.7 (95% CI, 0.11.3; p = .035), leading us to the more intuitive conclusion that therapy is of benefit and that any effect of contact and relaxation is small. Note that the coefficients are equivalent, respectively, to the difference between groups for relaxation versus control and therapy versus relaxation. They would be reported along with the means and standard deviations of the baseline and follow-up anxiety scores for each group such as was done in Table 1.
In a second scenario, the anxiety scores for psychotherapy, relaxation, and control are 5.3, 5.3, and 6.3. ANOVA gives F(2,117) = 6.6, p = .002; regression gives coefficients for "contact" and "therapy" as 1.0 (95% CI, 0.41.6; p = .002) and 0.0 (95% CI, 0.50.5; p = .8). Again, the conclusion from regressionthat psychotherapy is effective, but that this is the result of relaxation and contact with a health professionalis more useful and relates more strongly to the study questions than the conclusion of the ANOVA, which is only that relaxation classes, psychotherapy, and usual care are not equivalent.
A multivariable regression is not the only alternative to ANOVA. One point of view is that we should not ask whether psychotherapy is superior to relaxation alone unless it is known that treatment, of whatever form, is better than control. Accordingly, we would combine the results from the psychotherapy and relaxation groups and compare with controls using a t test. If there was a significant difference, we would then compare psychotherapy and relaxation. Like regression, and in contrast to ANOVA, this method provides answers to specific questions of clinical relevance.
| TRIALS WITH REPEATED MEASURES |
|---|
|
|
|---|
Unfortunately, it is much more common in the literature to see what I consider to be an inappropriate use of ANOVA. When setting up an ANOVA, a researcher specifies the different effects of interest. In the case of our cancer study, in which we measure anxiety before and after a course of psychotherapy treatment or usual care, these effects are "time" (Do scores change between baseline and follow up after treatment?) and "group" (Do scores depend on whether a patient is assigned to psychotherapy or control?). The problem is that these effects are uninteresting and irrelevant to the analysis of the randomized trial. We are not concerned in whether scores will change from baseline (it seems likely than they would) or whether overall anxiety scores, including pretreatment score, differ between groups (at baseline, they should be similar because of randomization). What we are interested in, and why we conducted the randomized trial, is whether the change over time is different between groups. This is technically known as the "group by treatment interaction," an unwieldy term that, in my view, reduces the interpretability of clinical trial results. Take the following example, based on a trial reported in the Archives of General Psychiatry: "There was no significant group effect F(1,18) = 1.2, p = .3. However, there was a significant time effect F(1,18) = 48, p < .001 and a significant time by treatment interaction F(1,18) = 11, p = .003." It is not immediately obvious to the nonstatistical reader that the treatment was effective; indeed, the statement "there was no significant group effect" might lead one to conclude exactly the opposite. Here the "group effect" includes the pretreatment mean and so is of little interest.
An additional complication is one in which the trial incorporates more than one measure after randomization. Imagine that our colorectal cancer study involved an additional end point at 6 months, after the completion of adjuvant chemotherapy. Again, I present the possible results in terms of three scenarios shown in Figure 1. The simplest approach would be to report the results of the 6-week and 6-month follow up separately, particularly because they appear to address separate questions: Can psychotherapy alleviate distress in the period immediately after diagnosis? Can psychotherapy lead to persisting improvements in a patients psychologic response to cancer? The approach would then be to undertake a linear regression of the 6-week score separately from the 6-month score using baseline score as a predictor variable for both, and estimate the coefficient for group (equivalent to the difference between groups). This approach would often be described as ANCOVA.
|
An alternative would be to throw all assessments together into a single, repeated-measures ANOVA. One problem would be the interpretation of the time-by-group interaction. In my view, what clinicians and patients are interested in are the posttreatment results, in which a time-by-group interaction means that the short- and long-term effects of psychotherapy differ. This is the case in scenario 1, in which the effects of psychotherapy do not persist, and in scenario 3, in which differences between groups become larger over time. However, in traditional ANOVA models, the time-by-group interaction includes the baseline. Hence, we might well see a time-by-treatment interaction for scenarios 2 and 3, but not 1. The appropriate analysis is an extension of linear regression known as "longitudinal mixed models," "latent growth curve modeling," or "generalized linear modeling." An excellent description has been given in a prior paper in this series (7). Such models allow clear specification of the particular periods of time over which investigators want to examine whether the effects of treatment differ.
It should also be noted that in many cases in which repeated measures are taken, there is no need to examine time-by-treatment interactions. For example, in a randomized trial of two different surgical techniques, we might measure pain once or twice a day during the patients hospital stay. Such a trial might best be analyzed by first calculating the mean of each patients postoperative pain scores and then comparing these means between groups. A more complex analysis involving time-by-treatment interaction is not warranted because it is unlikely that the difference between groups will change importantly over time.
| DISCUSSION |
|---|
|
|
|---|
However, ANOVA is especially prone to misuse: obtaining an F and p value from an ANOVA does require calculation of a difference between groups, entailing that this estimate often goes unreported; for trials with more than two arms, ANOVA tests a hypothesis that is often uninteresting; for trials with repeated measures, ANOVA requires specification and analysis of effects that are extraneous to the principal study question of a randomized trial. In Table 2, I summarize this article by describing each problem associated with ANOVA and giving an alternative statistical approach. These alternatives should be considered in preference to ANOVA for the analysis of randomized trials.
|
| NOTES |
|---|
|
|
|---|
Received for publication September 16, 2004; revision received May 18, 2005.
DOI:10.1097/01.psy.0000172624.52957.a8
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
J. E. Harris, J. J. Eng, W. C. Miller, and A. S. Dawson A Self-Administered Graded Repetitive Arm Supplementary Program (GRASP) Improves Arm Function During Inpatient Stroke Rehabilitation: A Multi-Site Randomized Controlled Trial Stroke, June 1, 2009; 40(6): 2123 - 2128. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Hartmann PLEASE DON'T TALK BAD ABOUT GOOD OLD AUNT ANOVA!: A REPLY TO A.J. VICKERS' CRITIQUE Psychosom Med, January 1, 2006; 68(1): 175 - 176. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |