Psychosomatic Medicine
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS

This Article
Right arrow Abstract Freely available
Right arrow Full Text
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Babyak, M. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Babyak, M. A.
Related Collections
Right arrow Statistical Corner
Right arrow Reviews

What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression-Type Models

Michael A. Babyak, PhD

From Duke University Medical Center, Durham, NC.



View larger version (22K):

[in a new window]
 
Figure 1. Example of a simple simulation study. A simulated population was created in which the equation y = 0.4x + error was true. Ten thousand random samples of N = 100 were drawn, and an ordinary least squares regression model, specified as y = bx + error, was estimated for each sample. The regression coefficient b was collected from each of the 10,000 models and plotted here by frequency. The location and shape of such a distribution can be examined to see whether it has the properties we would expect given our model assumptions.

 


View larger version (20K):

[in a new window]
 
Figure 2. Pure noise variables still produce good R2 values if the model is overfitted. The distribution of R2 values from a series of simulated regression models containing only noise variables. The model contained 15 predictors, each consisting of randomly generated values, and a response variable, whose values were also randomly generated. Thus, the true model has an R2 of 0. Four sets of 10,000 random samples were drawn, each of sample size N = 50, N = 100, N = 150, and N = 200. The smoothed frequency distribution of the R2 values generated by each of the 10,000 models is plotted here for the 4 sample size conditions. Note that even when the number of cases per predictor is reasonably good (200/15=13.3), there are, solely because of the chance of the draw, a fair number of non-0 R2 values. When there were only approximately 50/15=3.3 observations per predictor, the frequency of large R2 values was quite high.

 


View larger version (22K):

[in a new window]
 
Figure 3. Results of the simulation study of logistic regression models by Peduzzi et al. Pedduzi et al. (9) studied the stability of logistic regression coefficients under a variety of events-per-predictors ratios. Recall that the limiting sample size for a logistic model is the number of events (when there are fewer events than nonevents). The x-axis represents the ratio of events per predictor in the model for the case of 7 predictors. The y-axis shows the percent relative bias in the regression weight compared with the known population weight. The results suggest that bias is unacceptably high when there are fewer than 10 to 15 events per predictor. Reproduced with permission (9).

 





HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
Copyright © 2004 by the American Psychosomatic Society