Click on image to view larger version.



Figure 2. Pure noise variables still produce good R2 values if the model is overfitted. The distribution of R2 values from a series of simulated regression models containing only noise variables. The model contained 15 predictors, each consisting of randomly generated values, and a response variable, whose values were also randomly generated. Thus, the true model has an R2 of 0. Four sets of 10,000 random samples were drawn, each of sample size N = 50, N = 100, N = 150, and N = 200. The smoothed frequency distribution of the R2 values generated by each of the 10,000 models is plotted here for the 4 sample size conditions. Note that even when the number of cases per predictor is reasonably good (200/15=13.3), there are, solely because of the chance of the draw, a fair number of non-0 R2 values. When there were only approximately 50/15=3.3 observations per predictor, the frequency of large R2 values was quite high.