Fundamentals of Statistics contains material of various lectures and courses of H. Lohninger on statistics, data analysis and chemometrics......click here for more.


Spurious Correlation

Suppose you have a data matrix of 6 variables and 13 observations filled with random numbers. Now let us try a simple experiment: without loss of generality we pick the first of the six variables, and try to model this variable using the 5 remaining variables. Ideally, we should expect that it is impossible to set up a regression model which creates a significant relationship between any of the selected variables and the first variable. However, if we actually perform the experiment, repeating it several times, we see that we obtain considerable correlations between the predicted and the actual target values. This effect is worsened, if more independent variables are used, and if less observations are used. You may also use the following  interactive example  to gain some experience with this effect.

In summary, chance correlations have a considerable effect in multivariate models. Thus it is important that the number of variables is low compared to the number of observations. In the literature a rule of thumb is often presented which requires the number of observations to be at least 3 times the number of  variables. However, it can easily be shown that this rule of thumb is quite useless, especially when extensive feature selection is taking place.

Go to the DataLab to carry out some trial calculations on your own. Use the mathematical formula editor to fill a data matrix with random numbers and then try to establish an MLR model between any number of independent variables and one selected target variable. Change the number of observations and repeat the experiment (start with 10 observations, then repeat the experiment with 20 and with 100 observations).