Fundamentals of Statistics contains material of various lectures and courses of H. Lohninger on statistics, data analysis and chemometrics......click here for more.

Correlation and Causality

Observing a correlation between two variables may mislead someone into seeing a causal relationship between these variables. However, this is often not the case. In general, correlations and causality can be interpreted in the following ways:

  1. a controls b
  2. b controls a
  3. a and b are determined by a third variable
  4. a and b exert a mutual influence on each other
  5. a and b do not control each other at all, the correlation is spurious (it happens to occur by chance)

The third situation is the most common in everyday practice, the last situation is of importance for small samples, because the correlation coeffcient shows a broad distribution for low numbers of observations.

 

The following summary presents the most important aspects of correlation and causality:

Correlation by formal means If two independent variables X and Y are divided by a variable Z which is correlated to either X or Y, the resulting variables X' and Y' are correlated.

The same is true for variables which are normalized to a sum of 100 percent (as it is often the case with tables of nutritive values). Such variables always show a negative correlation.

Correlation by Inhomogeneity If the distribution of the data is inhomogeneous, a correlation is likely to occur. It is therefore advisable to plot the variables against each other (scatter plot of X vs. Y)

Example: Shoe size is correlated to income. The larger the shoe size, the higher the income. (Solution: women earned less money than men. Both groups show no internal correlation, but if both groups are pooled a "correlation" occurs.)

The longer a student needs to finish his study, the higher is his income afterwards. (Solution: the time required to get a degree depends on the studies, e.g. the average time to graduate in philosophy is shorter than to get a degree in chemistry. Within the group of chemists, the income increases with decreasing time of study, but again: pooling the data creates inhomogeneity and leads to the described correlation)

Additional (hidden) variables Variables X and Y are correlated, but in fact a third parameter Z, which is not included in the data set, is correlated to both X and Y. This is particularly hard to discover, since the parameter Z may well be unknown. An important subclass of this type of correlation is time series, where time is the common variable. If both X and Y show a trend in time, correlation will be observed.

Example: Shoe size is correlated to the calcium content of bones. (Solution: children have less calcium in their bones than adults, naturally the shoe size of children is also smaller than that of adults)

Outliers in the data Outliers cause high correlations if the outlier is far enough away from the rest of the data.

Example: A common spike in the signals of an analytical instrument may result in high correlation between these signals (note: spikes are a common problem in laboratories; they are e.g. caused for example by switching refrigerators).

As an important consequence, we have to state that mathematical correlation is no proof of causality. Correlations must not interpreted in a causal way unless there is evidence of a causal relationship beyond the correlation.

Last Update: 2012-10-08