Correlation and Causality
|| Observing a correlation between two variables may mislead someone into seeing a causal relationship between these variables. However, this is often not the case. In general, correlations and causality can be interpreted in the following ways:
- a controls b
- b controls a
- a and b are determined by a third variable
- a and b exert a mutual influence on each other
- a and b do not control each other at all, the correlation is spurious (it happens to occur by chance)
The third situation is the most common in everyday practice, the last situation is of importance for small samples, because the correlation coeffcient shows a broad distribution for low numbers of observations.
The following summary presents the most important aspects of correlation and causality:
|Correlation by formal means
||If two independent variables X and Y are divided by a variable Z which
is correlated to either X or Y, the resulting variables X' and Y' are correlated.
The same is true for variables which are normalized to a sum of 100
percent (as it is often the case with tables of nutritive values). Such
variables always show a negative correlation.
|Correlation by Inhomogeneity
||If the distribution of the data is inhomogeneous, a correlation is
likely to occur. It is therefore advisable to plot the variables against
each other (scatter plot of X vs. Y)
||Shoe size is correlated to income. The larger
the shoe size, the higher the income. (Solution: women earned less money
than men. Both groups show no internal correlation, but if both groups
are pooled a "correlation" occurs.)
The longer a student needs to finish his study,
the higher is his income afterwards. (Solution: the time required to get
a degree depends on the studies, e.g. the average time to graduate in philosophy
is shorter than to get a degree in chemistry. Within the group of chemists,
the income increases with decreasing time of study, but again: pooling
the data creates inhomogeneity and leads to the described correlation)
|Additional (hidden) variables
||Variables X and Y are correlated, but in fact a third parameter Z,
which is not included in the data set, is correlated to both X and Y. This
is particularly hard to discover, since the parameter Z may well be unknown.
An important subclass of this type of correlation is time series, where
time is the common variable. If both X and Y show a trend in time, correlation
will be observed.
||Shoe size is correlated to the calcium content
of bones. (Solution: children have less calcium in their bones than adults,
naturally the shoe size of children is also smaller than that of adults)
|Outliers in the data
||Outliers cause high correlations if the outlier is far enough away
from the rest of the data.
||A common spike in the signals of an analytical
instrument may result in high correlation between these signals (note:
spikes are a common problem in laboratories; they are e.g. caused for example
by switching refrigerators).
As an important consequence, we have to state that mathematical correlation is no proof of causality. Correlations must not interpreted in a causal way unless there is evidence of a causal relationship beyond the correlation.