Fundamentals of Statistics contains material of various lectures and courses of H. Lohninger on statistics, data analysis and here for more.

Distance and Similarity Measures

Distances between objects in multidimensional space form the basis for many multivariate methods of data analysis. Using a different method for calculating the distances may influence the results of a method considerably. Similarity of objects and distances between them are closely related and are often confused. While the term distance is used more precisely in a mathematical sense, the particular meaning of the term similarity often depends on the circumstances and its field of application.

In general, the distance dij between any two points in n-dimensional space may be calculated by the equation given by Minkowski:

with k being the index of the coordinates, and p determining the type of distance.

There are three special cases of the Minkowski distance:

  • p = 1: this distance measure is often called city block distance, or Manhattan distance.
  • p = 1, binary data: Hamming distance. The Hamming distance defines the number of common "1" bits of two binary values.
  • p = 2: with p equalling 2 the Minkowski distance is reduced to the well-known Euclidean distance.

The various forms of the Minkowski distance do not account for different metrics of the individual coordinates. If the coordinates span different ranges, the coordinate with the largest range will dominate the results. You therefore have to scale the data before calculating the distances. Furthermore, any correlations between variables (coordinates) will also distort the distances. In order to overcome these drawbacks, one should calculate the Mahalanobis distance which allows for correlation and different scalings.

The Mahalanobis distance is related to the Euclidean distance, and results in the same values for uncorrelated and standardized data. It can easily be calculated by including the inverse covariance matrix into the distance computations:

Another distance measure, which is rather a measure for the similarity between two objects, has been proposed by Jaccard (it is also called Tanimoto coefficient):


with (x.y) being the inner product of the two vectors x, and y. Note that the Jaccard coefficient equals 1.0 for objects with zero distance. Furthermore, the Tanimoto coefficient can be appied to binary data, as well:

T = Nxy / (Nx + Ny - Nxy)

with Nx, Ny.... number of 1-bits in the vectors x and y, and
Nxy... number of 1-bits, which occur in the vectors x and y at the same position.