Fundamentals of Statistics contains material of various lectures and courses of H. Lohninger on statistics, data analysis and here for more.

Transformation of the Data Space

If modeling techniques are applied to high-dimensional multivariate problems, most methods fail to deliver a fair model because of the complexity of the data space. Although there is some relationship between the data, it cannot be modeled because the relationship is hidden by too many variables (or, stated from another point of view, the relationship is distributed over too large a number of variables). In this case some special pre-processing of the data may enhance the results considerably. The pre-processing should be applied with the knowledge about the data in mind. Generally speaking, data pre-processing is a means of introducing specific knowledge about the data. In mathematical terms, the pre-processing should transform the data space in a way that (1) less variables are needed for the model, and (2) the relationship between the descriptor variables and the target variable becomes simpler.

An extreme but comprehensible example will demonstrate the idea behind transformation of data space. Suppose you have two classes of objects which are described by three parameters x1, x2 and x3. Class 1 forms a cluster having a shape similar to an ellipsoid. The objects of class 2 are all located outside of the ellipsoid.

This is a simple example of a classification problem which can only be solved using non-linear methods. It cannot be solved using linear methods such as multiple linear regression. Now, transform the data space (x1, x2, x3) to another space which is defined by two new descriptors l1 and l2. The new descriptors specify the distance of the objects to the foci of the ellipsoid F1 and F2, respectively. Plotting class 1 in the space (l1-l2) transforms the elliptical cluster of class 1 to a rectangular region with a single (!) linear separating surface.

This simple example clearly shows that the transformation of the data space can ease a given problem considerably. In fact, by knowing that cluster 1 forms an ellipsoid, and introducing some knowledge on analytical geometry, we transformed the data space in such a way that (1) the number of necessary variables is decreased and (2)  the non-linear classification task becomes a linear problem (which is easy to solve using well established methods).