Fundamentals of Statistics contains material of various lectures and courses of H. Lohninger on statistics, data analysis and chemometrics......click here for more.

## Cross Validation

When setting up multivariate models, it is very important to check their validity. While the reliablity of well-known linear models can usually be expressed by some theoretical measures (i.e. the F-value, or the goodness of fit), the situation is less favourable to other methods, such as neural networks or some other non-linear mappings. One particular method to assess the performance of a method is a procedure commonly called cross-validation, or boot-strapping.

While there are several different flavors of cross-validation, the fundamental idea stays the same: the model data is split into two mutally exclusive sets, a larger one (the 'training' set) and a smaller one (the 'test' set). The larger data set is used to set up the model, while the smaller data set is used to validate the model, i.e. the model is applied to the smaller data set and the results are compared to the expected values (as defined in the smaller data set). This process is then repeated with different subsets, until each object of the data set is used once for the test set.

The size of the test set for each repetition of the procedure can be adjusted to the user's needs, and mainly depends on the size of the entire data set and the amount of time and effort used to perform the cross-validation. There are two conceivable extreme cases: (1) splitting the data set into two equal halves, and (2) selecting only a single object for the test set. The latter approach is also called full cross-validation, and is in general the favourable approach.

In order to measure the performance of the model, one should calculate the PRESS value.

Last Update: 2012-10-08