Fundamentals of Statistics contains material of various lectures and courses of H. Lohninger on statistics, data analysis and here for more.

Class Information

Some of the most frequent tasks of statistical data analysis are the classification (categorization) of observations and the interpretation of classified data. The classification in its simplest form is based on the assignment of a class number (or, in more general form, of a class property) to each observation.

In order to facilitate this, the class information has to be stored along with the data matrix, usually as ordinal or nominal values.

As an alternative to a dedicated storage vector (red region in the figure above) the class information may equally well be stored as a normal variable of the data matrix. This brings the benefit that class information (if coded in numerical form) can be used for calculations and can directly influence, for example, the color coding of results. On the other hand class information as part of the data matrix includes the risk of mixing up class information with explanatory variables, which may pose problems in certain circumstances.

Displaying classified data

When displaying classified data we can distinguish three basic types:

  • the depiction of count data in form of histograms or pie charts,
  • the representation of the data in dependence of the corresponding class, and
  • the coloring of the data points according to the class information.
The first approach contains the least details and is normally used for surveys or simplified presentations, only. As an example for the second approach the following figure displays the dependence of the proline concentration (an amino acid) of three kinds of red Italian wines on the variety of wine. The x-axis shows the class information (the variety of wine), the y-axis shows the proline concentration.

Another example shows the inking of a diagram due to the corresponding class information. In the following plot the same red wines as above are displayed indicating the variety of the wine by class-controlled inking (red = Barbera, green = Grignolino, blue = Barolo). By switching to color-coded class information we gain another dimension, leaving more space for plotting details in the data. Thus the different clusters in our example can be seen more clearly.