Fundamentals of Statistics contains material of various lectures and courses of H. Lohninger on statistics, data analysis and here for more.

Classifier Performance

Evaluating the performance of a classifier largely depends on the type of classifier. In the most common (and simplest) case we deal with binary classifiers creating only two outcomes. While the methods for the evaluation of binary classifiers are well established and straightforward, the situation with multiclass methods creating more than two possible answers is much more complicated. Furthermore, the situation may become even more difficult if several classifiers are combined to give the final decision. In order to make things simple, the following introduction is restricted to binary classifiers only.

When looking at binary classifiers we see that each observation A is mapped to one of two binary labels (e.g. to YES and NO, or 0 and 1, or healthy and sick) upon classification. The classifier response may be either correct or wrong with respect to the (unknown) reality and may be summarized in a confusion matrix, which contains the counts of occurrences of all four possible combinations of classification results. If the response of the classifier is correct then we speak of "true positive" or "true negative" results, depending on the actual class the observation belongs to. If the classifier is wrong upon the true class we speak of "false positive" or "false negative" decisions:

Correct classifications appear in the diagonal elements (green regions), and errors appear in the off-diagonal elements (red regions).

Some classifier models (such as discriminant analysis, for example) provide above all a continuous output which has to be compared to a classification threshold in order to deliver a binary result. In the case of a continuous response the confusion matrix may be visualized by plotting the continuous output of the classification model on one axis and the actual class (the "reality") on the other axis. The classification threshold is indicated by a broken line:

This diagram allows to visually judge the reliability of a classifier simply by looking at the distance and the density of those observations which fall close to the decision threshold.

In order to specify the classifier performance in a more formal and quantitative way, several measures have been defined. The table below uses the following notation:

N .... total number of observations
TP ... number of true positive classifications
FP ... number of false positive classifications
TN ... number of true negative classifications
FN ... number of false negative classifications

True Positive Rate
(recall, hit rate, or sensitivity)
The TP rate is specified by the ratio of true positives to the total number of positives:

Example: the percentage of sick persons who have been diagnosed as sick.

False Positive Rate
(false alarm rate)
The FP rate is defined as ratio of false positives to the total number of negatives:

Example: percentage of people classified as sick while actually being healthy with regard to the number of healthy people (probability of a false alarm).

(PPV = positive predictive value)
The precision is the ratio of true positive decisions to the total number of positive decisions:

Example: proportion of actually sick people to the number of persons classified as sick.

Specificity The specificity defines the correctly classified negative observations in proportion to the sum of all negative observations:

Example: percentage of the healthy classified persons to all actually healthy persons.

Negative Predictive Value
The negative predictive value defines the ratio of the correctly classified negative observations to the set of negative classified observations:

Example: The portion of actually healthy persons who have been classified as healthy.

Accuracy The accuracy specifies the proportion of the correctly classified observations in regard to the entire set of observations:

Prevalence The prevalence specifies the ratio of the actually positive observation to all observations:

Example: The percentage of sick people in the entire population.

Positive Likelihood Ratio The positive likelihood ratio calculates the ratio of the probability to get a positive classification among the set of positive observations and the probability to get a positive classification among the negative observations.

Example: A positive likelihood ratio of 50 means that the probability to classify an actually sick person as sick is 50 times higher than the probability to find a (allegedly) sick person among the healthy ones.

Negative Likelihood Ratio The negative likelihood ratio calculates the ratio of the probability to get an erroneous negative classification among the set of all actually positive observations and the probability to get a negative classification among the negative observations.

Example: A negative likelihood ratio of 0.01 means that the probability to classify a person as (allegedly) healthy among the actually sick persons is 100 times lower (1/0.01 = 100) than to classify a person of the actually healthy group as healthy.

Receiver Operating Characteristics Receiver Operating Characteristics (ROC) is a graphic method to visualize the trade-off between benefits (true positives) and costs (false positives). The ROC of a given classifier is plotted in a diagram showing the TP rate against the FP rate.