Censored data are data which are partly unknown (for example, some of the measured data may fall below the limit of detection, which means that we know that they are smaller than the limit of detection but this does not imply that they are zero). Censored data may also arise from missing observations (due to the premature withdrawal of some subjects from an ongoing study). Censored data are, in effect, missing values with additional information regarding the range where the missing data fall into.
We may distinguish the following types of censoring:
- right censored data: the exact values of data exceeding a particular threshold are unknown
- left censored data: the exact values of data falling below a particular threshold are unknown
- interval censored data: the data outside an interval are unknown
||Consider a survival study of cancer patients over 10 years. At the end of the study 34 of the 120 subjects involved in the study are still alive. Thus the survival time is right censored as we only know of the 34 surviving patients that they survived at least 10 years. In this case the median can be calculated as we already know the survival time of more than 50% of the patients. On the other hand the mean cannot be calculated, because for calculating the mean we need to know the survival time of all 120 patients.
||A chemical analysis of the strontium concentration in mushrooms resulted in the following values (mg/kg dry substance): 2.2, 23.2, 18.1, 7.9, <0.1, 2.5, 1.9, 0.4, <0.1, 11.6. As the concentrations below a limit of 0.1 mg/kg cannot be detected (the limit of detection is 0.1 mg/kg) the data are left censored.
How can we deal with censored data in practical situations? As specialized approaches in interpreting censored data such as Kaplan-Meier curves or logrank tests may be of little help in practical situations, one may try to transform the censored data into numbers in order to be able to calculate models.
For the transformation of censored data into numeric values we have mainly four methods at hand:
None of these four possibilities are optimal. In most practical cases one would resort to the method requiring the least efforts.
- In the simplest case, if data fall below the limit of detection, the missing values are replaced by half of the limit of detection.
- We may try to compute a model of the distribution of the data in the censored range. The censored values are then replaced by random values drawn from this distribution.
- We may try to estimate the censored values by means of a multiple regression model based on the other variables.
- Another way would be multiple imputation, which estimates the censored values by multiple replacement by plausible values.
||The terms "censored" and "trimmed" data are sometimes used as synonyms - which is wrong, as the values of trimmed data are simply discarded if they exceed a certain (distribution-dependent) limit while censored data will be processed always with the limits in mind.