Exercise - Create a Data Set with Outliers
The detection of outliers can be quite important, and cumbersome. To
gain experience in detecting outliers, you should design 2 data sets exhibiting
the following features:
|Data set 1:
||700 to 1500 data points, normally distributed, no special measures
against outliers taken (use the function "gauss" of the DataLab
command Math/Transformation/Single Formula to create the data set)
|Data set 2:
||approx. 1000 data points, skewed to the right (hint: use squared data
of a normal distribution with a zero mean to create the skewed data). Change 2 values of the data set such that one of these values falls outside the +/-2.5 sigma range, but within the +/-4 sigma range, and the other falls outside the +/-4 sigma range.
Apply the variance/iqr outlier test of DataLab
and report the list of outliers.
Please answer the following questions:
- How many values of data set 1 were you expecting to fall outside the 2.5
sigma area, and how many values actually fall outside these limits?
- Remove the "outliers" of data set 1 which fall outside the 2.5 sigma area,
and repeat the test. What is the result? Is it OK to eliminate outliers
by such a stepwise approach?
- Compare the results of the 2.5 sigma test and the interquartile range test
for both data sets. Explain the difference in sensitivity.
You may now go directly to the DataLab in
order to experiment with the data.