Fundamentals of Statistics contains material of various lectures and courses of H. Lohninger on statistics, data analysis and here for more.

Exercise - Estimation of Boiling Points from Chemical Structure

Chemical structures can be described in many different ways. One particular way is quite useful for setting up quantitative structure property relationships (QSPR). For each chemical structure investigated, a lot of numerical descriptors are calculated. These descriptors may define simple things, like the number of carbon atoms in the structure, or more sophisticated things, such as descriptors derived from graph-theoretical calculations. After calculating these descriptors, you end up with a matrix containing these numbers and a vector with the chemical/physical property of being investigated (e.g. the boiling point). You can then try to find a suitable set of variables and set up a multivariate regression model.

Use the data set BOILPTS and go to the  DataLab  to model the boiling point from the given structural descriptors. Try to combine different descriptors to find an optimum combination (just a hint: the model should result in a standard deviation of the residuals of below 8.0, a quality of fit of about 0.97, and a F-statistic of about 2300). Try to answer the following questions:

  • How do you justify your selection of variables?
  • How do the MLR results compare with PCR?
  • Do you have any idea how to cope with the remaining non-linearity?