Fundamentals of Statistics contains material of various lectures and courses of H. Lohninger on statistics, data analysis and chemometrics......click here for more.


Exercise - Estimation of Henry's Constant from Chemical Structure

The data set HENRYSEM contains various descriptors and properties of 157 substances, among them the logarithm of Henry's coefficient, the boiling point, and the melting point. The following variables are available:

ln(H)       logarithm of Henry's constant
melt.p. melting point (deg. Celsius)
boil.p. boiling point (deg. Celsius)
DENS20 density at 20 deg. Celsius
nD20 refractive index at 20 deg. Celsius
Hv(LB) enthalpy of evaporation
compact topological index indicating the compactness of a molecule
rad topological radius
dia topological diameter
nvz number of branches in the molecule
Randic Randic index
RdOz modified Randic index
NMethyl number of methyl groups in molecule
TJ topological index J (defined by Balaban)
C number of carbon atoms
H number of hydrogen atoms
O number of oxygen atoms
N number of nitrogen atoms
SumH number of hetero (non-H, non-C) atoms in molecule
MWgt molecular weight
LOIX topological index reflecting electronegativities

Use this data and go to the  DataLab  to model Henry's constant from the molecular descriptors. Try to compare several methods, ie. MLR (in combination with forward selection of variables), PCR, and ANN (RBF networks). Which of the models is "best"?

What about modeling the boiling points and the melting points by using this data set?

Do you have an explanation for the difference between boiling points and melting points?