Connecting Chemometrics to Statistics - Part I: The Chemometrics Side

May 01, 2006
Volume 21, Issue 5

This series of columns has been running for a long time. Long-time readers will recall that it has even changed its name since its inception. The original name was "Statistics in Spectroscopy." This was a multiple pun, as it referred to the science of Statistics in the journal Spectroscopy and the science of Statistics in the science of Spectroscopy as well as statistics (the subject of the science of Statistics) in the journal Spectroscopy. [See our third column ever (1) for a discussion of the double meaning of the word "Statistics." The same discussion is found in the book based upon those first 38 columns (2).]

Our goal then, as now, was to bring the study of chemometrics and the study of statistics closer together. While there are isolated points of light, it seems that many people who study chemometrics have no interest in and do not appreciate the statistical background upon which many of our chemometric techniques are based, nor do they appreciate the usefulness of the techniques that we could learn from that discipline. Worse, there are some who actively denigrate and oppose the use of statistical concepts and techniques in the chemometric analysis of data. The first group can, perhaps claim unfamiliarity (ignorance?) with statistical concepts. It is difficult, however, to find excuses for the second group.

Nevertheless, at its very fundamental core, there is a very deep and close connection between the two disciplines. How could it be otherwise? Chemometric concepts and techniques are based upon principles that were formulated by mathematicians hundreds of years ago, even before the label "statistics" was applied to the subfield of mathematics that deals with the behavior and effect of random numbers on data. Nevertheless, recognition of statistics as a distinct subdiscipline of mathematics also goes back a long way, certainly long before the term "chemometrics" was coined to describe a subfield of that subfield.

Before we discuss the relationship between these two disciplines, it is, perhaps, useful to consider what they are. We have already defined "statistics" as ". . . the study of the properties of random numbers . . ." (3).

A definition of "chemometrics" is a little trickier of come by. The term originally was coined by Kowalski, but currently, many chemometricians use the definition by Massart (4). On the other hand, one compilation presents nine different definitions for "chemometrics" (5,6) (including "what chemometricians do," a definition that apparently was suggested only half humorously). But our goal here is not to get into the argument over the definition of the term, so for our current purposes, it is convenient to consider a somewhat simplified definition of "chemometrics" as meaning "multivariate methods of data analysis applied to data of chemical interest."

This definition is convenient because it allows us to then jump directly to what is arguably the simplest chemometric technique in use, and consider that as the prototype for all chemometric methods; that technique is multiple regression analysis. Written out in matrix notation, multiple regression analysis takes the form of a relatively simple matrix equation:

where B represents the vector of coefficients, A represents the matrix of independent variables, and C represents the vector of dependent variables.