Connecting Chemometrics to Statistics - Part II: The Statistics Side

Jerome Workman Jr.;

Connecting Chemometrics to Statistics - Part II: The Statistics Side

June 1, 2006

By Jerome Workman Jr.

Article

Spectroscopy

SpectroscopySpectroscopy-06-01-2006

Volume 21

Issue 6

In this month's installment of "Chemometrics in Spectroscopy," the authors again explore that vital link between statistics and chemometrics, this time with an emphasis on the statistics side.

In part I of this column series, we worked out the relationship between the calculus-based approach to least-squares calculations and the matrix algebra approach to least-squares calculations, using a chemometrics-based approach (1). Now we need to discuss a topic squarely based in the science of statistics.

The topic we will discuss is analysis of variance (ANOVA). This is a topic we have discussed previously — in fact, several times. Put into words, ANOVA shows that when several different sources of variation act on a datum, the total variance of the datum equals the sum of the variances introduced by each individual source. We first introduced the mathematics of the underlying concepts behind this in (2), then discussed its relationship to precision and accuracy (3), the connection to statistical design of experiments (4–6), and its relation to calibration results (7,8).

All of those discussions, however, were based upon considerations of the effects of multiple sources of variability on only a single variable. To compare statistics with chemometrics, we need to enter the multivariate domain, and so we ask the question: "Can ANOVA be calculated on multivariate data?" The answer to this question, as our long-time readers will undoubtedly guess, is "Of course, otherwise we wouldn't have brought it up!"

Multivariate ANOVA

Therefore, we come to the examination of ANOVA of data depending upon more than one variable. The basic operation of any ANOVA is the partitioning of the sums of squares.

A multivariate ANOVA, however, has some properties different than the univariate ANOVA. To be multivariate, obviously there must be more than one variable involved. As we like to do, then, we consider the simplest possible case; and the simplest case beyond univariate is obviously to have two variables. The ANOVA for the simplest multivariate case — that is, the partitioning of sums of squares of two random variables (X and Y) — proceeds as follows. From the definition of variance:

expanding equation 1 and noting that

results in:

expanding still further:

Then we rearrange the terms as follows:

and upon collecting terms and replacing Var (X + Y) with its original definition, this can finally be written as:

The first two terms on the right-hand side of equation 5 are the variances of X and Y. The third term, the numerator of which is known as the cross-product term, is called the covariance between X and Y. We also note (almost parenthetically) here that multiplying both sides of equation 5 by (n – 1) gives the corresponding sums of squares; hence, equation 5 essentially demonstrates the partitioning of sums of squares for the multivariate case.

Let's discuss some of the terms in equation 5. The simplest way to think about the covariance is to compare the third term of equation 5 with the numerator of the expression for the correlation coefficient. In fact, if we divide the last term on the right-hand side of equation 5 by the standard deviations (the square root of the variances) of X and Y in order to scale the cross-product by the magnitudes of the X and Y variables and make the result dimensionless, we obtain:

and after canceling the "n – 1"s, we get exactly the expression for R, the correlation coefficient:

There are several critical facts that come out of the partitioning of sums of squares and its consequences, as shown in equations 5 and 7. One is the fact that in the multivariate case, variances add only as long as the variables are uncorrelated — that is, the correlation coefficient (or the covariance) is zero.

There are two (count them: two) more critical developments that come from this partitioning of sums of squares. First, the correlation coefficient is not just an arbitrarily chosen computation (or even concept), but as we've seen, bears a close and fundamental relationship to the whole ANOVA concept, which is itself a very fundamental statistical operation that data are subject to. As we've seen here, all of these quantities: standard deviation, correlation coefficient, and the whole process of decomposing a set of data into its component parts, are related very closely to each other, because they all represent various outcomes obtained from the fundmental process of partitioning the sums of squares.

The second critical fact that comes from equation 5 can be seen when you look at the chemometric cross-product matrices used for calibrations (least-squares regression, for example, as we discussed in reference 1). What is this cross-product matrix that is often so blithely written in matrix notation as A^T A as we saw in our previous column? Let's write one out (for a two-variable case like the one we are considering) and see:

Gosh darn it, those terms look familiar, don't they? If they don't, check equation 13b again in reference 1 and equation 5 in this column.

And note a fine point we've deliberately ignored until now: that in equation 5 the (statistical) cross-product term was multiplied by two. This translates into the two appearances of that term in the (chemometrics) cross-product matrix.

This is where we see the convergence of statistics and chemometrics. The cross-product matrix, which appears so often in chemometric calculations and is so casually used in chemometrics, thus has a very close and fundamental connection to what is one of the most basic operations of statistics, much though some chemometricians try to deny any connection. That relationship is that the sums of squares and cross-products in the (as per the chemometric development of equation 10 in reference 1) cross-product matrix equals the sum of squares of the original data (as per the statistics of equation 5). These relationships are not approximations, and not "within statistical variation," but, as we have shown, are mathematically (algebraically) exact quantities.

Furthermore, the development of these cross-products in the case of the chemometric development of a solution to a data-fitting problem came out of the application of the least-squares principle. In the case of the statistical development, neither the least-squares principle nor any other such principles was, or needed to be, applied. The cross-products were arrived at purely from a calculation of a sum of squares, without regard to what those sums of squares represented; they certainly were not designed to be a least-square estimator of anything.

So here we have it: the connection between statistics and chemometrics. But this is only the starting point. It behooves all of us to pay more attention to this connection. There is a lot that statistics can teach all of us.

Jerome Workman, Jr. serves on the Editorial Advisory Board of Spectroscopy and is director of research, technology, and applications development for the Molecular Spectroscopy & Microanalysis division of Thermo Electron Corp. He can be reached by e-mail at: jerry.workman@thermo.com.

Howard Mark serves on the Editorial Advisory Board of Spectroscopy and runs a consulting service, Mark Electronics (Suffern, NY). He can be reached via e-mail: hlmark@prodigy.net.

References

(1) H. Mark and J. Workman, Spectroscopy 21(5), 34–38 (2006).

(2) H. Mark and J. Workman, Spectroscopy 3(3), 40–42 (1988).

(3) H. Mark and J. Workman, Spectroscopy 5(9), 47–50 (1990).

(4) H. Mark and J. Workman, Spectroscopy 6(1), 13–16 (1991).

(5) H. Mark and J. Workman, Spectroscopy 6(4), 52–56 (1991).

(6) H. Mark and J. Workman, Spectroscopy 6(7), 40–44 (1991).

(7) H. Mark and J. Workman, Spectroscopy 7(3), 20–23 (1992).

(8) H. Mark and J. Workman, Spectroscopy 7(4), 12–14 (1992).

Articles in this issue

Analysis June 06

Market Profile: Ion Trap Time-of Flight (IT-TOF) Mass Spectrometry

End of the Spectrum: A 3D Look at Alzheimer's Disease

Light: Particle or Wave?

Connecting Chemometrics to Statistics - Part II: The Statistics Side

The 54th Annual ASMS Conference: A Review

Analysis of Volatile Bacterial Metabolites by Gas Chromatography-Mass Spectrometry

News Spectrum June 06

Get essential updates on the latest spectroscopy technologies, regulatory standards, and best practices—subscribe today to Spectroscopy.

Subscribe Now!

Related Content

Visible light spectrum color waves perceived by human eye © Johannes-chronicles-stock.adobe.com

High-Speed Immune Cell Identification Using New Advanced Raman BCARS Spectroscopy Technique

Jerome Workman, Jr.

July 16th 2025

Article

Irish researchers have developed a lightning-fast, label-free spectroscopic imaging method capable of classifying immune cells in just 5 milliseconds. Their work with broadband coherent anti-Stokes Raman scattering (BCARS) pushes the boundaries of cellular analysis, potentially transforming diagnostics and flow cytometry.

Vibrant light waves: colorful spectrum visualization © StudioATC -chronicles-stock.adobe.com

AI-Powered Raman with CARS Offers Laser Imaging for Rapid Cervical Cancer Diagnosis

Jerome Workman, Jr.

July 15th 2025

Article

Chinese researchers have developed a cutting-edge cervical cancer diagnostic model that combines spontaneous Raman spectroscopy, CARS imaging, and artificial intelligence to achieve 100% accuracy in distinguishing healthy and cancerous tissue.

A refreshing bowl of mixed fruit salad featuring pineapple, grapes, melon. Generated with AI. | Image Credit: © aubriella - stock.adobe.com

New Frontiers in Fruit Analysis: How Raman Spectroscopy and Machine Learning Are Improving Quality Detection

Will Wetzel

July 14th 2025

Article

Researchers from Guangdong Polytechnic Normal University highlight how combining Raman spectroscopy with machine learning enables rapid, non-destructive, and highly accurate analysis of fruit quality, offering transformative potential for food safety and agricultural diagnostics.

Drone-mountrd Infrared camera sees invisible methane leaks in real time © DigitalSpace -chronicles-stock.adobe.com

Drone-Mounted Infrared Camera Sees Invisible Methane Leaks in Real Time

Jerome Workman, Jr.

July 9th 2025

Article

Researchers in Scotland have developed a drone-mounted infrared imaging system that can detect and map methane gas leaks in real time from up to 13.6 meters away. The innovative approach combines laser spectroscopy with infrared imaging, offering a safer and more efficient tool for monitoring pipeline leaks and greenhouse gas emissions.

Drone with spectroscopy reveals hidden threats to soybean crops in China © Та -chronicles-stock.adobe.com

How Spectroscopy Drones Are Detecting Hidden Crop Threats in China’s Soybean Fields

Jerome Workman, Jr.

July 8th 2025

Article

Researchers in Northeast China have demonstrated a new approach using drone-mounted multispectral imaging to monitor and predict soybean bacterial blight disease, offering a promising tool for early detection and yield protection.

Hand of male holding soil in the hands for planting. | Image Credit: © krisana - stock.adobe.com

Radar and Soil Spectroscopy Boost Soil Carbon Predictions in Brazil’s Semi-Arid Regions

Will Wetzel

July 7th 2025

Article

A new study published in Geoderma demonstrates that combining soil spectroscopy with radar-derived vegetation indices and environmental data significantly improves the accuracy of soil organic carbon predictions in Brazil’s semi-arid regions.

Connecting Chemometrics to Statistics - Part II: The Statistics Side

Multivariate ANOVA

References

Newsletter