Statistics, Part I: First Foundation

Author(s)Howard Mark, Jerome Workman Jr.

We present the first of a short set of columns dealing with the subject of statistics. This current series is organized as a “top down” view of the subject, as opposed to the usual literature (and our own previous) approach of giving “bottom up” description of the multitude of equations that are encountered. We hope this different approach will succeed in giving our readers a more coherent view of the subject, as well as persuading them to undertake further study of the field.

As a rejuvenation of a previous attempt to disseminate and promulgate correct statistical understanding and methodology, we present the first of a short set of columns dealing with the subject of statistics. This current series is organized as a top-down view of the subject, as opposed to the usual literature (and our own previous) approach of giving bottom-up descriptions of the multitude of equations that are encountered. We hope this different approach will succeed in giving our readers a more coherent view of the subject, as well as persuading them to undertake further study of the field.

A good number of years ago we published a series of 37 columns called “Statistics in Spectroscopy,” which ran from the February 1987 to the January 1993 issues of Spectroscopy. That series was the predecessor of the current series of “Chemometrics in Spectroscopy” columns that you are now reading. There is also an eponymous book Statistics in Spectroscopy originally published by Academic Press (1) (and still available at amazon.com and barnesandnobel.com) based on that series of columns. The rationale for that series of columns was this: When I was first learning statistics, I was fortunate to have the consultation services of a first-class statistician available. I soon learned what could be done with data when the statistical methodology is handled properly and what statistics can accomplish when applied by someone truly knowledgeable. At that point my reaction was something like, “Every experimental scientist needs to learn this stuff!” At the time it seemed that, unfortunately, most chemists and spectroscopists did not have the time or inclination to learn this field of study, despite being exposed to some of the peripheral concepts and applications. This led to some incorrect and strange pronouncements appearing in the literature (sometimes, as the saying, generally attributed to the physicist Wolfgang Pauli, goes, “not even wrong!”). We published those statistics columns, and ultimately the book, as an attempt to disseminate correct statistical concepts and methodology, and hopefully to motivate scientists to learn the proper use of the tools of that field. We received some positive feedback, and seemed to persuade at least some scientists to learn to use statistics properly. Over the intervening years, however, the status of the field is that it has crept back to a state of authors espousing uninformed pronouncements, again giving incorrect results based on incorrect explanations. We hope to mitigate at least some of that behavior here, with this short series of columns.

The presentation here will be completely different from the previous ones (both in the literature and in our previous effort), and hopefully nobody will confuse what we say here with what we said there. Even better, readers may want to read that version as well as this one (the book and the previous series of columns are effectively the same). Anyone wanting to dig a little deeper and learn something about how the various equations commonly used arise out of the underlying concepts could do worse than obtaining that book (1) and perusing it or, alternatively, obtain any of the many books written by a statistician and studying from that, or taking a course in the subject.

Here, we’re going to take a different approach, however. Besides having a more limited venue in which to discuss what is a fairly large topic, what we want do here is discuss the foundations of the science of statistics, which itself is the foundation of chemometrics, although hardly anybody these days seems to know that. Chemometrics (literally meaning the measurement of chemistry) would appear to have come into the world fully formed in the 1980s, with hardly any background to it. In the early days of chemometrics, the emphasis was (and to a large extent still is) on the multivariate mathematics and the ability to sling the multivariate equations, with little attention paid to many of the underlying historical concepts that formed the foundation of the mathematics. Nowadays, these fundamental principles are hidden in the intricacies of the multivariate equations that are routinely used and presented, but rarely described or analyzed.

These foundations were built by mathematicians using them to create the science of statistics, even before statistics was recognized as a distinct discipline. However, ignoring those principles, and the requirements they place on the application of the mathematics leads to subtle, and sometimes not-so-subtle, errors in the results, lending some credence to Mark Twain’s famous dictum, “There are lies, damned lies and statistics.”

Incidentally, anyone reading this column with the expectation that they will learn all about statistics will be sorely disappointed. We could no more teach anyone statistics in the space of an essay of this length (even if that was the purpose) than a doctor could teach someone heart surgery that way, a chemist teach, say, organic chemistry, or even an automobile mechanic teach how to diagnose and repair an engine. What we hope to do in this limited space, is to lay some groundwork and explain some of the fundamental concepts of the discipline. This will enable any interested person who decides they want to undertake further study to have learned some of the jargon and be able to pick up a real statistics book (or even better, take a course) and have some orientation toward the subject matter, so that it will present a shorter and less steep learning curve.

The word statistics itself appears to take its origin in the word state, based on the need for political states to collect the data needed to know and understand the makeup and demographics of the people under its jurisdiction. The data were needed for the state to rationally implement its activities of taxation, conscription, and so on. The meaning has considerably changed and expanded to include formal methods for the analysis of scientific and other types of data over the years. For our current purposes it can be thought to have the meaning “the study of the properties and behavior of random numbers” and how to use the knowledge gained from that study to draw valid conclusions from data, even in the face of randomness. A good statistician can figure out what your data are telling you and, even more importantly, are not telling you-a far cry from the opinion of the field espoused by Twain.

While a newcomer to the study of statistics can become confused and disoriented by the plethora of equations encountered, and the differences in those equations that arise when dealing with situations that are only slightly different, modern statistics rests, at the bottom, on three foundational principles. So, for the most part we will not use a mathematical presentation, and will minimize the use of equations. Another way to express this approach is to say that rather than looking at the subject from the bottom up by getting involved in the nitty-gritty of the equations, we will inspect it from the top down, concentrating on underlying principles that can apply to all major aspects of the use and application of statistical thought.

The FoundationsFirst Foundation: Probability Theory

The beginnings of the science of statistics can be traced to a famous event (famous, at least, among statisticians), wherein a nobleman was addicted to playing dice, but found that he was regularly losing money at that activity. He sought the assistance of Gerolamo Cardano, a mathematician already prominent in his day. (Please note some of the scene is dim here: How did he know a mathematician could help him? How did he know which one and how was he able to contact him?) In any case, after some consideration of the nature of the problem Cardano wrote down the 36 ways two dice can fall and counted the number of times each value (from 2 to 12) occurred. After making some assumptions about the dice (for example, each throw of a die was independent of the other die; that the value thrown on each die occurred randomly and was independent of previous throws of that die; and that “sufficiently many” throws were made so that his conclusions were valid for the “long run”), he was able to calculate the probabilities corresponding for each value to occur. He then recommended betting strategies to the nobleman based on his findings, whereupon the nobleman started winning (over the long run, at least). We can imagine that Cardano was well-compensated for his advice! Thereupon, many other mathematicians of the day, whose names are now almost household words (depending on the household, of course: Galileo, Pascal, Fermat, Leibnitz, Bernoulli, Gauss, Newton, and other scientific luminaries throughout the years) got on the bandwagon of generating and disseminating the knowledge of this new branch of mathematics. In the process of doing that, they inevitably generalized their results and created the theories and theorems for describing and calculating the behavior of random data, that we now recognize as the foundations of modern statistical analysis.

The next step in this saga, came the realization among scientists in all disciplines, but especially astronomers at that time, that the uncontrollable fluctuations that inevitably contaminated their measurements were also caused by random factors, and could be treated similarly to the fluctuations seen in gambling, by applying similar methodology to the scientific data as was applied to gambling results.

In modern times, of course, we realize that virtually any and all measured data are subject to fluctuations from uncontrollable sources. We generally call these fluctuations noise or error. Metrologists, such as those at the Bureau of Standards, nowadays called the National Institute of Standards and Technology (NIST), go further and recognize that errors can be of two types: systematic and random. They designate the total departure of a given measurement from truth as the uncertainty of the measurement. Systematic errors are presumed to arise from mechanical, electrical, optical, or other physical causes. Then, by application of knowledge from the appropriate field, the source of the discrepancy can be tracked down and eliminated. For this reason, the science of statistics, and we ourselves here, are concerned with the effects of the random portion of the uncertainty, not least because it can help discern and, even better, calculate the systematic errors so that they can be corrected. Over the years, statisticians have discovered much about the nature and behavior of random errors, and have learned how to specify bounds for what can legitimately be said about the data that are subject to these errors.

One of the more important findings is that if the data are subject to fluctuations because of random error, then anything you calculate from those data will also be subject to fluctuations. Thus, if you select a set of samples and measure some property of each sample, the value of anything you calculate from the several measured values (their mean, say) will differ from the corresponding calculation made on data from a similar but other set of samples, purely because of the random fluctuations of the data and the fact that the random errors will necessarily differ between the two sets. That is, if you select a set of samples, measure some property of them and calculate their mean, then perform a corresponding calculation on data from a second group of samples, you will get a different value for the mean, purely because the data are different than those from the first set. This is true even for such basic calculated values as the mean or standard deviation (SD) of the data. More importantly, the magnitudes of these intersample differences tell us some other important things:

The future differences expected between samples similar to the ones at hand.

The differences between the measured values and truth, and because of that, an estimate of the true value of the measured characteristic that would be obtained if an error-free measurement were possible.

The larger the sample set you take, or the more samples you average together, the closer these averages tend to cluster around some constant value, which thus becomes a better approximation of the true value of what you’re measuring, as more and more measurements are made.

To the statistician, a value calculated from a set of data subject to random fluctuations is what is known as a statistic. That is the specific technical meaning of the term in the field (the jargon).

An important difference between samples (or the data therefrom) and statistics is the distribution of the variations. The distribution of values of multiple measurements from a sample, or the distribution of values of measurements from multiple samples can be almost anything, and is determined by the physics or chemistry (or other applicable discipline) of the situation governing the behavior of the samples.

The distribution of values of statistics, on the other hand, has been shown with mathematical rigor, to depend not so much on the samples or the physics, as on the nature of the calculations performed on the measured data. Thus, the mean of a set of samples will follow a normal (Gaussian) distribution, the variance (square of the standard deviation) will follow a χ² (chi-squared) distribution, and so forth for other common statistical calculations. There is a caveat, though: For this to be true, the random variations of the data need to have certain prescribed characteristics. They must be truly random and independent of each other, and there must be a sufficient number of data points that short-term departures from the overall behavior are unimportant.

The usage of these concepts is that once you understand the behavior of your data and how they are distributed, you have a priori knowledge of the probability of finding future data within specified boundaries (or limits). The boundaries are called the confidence limits and define what is known as a confidence interval, which is the range of values within which samples similar to the ones used to collect the original data are likely to be found. Samples outside the confidence limits are in what is known as the rejection region. This means what the name implies: We reject the hypothesis that a sample outside the confidence interval is the same as the ones within the confidence interval. Since the confidence interval contains those readings that are likely to occur for samples that are the same as those on which the confidence interval is based, a new sample that is found to be within the confidence interval is highly likely to be the same as those on which the confidence interval is based. It is important to note that this is only a probabilistic conclusion, and does not prove unequivocally that the new sample is the same as the old ones. In this discussion, for our current purposes we will call the sample set that we use to determine the distributional characteristics of these samples, along with the associated confidence interval and confidence limits, the reference set.

On the other hand, if the new sample under test is outside the confidence interval, it is highly unlikely that the new sample is the same as the previous ones; the further outside the confidence interval it is, the less likely it is to belong to (or be the same as) the reference set. Therefore, this is considered proof that the new sample is not the same as the ones used to generate the confidence interval. Note that the two conclusions are not symmetric in this respect. When a sample is rejected from consideration as being part of the reference sample set, that is a much stronger conclusion than if it is accepted as being part of the set. The term used to describe the condition that a reading is outside the confidence interval is to say that it is “statistically significant.” Non-statisticians tend to use that term very loosely, and to have it mean anything they want it to mean. However, the only statistically correct meaning is the one stated above: a reading outside a specified confidence interval.

In this column, we state these concepts informally, but there is a formal way this is done; the formalism for testing the hypothesis described above is called a hypothesis test. When performing a hypothesis test, the reading being tested is compared with the limits of the confidence interval and is either rejected if outside the confidence interval, or accepted as being the same as the samples in the reference set if within the confidence interval, based on whatever statistical criterion is being used to make the comparison. There is an asymmetry here, however: once you’ve found a statistically significant difference between the sample under test and the reference set, you definitely know that the sample has been objectively shown to be different from the reference set. If, on the other hand, you have not found a difference based on the one test you did, you have to realize that there may be other characteristics of the sample that are not the same as the corresponding characteristic in the reference set, but you don’t know because you haven’t yet tested it that way. Thus, a result showing statistical significance is a much stronger result than one that does not show statistical significance.

The concepts behind using hypothesis tests to determine the relationship between sets of numbers is used in a wide variety of applications of statistics. Conventional texts dealing with the topic of statistics break down the subject into very small pieces thereby presenting the information to the reader in very small pieces; this is the reason for the multitude of equations described above. As stated earlier, the approach here will be somewhat different. Extensions of the basic concepts lead us to the two other foundational principles that were alluded to above. These two principles are analysis of variance and least squares.

These two principles underlie most, if not all, of the common algorithms used to analyze spectral data including the ones characterized as chemometrics, in addition to the various constructs that statisticians have developed themselves. A reader who wants to understand the concepts behind these commonly used algorithms needed to pull out recalcitrant information from uncooperative data can sometimes, perhaps, better understand the operation of the algorithms by relating them to the basic principles presented here.

Both of these principles make use of properties that all data have, even data that include a random contribution. Indeed, in some situations the random component of the data plays a key role in the formulation and application of the basic principle. Both principles are characterized by a (sometimes unexpected) underlying mathematical equality that exists between apparently different quantities. These two foundations will be discussed in forthcoming columns, soon to be found in your mailboxes.

Reference

(1) H. Mark and J. Workman, Statistics in Spectroscopy, second edition (Academic Press, New York, New York, 1991).

Jerome Workman Jr. serves on the Editorial Advisory Board of Spectroscopy and is the Executive Vice President of Engineering at Unity Scientific, LLC, in Brookfield, Connecticut. He is also an adjunct professor at U.S. National University in La Jolla, California, and Liberty University in Lynchburg, Virginia. His e-mail address is JWorkman04@gsb.columbia.edu

Howard Mark serves on the Editorial Advisory Board of Spectroscopy and runs a consulting service, Mark Electronics, in Suffern, New York. He can be reached via e-mail: hlmark@nearinfrared.com