Outliers are fundamentally a very fuzzy notion. Here, we try to clear up what outliers are and how they affect your data.
This column is our attempt to make some sense out of what is fundamentally a very fuzzy notion. Since the twin sciences of statistics and chemometrics are firmly grounded in mathematics, there would seem to be no wiggle room for “fuzziness” in the results. However, the presence of noise and other forms of fuzziness in the data give rise to fuzziness in the results nevertheless-even the most exact calculations can only give us approximations to what we would really like to know. Thus, the existence and treatment of outliers constitute a notion we have to deal with.
It’s a happy circumstance (that is, not exactly a coincidence, but not exactly planned, either; two discussions were written separately and then we noticed how well they fit together) that this new topic follows hard on the heels of a previous topic, the discussion of statistics (1–3). This progression is definitely beneficial. If you keep in mind what we told you about statistics while reading what follows, we hope they will be interdependent and shed light on each other. You will gain more insight into each topic by reading the series on statistics and this installment on outliers than you would by reading the explanations for each topic alone.
Why is this so? Well, the science of statistics is, as a branch of mathematics, very general. It covers descriptions of the way data behave in all other fields of scientific endeavor, and even some that are not so scientific (recall the origin of statistics was an initial attempt to describe and explain the results of gambling with dice) (1). Although the subject being investigated in that case was not particularly scientific, applying a rigorous scientific analysis to the situation certainly stood everyone involved in good stead. This result is especially the case when the underlying subject is itself mathematical, as chemometrics is.
Although the scientific field that the original developers of statistics were largely concerned with was astronomy, it can also be applied to other scientific fields. Statistics, and the results it provides us, can be (and is) applied to chemistry, physics, geology, and virtually every other subject of interest that includes the experimental measurement of one or more physical properties. The measurements in all fields contain error, and as we have previously discussed, the investigation and explanation of the effects of random error is exactly what statistics can tell us. Our current interest is the application to spectroscopy and related optical and mathematical phenomena, especially the use of optical instruments to perform chemical analysis through the application of the related branch of mathematics called chemometrics. But the application of the more sophisticated algorithms used in chemometrics does not change what happens to the data at the lower, more fundamental, levels where statistics applies. An example of this effect of randomness on the behavior of a chemometric algorithm was published a good number of years ago (4). The point of investigations such as that one is to demonstrate that any one measurement, even a multivariate measurement such as a spectrum, is only a single case obtained at random out of the multitude of possibilities that exist.
It is a similar situation with outliers. When we see an “outlier,” it is a single case out of a multitude of possible measurements (often unobserved) that might have been made if the instantaneous noise had a different value. So, when a reading attracts our attention and we suspect its validity, we need to use the known behavior of the data as well as the algorithms to determine whether our suspicions are justified, no matter how sophisticated the algorithm we applied to that data. So, let’s get started.
Everyone knows what an “outlier” is. Or do we? Or is an outlier another one of those things that “we think we ‘know’ that ain’t so” (5) (with apologies to Mark Twain). Statisticians sometimes use alternate terminology to describe outliers, some use “discordant” observations, and others call them “aberrant” observations, but we will stick to “outlier” as the terminology we expect our readership to be most familiar with. As we see below, it is much easier to describe what an outlier is not rather than what it is.
So what is an outlier? How can we detect them? What are their causes? How can we make sure that a suspect data value really is an outlier? How can we distinguish outliers from other types of problems with data? Does it matter whether the data are univariate or multivariate? And finally, if we do in fact have an outlier on our hands, how can we deal with it?
Each one of these questions can be the subject of a book, and indeed, books have been written dealing with the subject (6). Although this book is somewhat dated, the fundamentals of data handling haven’t changed over the years, and it is one that happens to be on hand. Those readers interested in digging deeper into the topic can find more recent books, if they wish.
Unsurprisingly, of course there is also a lot of information available online. See, for example, the websites listed in references 7–11 (starting with Wikipedia) as well as innumerable others, of varying degrees of specialization and sophistication. What you do have to watch out for, however, are the alternate meanings of the word “outlier.” There seems to be at least one specialized meaning of the word in the context of the science of geology, and several of the “hits” found had to do with the title of a book with that name (but nothing to do with our current meaning of the term). So when searching for information about outliers, make sure that the search results are for the statistical, or mathematical, meaning of the word (what Wikipedia calls “disambiguation”).
Our main interest, which is also presumably our readers’ as well, is in the application and detection of outliers in situations involving calibration of spectroscopic instruments used for chemical analysis, and which also involve application of one or another of the various chemometric algorithms to perform the analysis. However, we will couch our discussion in a more general terminology because the concepts and applications thereof that we deal with are, in fact, very general. We will, when appropriate, revert to a more specific analysis and applicable terminology when the discussion makes that approach appropriate. So let’s begin at the beginning.
Where should we start? Let’s start by trying to figure out just what it is that we’re dealing with. Most of the online references mentioned earlier include explicit definitions of the term outlier (8–11). Let’s look at some of them:
The underlying (root) cause of this situation is the basic fact that when physical measurements are subject to random variations because of uncontrollable physical causes, a given measurement can be anywhere within some spread of values, due to the random variations alone. This spread, which can be characterized by the distribution of values within the range of the spread (more on that later), is the ultimate source of the “fuzziness” of the data. This distribution gives rise to equivalent fuzziness in any properties calculated from this more or less fuzzy data. Indeed, the whole science of statistics is based on the need and desire to be able to specify what can reasonably be inferred from data containing random contributions, and thus provides many of the tools that we can use to tell us when data are behaving (or misbehaving), and what conclusions we can draw from those data. We have discussed this topic previously and will not belabor it here. As we will see, one way to look at an outlier, therefore, is a situation where data are severely misbehaving.
So the fuzziness of the underlying notions inevitably gives rise to fuzziness in the definitions, and leads to a whole set of related questions: How different is “markedly different”? What does it mean for a value to “lie outside” the other values? How large does a distance have to be for it to be “abnormally” distant? And how do we measure “distance”? How do we know if an observation is “inconsistent” with other observations? Each definition raises its own questions.
We are not be able to give definitive answers to all these questions, nor can we provide a clear, concise, and unambiguous definition of an outlier within the confines of this column. What we will (hopefully) do is to acquaint our readers with a set of tools (largely statistical tools) for answering the question in individual cases, for instance: Is this observation an outlier?
To address that question we need to look at some fundamentals about the way data behave, harking back to some of our earliest columns, which were eventually collected and published in book form (12). This book essentially discusses the basic properties of collections of data. One of the key properties of any collection of data is the distribution of the data in the collection. Data that are of interest to us will usually follow one of a small number of distributions: uniform, normal (Gaussian), t, chi-square (χ2), F, binomial, Poisson, and, less commonly, geometric and hypergeometric distributions. These distributions are all relatively easy to deal with since they are characterized by distinct and well-characterized mathematical definitions. These mathematical definitions tell us a very fundamental fact about the data: The proportion of data points falling in various ranges that encompass the data. This, in turn, translates into the probability of finding data points within various subranges within the span of the data. However, it may not always be easy to decide which distribution is the correct one for the data at hand.
Additionally, data must always belong to some distribution, so any data values that do not follow one of those defined distributions will therefore, of necessity, follow an empirically defined distribution-that is, a distribution defined by the data. We have recently addressed this concept, using the term EDF to represent the empirical distribution function for data used in a calibration (13). These distributions also ultimately result in specifying probability values for data in various subranges, but those probabilities generally cannot be represented by an analytic mathematical expression. By using an empirically defined distribution, we obtain a reference to compare individual data values to, but using an empirically defined distribution to characterize a data set also creates some difficulties in trying to characterize individual data values. It is likely to be even more difficult to determine exactly what the relevant distribution to compare to is than with the mathematically defined distributions. In all cases, to decide whether a given data value is an outlier, it must be compared to other values in the data set, which presumably follow one of the known distributions or some empirical distribution.
At this point, therefore, it is perhaps easier to talk about what an outlier is not rather than what it is. The data in a coherent set will be distributed according to the probabilities defining the specifications for that distribution. This approach introduces a key concept: An outlier can only be characterized in terms of its probability of belonging to the specified distribution. Thus, a data value that is an “outlier” for one data set with a certain distribution may or may not be an outlier for a set with a different distribution. In this sense, then, we can define an outlier as any data point that cannot be considered to be part of a distribution that characterizes the rest of the data. After all, any finite data set must have a largest value and a smallest value for its observations. Therefore, to claim that a datum is an outlier simply because it is one of the extreme values is clearly misidentifying it as such.
If the suspect datum is not a legitimate part of the data set used for comparison, then it makes sense that a characteristic of that datum is that it will be “far away,” “differ markedly,” “lie outside,” or be an “abnormal distance” from the rest of the data, in some way.
It is the dependence on the data distribution that makes the definition (and everything else) about outliers so fuzzy, as we noted above. Although all statistics are affected by the randomness inherent in data sets, detection of outliers is perhaps more affected than other aspects of data since almost by definition, an outlier depends on single readings, so it is rare that a benefit can be achieved by data processing alone. More important is that the data analyst pays attention to what the data are telling him, and does not depend on the computer alone to think for him. A nice example was published recently (14) where, in fact, a data transformation (second derivative) revealed an outlier. But the outlier was observed by “eyeballing” the spectra; the erroneous spectrum was seen to have “wiggles” in a spectral region where the other spectra were smooth. How many data analysts take the time or have the patience to check their data that carefully?
Our intuition about data tends to lead us to an assumption about that data, namely that the distribution is Normal (that is, Gaussian). If that is not the case, however, say if the data comes from a χ2 distribution, then many intuitive assumptions about the data will be false. χ2 distributions tend to be asymmetric with a long tail extending toward higher values. That long tail means that high values (even compared to the other values in the distribution) of χ2 will be common, and therefore, ipso facto, not “unusual” or lying at an “abnormal distance” from the rest of the data.
Conversely, since every set of data must have some value that is the largest, as well as one that is the smallest, simply having either of those properties is not enough to flag a datum as an “outlier.”
We will continue our discussion of outliers in a future column of “Chemometrics in Spectroscopy” coming soon to your mailbox.
Jerome Workman Jr. serves on the Editorial Advisory Board of Spectroscopy and is the Executive Vice President of Engineering at Unity Scientific, LLC, in Milford, Massachusetts. He is also an adjunct professor at U.S. National University in La Jolla, California, and Liberty University in Lynchburg, Virginia
Howard Mark serves on the Editorial Advisory Board of Spectroscopy and runs a consulting service, Mark Electronics, in Suffern, New York. Direct correspondence to: SpectroscopyEdit@UBM.com