Calibration Transfer, Part IV: Measuring the Agreement Between Instruments Following Calibration Transfer

Sep 30, 2013

This is our 100th "Chemometrics in Spectroscopy" column, and when including "Statistics in Spectroscopy," there are now a total of 138 columns. We began in 1986 and have been working since that time to cover in-depth discussions of both basic and difficult subjects. In this newest series, we have been discussing the subject of multivariate calibration transfer (or calibration transfer) and the determination of acceptable error for spectroscopy. We have covered the basic spectroscopy theory of spectroscopic measurement in reflection, discussed the concepts of measuring and understanding instrument differences, and provided an overview of the mathematics used for transferring calibrations and testing transfer efficacy. In this installment, we discuss the statistical methods used for evaluating the agreement between two or more instruments (or methods) for reported analytical results. The emphasis is on acceptable analytical accuracy and confidence levels using two standard approaches: standard uncertainty or relative standard uncertainty, and Bland-Altman "limits of agreement."

As we have discussed in this series (1–3), calibration transfer involves several steps. The basic spectra are initially measured on at least one instrument (that is, the parent, primary, or master instrument) and combined with the corresponding reference chemical information (that is, actual values) for the development of calibration models. These models are maintained on the original instrument over time, are used to make the initial calibration, and are transferred to other instruments (that is, child, secondary, or transfer instruments) to enable analysis using the child instruments with minimal corrections and intervention. We note that the issue of calibration transfer disappears if the instruments are precisely alike. If instruments are the "same" then one sample placed on any of the instruments will predict precisely the "same" result. Because instruments are not alike and also change over time, the use of calibration transfer techniques is often applied to produce the best attempt at model or data transfer. As mentioned in the first installment of this series (1), there are important issues of attempting to match calibrations using optical spectroscopy to the reference values. The spectroscopy measures the volume fractions of the various components of a mixture.

Historically, instrument calibrations have been performed using the existing analytical methods to provide the reference values. These existing methods have often overwhelmingly used weight percents for their results. Until this paradigm changes and analysts start using the correct units for reporting their results we have to live with this situation. We will also have to recognize that the reference values may be some prescribed analysis method, the weight fraction of materials, the volume percent of composition, or sometimes some phenomenological measurement having no known relation to underlying chemical entities, resulting from some arbitrary definition developed within a specific industry or application. Amongst ourselves we have sometimes termed these reference methods "equivalent to throwing the sample at the wall and seeing how long it sticks!" The current assumption is the nonlinearity caused by differences in the spectroscopy and the reported reference values must be compensated for by using calibration practices. This may not be as simple as presupposed, but requires further research.

Multivariate calibration transfer, or simply calibration transfer, is a set of software algorithms, and physical materials (or standards) measured on multiple instruments, used to move calibrations from one instrument to another. All the techniques used to date involve measuring samples on the parent, primary (calibration) and child, secondary (transfer) instruments and then applying a variety of algorithmic approaches for the transfer procedure. The most common approaches involve partial least squares (PLS) calibration models with bias or slope corrections for predicted results, or the application of piecewise direct standardization (PDS) combined with small adjustments in bias or slope of predicted values. Many other approaches have been published and compared, but for many users these are not practicable or have not been adopted and made commercially available for various reasons.

In any specific situation, if the prescribed method for calibration transfer does not produce satisfactory results, the user simply begins to measure more samples on the child (transfer) instrument until the model is basically updated based on the child or transfer instrument characteristics. We have previously described the scenario in which a user has multiple products and constituents and must check each constituent for the efficacy of calibration transfer. This is accomplished by measuring 10–20 product samples for each constituent and comparing the average laboratory reference value to the average predicted value for each constituent, and then adjusting each constituent model with a new bias value, resulting in an extremely tedious and unsatisfying procedure. Such transfer of calibration is also accomplished by recalibration on the child instrument or by blending samples measured on multiple instruments into a single calibration. Although the blending approach improves method robustness (or ruggedness) for predicted results across instruments using the same calibration, it is not applicable for all applications, for analytes having small net signals, or for achieving the optimum accuracy.

How to Tell if Two Instrument Predictions, or Method Results, Are Statistically Alike

The main question when comparing parent to child instrument predictions, a reference laboratory method to an instrument prediction, or results from two completely different reference methods, is how to know if the differences are meaningful or significant and when they are not. There is always some difference expected, since an imperfect world allows for a certain amount of "natural" variation. However when are those differences considered statistically significant differences, or when are the differences too great to be acceptable? There are a number of reference papers and guides to tell us how to compute differences, diagnose their significance, and describe the types of errors involved between methods, instruments, and analytical techniques of many types. The analytical method can be based on spectroscopy and multivariate calibration methods, other instrumental methods, or even gravimetric methods. We have included several of the most noted references in the reference section of this column. One classic reference of importance for comparing methods is by Youden and Steiner (4). This reference describes some of the issues we will discuss in this column as well as details regarding collaborative laboratory tests, ranking of laboratories for accuracy, outlier determination, ruggedness tests for methods, and diagnosing the various types of errors in analytical results.

Table I: Data used for illustration, instruments (or methods A, B, C, and D)
Let us begin with a set of measurement data as shown in Table I. This is simulation data that is fairly representative of spectroscopy data following multivariate calibration. These data could refer to different methods, such as methods A, B, C, and D; or to different instruments. For our discussion we will designate that the data is from four different instruments A to D for 20 samples. The original data from a calibrated instrument is A1. The results of data transferred to other instruments is represented by B, C, and D. There are duplicate measurements for A as A1 and A2, and for B as B1 and B2, respectively. From these data we will perform an analysis and look for levels of uncertainty and acceptability for the analytical results. Note: C1 and D1 data are used in Figure 2 and will be referred to along with A2 and B2 in the next installment of this column.