Multivariate Calibration for Spectral Analysis Based on P-Spline Signal Regression with Net Analyte Signal

April 1, 2013

A P-spline signal regression (PSR) with net analyte signal (NAS) method is proposed to construct a quantitative calibration model with high precision.

In this article, a P-spline signal regression (PSR) with net analyte signal (NAS) method is proposed to construct a quantitative calibration model with high precision. The PSR with NAS method creates a basis coefficient vector by using a projection matrix for computing the NAS of the target analyte. PSR with NAS inherits the advantages of PSR, but is superior to PSR in terms of accuracy and flexibility. Two visible–near infrared (vis–NIR) spectra data sets from basic research experiments, including a leaf chlorophyll experiment and a leaf water experiment, are used to evaluate the performance of PSR with NAS. The root mean squared error of prediction (RMSEP), residual predictive deviation (RPD), and correlation coefficient indicate that the PSR with NAS method gives a better predictive accuracy than other calibration methods. Moreover, the results imply that PSR with NAS has distinct adaptability for the complex spectra model. It is shown that PSR with NAS is a promising multivariate calibration approach for spectral analysis.

It is well known that high-accuracy quantitative calibration is beneficial to improve prediction accuracy in chemometrics (1–5). Accordingly, quantitative calibration is a key point to extract analyte information by building a relationship of response variables (property) and predictor variables (wavelength). Linear models such as partial least squares (PLS) and principal component regression (PCR) are often used in the research (6–11). A revealing fact about PLS or PCR is that the order of the regressors is immaterial; that is, if the wavelengths are permuted arbitrarily, the PLS or PCR vectors will be permuted in the same way. PLS or PCR does not take into account the spatial nature of the regressor index (12).

P-spline signal regression (PSR) directly uses the ordered or array structure among the regressors (for example, along the wavelength) and forces the regression coefficients to be smooth along the order index. PSR does not necessarily require smoothness of the spectra, but only that the coefficient vector can be smoothed without doing much harm. Traditionally, the major obstacle with B-spline smoothers is the choice of the number and placement of knots. Too many (or few) knots will lead to overfitting (or underfitting), and optimization schemes are complicated nonlinear problems (12). P-splines circumvent this problem by combining regression on a B-spline basis with the difference penalty on the B-spline coefficients. The PSR approach goes a long way in solving the problem of multivariate calibration, but it also has a shortcoming. Prediction quality is limited for PSR because the useless information of irrelevant components is considered, which makes the relationship between spectra and the analyte of interest not so close.

Accordingly, a PSR with a net analyte signal (NAS) method that creates a basis coefficient vector by using a projection matrix for computing the NAS of the target analyte is proposed in this article. The approach inherits the strengths of PSR and outperforms PSR in terms of precision and flexibility. The method is capable of analyzing the data situation commonly found in certain biological applications where the number of variables is several orders of magnitude larger than the number of observations. Correspondingly, this article focuses on two interesting and intensively studied visible–near infrared (vis–NIR) spectra data sets for analysis of leaf biochemical parameters to illustrate the advantages of the proposed method. The prediction performance of the proposed method is underlined by comparing it with other calibration methods such as PLS, PCR, and PSR in terms of root mean squared error of prediction (RMSEP), correlation coefficient, and residual predictive deviation (RPD). Also, the adaptability of the proposed method to the complex spectra model is discussed. In this research, a feasible PSR with NAS model with optimal parameters is selected for multivariate calibration.

Theory

P-Spline Signal Regression

Consider a standard regression approach:

where y is the realization of the response, X is the spectra matrix and β is the unknown regression coefficient vector. Typically the number p of regressors far exceeds the number m of observations.

The goal of PSR is smoothness in β, and this is achieved through dimension reduction by first projecting β onto a rich B-spline basis using a moderate number of equally spaced knots (n-dimensional, n < p), that is, βpx1 = βpxn αnx1. For some specifics on the B-spline basis see reference 13. The vector α is the unknown vector of basis coefficients of modest dimension. Notice that equation 1 can be rewritten as

where

Then PSR further increases smoothness by imposing a difference penalty on adjacent B-spline coefficients in the α vector (12).

The penalized least-squares solution simplifies as (12)

where Dd is a (n - d) × n banded matrix of contrasts resulting from differencing adjacent rows of the identity matrix (In) d times. The order d of the difference penalty can moderate smoothing. The non-negative tuning parameter λ regularizes the penalty and can be chosen through a logarithmic grid search.

PSR typically uses between 10 and 200 equally spaced B-splines. The degree of the B-spline can vary (q = 3 to 0), and the order of the difference penalty can also vary (d = 3 to 0). For fixed n, q, and d, increasing λ makes α smoother and optimal λ is searched for systematically by monitoring root mean squared error of cross-validation (RMSECV). Results of these optima can be directly compared over the various model parameter combinations. It is important to note that the penalty terms introduce very little extra computational effort, since UTU and UTy do not have to be recomputed when the smoothness parameter λ is changed. Given a choice of parameters, then the p-dimensional regression coefficient vector can be constructed:

P-Spline Signal Regression with Net Analyte Signal

NAS is defined as the part of a spectrum that is orthogonal to the subspace spanned by the spectra of all other components (14). NAS only considers the part of the spectrum usable or available for quantization of the relevant component. Thus, it can extract the useful information of the analyte of interest and eliminate the useless information of other components. As a matter of fact, in the newly proposed method, a projection matrix for computing NAS is the basis for further calculations such as the basis coefficient vector and regression coefficient vector.

We propose to create a basis coefficient vector with a projection matrix for computing NAS of the analyte of interest. The projection matrix is used to make the relationship between spectra and the analyte of interest closer by eliminating unwanted information, so more relevant and useful basis coefficients can be estimated for quantization of the analyte of interest and, furthermore, a better regression coefficient vector can be obtained.

Accordingly, a modification of equation 3 can be presented as the formula

where G is the projection matrix for computing NAS of the kth analyte (15), the analyte of interest. Then the basis coefficient estimates â can be calculated through equation 4. Finally, the estimation of regression coefficient vector β is given by

With a limited amount of information, typically found in complex situations in which only the measured spectra of a set of calibration samples and the response of the analyte of interest are available, the projection matrix computation is feasible (15–18).

The superscript "+" symbolizes the Moore-Penrose pseudoinverse. I is the unit matrix of size p × p.

The spectra matrix X is rebuilt using h significant components, yielding the matrix X(with circumflex). PLS is used for calculation of significant components (15). r(with circumflex) is a sum of the rows of X(with circumflex). ŷ is presented as the formula

The optimal value for the number h of significant components in the PSR with NAS method is determined by minimizing RMSECV. This prevents underfitting and overfitting, so the projection matrix for computing NAS and the basis coefficient vector can be accurately estimated and, furthermore, a reliable prediction can be obtained.

Comparison Between PSR and PSR with NAS

As mentioned earlier, prediction quality is limited for PSR because the useless information of irrelevant components is considered, which makes the relationship between spectra and the analyte of interest not so close. Utilizing the projection matrix for computing NAS in the proposed method is helpful to get a higher precision multivariate calibration. Additionally, the choice for h in the proposed method prevents underfitting and overfitting, which also helps in obtaining a higher precision multivariate calibration.

Apart from that, relying on the theory of PSR with NAS, ample flexibility for the regression coefficient vector can be achieved with different values of h. In this regard, PSR with NAS is a more flexible linear regression technique in comparison to PSR.

Experimental

Leaf biochemical parameters such as chlorophyll content and water content can provide valuable insight into the physiological performance of plants. Vis–NIR spectral analysis is a nondestructive, rapid, and applicable technique in different spatial scales. Accordingly, it is widely used for plant parameter estimation in the level of plant canopy and leaf (19–24).

Vis–NIR Analysis of Leaf Chlorophyll

A total of 55 Epipremnum aureum leaves with different green and sapless levels were selected. All of them were healthy and homogeneous in color without anthocyanin pigmentation or visible symptoms of damage. The sample spectrum per leaf was obtained by averaging five repeating measurement spectra. Therefore, 55 sample spectra were used as predictor variables for the response variable; that is, chlorophyll content whose unit was micrograms per square centimeter (μg/cm2).

An Ocean Optics spectrometer and diffuse reflectance sample accessories Y-style fiber were used for spectral measurement. The light source was a white light. A white panel (Spectralon, Labsphere) was used as a 100% reflectance standard for all measurements. The scanning range of the spectrum ranged from 494 nm to 1017 nm. The resolution was set as 0.2 nm. The data were stored in the form of absorbance.

To obtain reference values of chlorophyll content, each leaf was cut into fragments and was extracted with an 80% aqueous solution with acetone and then centrifuged. The absorption spectra of the acetone extract were measured with the same spectrophotometer. The concentration of chlorophyll was calculated according to the Porra formula (25):

where cChl is the concentration of chlorophyll, A646.6 is the absorbance measured at 646.6 nm, A663.6 is the absorbance measured at 663.6 nm, and A750 is the absorbance measured at 750 nm.

Vis–NIR Analysis of Leaf Water

A total of 31 samples of heterogeneous plant leaves with different water contents were obtained. All of them were healthy and homogeneous in color without anthocyanin pigmentation or visible symptoms of damage. The sample spectrum per leaf was also obtained by averaging five repeating measurement spectra. Therefore, 31 sample spectra were utilized as predictor variables for the response variable; that is, water content whose unit was percent.

The diffuse reflectance spectra of leaves were obtained with the Evolution 300 spectrometer (Thermo Scientific) with reflectance accessories integrating sphere. The spectra scan scope was 400–1100 nm with a resolution of 1 nm. The data were stored in the form of reflectance.

Leaf relative water content (RWC), which is used as reference value of water content, is determined through the method of roasting as

where FW is fresh weight and DW is dry weight. Each leaf was first weighed quickly using an analytical balance (Mettler Toledo AL104) after collecting the corresponding spectra, and FW was recorded. Then leaves were dried at 120 °C in a circulation oven for 20 min, and the temperature was reduced to 80 °C until the constant weight (dry weight, DW) was reached.

Calculation and Software

The observations were randomly split into two groups: two-thirds of the observations was used as calibration set for modeling and the other one-third as independent external prediction set for evaluating the performance of the model. For ease of reproducibility, we chose every third observation (that is, numbers 3, 6, 9, and so on) as the prediction set. Correspondingly, the chlorophyll data were split in a calibration:prediction set of 37:18 samples; the water data were split in a calibration:prediction set of 21:10 samples. All further calculations were performed with Matlab 7.6.0 (The Mathworks, Inc.).

Results and Discussion

Model for Leaf Chlorophyll Data

Figure 1 presents the raw spectra of the leaf chlorophyll experiment. PSR with NAS was used to construct a calibration model for the raw spectra that is used as the regressor. Based on minimizing RMSECV, this article chooses the optimal model parameters n = 10, q =3, d = 3, λ = 1, and h= 5 in the proposed method. The selection of penalty tuning parameter in the proposed method is shown in Figure 2. Figure 3 illustrates the selection of the number of significant components.

Figure 1: Original spectra of leaf chlorophyll experiment.

To estimate the effectiveness of quantitative calibration models for leaf chlorophyll content nondestructive measurement using vis-NIR spectroscopy, different strategies including PLS, PCR, and PSR were introduced to compare against the proposed method. By means of full cross-validation, the optimal model dimensionality was 3 in the PLS or PCR calibration; 10 equally-spaced cubic B-splines, a zero order difference penalty, and a tuning parameter with the value of 1 were determined in standard PSR.

Figure 2: Root mean squared errors of cross-validation of PSR with NAS using n = 10, q = 3, d =3, h = 5, and different values of λ for the leaf chlorophyll data.

Model for Leaf Water Data

The spectra model of the leaf water experiment is more complex than the spectra model of the leaf chlorophyll experiment. When taking the physical characteristics of leaves into account, the difference in species for the vis–NIR analysis of leaf water induces additional incorrect or nonlinear factors and complicates the spectra model. Moreover, chlorophyll is the main compound dominating the optical character in the vis–NIR region. It has an obvious absorption peak and the information about it can be easily extracted from spectra. But the situation of water is different. The absorption peak of water in the shortwave NIR range (around 970 nm) is weak. Thus, the signal is weak even though a great deal of water exists in the leaf, which also makes the spectra model more complex.

Figure 3: Root mean squared errors of cross-validation of PSR with NAS using n = 10, q = 3, d =3, λ = 1, and different values of h for the leaf chlorophyll data.

The raw spectra from the leaf water analysis are presented in Figure 4. The raw spectra were used to construct a PSR with NAS model. Similarly, RMSECV is used in the proposed method, and the optimal n = 50, q = 1, d = 0, λ = 100, and h =20 are determined automatically. The choices for the penalty tuning parameter and the number of significant components in the proposed method are displayed in Figures 5 and 6, respectively.

Figure 4: Original spectra of leaf water experiment.

For the vis–NIR analysis of leaf water, the optimal model dimensionality is 5 by full cross-validation in the PLS calibration, and the optimal model dimensionality is 6 in the PCR calibration. The PSR approach takes 20 equally-spaced quadratic B-splines, a zero order difference penalty, and a tuning parameter with the value of 100.

Figure 5: Root mean squared errors of cross-validation of PSR with NAS using n = 50, q = 1, d = 0, h = 20, and different values of λ for the leaf water data.

Comparison of Prediction Results

To investigate the prediction accuracy of the proposed method, the prediction results obtained by PLS, PCR, PSR, and the proposed method are summarized in Tables I and II. The calibration sets are used to construct the model; root mean squared error of calibration (RMSEC) and correlation coefficient between the reference contents and the predicted values are utilized to evaluate fitting performance. The independent external prediction sets are used to validate the constructed model; RMSEP, correlation coefficient, and RPD are introduced to evaluate prediction performance. High correlation coefficient values and low RMSEC or RMSEP values are desired. RPD is a statistical indicator that is applied to evaluate how well a calibration model can predict a prediction data set (26,27), which is defined as the ratio between the standard deviation and the prediction standard error of the population in the prediction set. Therefore, the higher the RPD value, the greater power of the model to predict accurately, and an RPD value greater than three is generally considered to be desirable for predictive purposes (28).

Figure 6: Root mean squared errors of cross-validation of PSR with NAS using n = 50, q = 1, d = 0, λ = 100 and different values of h for the leaf water data.

The parameters listed in Tables I and II are in a good coherence within a data set. According to the results of RMSEC and correlation coefficient for calibration set, it is clear that the PSR with NAS model has the best fitting performance for each data set. As shown in Table I, calibration obtained by PSR with NAS for the prediction set obtains the best results, that is, the lowest RMSEP and the highest correlation and RPD. Under the modeling strategy of PSR with NAS, the RMSEP is 1.5 μg/cm2, which is a decrease of 32% compared to the method with PLS, 35% compared to PCR, and 17% compared to PSR. Table II illustrates that for the leaf water data set, the best prediction accuracy is still obtained by PSR with NAS, and the RMSEP is decreased 42% compared to PLS, 63% compared to PCR, and 29% compared to PSR. The differences between the RMSEPs turn out to be pronounced. The RPD is very distinct for different calibration methods, and the calibration by PSR with NAS still has the best RPD and correlation.

Table I: Comparison between different calibration methods for the leaf chlorophyll data

Satisfactory prediction results are obtained by the proposed method. The reasons are as follows: The basis coefficient vector is created with the projection matrix for computing the NAS of chlorophyll, which is used to make the relationship between spectra and chlorophyll closer by eliminating unwanted information, so the better regression coefficient vector can be obtained; the optimal value for h can be determined by RMSECV, which prevents overfitting or underfitting, and as a consequence helps in estimating the projection matrix for computing NAS and the basis coefficient vector accurately; an additional structure can be utilized, accounting for the indexing information along the signal, hence the estimate of the coefficient vector that contrasts important signal information useful for predicting the chlorophyll content; the decision associated with the number and position of B-spline knots can be transferred to optimization of a continuous smoothing parameter, and the combination of n, q, d, λ, and cross-validation is beneficial to the chlorophyll content information extraction.

Table II: Comparison between different calibration methods for the leaf water data

For comparison of different data sets, RPD seems better than the others because the value of RMSEC or RMSEP is determined by the absolute error of each sample and the value of correlation coefficient can only reflect the correlativity. The PCR model shows poor performance especially for the complex data of leaf water, and its RPD value for leaf water data is not desirable for predictive purposes. However, according to RPD, the PSR with NAS model possesses the greatest power to predict accurately, particularly for the complex data of leaf water. And the PLS and PSR models are compromises. From this point of view, it is implied that PSR with NAS has better robustness than the other methods, especially when used for complex spectra data.

Adaptability to a Complex Spectra Model

The adaptability to a complex spectra model is a very important parameter for assessing the usefulness of a multivariate calibration method. The prediction results of different calibration methods for the two spectra data indicate that

  • Under the method of PSR with NAS, it can be seen that the RMSEP is decreased 35% from 2.3 to 1.5 for the leaf chlorophyll data; and the RMSEP is decreased 63% from 6.0 to 2.2 for the leaf water data. It is validated that this method has better adaptability to the complex spectra model of leaf water, in which the improvement of prediction accuracy of the leaf water analysis is much better than that of the leaf chlorophyll analysis.

  • Under different calibration strategies, the RPDs for different spectra data sets are calculated respectively. It is clear that, for the more complex spectra model, which is obtained by the leaf water experiment, the RPD value is very different for different calibration methods, and calibration using the PSR with NAS method has the best RPD value. Accordingly, PSR with NAS also has better adaptability and robustness to complex spectra model. In this regard, PSR with NAS is a better alternative for multivariate calibration.

  • For the complex spectra data of the leaf water analysis, the RPD of PLS model amounts to a relative gain of about 17% in comparison to the PLS model for the leaf chlorophyll analysis. Similarly, the RPD of PSR model amounts to a relative gain of about 17%, and the RPD of PSR with NAS model amounts to a relative gain of about 29%. Conversely, the RPD of PCR model is decreased by 21%. It should be noted that the best RPD is achieved by using PSR with NAS for these two sets of spectral data. These phenomena provide convincing evidence that the PSR with NAS method has distinct adaptability to the complex spectra model, which is very important for spectral analysis.

Conclusion

A P-spline signal regression with net analyte signal is proposed in this article to improve the prediction performance of the quantitative calibration model. PSR with NAS is easy to be interpreted and computed, and it has strong connections to classical regression. PSR with NAS is straightforward to use: It uses the entire ("raw") signal and works without any data preprocessing. It is superior to PSR in terms of accuracy and flexibility. PSR with NAS quantification was successfully applied to two experimental spectra data sets for analysis of leaf biochemical parameters. It is shown that the P-spline signal regression with net analyte signal procedure for these two data sets performs better than PLS, PCR, and PSR in terms of precision. Furthermore, since it can take full advantage of the relevant information, PSR with NAS has better adaptability for the complex spectra model, which has the potential capability for multivariate calibration and spectral analysis. It is expected that the optimal prediction accuracy can be obtained when the most informative wavelength bands, the fitting pretreatment method, and the PSR with NAS calibration are all used in the study. Therefore, PSR with NAS is a highly competitive and promising method. The excellent prediction performance by PSR with NAS for leaf biochemical parameter determination can be expanded and made more stable for future practical applications.

Acknowledgments

This work is supported by Programs for Changjiang Scholars and Innovative Research Team (PCSIRT) in University of China (IRT0705) and National Natural Science Foundation (60708026).

References

(1) R.K.H. Galvao, M.C.U. Araujo, M.D. Martins, G.E. Jose, M.J.C. Pontes, E.C. Silva, and T.C.B. Saldanha, Chemom. Intell. Lab. Syst. 81, 60–67 (2006).

(2) E.V. Thomas, Anal. Chem. 72, 2821–2827 (2000).

(3) M.P.A. Ribeiro, T.F. Padua, O.D. Leite, R.L.C. Giordano, and R.C. Giordano, Chemom. Intell. Lab. Syst. 90, 169–177 (2008).

(4) M. Daszykowski, M.S. Wrobel, H. Czarnik-Matusewicz, and B. Walczak, Analyst 133, 1523–1531 (2008).

(5) X.Y. Zhang, Q.B. Li, and G.J. Zhang, Chemom. Intell. Lab. Syst. 107, 333–342 (2011).

(6) S. Serneels, C. Croux, and P.J. Van Espen, Chemom. Intell. Lab. Syst. 71, 13–20 (2004).

(7) S. Serneels and T. Verdonck, Comput. Stat. Data Anal. 53, 3855–3873 (2009).

(8) C.C. Felicio, L.P. Bras, J.A. Lopes, L. Cabrita, and J.C. Menezes, Chemom. Intell. Lab. Syst. 78, 74–80 (2005).

(9) J.A. Westerhuis, T. Kourti, and J.F. MacGregor, J. Chemom. 12, 301–321 (1998).

(10) Z. Ramadan, P.K. Hopke, M.J. Johnson, and K.M. Scow, Chemom. Intell. Lab. Syst. 75, 23–30 (2005).

(11) S. Gourvenec, J.A.F. Pierna, D.L. Massart, and D.N. Rutledge, Chemom. Intell. Lab. Syst. 68, 41–51 (2003).

(12) B.D. Marx and P.H.C. Eilers, Technometrics 41, 1–13 (1999).

(13) P.H.C. Eilers and B.D. Marx, Stat. Sci. 11, 89–121 (1996).

(14) L. Xu and I. Schechter, Anal. Chem. 68, 2392–2400 (1996).

(15) A. Lorber, K. Faber, and B.R. Kowalski, Anal. Chem. 69, 1620–1626 (1997).

(16) K. S. Booksh and B. R. Kowalski, Anal. Chem. 66, 782–791 (1994).

(17 ) A. Lorber, Anal. Chim. Acta 164, 293–297 (1984).

(18) H. Martens and T. Næs, Multivariate Calibration (Wiley, New York, 1989).

(19) A.A. Gitelson, Y. Gritz, and M.N. Merzlyak, J. Plant Physiol. 160, 271–282 (2003).

(20) R. Colombo, M. Meroni, A. Marchesi, L. Busetto, M. Rossini, C. Giardino, and C. Panigada, Remote Sens. Environ. 112, 1820–1834 (2008).

(21) G.A. Blackburn and J.G. Ferwerda, Remote Sens. Environ. 112, 1614–1632 (2008).

(22) D.A. Sims and J.A. Gamon, Remote Sens. Environ. 81, 337–354 (2002).

(23) L.H. Xue and L.Z. Yang, ISPRS J. Photogram. Remote Sens. 64, 97–106 (2009).

(24) P. Ceccato, S. Flasse, S. Tarantola, S. Jacquemoud, and J.M. Grégoire, Remote Sens. Environ. 77, 22–33 (2001).

(25) R.J. Porra, W.A. Thompson, and P.E. Kriedemann, Biochim. Biophys. Acta 975, 384–394 (1989).

(26) J. Pink, M. Naczk, and D. Pink, J. Agric. Food Chem. 46, 3667–3672 (1998).

(27) H.E. Smyth, D. Cozzolino, W.U. Cynkar, R.G. Dambergs, M. Sefton, and M. Gishen, Anal. Bioanal. Chem. 390, 1911–1916 (2008).

(28) F.J. Rambla, S. Garrigues, and M. de la Guardia, Anal. Chim. Acta 344, 41–53 (1997).

Xiaoyu Zhang, Qingbo Li, and Guangjun Zhang are with the Precision Opto-mechatronics Technology Key Laboratory of Education Ministry in the School of Instrumentation Science and Opto-electronics Engineering at Beihang University in Beijing, China. Please direct correspondence to: xiaoyuzhangzi@126.com