A training set sample selection method based on the simple-to-use interactive self-modeling mixture analysis (SIMPLISMA) algorithm is proposed in this article. We applied this method in near-infrared (NIR) spectral analysis of the content of protein, water, oil, and starch in corn. It is concluded that SIMPLISMA can be used as an alternative path to select training samples for constructing a robust prediction model in NIR spectral analysis.
In the field of quantitative analysis using near-infrared (NIR) spectroscopy, calibration is necessary. Training set sample selection is an important aspect of multivariate calibration, and it has attracted the attention of several researchers. The aim of this technique consists of the following two items:
In this article, we mainly discuss the first case, which involves the training set sample selection method and subset selection from full calibration samples. A sufficient number of samples with a wide concentration (or the nature) distribution of the analyte is needed for constructing a calibration model. Especially during the selection of the representative training samples, it is helpful for us to establish a calibration model with higher efficiency and good accuracy.Several works have discussed the problem of selecting a representative subset from a large pool of samples (1,2,12–14). In general, there are three types of training set sample selection methods: random sampling; sample selection based on data analysis of instrumental response variables x (that is, spectra); and sample selection based on data analysis of instrumental response variables x and output vector y (that is, concentration or the nature of the analyte).
Random sampling is the most popularly used method because of its simplicity — a group of data randomly extracted from a larger set follows the statistical distribution of the entire set. However, random sampling does not guarantee that representative samples are included in the calibration, and it also does not ensure that the samples on the boundaries of the set are selected in the calibration.
Therefore, many sample selection methods based on data analysis of instrumental response variables x are typically used; an example would be the linear algebra techniques similar to spectral subtraction proposed by Honigs (12), or the Kennard–Stone (KS) algorithm, which is an alternative that is often used. KS is aimed at covering the multidimensional space in a uniform manner by maximizing the Euclidean distances between the instrumental response vectors (x) of the selected samples (13,14). The advantage of the KS algorithm is that it ensures that the selected samples are uniformly distributed in accordance with the spatial distance. Despite the comparative advantages of KS, however, the data conversion and the space distance between samples must be calculated, which is a challenging computation procedure. Another shortcoming of the KS algorithm for multivariate calibration lies in the fact that the statistics of the output vectors (y) are not taken into account (2).
Recently, some studies of sample selection have focused on the data analysis of instrumental response variables x and output vector y (1,2). It could be argued that the inclusion of y-information in the selection process might result in a more effective distribution of calibration samples in the multidimensional space, thus improving the predictive ability and robustness of the resulting model. In the work of Galvão and colleagues (1), an approach for considering joint x–y statistics in the selection of samples is proposed for the purpose of diesel analysis by NIR spectrometry. The method, termed sample set partitioning based on joint x–y distances (SPXY), extends the KS algorithm by encompassing both x- and y-differences in the calculation of inter-sample distances. In other words, in comparison to KS, the y-information is considered by SPXY. However, the collinearity effect between samples is not included in the algorithm. Especially when hundreds of samples are used in calibration, the collinearity effect exists.
In this article, we extend the simple-to-use interactive self-modeling mixture analysis (SIMPLISMA) method as a novel, training set, sample selection method for eliminating samples with collinearity. In this method, x and y-information are included in the sample selection process. The selected representative samples for calibration are used to improve the modeling efficiency, and good accuracy can be achieved.
SIMPLISMA was originally developed by Willem Windig and colleagues for pure variable selection (15). A pure variable is defined as a variable whose intensity mainly results from one of the components in the mixture under consideration (16). According to the purity spectrum and standard deviation spectrum of variables, the pure variables can be selected straightforwardly. And the determinant-based function is used in SIMPLISMA for determining the independence of variables. The point is to eliminate collinearity of variables.
SIMPLISMA has been successfully used for the investigation of peak purity with liquid chromatography (LC) and diode-array detection (DAD) data (17). It also has been applied for the second-derivative NIR spectra to resolve highly overlapping signals with a baseline problem (18). Also, SIMPLISMA has already been applied to the analysis of categorized pyrolysis mass spectra data, Fourier transform infrared (FT-IR) spectroscopy, microscopy data of a polymer laminate, Raman spectra of a reaction followed in time, and so on (19–22).
This article discusses the use of SIMPLISMA for the selection of samples for a training set for multivariate calibration in NIR spectral analysis. In this study, SIMPLISMA is used to select a subset of calibration samples that are minimally redundant but still representative of the data set. It is not the same as pure variable selection by SIMPLISMA; this approach involves pure sample selection using the SIMPLISMA principle to eliminate the collinearity of samples. Our goal is to reduce computational work and obtain a robust multivariate model with stable prediction performance. The sample selection method takes into account both x and y statistics by combining with wavelength-variable selection, which is helpful in resolving multicomponent chemical or physical information of the mixture from overlapped spectra. The use of SIMPLISMA for sample selection is illustrated for NIR spectra analysis of the protein, water, oil, and starch content in corn. In this application, sample selection by SIMPLISMA is also compared to the KS algorithm and the SPXY method.