A training set sample selection method based on the simple-to-use interactive self-modeling mixture analysis (SIMPLISMA) algorithm
is proposed in this article. We applied this method in near-infrared (NIR) spectral analysis of the content of protein, water,
oil, and starch in corn. It is concluded that SIMPLISMA can be used as an alternative path to select training samples for
constructing a robust prediction model in NIR spectral analysis.
In the field of quantitative analysis using near-infrared (NIR) spectroscopy, calibration is necessary. Training set sample
selection is an important aspect of multivariate calibration, and it has attracted the attention of several researchers. The
aim of this technique consists of the following two items:
- Choosing the smallest set of training samples that can be used without significantly compromising the prediction performance
of the model. It is often used to save computational work (1–8). In some situations, the representative samples used in multivariate
calibration can benefit both the modeling efficiency and the prediction performance.
- Training set sample selection is also an important part in the context of transfer of multivariate calibration models (9–11),
in which a subset of samples must be measured in different spectrometers. The measurements are then used to standardize the
instrumental response of one or more spectrometers (slaves) with respect to a single one (master). In this case, a small number
of representative samples are chosen so that they will contain sufficient information about the variations of the samples
because of different measurement conditions. This benefits the practical application.
In this article, we mainly discuss the first case, which involves the training set sample selection method and subset selection
from full calibration samples. A sufficient number of samples with a wide concentration (or the nature) distribution of the
analyte is needed for constructing a calibration model. Especially during the selection of the representative training samples,
it is helpful for us to establish a calibration model with higher efficiency and good accuracy.
Several works have discussed the problem of selecting a representative subset from a large pool of samples (1,2,12–14). In
general, there are three types of training set sample selection methods: random sampling; sample selection based on data analysis
of instrumental response variables x (that is, spectra); and sample selection based on data analysis of instrumental response variables x and output vector y (that is, concentration or the nature of the analyte).
Random sampling is the most popularly used method because of its simplicity — a group of data randomly extracted from a larger
set follows the statistical distribution of the entire set. However, random sampling does not guarantee that representative
samples are included in the calibration, and it also does not ensure that the samples on the boundaries of the set are selected
in the calibration.
Therefore, many sample selection methods based on data analysis of instrumental response variables x are typically used; an example would be the linear algebra techniques similar to spectral subtraction proposed by Honigs
(12), or the Kennard–Stone (KS) algorithm, which is an alternative that is often used. KS is aimed at covering the multidimensional
space in a uniform manner by maximizing the Euclidean distances between the instrumental response vectors (x) of the selected samples (13,14). The advantage of the KS algorithm is that it ensures that the selected samples are uniformly
distributed in accordance with the spatial distance. Despite the comparative advantages of KS, however, the data conversion
and the space distance between samples must be calculated, which is a challenging computation procedure. Another shortcoming
of the KS algorithm for multivariate calibration lies in the fact that the statistics of the output vectors (y) are not taken into account (2).
Recently, some studies of sample selection have focused on the data analysis of instrumental response variables x and output vector y (1,2). It could be argued that the inclusion of y-information in the selection process might result in a more effective distribution of calibration samples in the multidimensional
space, thus improving the predictive ability and robustness of the resulting model. In the work of Galvão and colleagues (1),
an approach for considering joint x–y statistics in the selection of samples is proposed for the purpose of diesel analysis by NIR spectrometry. The method, termed
sample set partitioning based on joint x–y distances (SPXY), extends the KS algorithm by encompassing both x- and y-differences in the calculation of inter-sample distances. In other words, in comparison to KS, the y-information is considered by SPXY. However, the collinearity effect between samples is not included in the algorithm. Especially
when hundreds of samples are used in calibration, the collinearity effect exists.
In this article, we extend the simple-to-use interactive self-modeling mixture analysis (SIMPLISMA) method as a novel, training
set, sample selection method for eliminating samples with collinearity. In this method, x and y-information are included in the sample selection process. The selected representative samples for calibration are used to
improve the modeling efficiency, and good accuracy can be achieved.
SIMPLISMA was originally developed by Willem Windig and colleagues for pure variable selection (15). A pure variable is defined as a variable whose intensity mainly results from one of the components in the mixture under consideration (16).
According to the purity spectrum and standard deviation spectrum of variables, the pure variables can be selected straightforwardly.
And the determinant-based function is used in SIMPLISMA for determining the independence of variables. The point is to eliminate
collinearity of variables.
SIMPLISMA has been successfully used for the investigation of peak purity with liquid chromatography (LC) and diode-array
detection (DAD) data (17). It also has been applied for the second-derivative NIR spectra to resolve highly overlapping signals
with a baseline problem (18). Also, SIMPLISMA has already been applied to the analysis of categorized pyrolysis mass spectra
data, Fourier transform infrared (FT-IR) spectroscopy, microscopy data of a polymer laminate, Raman spectra of a reaction
followed in time, and so on (19–22).
This article discusses the use of SIMPLISMA for the selection of samples for a training set for multivariate calibration in
NIR spectral analysis. In this study, SIMPLISMA is used to select a subset of calibration samples that are minimally redundant
but still representative of the data set. It is not the same as pure variable selection by SIMPLISMA; this approach involves
pure sample selection using the SIMPLISMA principle to eliminate the collinearity of samples. Our goal is to reduce computational
work and obtain a robust multivariate model with stable prediction performance. The sample selection method takes into account
both x and y statistics by combining with wavelength-variable selection, which is helpful in resolving multicomponent chemical or physical
information of the mixture from overlapped spectra. The use of SIMPLISMA for sample selection is illustrated for NIR spectra
analysis of the protein, water, oil, and starch content in corn. In this application, sample selection by SIMPLISMA is also
compared to the KS algorithm and the SPXY method.