Sample Selection Methods Tested Against Near-Infrared Spectral Information Entropy


Scientists from East China Jiaotong University, located in Nangchang, Jiangxi, China, recently tested different sample selection methods using near-infrared (NIR) spectral information entropy as a similarity criterion. Their findings were published in the Journal of Chemometrics (1).

Young woman examines a spectroscopy picture in a quantum physics laboratory | Image Credit: © luchschenF -

Young woman examines a spectroscopy picture in a quantum physics laboratory | Image Credit: © luchschenF -

Near-infrared (NIR) spectroscopy has been used in a wide variety of tasks in the past few years. The technique has been used to predict the harvest times of cabernet sauvignon grapes, detect Covid-19, and analyze emission lines from a supernova (in tandem with mid-infrared [MIR] spectroscopy (2–4). When using NIR, model constructions and maintenance updates are essential. Model construction, when being performed in machine learning, usually has a sample set divided into a calibration set and a validation set. The representativeness of the calibration set, and the reasonable distribution of the validation set affect the accuracy of the established model. Additionally, while maintaining and updating models, selecting the most informative updated samples can not only improve the model prediction accuracy, but also reduce the amount of sample preparation that is necessary.

For this study, spectral information entropy (SIE) is proposed as a similarity criterion for dividing sample sets, with this criterion being used to select updated samples. Two methods were used for comparing and verifying the superiority of this proposed method: the Kennard–Stone (KS) method, which is a way to perform a split between training and test set based on a distance metric between data points, spectra or labels, and the sample set portioning based on joint x–y distance (SPXY) method (5).

The model that was built after dividing the sample set with SIE was shown to have a good prediction effect compared to the sample sets that were divided with KS and SPXY. When predicting soluble solid content (SSC) and hardness, the prediction determination coefficient (R2P) was improved by over 15%, while the root mean square error (RMSE) of prediction was reduced by 50%. Regarding model updating, it was found that selecting a small number of updated samples using SIE can improve a correlation efficient (RP) by more than 80%, with updated models having prediction accuracies higher than those of the KS and SPXY methods. These results confirm that SIE can make the NIR analysis technique more reliable.


(1) Liu, Y.; He, C.; Jiang, X. Sample Selection Method Using Near-Infrared Spectral Information Entropy as Similarity Criterion for Constructing and Updating Peach Firmness and Soluble Solids Content Prediction Models. J. Chemom. 2023, 38 (2), e3528. DOI:

(2) Luo, Y.; Zhao, J.; Zhu, H.; Li, X.; Dong, J.; Sun, J. Prediction of the Harvest Time of Caberney Sauvignon Grapes Using Near-Infrared Spectroscopy. Spectroscopy 2024. (accessed 2024-3-25)

(3) Acevedo, A. Detecting Covid-19 Using Visible or Near-Infrared Spectroscopy and Machine Learning. Spectroscopy 2023. (accessed 2024-3-25)

(4) Wetzel, W. Observing Supernova 1987A with Near-infrared and Mid-infrared Spectroscopy. Spectroscopy 2024. (accessed 2024-3-25)

(5) The Kennard-Stone Algorithm. NIRPY Research 2022. (accessed 2024-3-25)