Scientists demonstrate a self-supervised learning framework that dramatically improves near-infrared spectroscopy classification results, even with minimal labeled data.
Artificial intelligence (AI) and NIR for Classification © putilov_denis - stock.adobe.com
Near-infrared (NIR) spectroscopy, a cornerstone in non-destructive analysis, has long been valued for its simplicity, speed, efficiency, and non-destructive analysis capabilities. Yet, its effectiveness often hinges on large databases, complex preprocessing, and skilled feature selection, often requiring significant domain expertise. Researchers at Fujian Agriculture and Forestry University have introduced a creative approach to overcome these limitations—a convolutional neural network (CNN)-based self-supervised learning (SSL) framework designed to excel even with small datasets. Published in Analytical Methods, the study promises to reshape spectral analysis by automating feature extraction and reducing reliance on labor-intensive data labeling (1–3).
Overcoming Challenges in NIR Spectroscopy
NIR spectroscopy commonly operates within the 780–2526 nm wavelength range, exploiting absorption patterns of hydrogen-containing groups like O–H, C–H, and N–H. Despite its versatility in analyzing organic molecules, challenges such as broad overlapping peaks, and low to noise signals often complicate direct data interpretation. Traditional machine learning (ML) methods, which rely heavily on preprocessing, feature selection, and model construction, risk signal distortion and information loss (1–3).
Deep learning has emerged as a promising alternative, but its reliance on large labeled datasets has limited its adoption in NIR spectroscopy, where labeling is costly and time-consuming. Addressing this gap, the Fujian team—comprising Rongyue Zhao, Wangsen Li, Jinchai Xu, Linjie Chen, Xuan Wei, and Xiangzeng Kong affiliated with the School of Future Technology and the College of Mechanical and Electrical Engineering—developed a novel SSL framework to extract critical spectral features with minimal human intervention (1).
The Self-Supervised Learning Framework
The proposed SSL model comprises two stages: pre-training and fine-tuning. During pre-training, the model utilizes pseudo-labeled data to learn intrinsic spectral features, setting initial parameters without requiring human-labeled samples. Fine-tuning then optimizes these parameters using a smaller set of labeled data. By leveraging this two-stage process, the model reduces the need for preprocessing while enhancing classification accuracy (1).
To validate the framework, the researchers applied it to their proprietary dataset of three tea tree varieties and three publicly available datasets—mango, tablet, and coal samples. Across all datasets, the model delivered remarkable results (1):
Performance Insights
The framework’s transformative potential is evident in comparative experiments. When tested with only 5% of labeled data, the SSL model outperformed traditional ML methods by a substantial margin. Even as labeled data availability increased, the SSL approach maintained superior accuracy, displaying its efficiency and adaptability (1).
Additionally, ablation studies confirmed the critical role of the pre-training phase, which enhanced model performance by up to 10.41%. The researchers attribute this success to the model’s ability to extract both local and global spectral features during pre-training, ensuring consistent generalization across datasets (1).
Implications for Spectral Analysis
This study highlights SSL’s potential to address long-standing challenges in spectral analysis. By automating feature extraction and minimizing data-labeling requirements, the CNN-based SSL framework reduces dependency on domain expertise while improving model reliability. The implications extend beyond NIR spectroscopy, offering a blueprint for advancing small-sample analyses in diverse fields, from agriculture to pharmaceutical products and environmental monitoring (1).
“Our results demonstrate that SSL can significantly enhance spectral analysis, even under the constraints of limited data availability,” the authors concluded. “This framework not only advances the capabilities of NIR spectroscopy but also opens doors for broader applications of SSL in analytical science” (1).
The study’s authors, emphasize that this breakthrough sets the stage for more automated and scalable approaches to spectroscopy (1). By combining deep learning and self-supervised methodologies, the researchers have redefined what’s possible using NIR spectroscopy for classification, marking a pivotal step toward smarter, more efficient NIR analytical techniques (1).
Reference
(1) Zhao, R.; Li, W.; Xu, J.; Chen, L.; Wei, X.; Kong, X. A CNN-Based Self-Supervised Learning Framework for Small-Sample Near-Infrared Spectroscopy Classification. Anal. Methods. 2025, 13 Jan. DOI: 10.1039/D4AY01970A
(2) Yang, J.; Xu, J.; Zhang, X.; Wu, C.; Lin, T.; Ying, Y. Deep Learning for Vibrational Spectral Analysis: Recent Progress and a Practical Guide. Anal. Chim. Acta 2019, 1081, 6–17. DOI: 10.1016/j.aca.2019.06.012.
(3) Yang, L.; Sun, Q. Recognition of the Hardness of Licorice Seeds Using a Semi-Supervised Learning Method and Near-Infrared Spectral Data. Chemom. Intell. Lab. Syst. 2012, 114, 109–115. DOI: 10.1016/j.chemolab.2012.03.010.
AI Boosts SERS for Next Generation Biomedical Breakthroughs
July 2nd 2025Researchers from Shanghai Jiao Tong University are harnessing artificial intelligence to elevate surface-enhanced Raman spectroscopy (SERS) for highly sensitive, multiplexed biomedical analysis, enabling faster diagnostics, imaging, and personalized treatments.
AI and Dual-Sensor Spectroscopy Supercharge Antibiotic Fermentation
June 30th 2025Researchers from Chinese universities have developed an AI-powered platform that combines near-infrared (NIR) and Raman spectroscopy for real-time monitoring and control of antibiotic production, boosting efficiency by over 30%.
Toward a Generalizable Model of Diffuse Reflectance in Particulate Systems
June 30th 2025This tutorial examines the modeling of diffuse reflectance (DR) in complex particulate samples, such as powders and granular solids. Traditional theoretical frameworks like empirical absorbance, Kubelka-Munk, radiative transfer theory (RTT), and the Hapke model are presented in standard and matrix notation where applicable. Their advantages and limitations are highlighted, particularly for heterogeneous particle size distributions and real-world variations in the optical properties of particulate samples. Hybrid and emerging computational strategies, including Monte Carlo methods, full-wave numerical solvers, and machine learning (ML) models, are evaluated for their potential to produce more generalizable prediction models.