A new tutorial provides a step-by-step, hands-on guide to using multivariate data analysis tools like PCA and PLS to extract meaningful insights from complex pharmaceutical data sets.
A recent tutorial published by researchers from the Dipartimento di Scienza Applicata e Tecnologia at Politecnico di Torino in collaboration with Merck Serono SpA provided guidance for researchers to master multivariate data analysis methods. This study, published Chemometrics and Intelligent Laboratory Systems and led by Nicola Cavallini, a researcher from the Dipartimento di Scienza Applicata e Tecnologia at Politecnico di Torino, aids to help researchers understand complex data sets produced by modern analytical technologies such as near-infrared (NIR) spectroscopy and Raman spectroscopy (1).
NIR and Raman spectroscopy are analytical spectroscopic techniques routinely used in pharmaceutical analysis. NIR can identify impurities in drugs and pharmaceutical formulations, and it is often handy in quality assurance applications to ensure the drugs that are sent to market are of high quality (2). Meanwhile, Raman spectroscopy, because it is non-invasive, can test drug products that are already in the package (3). Because Raman requires no sample preparation, using this technique in pharmaceuticals helps keep costs down.
A collection of colorful capsules and tablets scattered on a surface. Generated by AI. | Image Credit: © Khatyjay - stock.adobe.com
In their tutorial, Cavallini and the team discusses every major stage of pharmaceutical data analysis. From raw data organization to predictive modeling, this case-study-based tutorial provides a clear and reproducible framework that is both accessible and informative (1).
One current trend in pharmaceutical analysis is that modern process analytical technologies (PAT) are commonly used to generate massive volumes of spectral data that contain a wealth of hidden chemical and physical information. However, extracting insights from these data requires more than just sophisticated instrumentation (1). Cavallini and the team discuss the role of chemometric tools in navigating this complexity, particularly through techniques such as principal component analysis (PCA), partial least squares (PLS) regression, and partial least squares-discriminant analysis (PLS-DA) (1).
Their tutorial demonstrates this by describing a real-world data set involving multiple freeze-dried pharmaceutical formulations. Beginning with a detailed explanation of the dataset’s structure and characteristics, the authors methodically lead the reader through a complete data analysis pipeline (1). Through each step in this process, which includes data preprocessing, exploratory analysis, regression modeling, and classification, the researchers explain exactly what to do and how to execute them (1).
The tutorial also demonstrates how increasing levels of sucrose and arginine in the formulations influence the clustering and regression results, offering insight into how formulation variables affect the final product. It also uncovers subtler patterns, such as the impact of the operator performing the analysis and the session in which data were collected, highlighting the method's sensitivity not just to sample composition but also to procedural variability (1).
This practical approach to chemometrics is important because the pharmaceutical industry has put more emphasis on quality control, process optimization, and regulatory compliance. As a result, the authors are keen on encouraging critical thinking during each stage of drug analysis (1). At each stage, key questions are posed and discussed, which allow readers to reflect on the decisions they make during their own analyses. The inclusion of fully commented Matlab code furthers this educational goal, allowing even those with limited programming experience to adapt the scripts for their own data sets (1).
By presenting a tutorial that is both technically rigorous and practically approachable, Cavallini and his co-authors provide a roadmap for advancing data literacy in one of the world’s most scientifically demanding industries (1). Ultimately, the tutorial shows that with the right tools and approach, even complex, high-dimensional spectral data can become a source of actionable insight.
Hyperspectral Imaging for Walnut Quality Assessment and Shelf-Life Classification
June 12th 2025Researchers from Hebei University and Hebei University of Engineering have developed a hyperspectral imaging method combined with data fusion and machine learning to accurately and non-destructively assess walnut quality and classify storage periods.
AI-Powered Near-Infrared Imaging Remotely Identifies Explosives
June 11th 2025Chinese researchers have developed a powerful new method using near-infrared (NIR) hyperspectral imaging combined with a convolutional neural network (CNN) to identify hazardous explosive materials, like trinitrotoluene (TNT) and ammonium nitrate, from a distance, even when concealed by clothing or packaging.
New NIR/Raman Remote Imaging Reveals Hidden Salt Damage in Historic Fort
June 10th 2025Researchers have developed an analytical method combining remote near-infrared and Raman spectroscopy with machine learning to noninvasively map moisture and salt damage in historic buildings, offering critical insight into ongoing structural deterioration.
New Machine Learning Model Distinguishes Recycled PET with 10% Accuracy Threshold
June 9th 2025Researchers from Jinan University and Guangzhou Customs Technology Center have developed a cost-effective UV-vis spectroscopy and machine learning method to accurately identify recycled PET content as low as 10%, advancing sustainable packaging and circular economy efforts.
Harnessing Near-Infrared Spectroscopy and Machine Learning to Detect Microplastics in Chicken Feed
June 5th 2025Researchers from Tianjin Agricultural University, Nankai University, and Zhejiang A&F University have developed a highly accurate method using near-infrared spectroscopy and machine learning to rapidly detect and classify microplastics in chicken feed.