A new tutorial provides a step-by-step, hands-on guide to using multivariate data analysis tools like PCA and PLS to extract meaningful insights from complex pharmaceutical data sets.
A recent tutorial published by researchers from the Dipartimento di Scienza Applicata e Tecnologia at Politecnico di Torino in collaboration with Merck Serono SpA provided guidance for researchers to master multivariate data analysis methods. This study, published Chemometrics and Intelligent Laboratory Systems and led by Nicola Cavallini, a researcher from the Dipartimento di Scienza Applicata e Tecnologia at Politecnico di Torino, aids to help researchers understand complex data sets produced by modern analytical technologies such as near-infrared (NIR) spectroscopy and Raman spectroscopy (1).
NIR and Raman spectroscopy are analytical spectroscopic techniques routinely used in pharmaceutical analysis. NIR can identify impurities in drugs and pharmaceutical formulations, and it is often handy in quality assurance applications to ensure the drugs that are sent to market are of high quality (2). Meanwhile, Raman spectroscopy, because it is non-invasive, can test drug products that are already in the package (3). Because Raman requires no sample preparation, using this technique in pharmaceuticals helps keep costs down.
A collection of colorful capsules and tablets scattered on a surface. Generated by AI. | Image Credit: © Khatyjay - stock.adobe.com
In their tutorial, Cavallini and the team discusses every major stage of pharmaceutical data analysis. From raw data organization to predictive modeling, this case-study-based tutorial provides a clear and reproducible framework that is both accessible and informative (1).
One current trend in pharmaceutical analysis is that modern process analytical technologies (PAT) are commonly used to generate massive volumes of spectral data that contain a wealth of hidden chemical and physical information. However, extracting insights from these data requires more than just sophisticated instrumentation (1). Cavallini and the team discuss the role of chemometric tools in navigating this complexity, particularly through techniques such as principal component analysis (PCA), partial least squares (PLS) regression, and partial least squares-discriminant analysis (PLS-DA) (1).
Their tutorial demonstrates this by describing a real-world data set involving multiple freeze-dried pharmaceutical formulations. Beginning with a detailed explanation of the dataset’s structure and characteristics, the authors methodically lead the reader through a complete data analysis pipeline (1). Through each step in this process, which includes data preprocessing, exploratory analysis, regression modeling, and classification, the researchers explain exactly what to do and how to execute them (1).
The tutorial also demonstrates how increasing levels of sucrose and arginine in the formulations influence the clustering and regression results, offering insight into how formulation variables affect the final product. It also uncovers subtler patterns, such as the impact of the operator performing the analysis and the session in which data were collected, highlighting the method's sensitivity not just to sample composition but also to procedural variability (1).
This practical approach to chemometrics is important because the pharmaceutical industry has put more emphasis on quality control, process optimization, and regulatory compliance. As a result, the authors are keen on encouraging critical thinking during each stage of drug analysis (1). At each stage, key questions are posed and discussed, which allow readers to reflect on the decisions they make during their own analyses. The inclusion of fully commented Matlab code furthers this educational goal, allowing even those with limited programming experience to adapt the scripts for their own data sets (1).
By presenting a tutorial that is both technically rigorous and practically approachable, Cavallini and his co-authors provide a roadmap for advancing data literacy in one of the world’s most scientifically demanding industries (1). Ultimately, the tutorial shows that with the right tools and approach, even complex, high-dimensional spectral data can become a source of actionable insight.
Whey Protein Fraud: How Portable NIR Spectroscopy and AI Can Combat This Issue
May 20th 2025Researchers from Tsinghua and Hainan Universities have developed a portable, non-destructive method using NIR spectroscopy, hyperspectral imaging, and machine learning to accurately assess the quality and detect adulteration in whey protein supplements.
AI and Infrared Light Team Up to Advance Soil Carbon Monitoring
May 19th 2025A team of international researchers has developed a faster, more accurate method to analyze soil carbon fractions using mid-infrared spectroscopy and deep learning. Their approach preserves the chemical balance of soil organic carbon components, paving the way for improved climate models and sustainable land management.
Analyzing the Protein Secondary Structure in Tissue Specimens
May 19th 2025In the first part of this three-part interview, Ayanjeet Ghosh of the University of Alabama and Rohit Bhargava of the University of Illinois Urbana-Champaign discuss their interest in using discrete frequency infrared (IR) imaging to analyze protein secondary structures.
Exploring Data Transforms in Chemometrics
May 14th 2025Our “Chemometrics in Spectroscopy” column highlights the methodology that is used in order to apply chemometric methods to data. Integrating chemometrics with spectroscopy allows scientists to understand solutions to their problems when they encounter surprising results. Recently, columnists Howard Mark and Jerome Workman, Jr., wrote a series of articles about data transforms in chemometric calibrations. In this listicle, we profile all pieces in this series and invite you to learn more about applying chemometric models to continuous spectral data.
Wearable fNIRS Sensor Tracks Cognitive Fatigue in Real Time
May 7th 2025Researchers have developed a wireless, wearable brain-monitoring device using functional near-infrared spectroscopy (fNIRS) to detect cognitive fatigue in real time. The miniaturized system enables mobile brain activity tracking, with potential applications in driving, military, and high-stress work environments.