In the second part of this three-part interview, Ayanjeet Ghosh of the University of Alabama and Rohit Bhargava of the University of Illinois Urbana-Champaign discuss how machine learning (ML) is used in data analysis and go into more detail about the model they developed in their study.
A recent study in Applied Spectroscopy introduced a two-step regressive neural network model that improves discrete frequency infrared (IR) imaging for biomedical use, especially in studying protein structures in tissues affected by neurodegenerative diseases (1). Unlike traditional methods like principal component analysis (PCA), which are less interpretable and require dense spectral data, this model uses only seven wavenumbers to accurately reconstruct high-resolution spectra and predict structural features (1). As a result, it significantly accelerates both data acquisition and analysis, offering a more efficient and scalable solution for IR imaging in biomedical research (1).
Two of the authors of this study, Ayanjeet Ghosh, who is a professor in the Department of Chemistry and Biochemistry at the University of Alabama, and Rohit Bhargava, who is a professor in the Department of Bioengineering at the University of Illinois, Urbana-Champaign, recently sat down with Spectroscopy to discuss their findings (2,3).
In the second part of this three-part interview, Ghosh and Bhargava discuss how machine learning (ML) is used in data analysis and go into more detail about the model they developed in their study.
DNA samples in test tubes with glowing neon lights, representing genetic research and scientific exploration. Generated with AI. | Image Credit: © Asawin - stock.adobe.com
How is machine learning (ML) used for data analysis, and why is principal component analysis (PCA) insufficient for analyzing sparsely sampled discrete frequency IR data, especially in biomedical applications?
ML approaches have been widely in for chemical imaging, specifically for both Fourier transform infrared (FT-IR) and discrete frequency IR (DFIR) microscopies, wherein spectral parameters, such as intensities and frequencies, have been leveraged to distinguish between different disease states, such as early-stage vs metastatic cancer or identify chemical signatures underlying pathological markers, such as composition of protein aggregates in neurodegenerative diseases. PCA is a dimensionality reduction technique typically used in conjunction with FT-IR imaging. DFIR does not require tools like PCA because it already provides only the specific spectral data relevant to chemical characterization of a specific specimen—the sparse spectral sampling in DFIR makes dimensionality reduction unnecessary.
Could you walk us through the architecture and design of the two-step regressive neural network model you developed? How does it address the challenges of curve fitting at scale?
Our two-step neural network is designed to perform two key steps necessary for quantification of protein secondary structures from discrete frequency data. It reconstructs the full spectra from seven wavenumbers, and it then predicts areas under curve (AUCs) of underlying spectral components for structural quantification, which is typically done using band fitting.
Our approach is ~3000x (based exclusively on our computational resources available to us at the time) faster than Gaussian fitting, which is particularly relevant for large images with > 1 million pixels.
Your model requires only seven wavenumbers to generate high-resolution spectral predictions—how did you determine the optimal spectral frequencies to sample, and how generalizable is this selection across tissue types or applications?
Our goal was to use the minimum number of spectral bands to reconstruct full-resolution spectra. We trained our models on largely simulated spectral data composed of three components, representative of the most common secondary structural elements in proteins. The number of bands was chosen empirically based on performance tests comparing mean absolute error (MAE) of the model vs. the band count. The specific wavenumbers chosen were not necessarily tied to specific structures but were selected to best reconstruct the spectrum. We found that our model performance was slightly better with hand-picked bands compared to uniformly spaced bands across the amide-I range.
The data chosen to train the models was designed to capture the possible variations of the amide I IR spectra as typically observed in biological specimens. Hence, this model should be generalizable across different tissue types. We have recently verified this by comparing the model output with band fitting for breast cancer tissue biopsies. However, retraining of the models may be necessary for specific applications where the spectra are known to be composed of additional structural components.
How THz and THz-Raman Spectroscopy Are Used in Drug Safety, Farming, and Mining
May 20th 2025A new review by researchers from IIT Delhi and the University of Queensland highlights how Terahertz (THz) and low-wavenumber Raman (THz-Raman) spectroscopy are advancing quality control and efficiency in pharmaceuticals, agriculture, and mineral industries. These powerful non-invasive tools enable detailed multi-parameter sensing, offering deeper insight at the molecular level.
Analyzing the Protein Secondary Structure in Tissue Specimens
May 19th 2025In the first part of this three-part interview, Ayanjeet Ghosh of the University of Alabama and Rohit Bhargava of the University of Illinois Urbana-Champaign discuss their interest in using discrete frequency infrared (IR) imaging to analyze protein secondary structures.
Accurate Plastic Blend Analysis Using Mid-Infrared Spectroscopy
May 15th 2025Researchers at the Sinopec Research Institute have developed a novel method using virtually generated mid-infrared spectra to accurately quantify plastic blends, offering a faster, scalable solution for recycling and environmental monitoring.