Quantification of Adenine Residues in the PolyA Tail of Oligonucleotides Using Raman Spectroscopy

https://doi.org/10.56530/spectroscopy.ia8480b4

Raman spectroscopy emerges as a powerful tool for quantifying adenine in nucleic acid therapeutics, enhancing analytical efficiency and quality assessment.

Nucleic acid therapeutics, including DNA, RNA, and their modified forms, are revolutionizing the biopharmaceutical industry with applications ranging from infectious disease vaccines to gene and CAR-T cell therapies. Despite their enormous potential, challenges related to cost, efficiency, and rapid analytical methods remain bottlenecks in accelerating the journey from discovery to manufacturing. Traditional techniques like mass spectrometry (MS), high-performance liquid chromatography (HPLC), ultraviolet–visible (UV-vis), and sequencing are effective and serve as standard analytical tools for ensuring the quality of nucleic acid therapeutics. Although Raman spectroscopy is well-explored in protein therapeutics for its rapid, non-destructive, and high molecular specificity analysis in aqueous phases with minimal sample preparation, its application in nucleic acid therapeutics has been limited. In this proof-of-concept study, we demonstrate the feasibility of Raman spectroscopy for the quantification of adenine residues in the polyA tail using oligonucleotides as example, highlighting its potential for broader applications in nucleic acid therapeutics.

Nucleic acid therapeutics represent an emerging technology with a wide range of applications, revolutionizing the landscape of modern biopharmaceutical industries (1). By leveraging the encoded genetic information in their sequences, deoxyribonucleic acid (DNA), ribonucleic acid (RNA), and other forms of modified or conjugated nucleic acids offer innovative solutions for various health issues. The successful deployment of messenger RNA (mRNA) vaccines during the Covid-19 pandemic exemplifies the enormous potential of these advanced treatments (2). Beyond infectious diseases, nucleic acid therapeutics are also making significant strides in gene therapy and chimeric antigen receptor T (CAR-T) cell therapy, offering new avenues for the treatment of cancer and genetic diseases (3,4). As research and development continue to advance, the scope and impact of nucleic acid therapeutics are poised to expand, providing hope for more effective and targeted medical interventions. However, from discovery to manufacturing, this nascent field faces many challenges leading to increased cost, operational time, and lower process efficiency, particularly in the absence of established robust and reliable analytical methods (5).

One of the needs in nucleic acid therapeutics is fast, reliable, and cost-effective analytics for the identification and quality assessment of nucleic acids (6). Like any other process of drug product manufacturing, the identity of the product and critical quality attributes (for example, 5’ capping efficiency, the number of modified nucleotides, and the length of the polyA tail in mRNA) must be established before further downstream processing to determine the stability and efficacy of the therapeutic. Currently, incumbent technologies such as mass spectrometry (MS), high-performance liquid chromatography (HPLC), and nuclear magnetic resonance (NMR) are the preferred analytical methods. An alternative to these technologies is Raman spectroscopy. The application of Raman spectroscopy is well-established for monitoring and control of upstream and downstream processes for monoclonal antibody production (7–9). Recently, there is a growing interest in implementing Raman spectroscopy for the development of nucleic acid therapeutics. A successful example is the demonstration of in-line process Raman for real time monitoring of the in-vitro transcription (IVT) reaction(10).

Raman spectroscopy is a vibrational technique that utilizes inelastic scattering of photons to provide detailed information about the molecular composition and structure of a given sample (11). This technique is based on the Raman effect, where incident light interacts with the sample, resulting in a shift in energy that corresponds to specific vibrational modes of the molecules present. Through this inelastic scattering process, a unique Raman spectrum is generated, serving as a distinctive fingerprint for each molecule. This enables the identification and characterization of various chemical entities.

Raman spectroscopy is highly effective for the analysis of dry or aqueous nucleic acids. The nucleic acids are made up of nucleotide bases that are rich in unsaturated double bonds, heteroatomic aromatic rings, and single bonds between carbon, hydrogen, nitrogen, oxygen, and sometime sulfur in the modified nucleotides(12). These electron cloud densities are easily distorted or polarized when these vibrating bonds are under the influence of the electrical field of the incident light. This generates a high-induced electrical dipole moment that results in strong Raman scattering (11). Each nucleotide has distinct chemical bonds with associated Raman cross-sectional scattering that leads to a unique Raman fingerprint (11). Nucleic acids are ordered assemblies of nucleotides or modified nucleotides. Consequently, depending on the sequence, each nucleic acid sequence also has its own unique Raman signature. This high molecular specificity makes Raman spectroscopy a powerful tool for identification, structural analysis (for example, single stranded RNA vs. double-stranded RNA), assessment of quality attributes, and monitoring and control of manufacturing process of nucleic acid therapeutics (10,13).

In this study, we demonstrate the potential of Raman spectroscopy for assessing the number of adenines in the polyA tail of oligonucleotide. The results of this study serve as a proof of concept that Raman spectroscopy could be broadly applied to evaluate critical quality attributes of nucleic acid therapeutics, including estimating the length of the polyA tail of mRNA.

Materials and Methods

Nine different single stranded DNA oligonucleotides (P1, P3, P4, P5, P6, T1, T2, T3, and T4) were obtained from Life Technologies Corporation (ThermoFisher Scientific). The samples were split into training (P1, P3, P4, P5, and P6) and test samples (T1, T2, T3, and T4) and used to train and test the chemometric models respectively. The sequences for these oligonucleotides are shown in Table I. All these oligonucleotides share a common sequence (ATGCGTAGC) but differ in the numbers of adenines in the polyA tail.

The oligonucleotides were received in lyophilized form at the concentration of approximately 1 μmol, and they were diluted with distilled water to approximately 1 mM solution. A droplet of 10 μL volume of the ~1 mM solution was placed in a hydrophobic coated stainless-steel surface sourced from BioTools. The Raman spectra were collected using a micro-immersible probe integrated with Thermo Scientific MarqMetrix All-In-One Process Raman Analyzer. The high surface tension of aqueous solution facilitated contact with the optical surface of micro-immersible probe. The acquisition parameters for acquiring Raman spectra were set to power of 450 mW, exposure time of 3000 ms, and a total average of three. All spectra were preprocessed using Automatic Whittaker Filter (asymmetry = 0.001 and lambda = 1000) to remove the baseline followed by normalized using the infinity norm calculated in the region 1470 to 1500 cm^-¹. The preprocessed Raman spectra were used to train the classification model for identification of the nucleotide-based on sequence. The classification model was developed using principal component analysis (PCA) based soft independent modeling by class analogy (SIMCA) algorithm. Similarly, the principal component regression (PCR) was developed to quantify the number of adenine residue in the polyA tail of oligonucleotide. The PCR model was internally cross-validated to minimize overfitting by using a leave-one-out cross validation strategy. Finally, the predictive performance of the regression model was evaluated on the independent test samples shown in Table I. The oligonucleotide solution for the test samples were prepared like the training sample and the Raman data were acquired using the same acquisition parameters as the training data.

All data preprocessing and chemometric analysis were performed using chemometric software SOLO 9.3.1 (2024) from Eigenvector Research, Inc.

Results and Discussion

The overlay of Raman spectra of training oligonucleotides (P1, P3, P4, P5, and P6) in the spectral region 800 to 1810 cm^-¹ is shown in Figure 1a. The oligonucleotides used in this study were composed of adenine (A), guanine (G), cytosine (C), and thymine (T) residues, linked through phosphodiester bonds in a sequence utilizing deoxyribose sugars as the backbone. Adenine and guanine, which feature fused pyrimidine and imidazole rings, are classified as purines, whereas cytosine and thymine, containing only pyrimidine rings, are classified as pyrimidines (14). These structural differences among adenine, guanine, cytosine, and thymine result in distinct Raman signatures (15) thymine, guanine, and cytosine are calculated in the frame of density functional theory (DFT). Consequently, different oligonucleotides exhibit different Raman signals depending on the number of adenines, guanines, cytosines, and thymines present in their sequences.

The spectra shown in Figure 1a were preprocessed by removing the baseline and normalized to the region 1470 to 1500 cm^-¹ for correcting path length differences as shown in Figure 1b.

The peak at approximately 1486 cm^-¹ is primarily attributed to the vibrational mode of the imidazole ring in guanine as shown in Figure 2, which is free of interferences from adenine. Thus, all spectra were normalized using the peak at approximately 1486 cm^-¹ to remove differences in intensity because of differences in the concentrations between samples.

The normalized spectra shown in Figure 1b exhibit distinct Raman intensity profiles that linearly increase with the number of adenines in the polyA tail. These Raman peaks can be attributed to the specific energies of the molecular vibrations of adenine residues (16) using vibrational spectroscopy techniques coupled to DFT calculations for the isolated molecule and the solid. In both cases, the N9H-amino tautomer was found to be the predominant species, followed by the N7H-amino form. Excellent agreement was achieved between experiment and theory, both for the wavenumbers and intensities without the need for scaling. The Raman band at approximately 1025 cm^-¹ can be attributed to the twisting of the NH₂ (t(NH₂)) functional group attached to the C6 carbon and the stretching of the C6-N1 bond (ν(C6-N1)) of the pyrimidine ring. The Raman band at approximately 1127 cm^-¹ can be attributed to contributions from the in-plane deformation of the C8-H (δ(C8-H)) bond of the imidazole ring and the stretching of C-N bonds within the pyrimidine ring (ν(C3-N3), ν(C4-N9), ν(C5-C6), ν(C5-N7)). The strong Raman feature at approximately 1109 cm^-¹, as well as those at 1334 and 1372 cm^-¹, are because of a mixture of various C–N stretching modes of the pyrimidine and imidazole rings, and the in-plane C-H deformation associated with C2 and C8 carbon. Similarly, the Raman bands at approximately 1420 and 1580 cm^-¹ result from a mixture of various C-C and C-N bond stretching modes, with some additional contributions from NH₂ scissoring. These assignments of Raman bands to the specific vibration modes of adenine clearly demonstrate that in Figure 1b the oligonucleotides are distinguished based on the Raman signatures of adenine.

Identification and Classification of Oligonucleotide

Before performing the regression analysis, the data were analyzed by developing the classification models using principal components. A principal component analysis (PCA)-based soft independent modeling by class analogy (SIMCA) model was developed for the classification of P1, P3, P4, P5, and P6 using the preprocessed spectra shown in Figure 1b. Three spectra for each oligonucleotide were collected as independent test samples and fed into the model. The predicted results are shown in Table II as a confusion matrix. All the samples were correctly identified to the appropriate class with no false predictions. The values for the true positive rate (TPR), true negative rate (TNR), false positive rate (FPR), and false negative rate (FNR) were 1, 1, 0, and 0, respectively (confusion matrix not shown). Thus, the high sensitivity (TPR) and specificity (TNR) of the SIMCA classification model indicate that the model is capable of correctly identifying positive (lower type II error) and negative cases (lower type I error)(17). This clearly demonstrated that Raman spectroscopy with chemometrics can be used to classify or identify nucleic acids based on the length of polyA tail.

Quantification of the PolyA Tail of the Oligonucleotide

A principal component regression (PCR) model was developed to mathematically correlate the Raman spectra shown in Figure 1b with the number of adenine residues in the polyA tail region, as outlined in Table I, after mean-centering all data. The root mean square error of cross-validation (RMSECV) was approximately 0.60 number of adenine residues that approximately round off to an error of ± 1 adenine. The R² value for the cross-validation of the model was approximately 0.98, indicating that the model captures 98% of the variance in the training data. These statistics collectively indicate good qualities of the model.

Figure 3a displays the plot of the root mean square error of calibration (RMSEC), the root means square error of cross-validation (RMSECV), and the ratio of RMSECV to RMSEC for each PCR model developed with an increasing number of principal components (PCs), as labeled on the x-axis. The optimum number of PCs for the regression model was determined using internal cross-validation strategy (18). In this study, leave-one-out cross-validation was used. The two principal component model was selected as an optimum model as adding more PCs did not significantly improve RMSECV while it significantly increased the ratio of RMSECV to RMSEC.

The loadings for the two principal components (PCs) used to develop the PCR model are shown in Figure 3b. PC1 explains 96% of the variance in the training spectra. The loading of PC1 closely resembles the Raman signature of adenine, as explained above. The selectivity ratio plot for the PCR model is shown in Figure 3c. The selectivity ratio provides a numerical assessment of the importance of each variable (Raman shift) in a regression model (19)or equivalently, the predictive component from orthogonal partial least squares (OPLS). It is calculated as the ratio of explained to residual variance for each Raman shift. The larger the selectivity ratio, the more useful the given Raman shifts are for prediction. For the PCR model developed in this study, the Raman shifts associated with the Raman signature of adenine have selectivity ratios exceeding the threshold value of 2.7. Thus, both the loading of PC1 and the selectivity ratio plot clearly demonstrate that the PCR model is specifically based on differences in the number of adenine residues in the oligonucleotide.

To evaluate whether the model was based on random chance (indicative of overfitting) or true correlation, a permutation test was performed (20). In this test, the response variables (Y block; number of adenines in the polyA tail) were shuffled and randomly assigned to the spectral data. The fraction sum of squares Y (SSQ Y) captured for calibration and cross-validation were calculated, transformed to a centered and standardized form, and plotted against the correlation of the permuted Y-block to the original Y-block (x-axis), as shown in Figure 3d. The magnitude of SSQ Y on the y-axis of Figure 3d indicates the distance of the data point from the mean in units of standard deviation. For random models, the mean is zero, as indicated by the dotted line. The permuted or random models created by 1,000 iterations of permutation are shown on the left end, while the original unpermuted model is shown on the right end of Figure 3d. The original unpermuted model is approximately four standard deviations away from the random models, indicating that the original model is based on true correlation and not overfitted.

Additionally, pairwise Wilcoxon signed-rank tests, pairwise signed-rank tests, and randomization t-tests performed on the original and permuted models resulted in p-values that are tenfold lower than 0.05 at 95% confidence (data not shown). These probability tests provide confidence in concluding that the developed PCR model is statistically different from a random model and therefore reliable.

The predictive performance of the PCR model for the quantification of adenine in the polyA tail was evaluated by applying the model on the independent test samples. The result is shown in Figure 4. The root mean square error of prediction (RMSEP) was ~1.1. This indicate that the developed model with limited data set is reliable for quantifying the number of adenines in the polyA tail for oligonucleotides that are different by approximately three adenine residues (3* RMSEP = ~3). By augmenting more data and implementing different preprocessing or algorithms, more accurate models can be developed to improve the prediction errors. Thus, Raman-based polyA quantification can be a viable option for a standard ion pair reversed phase high performance liquid chromatography (IP-RP-HPLC) in a process where the tolerance of accuracy and precision for polyA quantification is above the prediction error from Raman.

Conclusion

In conclusion, we demonstrated the potential of Raman spectroscopy for the identification and classification of oligonucleotide as well as the quantifying the number of adenines in the polyA tail of oligonucleotides. Although we used deoxyribose short chain oligonucleotides, the applications of Raman spectroscopy extend to other nucleic acids and their derivatives, including RNA, DNA, and modified or conjugated nucleic acids. Additionally, in literature, Raman spectroscopy has been shown to provide valuable insights into the conformational analysis of nucleic acids, such as differentiating between single-stranded and double-stranded RNA and identifying conformational isomers of circular DNA (13). Thus, combining our findings with existing literature, it is evident that Raman spectroscopy holds immense potential for nucleic acid research, process development, and manufacturing.

References

Kulkarni, J. A.; Witzigmann, D.; Thomson, S. B.; Chen, S.; Leavitt, B. R.; Cullis, P. R.; van der Meel, R. The Current Landscape of Nucleic Acid Therapeutics. Nat. Nanotechnol. 2021, 16 (6), 630–643. DOI: 10.1038/s41565-021-00898-0.
Rohner, E.; Yang, R.; Foo, K. S.; Goedel, A.; Chien, K. R. Unlocking the Promise of mRNA Therapeutics. Nat. Biotechnol. 2022, 40 (11), 1586–1600. DOI: 10.1038/s41587-022-01491-z.
Liu, C.; Shi, Q.; Huang, X.; Koo, S.; Kong, N.; Tao, W. mRNA-Based Cancer Therapeutics. Nat. Rev. Cancer 2023, 23 (8), 526–543. DOI: 10.1038/s41568-023-00586-2.
Shen, T.; Zhang, Y.; Zhou, S.; Lin, S.; Zhang, X.-B.; Zhu, G. Nucleic Acid Immunotherapeutics for Cancer. ACS Appl. Bio Mater. 2020, 3 (5), 2838–2849. DOI: 10.1021/acsabm.0c00101.
Ouyang, J.; Zhan, X.; Guo, S.; Cai, S.; Lei, J.; Zeng, S.; Yu, L. Progress and Trends on the Analysis of Nucleic Acid and Its Modification. J. Pharm. Biomed. Anal. 2020, 191, 113589. DOI: 10.1016/j.jpba.2020.113589.
Analytical Procedures for Quality of mRNA Vaccines and Therapeutics (Draft Guidelines: 3rd Edition) | USP-NF. https://www.uspnf.com/notices/analytical-procedures-mrna-vaccines-20240802 (accessed 2024-08-26).
Villa, J.; Zustiak, M.; Kuntz, D.; Zhang, L.; Khadka, N.; Broadbelt, K.; Woods, S. Use of Lykos and TruBio Software Programs for Automated Feedback Control to Monitor and Maintain Glucose Concentrations in Real Time.
Villa, J.; Zustiak, M.; Martin, A.; Zhang, L.; Khadka, N. A Novel Strategy of Using Process Raman for Feedback Control of Viable Cell Density in Perfusion Cell Culture.
Nolasco, M.; Pleitt, K.; Khadka, N. Using a Process Raman Analyzer as an In-Line Tool for Accurate Protein Quantification in Downstream Processes.
Clenet, D.; Potisopon, S. Spectral Monitoring of in Vitro Transcription. WO2024074726A1, April 11, 2024. https://patents.google.com/patent/WO2024074726A1/en (accessed 2024-08-30).
Rzhevskii, A. Modern Raman Microscopy: Technique and Practice; Cambridge Scholars Publishing, 2021.
Peticolas, W. L.; Tsuboi, M. The Raman Spectroscopy of Nucleic Acids. In Infrared and Raman Spectroscopy of Biological Molecules; Theophanides, T. M., Ed.; Springer Netherlands: Dordrecht, 1979; pp 153–165. DOI: 10.1007/978-94-009-9412-6_11.
Morla-Folch, J.; Xie, H.; Alvarez-Puebla, R. A.; Guerrini, L. Fast Optical Chemical and Structural Classification of RNA. ACS Nano 2016, 10 (2), 2834–2842. DOI: 10.1021/acsnano.5b07966.
Minchin, S.; Lodge, J. Understanding Biochemistry: Structure and Function of Nucleic Acids. Essays Biochem. 2019, 63 (4), 433-456. DOI: 10.1042/EBC20180038.
Santamaria, R.; Charro, E.; Zacarías, A.; Castro, M. Vibrational Spectra of Nucleic Acid Bases and Their Watson–Crick Pair Complexes. J. Comput. Chem. 1999, 20 (5), 511–530. DOI: 10.1002/(SICI)1096-987X(19990415)20:5<511::AID-JCC4>3.0.CO;2-8.
Lopes, R. P.; Valero, R.; Tomkinson, J.; Marques, M. P. M.; Carvalho, L. A. E. B. de. Applying Vibrational Spectroscopy to the Study of Nucleobases – Adenine as a Case-Study. New J. Chem. 2013, 37 (9), 2691–2699. DOI: 10.1039/C3NJ00445G.
Ting, K. M. Sensitivity and Specificity. In Encyclopedia of Machine Learning and Data Mining; Sammut, C., Webb, G. I., Eds.; Springer US: Boston, MA, 2016; pp 1–1. DOI: 10.1007/978-1-4899-7502-7_758-1.
Lever, J.; Krzywinski, M.; Altman, N. Model Selection and Overfitting. Nat. Methods 2016, 13 (9), 703–704. DOI: 10.1038/nmeth.3968.
Kvalheim, O. M. Interpretation of Partial Least Squares Regression Models by Means of Target Projection and Selectivity Ratio Plots. J. Chemom. 2010, 24 (7–8), 496–504. DOI: 10.1002/cem.1289.
Tools: Permutation Test - Eigenvector Research Documentation Wiki. https://www.wiki.eigenvector.com/index.php?title=Tools:_Permutation_Test (accessed 2024-08-26).