Mass Spectral Libraries: The Next Generation

Fundamentally new manipulations of data contained in mass spectral databases and libraries are envisioned for the next generation of mass spectral libraries.

We described mass spectral libraries and their uses in this forum in 1994 (1), and we can assume that at least a few things have changed in the intervening 15 years. Certainly the standard mass spectral libraries contain more archived spectra, the computers used to search the libraries are much faster, and the software is more sophisticated. In the larger perspective, mass spectrometry (MS) is now applied to a much wider variety of analytical problems than before, and more types of mass spectra (using different ionization methods) are recorded. Major advances in instrumentation means that more accurate mass measurements are recorded, leading to information that more accurately predicts the ion empirical formula. Overall, the basic use of a mass spectral library remains the same, and that is to assist in the identification of molecular structure, and the identification of a particular compound, via a match between a measured mass spectrum and the mass spectrum stored within the library for the known compound. It has often been suggested (informally at least) that the breadth of modern spectral libraries and the speed with which they can be searched will result in a decay of both basic and advanced spectral interpretation skills on the part of the analyst. Phrased within a simple question: "Why interpret when you can just search?" There is an apparent push toward search rather than interpretation. This push is not so much driven by the increased speed of the library search with modern data systems but is instead catalyzed and amplified by the fact that mass spectra are acquired much faster than before, and hundreds of samples (if not more) can be analyzed in a single day. Although it can seem that there is little time for thoughtful spectral interpretation, interpretation short courses such as those offered annually at ASMS remain full of those wanting to learn the skill. There are certain applications, such as forensic MS, in which the truly unknown nature of some of the samples can reduce the usefulness of a library and magnifies the interpretive skill of the analyst. Interpretive acumen is a valuable skill that will never be replaced.

Kenneth L. Busch

Where do libraries of mass spectral data get their content? In the early days, mass spectra were shared freely among the members of the user community. The informal collection process soon evolved into a standard collection process and format, which included an evaluation of the spectral quality (and therefore the propriety for inclusion in the library). Users were encouraged to submit mass spectra recorded for new compounds, and many service laboratories in academic settings did so over a period of many years. However, as the libraries grew, the newly contributed mass spectra were more and more likely to be unique to a particular synthesis or research project and perhaps less likely to be of interest to members of the broader community, who were unlikely to encounter that particular compound themselves. The incremental cost for including the additional mass spectra was very low, however, and each new spectral–structure correlation could, in theory, be used to confirm and extend, or offer an exemption to, the standard "rules" of spectral interpretation. Concurrent with the growing importance of MS in support to the research efforts of the pharmaceutical industry, it also became clear that valuable competitive information could be gleaned from the mass spectra submitted by one's colleagues (and competitors). The uninhibited flow of mass spectral contributions into libraries slowed, and the dissemination of mass spectral information from the base of industrial users became much more tightly controlled. Within a very short time, the contribution of new data became almost the exclusive venue of the academic laboratories, and even there, the implications for potential proprietary information made an impact. Finally, the electron ionization (EI) mass spectra that formed the core of the original mass spectral libraries were supplanted by mass spectra recorded with other ionization methods including chemical ionization and later electrospray ionization mass spectra. Large, complex sets of data from gas chromatography (GC)–MS, liquid chromatography (LC)–MS, and MS-MS also needed to be accommodated within a modernized format and structure. Annotating and transferring such large data sets was cumbersome and awkward, and to this day, we still have not fully managed the glut of mass spectral data produced by our analytical instruments.

As libraries evolved, it also became clear that their potential uses were much broader than the simple identification of a compound via match of a spectrum with an entry in the library. The automated processing of large data sets derived from GC–MS and LC–MS for overlapped components constitutes the challenge of peak deconvoluton. In teaching about MS, I count on a humorous chuckle from those in the class when I refer to an "academically complex mixture of three components," and the relatively simple spectral processing used to unravel those three overlapped peaks. In reality, sample mixtures comprise hundreds of individual components across a wide range of concentrations. Overlapped peaks eluted from a chromatographic column are inevitable in the analysis of a real complex mixture. Only rarely will a well-shaped peak be found that consists of a single component, and the information contained in mass spectral libraries clearly could be used to assist in the peak deconvolution process.

One potential difficulty in such a library-assisted process is the completeness of the library. If one could assume that every compound in a complex mixture is represented accurately by an entry in the library, then the peak deconvolution challenge would be simplified. However, not only is a GC–MS data set likely to be a combination of compounds in the library and those not yet so archived, but there is also a variability in instrument and acquisition parameters that leads to a certain "natural" variability in the spectra recorded both during the analysis, and therefore archived in the library. In a one-on-one comparison, this variability is reflected in the match factor or quality factor. Say a perfect match might generate a match factor of 100; it is quite acceptable to observe match factors of 85–90 for confirmed compounds. In a deconvolution algorithm, proper assessment of this variability must be included because its effect will be magnified when the two overlapping components are of significantly different concentrations. It is reasonable to forecast the increasing use of libraries of mass spectral data used for deconvolution, based upon four factors: the increasing size of libraries of mass spectral data; the increasing availability within libraries of mass spectral data of several types (for example, both electron and chemical ionization) for the same mixture with the same chromatographic elution profile; increased computational power and the speed that results; and the ability to use several approaches to reach a deconvolution of the same complex data set. Of these factors, the increased computational power is probably most apparent to the average user. The true impact of the other three factors will not be apparent until we truly reach the next generation in the use of mass spectral libraries through data heuristics, as described later in this column.

Ausloos and colleagues (2) provide a critical evaluation of mass spectral libraries along with an instructive chronology of the history of their development. Representing the culmination of an extensive body of work on mass spectral libraries, the Automated Mass Spectral Deconvolution and Identification System (AMDIS) was first released in 1996, and it is now included with the NIST Mass Spectral Library (3). The AMDIS program provides the capability to "extract" the mass spectrum of each component in a mixture, with the spectral data drawn from a full GC–MS or LC–MS data set. After the program generates the "pure" mass spectrum, it searches for a match within the reference library. The program iterates through a process of four discrete sequential steps: noise analysis, component perception, spectral deconvolution, and finally, compound identification. Other substructure generation programs (such as MOLGEN) were applied to the interpretation of mass spectra and are constantly optimized and expanded and their performance compared (4).

Even as the first comprehensive libraries were being assembled and deconvolution, search, and compare algorithms developed, another approach was to create heuristic interpretive aids — to duplicate, in computer programs, the interpretive strategies used by a human. The heuristic approach extends beyond the ability to codify and capture the "rules" of spectral interpretation. "Heuristic" means that the programs attempted to use interpretive experience to create more optimized and refined means to aid in the interpretation of mass spectra. Remember that short courses that teach interpretation lay out a series of common-sense rules (algorithms) and then add experience in the guise of similar problems previously solved, analogous fragmentation processes, and that intuition that leads to the "aha" moment that leads to the solution of the interpretive problem. The sum total of that interpretive skill is what a heuristic program seeks to emulate. Consider an analogy to computer programs that play chess: The programming of the movement of the pieces is simple, but the earliest simple programs lost to human players that could better strategize and think ahead within a larger perspective. Eventually, the programs became more sophisticated and developed the same abilities. Most experts now concede that the computer–human chess competition can be called a draw at this point.

It is instructive to read about the first heuristic programs (5) and to evaluate them in the context of both the spectral knowledge of the time and the computer resources available. We forget just how rudimentary some of our abilities were, from the first data systems running on PDP-8 computers, to pen and ink plotters, to large Winchester hard drives with an (amazing) 20 MB of storage. Do column readers remember Beynon's book (6), which was a compilation of the exact masses calculated for empirical ion formulas, assembled from the output of a mainframe computer program? Does anyone remember using the printed eight-peak index of mass spectral data, or the printed library, to search for the mass spectra that matched the unknown? Now, such tasks are accomplished easily with personal computers, even with the large mass spectral libraries currently available. Even as some libraries of mass spectral data now contain hundreds of thousands of mass spectra, research continues into computational methods useful for linking spectral information with structure information. The goal in one recent work was to explore methods for deducing structure for an unknown compound when the library did not contain a spectrum for that compound, in which case the structure is derived through similarities in spectra and the implicated substructures (7). New data-mining methods explore the databases for new information (8). Seven "golden rules" for the heuristic derivation of molecular formula from exact mass spectral databases have been described (9), offering new computational twists on a process exceedingly simple on its face. In a test of accuracy of results obtained in compliance with the rules (developed on over 420,000 molecular formula), the correct formulas were retrieved at 80–99% probability with 5% absolute isotope ratio deviation and 3 ppm measured mass accuracy. Interestingly, this heuristic approach was developed based upon information for C-, H-, N-, S-, O-, and P-containing molecular formulas drawn from the PubChem database. This database now contains information on almost 10 million compounds. But, as Figure 1 shows, it is a database of lower molecular mass compounds, while MS has advanced to encompass analyses of much higher molecular mass compounds. We discuss this dilemma in the final part of this column.

Figure 1

As new library-related tools are developed and new applications are envisioned, users have developed custom libraries and the standard search of a mass spectral library has become anything but standard. It is conceptually straightforward to add to the standard positive ion, electron ionization library with a collection of chemical ionization spectra, electrospray ionization mass spectra, or a mass spectrum of negative ions, and to manipulate and search the data using the same algorithms. Exact mass data and chromatographic retention times can be linked directly into customized molecular features databases for specialized applications, such as pesticide analysis (10). The nature of the archiving process need not be changed significantly for a collection of MS-MS data, although the strategies for search and match evolve. An example is the creation of a computational link between product ion MS-MS spectra and standard EI mass spectra on the library (11). Furthermore, because "standard conditions" for the collection of MS-MS spectra have not been established, the concern of instrument reproducibility over time and its impact on library searching has been explored, with the conclusion that the search results generally maintain their usefulness (12). The role of the library has evolved with the changing nature of the data stored within it. Compound identification is no longer the sole final goal. GC–MS data are interpreted with the aid of a library and an analysis scheme that includes biotoxicity classifiers (13). In another example, use of libraries of LC–MS data in applications of forensic technology has been explored (14). An intriguing recent example of a customized library with a targeted goal of medical diagnosis is the creation of a library of matrix-assisted laser desorption ionization mass spectra, with the stored data evaluated using statistical means, used as part of an approach for colorectal cancer diagnosis (15). New open-source software to manage mass spectral libraries has been developed (16).

What's in the future? One of the first problems encountered is the limitations of the database world outside of MS. As mentioned with respect to Figure 1, PubChem (as an example) limits itself to a lower molecular mass world. What about salts? Ionic compounds? Noncovalently bound complexes? Extending the mass range of the library to 10,000 or 100,000 Da is reasonable, and we must remember that we seldom look at a full mass spectrum from m/z 50 to m/z 10,000. Not only is our spectral window of interest much narrower, but the library must record the charge of the ion as well as its mass and accommodate charge-changing reactions as well as mass-changing reactions. Far from being a mature part of our science, mass spectral libraries are truly at the base of the innovation curve, and they are poised to enter the next generation. The future will bring a true HAL (heuristic automated library), combining features that represent the new generation of library searches and assisted interpretation and quality control. The past evolution of information resources from printed to electronic to cloud hosts is instructive in our forecast. A printed library, starting with those early pages in the API binder, through to the multiple volumes of the mass spectral indices of the 1980s, is a tangible form of the mass spectral library. The search algorithms in computerized libraries were matched with expanding libraries stored on CDs and hard drives, with the libraries growing in size and complexity and expanding to include additional types of data. The issue of distribution of these tools and resources, and the need for updated versions, persists. But these are issues rooted in the past and are sidestepped easily in the future. Can there be a cloud-based mass spectral search library? If issues of proprietary or privileged information can be managed, the creation of a real-time (or almost real-time) library of mass spectral data is a lofty goal. Searches are a few seconds of internet connection time. Eventually, the same artificial intelligence applied to the interpretation of data will be applied to the collection of data. We need that discriminatory power, because the number of mass spectra recorded electronically on a daily basis can be estimated as between 10⁶ and 10⁷ individual mass spectra. If only 1% of these spectra were contributed to an online search system, and even if most of these were mass spectra for background signals derived from various GC or LC columns, were recorded for common contaminants, or were identified as the mass spectra of common chemicals of commerce, or as surface-resident compounds on objects sampled with ambient MS, then the potential value of the "cloud" networked library, with its diversity as well as its immediacy, becomes apparent. The development of such a resource will require the support and sponsorship of a professional organization.

Kenneth L. Busch is curious about how many participants attending current mass spectrometry meetings have ever held a printed library of mass spectra within their hands, or thumbed through the pages of such a volume, or attempted to do a manual search of an unknown spectrum. The authoritative heft of printed compilations is drawn from far more than their bulk. The carefully vetted, certified mass spectrum may be a relic of the first golden age of mass spectrometry. In this data-rich world, we now discard more mass spectra than we save. The author is solely responsible for the contents of this column, and can be reached at: wyvernassoc@yahoo.com

References

(1) K.L. Busch, Spectroscopy 9(9), 24 (1994).

(2) P. Ausloos, C.L. Clifton, S.G. Lias, A.I. Mikaya, S.E. Stein, D.V. Tchekhovskoi, O.D. Sparkman, V. Zaikin, and D. Zhu, J. Amer. Soc. Mass Spectrom. 10, 287 (1999).

(3) http://chemdata.nist.gov/mass-spc/amdis/

(4) E.L. Schymanski, M. Meringer, and W. Brack, Anal. Chem . 81(9), 3608–3617 (2009).

(5) R.K. Lindsay, B.G. Buchanan, E.A. Feigenbaum, and J. Lederberg, Artif. Intell. 61(2), 209–261 (1993).

(6) J.H. Beynon and A.E. Williams, Mass and Abundance Tables for Use in Mass Spectrometry (Elsevier, New York, 1963).

(7) W. Demuth, M. Karlovits, and K. Varmuza, Anal. Chim. Acta 516, 75–85 (2004).

(8) Y. Liang and F. Gam, Anal. Chim. Acta , 446(1–2), 115–120 (2001).

(9) T. Kind and O. Fiend, BMC Bioinformatics 8, 105 (2007). Available at http://www.biomedcentral.com/1471-2105/8/105.

(10) I. Ferrer, A. Fernandez-Alba, J.A. Zweigenbaum, and E.M. Thurman, Rapid. Comm. Mass Spectrom. 20, 3659–3668 (2006).

(11) J.-T. Chen, C.-G. Juo, C.-F. Yeh, and G.-R. Her, Rapid Comm. Mass Spectrom . 10, 179–182 (1996).

(12) M. Gergov, W. Weinmann, J. Meriluoto, J. Uusitalo, and I. Ojanpera, Rapid Comm. Mass Spectrom. 18, 1039–1046 (2004).

(13) E.L. Schymanski, C. Meinert, M. Meringer, and W. Brack, Anal. Chim. Acta 615, 136–147 (2008).

(14) M. Gergov, Ph.D. Dissertation "Library Search-Based Drug Analysis in Forensic Toxicology by Liquid Chromatography-Mass Spectrometry," Helsinki University of Technology, Helsinki, Finland, 2004.

(15) S. Cristoni, L. Molin, A. Lai, L.R. Bernardi, S. Pucciarelli, M. Agostini, C. Bedin, D. Nitti, R. Seraglia, O. Repetto, V.F. Dibari, R. Orlandi, P. Martinez- Lozano Sinues, and P. Traldi, Rapid Comm. Mass Spectrom. 23, 2839–2845 (2009).

(16) B. Thielen, S. Heinen, and D. Schomburg, BMC Bioinformatics 10, 229 (2009). Available at http://www.biomedcentral.com/1471-2105/10/229.