DreaMS: Self-Supervised Molecular Representations from Mass Spectra

4 months ago 19

Construction of GeMS dataset

To enable self-supervised learning, we mined new large datasets of metabolite MS/MS spectra from the GNPS part of the MassIVE repository, which we named GeMS. MassIVE is a community-driven resource with billions of mass spectra from various biochemical and environmental studies. However, it primarily focuses on proteomics and often contains low-quality data as a result of its uncurated nature. Therefore, we developed a series of algorithms to identify, filter and cluster the metabolomics spectra of MassIVE into high-quality, non-redundant datasets. In this section, we describe our procedure; a more detailed analysis and statistics are available in our technical report⁷⁹.

Selecting LC–MS/MS experiments from MassIVE

We start the mining of MassIVE by selecting all .mzML and .mzXML data files from all 4,467 MassIVE datasets (as of November 2022) that are explicitly marked as metabolomics studies with the ‘GNPS’ prefix in their names. This selection yields 338,649 distinct files, among which 249,422 contain MS/MS data with a total of 814 million MS/MS spectra. By filtering out empty or corrupted spectra with invalid m/z or intensity values (for example, negative intensity or multiple identical m/z values), we obtain a complete, unprocessed version of GeMS, comprising 714 million MS/MS spectra.

Estimating quality of MS data

To obtain higher-quality subsets, we apply file-level and spectrum-level quality criteria to the collected spectra. File-level criteria assess the ordering of spectra based on retention times and tandem MS levels. We discard files with unordered retention times, invalid sequences of MS levels (for example, MS³ following MS¹ without MS²), missing MS¹ data or fewer than three spectra. Notably, we estimate MS instrument accuracy by evaluating the deviation of similar m/z values within extracted ion chromatograms (XICs). More precisely, the algorithm constructs a set of XICs for MS¹ base peak masses and then estimates the accuracy of the instrument as the median of standard deviations within individual XICs (Algorithm 1).

The spectrum-level quality criteria operate in several steps. Initially, spectra with a low number of peaks or low intensity amplitudes (that is, the maximum intensity divided by the minimum intensity) are filtered out. Subsequently, non-single charge precursor ions and spectra with excessively high m/z values (>1,000 Da) are excluded. These steps are crucial for retaining only small metabolite molecules. We keep only spectra acquired in positive ionization mode and filter out those estimated to be non-centroided (Algorithm 2).

Algorithm 1 Estimate the absolute accuracy of a mass spectrometry instrument

Require: Sequence of MS¹ spectra from LC–MS experiment.

Ensure: Estimated absolute accuracy of mass spectrometry instrument.

1: M₁ ← M/z values of all base peaks ⊳ M/z values for 1st round of XICs

2: M₂ ← {} ⊳ M/z values for 2nd round of XICs

3: for m ∈ M₁ do

4: X ← XIC(m, 0.5) ⊳ Set of peaks forming XIC for m/z m and 0.5-Da absolute tolerance

5: if ∣X∣≥5 then

6: M₂ ← M₂ ∪ MedianMz(X)

7: end if

8: end for

9: A ← {} ⊳ Accuracy estimates within individual XICs

10: for m ∈ M₂ do

11: X ← XIC(m, 0.01) ⊳ XIC with lower 0.01-Da tolerance

12: if ∣X∣≥5 then

13: A ← A ∪ StdDevMz(X)

14: end if

15: end for

16: return Median(A)

By varying filtering thresholds, we create three GeMS variants: GeMS A (42 million spectra), GeMS B (100 million spectra) and GeMS C (201 million spectra). GeMS A has a low threshold for estimated instrument accuracy (approximately four decimal places in m/z ratios). GeMS B is primarily filtered by unknown charge values and is less stringent than GeMS A. GeMS C further relaxes criteria applied to GeMS B and is mainly filtered based on criteria related to spectral peak values. Figure 2b provides the details of the applied filters for each subset.

Algorithm 2 Estimate the type of a spectrum

Require: Spectrum m/z values ${\bf{m}}\in {{\mathbb{R}}}^{n}$ and intensities ${\bf{i}}\in {{\mathbb{R}}}^{n}$.

Ensure: Estimated spectrum type.

1: if n < 5 then

2: return CENTROID

3: end if

4: b ← argmax i ⊳ Index of base peak

5: $S\leftarrow \{s\in \{1,\ldots, n\}| (\forall {s}^{{\prime} }\in \{0,\ldots, s-b\})({i}_{b+{s}^{{\prime} }} > \frac{{i}_{b}}{2})\}$

6: if $\max S-\min S < 3$ or ${{\bf{m}}}_{\max S}-{{\bf{m}}}_{\min S} > \frac{\max {\bf{m}}-\min {\bf{m}}}{1000}$ then

7: return CENTROID

8: else

9: if (∃ i ∈ i)(i = 0) then

10: return PROFILE

11: else

12: return THRESHOLDED

13: end if

14: end if

Clustering mass spectra with LSH

The filtering pipeline ensures the quality of individual spectra, but it does not address biases in the entire GeMS datasets related to the natural abundance of metabolites. To tackle this, we employ the random projections algorithm⁸⁰ for efficient clustering and deduplication of mass spectra. This algorithm, falling under the family of LSHs, enables linear-time clustering of MS/MS spectra.

In the first step, we vectorize mass spectra via binning. Specifically, each spectrum is represented as a vector ${\bf{s}}\in {{\mathbb{R}}}^{n}$ with n equal-width bins covering the range of m/z values of interest. The value of s_i then corresponds to the summed intensity of the values contained within the i-th bin.

In the subsequent step, for a binned spectrum ${\bf{s}}\in {{\mathbb{R}}}^{n}$, we calculate the corresponding hash h(s) using a mapping $h:{{\mathbb{R}}}^{n}\to {\{0,1\}}^{m}$ defined as

$$h({\bf{s}})=[{\bf{Ws}}\ge 0],\,\text{where}\,\,{\bf{W}}\in {{\mathbb{R}}}^{m,n}\,\,\text{and}\,\,{{\bf{W}}}_{ij} \sim {\mathcal{N}}(0,1),$$

where [ ⋅ ] indicates an element-wise Iverson bracket, meaning that [x_i] = 1 if x_i is true and 0 otherwise. Essentially, each element of the Ws product is a dot product of s and a random n-dimensional hyperplane. Each of the m hyperplanes splits the n-dimensional space into two complementary subspaces, thereby determining the subspace to which s belongs, based on the sign of each dot product. These signs represent the bits of the resulting m-dimensional hash. Given that every hyperplane intersects the origin, the likelihood of two binned spectra s_i and s_j sharing the same hash is a function of their cosine similarity⁸⁰:

$${\mathbb{P}}\left(h\left({\textbf{s}}_i\right) = h\left({\textbf{s}}_j\right)\right) = 1 - \arccos\left(\underbrace{\frac{{\textbf{s}}_i^{\top}{\textbf{s}}_j}{\|{\textbf{s}}_i\| \|{\textbf{s}}_j\|}}_{{\rm{Cosine}}\,{\rm{similarity}}}\right)\frac{1}{\uppi},$$

(1)

where ${\mathbb{P}}$ denotes the joint probability over random hyperplanes. In essence, with a sufficient number of hyperplanes, random projections effectively approximate cosine similarity, which is the primary method for comparing mass spectra.

To cluster the spectra of GeMS, we use m = 64 random hyperplanes and the window of size 1 binning the range of m/z values from 0 to 1,000 Da (that is, n = 1,000). By varying the number of retained spectra per cluster, we establish two additional subsets for each of the A, B and C variants of GeMS with, at most, 10 and 1,000 allowed cluster representatives, denoted with additional suffixes such as GeMS-A1 or GeMS-B1000. Figure 2c demonstrates the sizes of the resulting clustered datasets.

GeMS data format

We store GeMS datasets in a compressed tensor format using our new .hdf5-based format, primarily designed for deep learning. Supplementary Table 2 outlines the format specifications, detailing all data and metadata entities retained from the input .mzML or .mzXML files.

Murcko histograms algorithm for splitting molecular datasets

A universal and reliable protocol for supervised learning on spectral libraries is crucial for fine-tuning our pre-trained DreaMS model. The commonly used technique is to split a spectral library into training and validation folds, ensuring that no molecules share identical structures (technically, the first 14 characters of InChI keys) between the folds. However, we identify three issues with this protocol that may limit the generalization capabilities and, therefore, the practical utility of the final fine-tuned model.

First, spectral libraries often contain closely similar structures (ref. ⁷⁹, Section 4.1), such as those resulting from click chemistry. Consequently, molecules with minor structural differences are often assigned to different training/validation folds, introducing a data leakage for tasks such as fingerprint prediction, where small structural details may not substantially impact performance metrics. Second, structure-disjoint splits are agnostic to the fragmentation nature of MS/MS. For instance, two molecules differing only in the length of the carbon chain connecting two subfragments have distinct structures, yet such chains can be easily fragmented by CID, resulting in nearly identical spectra. Third, the structure-disjoint approach often assigns entire molecules and their abundant fragments (such as the fragments of sugars) to different folds, increasing the chance of overfitting to abundant substructures. To address these issues, we designed a new algorithm, Murcko histograms, based on the Murcko scaffolds⁶⁴, for splitting molecular structures into training/validation folds.

Algorithm 3 Definition of a Murcko histogram

Require: Molecular graph G = (V, E), V = {1, …, n}, E ⊆ {{u, v}∣u, v ∈ V ∧ u ≠ v}.

Ensure: Murcko histogram h.

1: G ← MurckoScaffold(G)

2: V_R ← {V_r ⊂ V ∧ ∣V_r∣ > 3∣V_r contains all atoms of a (fused) ring}

3: V_L ← {v ∈ V∣deg(v) > 1 ∧ v is not in any ring}

4: h ← a map ${{\mathbb{N}}}^{2}\to {\mathbb{N}}$ initialized as $(\forall i,j\in {{\mathbb{N}}}^{2})(h(i,j)=0)$

5: for V_r ∈ V_R do

6: $r\leftarrow \sum \{| {V}_{r}\cap {V}_{r}^{{\prime} }/2| | {V}_{r}^{{\prime} }\in {V}_{R}\setminus {V}_{r}\}$

7: l ← ∣V_r ∩ V_l∣

8: h(r, l) ← h(r, l) + 1

9: end for

10: return h

To address the first issue, we built our method upon coarse-grained Murcko scaffolds. To tackle the second issue of insensitivity to fragmentation, our method operates on molecular fragments as the primary design principle. To address the third issue, we define a heavily relaxed notion of molecular similarity, ensuring that the distinction between folds is well defined.

In particular, our algorithm computes a histogram defined in terms of the counts of scaffold substructures (Algorithm 3). Given the Murcko scaffold of a molecule⁶⁴, the algorithm operates on two separate groups of its atoms. The first group consists of sets of atoms, with each set determining a ring (line 2 in the algorithm), whereas the second group includes all atoms connecting these rings (that is, linkers; line 3). For each ring, the algorithm calculates a pair of natural numbers: the number of neighboring rings and the number of adjacent linkers (denoted as r, l in lines 5–9). These pairs define the domain of the resulting histogram, where the values represent the counts of such pairs within a molecule (lines 4, 10). Extended Data Fig. 4a shows examples of Murcko histograms and the corresponding molecular structures.

The Murcko histogram-disjoint training/validation splitting resolves the first two aforementioned issues by being insensitive to minor atomic details and by taking into account the fragments of molecular scaffolds instead. We further address the third issue by defining a way to compare the histograms that is more relaxed than a simple identity (Algorithm 4). Specifically, we define a distance on Murcko histograms as the difference in the histogram values solely in rings, not considering the number of neighboring linkers. Using this definition, we relocate the samples from validation to training folds if their distance is less than 5 and not performing the relocation if the minimum number of rings in one of the molecules is less than 4. Notice that these parameters provide interpretability for the boundary between training and validation folds, and, by varying them, we can balance between the number of validation examples and the degree of similarity between training and validation folds in terms of scaffold substructures.

Unlike structure-disjoint splitting, our method eliminates virtually all near-duplicate training/validation leaks, resulting in a two-fold reduction in average Morgan Tanimoto similarity⁶⁵ between the molecules corresponding to training and validation spectra (Extended Data Fig. 4b).

With this approach, we define approximately 90%/10% training/validation splits for MoNA (25,319/3,557 spectra corresponding to 5,524/831 molecules) as well as for the combined MoNA and NIST20 datasets (439,927/43,040 spectra corresponding to 25,476/2,454 molecules). These splits are used to fine-tune the pre-trained DreaMS model, and, throughout the text, we refer to them as Murcko histogram-disjoint splits. As mentioned previously, the name originates from the use of Murcko scaffolds⁶⁴ as the basis for the algorithm. We anticipate that our training/evaluation protocol based on Murcko histograms will stimulate further research into the development of a new generation of models with enhanced generalization toward the undiscovered dark metabolome⁵.

Algorithm 4 Definition of a Murcko subhistogram relation

Require: Two Murcko histograms h₁ and h₂, a minimum number of rings k to compute the non-identity relation and a minimum difference in ring-only Murcko histogram m to consider the histograms different. The default values are k = 4 and m = 5.

Ensure: True if one of h₁, h₂ is a subhistogram of the other in Murcko rings, False otherwise.

1: if $\min \{{\sum }_{i,j\in {\mathbb{N}}}{h}_{1}(i,j),{\sum }_{i,\;j\in {\mathbb{N}}}{h}_{2}(i,j)\} < k$ then

2: return h₁ = h₂

3: end if

4: $d\leftarrow {\sum }_{i\in {\mathbb{N}}}| ({\sum }_{j\in {\mathbb{N}}}{h}_{1}(i,j)-{\sum }_{j\in {\mathbb{N}}}{h}_{2}(i,j))|$

5: if d < m then

6: return True

7: else

8: return False

9: end if

DreaMS neural network architecture

The DreaMS neural network architecture (Fig. 3b) can be decomposed into three main consecutive modules. Given a mass spectrum, the model first encodes each spectral peak into a high-dimensional continuous representation with PeakEncoder. Then, it processes the entire set of encoded peaks with SpectrumEncoder—a series of transformer encoder blocks⁵⁸. Each block learns relationships between peaks and consecutively enriches their representations. The final task-specific PeakDecoder adjusts the final transformer representations according to a task-specific training objective. Each of the modules is described in detail below.

PeakEncoder

We represent each raw mass spectrum as a matrix ${\bf{S}}\in {{\mathbb{R}}}^{2,n+1}$, constructed as

$${\bf{S}}=\left[\begin{array}{ccccc}{m}_{0}&{m}_{1}&{m}_{2}&\ldots \,&{m}_{n}\\ 1.1&{i}_{1}&{i}_{2}&\ldots \,&{i}_{n}\end{array}\right],$$

(2)

where each column, indexed by j ∈ {1, …, n}, corresponds to one of the n (we set n = 60 for pre-training and n = 100 for fine-tuning) spectral peaks and is represented as the continuous vector ${[{m}_{j},{i}_{j}]}^{\top }\in {\mathbb{R}}\times [0,1]$, denoting the pair of m/z and relative intensity values (m/z denoted by m and intensity denoted by i). Additionally, we prepend a precursor m/z m₀ and assign it an artificial intensity of 1.1. We term this additional peak the precursor token and utilize it as a master node⁵⁷ for aggregating spectrum-level information. If a spectrum has more than n peaks, we select the n most intense ones; if it has fewer than n peaks, we pad the matrix S with zeros.

Rather than treating each m/z ratio as a single continuous value, we process it using a mass-tolerant modification of Fourier features $\Phi :{\mathbb{R}}\to {[-1,1]}^{2B}$ (ref. ⁵⁹), dependent on B predefined frequencies ${\bf{b}}\in {{\mathbb{R}}}^{B}$. Specifically, the features are constructed with sine and cosine functions

$$\Phi {(m)}_{i}=\sin (2\uppi {b}_{i}m),\qquad \Phi {(m)}_{i+1}=\cos (2\uppi {b}_{i}m),$$

(3)

where each frequency b_i is uniquely associated with either a low frequency capturing the integer part of a mass $m\in {\mathbb{R}}$ or a high frequency capturing its decimal part, forming together a vector of frequencies

$${\mathbf{b}} = \left[\underbrace{\frac{1}{m_{\rm{max}}}, \frac{1}{m_{\rm{max}} - 1}, \dots}_{{\rm{Low}}\,{\rm{frequencies}}}, \frac{1}{1}, \underbrace{\frac{1}{k m_{\rm{min}}}, \frac{1}{(k - 1) m_{\rm{min}}}, \dots, \frac{1}{ m_{\rm{min}}}}_{{\rm{High}}\,{\rm{frequencies}}} \right]^{\top} \in {\mathbb{R}}^B.$$

(4)

Here constants ${m}_{\min }\in (0,1)$ and ${m}_{\max }\in (1,\infty )$ represent the minimum decimal mass of interest (that is, the absolute instrument accuracy) and the maximum integer mass of interest, and $k\in {\mathbb{N}}$ is such that km_min is the closest value to 1. For instance, when training DreaMS on GeMS-A spectra, we set m_min = 10⁻⁴ and m_max = 1,000 according to the construction of GeMS datasets. This schema yields 1,000 low frequencies and 5,000 high frequencies (that is, the overall dimensionality of the vector b is 6,000).

Furthermore, we process the Fourier features given by equation (3) with a feed-forward neural network ${{\rm{FFN}}}_{F}:{{\mathbb{R}}}^{2B}\to {{\mathbb{R}}}^{{d}_{m}}$ (d_m = 980 in our final model). We hypothesize that the sensitivity of Fourier features to both large and small differences in masses allows FFN_F to learn the space of plausible molecular masses given by elemental compositions. Our instantiation of frequencies outperforms both random initialization⁵⁹ and the log-spaced sinusoidal variant proposed for proteomics^14,81 (Extended Data Fig. 3; ref. ⁷⁹). Notably, because peaks do not form a sequence of tokens but, rather, a set, we do not encode their positions, in contrast with classic positional encoding⁵⁸.

The concatenation of the output of FFN_F with the output of another shallow feed-forward network ${{\rm{FFN}}}_{P}:{{\mathbb{R}}}^{2}\to {{\mathbb{R}}}^{{d}_{p}}$ (d_p = 44 in our final model) applied to raw m/z and intensity values forms the complete PeakEncoder: ${{\mathbb{R}}}^{2}\to {{\mathbb{R}}}^{{d}_{m}+{d}_{p}}$:

$${\rm{PeakEncoder}}(m,i)={{\rm{FFN}}}_{F}(\Phi (m))\,\parallel \,{{\rm{FFN}}}_{P}(m,i),$$

(5)

where ∥ denotes concatenation. Column-wise application of PeakEncoder to the matrix S yields a high-dimensional representation of the corresponding spectrum ${{\bf{S}}}_{0}\in {{\mathbb{R}}}^{d,n}$, where d = d_m + d_p (d = 1,024 in our final model) is the dimensionality of the representation, and n is the number of peaks.

SpectrumEncoder

Given the output of PeakEncoder, SpectrumEncoder: ${{\mathbb{R}}}^{d,n}\to {{\mathbb{R}}}^{d,n}$ updates the representations of peaks by exchanging information between individual peaks via the self-attention mechanism. This is achieved through a sequence of l (l = 7 in our final model) transformer encoder layers (that is, BERT⁴³), alternating multi-head self-attention blocks with peak-wise feed-forward networks. Starting from S₀, each i-th block gradually updates the representation of the spectrum from S_i−1 to S_i. Throughout the text, we denote the columns of S_l (that is, representations of individual peaks) as s₀, …, s_n. We refer to the first columns of such matrices, representing precursor tokens, as DreaMS.

An important property of the transformer encoder is its equivariance to permutations of tokens⁸². Combined with the position-invariant encoding of peaks through PeakEncoder, this implies that the same two peaks in different spectra will have identical attention scores in the first attention layer, regardless of the total number of peaks or noise signals between these two peaks. Multiple related works in metabolomics and proteomics have employed the transformer architecture to encode mass spectra^{25,28,32,83,84,85}. To further strengthen the inductive bias of the transformer toward the relations between peaks, we explicitly enrich the attention mechanism with all pairwise m/z differences including neutral losses. In each transformer layer, the attention score A_ij between the i-th and j-th peaks is computed as:

$${{\bf{A}}}_{ij}=\frac{{{\bf{q}}}_{i}^{\top }{{\bf{k}}}_{j}+\mathop{\sum }\nolimits_{k}^{2t}\Phi {({m}_{i})}_{k}-\Phi {({m}_{j})}_{k}}{\sqrt{d}},$$

(6)

where ${{\bf{q}}}_{i}^{\top }{{\bf{k}}}_{j}$ is a standard dot-product attention, and $\mathop{\sum }\nolimits_{k}^{2t}\Phi {({m}_{i})}_{k}-\Phi {({m}_{j})}_{k}$ is an additional Graphormer-like term⁶¹. Element-wise differences in Fourier features enable the transformer to directly attend to precise m/z differences, enhancing its capacity to learn fragmentation patterns and robustness to shifts in absolute m/z values. This is particularly important, for instance, in scenarios where m/z values are shifted due to the masses of ionization adducts. We use eight attention heads per transformer block in our final model.

In contrast to BERT, we use a pre-norm variant of transformer⁸⁶, remove biases in linear layers and use ReLU activations. We utilize the implementation of transformer provided by Nguyen et al.⁸⁷.

PeakDecoder

Depending on the training objective, we use linear layers of different shapes (referred to as heads) to refine and project the final hidden representations of peaks given by the SpectrumEncoder.

For both m/z masking and retention order pre-training objectives, we employ simple linear projections followed by suitable activation functions, mapping the representations of peaks into the corresponding domains of predictions:

$${\hat{{\bf{y}}}}_{{\rm{mass}}}={\rm{softmax}}\left({{\bf{W}}}_{{\rm{mass}}}{{\bf{s}}}_{k}\right),\qquad {\hat{y}}_{{\rm{order}}}=\sigma \left({{\bf{W}}}_{{\rm{order}}}\left({{\bf{s}}}_{0}^{(i)}\parallel {{\bf{s}}}_{0}^{(\!j)}\right)\right),$$

(7)

where ${{\bf{s}}}_{k}\in {{\mathbb{R}}}^{d}$ denotes the hidden representation of a masked peak k ∈ M from a set of masked indices M ⊂ {1, …, n}. It is projected by ${{\bf{W}}}_{{\rm{mass}}}\in {{\mathbb{R}}}^{c,d}$ and the Softmax function to obtain the predicted probability vector ${\hat{{\bf{y}}}}_{{\rm{mass}}}\,\in \,{{\mathbb{R}}}^{c}$ with c classes corresponding to the discretized mass bins to be reconstructed. Next, ${\hat{y}}_{{\rm{order}}}$ denotes the predicted probability that a spectrum i precedes the spectrum j in chromatography. The probability is predicted by concatenating two precursor embeddings ${{\bf{s}}}_{0}^{(i)},{{\bf{s}}}_{0}^{(\!j)}$ corresponding to the two spectra and applying the linear projection ${{\bf{W}}}_{{\rm{order}}}\in {{\mathbb{R}}}^{1,2d}$ followed by the sigmoid function σ.

For supervised fine-tuning tasks, we employ two variants of linear heads. The first variant is given by single linear layers operating solely on the precursor token representations ${{\bf{s}}}_{0}\in {{\mathbb{R}}}^{d}$:

$${\hat{{\bf{y}}}}_{{\rm{props}}}={{\bf{W}}}_{{\rm{props}}}{{\bf{s}}}_{0},\qquad {\hat{y}}_{{\rm{F}}}=\sigma ({{\bf{W}}}_{{\rm{F}}}{{\bf{s}}}_{0}),\qquad {\bf{z}}={{\bf{W}}}_{{\rm{emb}}}{{\bf{s}}}_{0},$$

(8)

where ${{\bf{W}}}_{{\rm{props}}}\in {{\mathbb{R}}}^{11,d}$, ${{\bf{W}}}_{{\rm{F}}}\in {{\mathbb{R}}}^{1,d}$ followed by sigmoid σ and ${{\bf{W}}}_{{\rm{emb}}}\in {{\mathbb{R}}}^{d,d}$ yield the predictions of 11 molecular properties ${\hat{{\bf{y}}}}_{{\rm{props}}}$, the probability of fluorine presence ${\hat{y}}_{{\rm{F}}}$ and the spectral embedding z, respectively.

For the task of predicting molecular fingerprints, we find a head with richer representation capacity to slightly improve the performance:

$${\hat{{\bf{y}}}}_{{\rm{fp}}}={{\bf{W}}}_{{\rm{fp1}}}\mathop{\sum }\limits_{i=0}^{n}{\rm{ReLU}}({{\bf{W}}}_{{\rm{fp0}}}{{\bf{s}}}_{i}).$$

(9)

Here the projections ${{\bf{W}}}_{{\rm{fp0}}}\in {{\mathbb{R}}}^{d,d}$ and ${{\bf{W}}}_{{\rm{fp1}}}\in {{\mathbb{R}}}^{4096,d}$ are arranged into the DeepSets-like⁸⁸ head to output 4,096 fingerprint elements. In this case, the head operates on the hidden representations of all peaks s_i rather than solely on the precursor peak s₀ as in equation (8). The details of pre-training and fine-tuning objectives are discussed in the following sections.

Self-supervised pre-training

The objective of self-supervised pre-training for DreaMS is defined by minimizing a weighted sum of two losses:

$${{\mathcal{L}}}_{{\rm{DreaMS}}}=0.8{{\mathcal{L}}}_{{\rm{mass}}}+0.2{{\mathcal{L}}}_{{\rm{order}}},$$

(10)

where ${{\mathcal{L}}}_{{\rm{mass}}}$ represents the masked modeling loss, quantifying the error of the model in reconstructing the masses of randomly masked peaks, and ${{\mathcal{L}}}_{{\rm{order}}}$ denotes the retention order prediction error. Each training example within a mini-batch consists of sampling two spectra with indices i, j from the same LC–MS/MS experiment. Here, we further detail the computation of both ${{\mathcal{L}}}_{{\rm{mass}}}$ and ${{\mathcal{L}}}_{{\rm{order}}}$ losses for the example pair i, j.

To compute the ${{\mathcal{L}}}_{{\rm{mass}}}$ loss, we randomly sample a predefined ratio (30%) of peaks M⁽ⁱ⁾, M^(j) ⊂ {1, …, n} from both spectra i and j, proportionally to their intensities. Then, we replace the masses of the sampled peaks in the spectra with −1.0, while keeping the intensities unchanged, and utilize the original mass values ${{\bf{m}}}^{(i)}\in {{\mathbb{R}}}^{| {M}^{(i)}| }$ and ${{\bf{m}}}^{(\!j)}\in {{\mathbb{R}}}^{| {M}^{(\!j)}| }$ as the prediction labels. Instead of directly predicting the continuous values m⁽ⁱ⁾, m^(j), we categorize them into c equal-width bins ranging from 0 to the maximum m/z of the training dataset (1,000 Da for GeMS-A subsets; Fig. 2b) and train the model to predict the correct bins. This classification approach⁸⁹, rather than regression, is adopted to better capture the inherent uncertainty of mass reconstruction, as it accounts for the possibility that several masses may be equally plausible for a masked peak. A regression model may converge at predicting the average value, whereas a classification model would learn to assign equal probability to each plausible mass.

Specifically, we convert continuous mass values into degenerate categorical distributions, represented by binary matrices ${{\bf{Y}}}_{\rm{mass}}^{(i)}\in {\{0,1\}}^{| {M}^{(i)}|, c}$ and ${{\bf{Y}}}_{\rm{mass}}^{(\!j)}\in {\{0,1\}}^{| {M}^{(\!j)}|, c}$, where rows correspond to masked peaks and columns correspond to mass bins. The elements of the matrices are ones in bins containing the corresponding masses and zeros elsewhere. In detail, for a masked peak l ∈ M^(k) in spectrum k ∈ {i, j} and bin b ∈ {0, …, c − 1}, the corresponding matrix element is

$${y}_{{\rm{mass}},l,b}^{(k)}=\left[{m}_{l}^{(k)}\in \left[b\frac{1,000}{c},(b+1)\frac{1,000}{c}\right)\right],$$

(11)

where [ ⋅ ] indicates the Iverson bracket, implying [x] = 1 if x is true and 0 otherwise. The terms $\frac{1,000}{c}$ represent the m/z range (0, 1,000) discretized into c bins. We set c = 20,000 in our final model.

Then, the model is trained to predict a categorical distribution for each of the masked peaks ${\hat{{\bf{Y}}}}_{{\rm{mass}}}^{(i)},{\hat{{\bf{Y}}}}_{{\rm{mass}}}^{(\!j)}$ (equation (7), left), and the reconstruction error is evaluated using the cross-entropy loss in the space of discretized mass values:

$${{\mathcal{L}}}_{{\rm{mass}}}\left({\hat{{\bf{Y}}}}_{{\rm{mass}}}^{(i)},{{\bf{Y}}}_{{\rm{mass}}}^{(i)},{\hat{{\bf{Y}}}}_{{\rm{mass}}}^{(\!j)},{{\bf{Y}}}_{{\rm{mass}}}^{(\!j)}\right)=-\frac{1}{2}\sum _{k\in \{i,\,j\}}\sum _{l\in {M}^{(k)}}{{{\bf{y}}}_{{\rm{mass}},l}^{(k)}}^{\top }\log \left({\hat{{\bf{y}}}}_{{\rm{mass}},l}^{(k)}\right),$$

(12)

where the first sum from the left averages the results across two sampled spectra i and j, and the second sum iterates over all masked peaks M^(k) in spectrum k. The dot-product ${{{\bf{y}}}_{{\rm{mass}},l}^{(k)}}^{\top }\log ({\hat{{\bf{y}}}}_{{\rm{mass}},l}^{(k)})$ calculates the cross-entropy between a ground-truth degenerate distribution ${{\bf{y}}}_{{\rm{mass}},l}^{(k)}$, which contains a 1 for the correct mass bin of peak l in spectrum k and 0s elsewhere, and the corresponding predicted distribution over bins ${\hat{{\bf{y}}}}_{{\rm{mass}},l}^{(k)}$. Minimizing ${{\mathcal{L}}}_{\rm{mass}}$ effectively maximizes the likelihood of predicting the correct mass bins, and the loss is minimal when all the bins are predicted correctly.

The second component of the DreaMS loss, ${{\mathcal{L}}}_{{\rm{order}}}$, is given by a binary cross-entropy classification loss. The model is trained to predict the retention order of two spectra i and j within the LC–MS/MS experiment by estimating the probability ${\hat{y}}_{{\rm{order}}}$ that spectrum i precedes spectrum j in chromatography (equation (7), right). The actual probability y_order is either 0 or 1:

$${{\mathcal{L}}}_{{\rm{order}}}=-\left(\;{y}_{{\rm{order}}}\log\left({\hat{y}}_{{\rm{order}}}\right)+\left(1-{y}_{{\rm{order}}}\right)\log \left(1-{\hat{y}}_{{\rm{order}}}\right)\right).$$

(13)

We pre-train DreaMS on the GeMS-A10 dataset and retain the 60 highest peaks when forming training batches. Additionally, with a 20% probability, we augment a spectrum by adding a random scalar from (0, 50) to all its m/z values. Such modification forces the neural network to learn relationships between spectral peaks rather than memorizing precise masses, a property important for making the model more robust to, for example, different ionization adducts.

Linear probing of the emergence of molecular structures

Every 2,500 pre-training iterations, we conduct linear probing—a technique enabling us to evaluate the gradual emergence of molecular structures during self-supervision. Specifically, we freeze a model and train a single linear layer ${{\bf{W}}}_{{\rm{probe}}}\in {{\mathbb{R}}}^{166,d}$, followed by a sigmoid activation function (that is, a logistic regression) to predict 166 MACCS fingerprint bits from precursor token embeddings, utilizing a random subsample of 6,000 examples from the Murcko histogram split of NIST20 and MoNA. We employ a binary cross-entropy loss function (equation (13)) for learning individual fingerprint bits. We select MACCS fingerprints as the probing objective because they offer an interpretable description of a molecular structure, allowing each predicted bit to be reconstructed back to a molecular substructure.

We report the average validation recall in predicted bits as a function of pre-training time (Fig. 3c) to illustrate the model’s progressively improving ability to discover the substructures of ground-truth molecules. For each iteration, we display the highest recall within 100 probing epochs. Notably, although the figure depicts only the increase in recall, this improvement is achieved without any decline in precision. In fact, precision slightly increases from 0.81 to 0.84 within the same evaluation setup.

Related work in supervised learning on mass spectra

The self-supervised pre-training objectives introduced in our work, namely masked m/z prediction (equation (13)) and retention order prediction (equation (13)), are related to the supervised tasks of spectrum simulation, MS/MS intensity prediction and retention time prediction given a molecular structure. However, because DreaMS does not utilize molecular structures as input, there are considerable differences between our approach and the corresponding supervised methods. First, spectrum simulation or intensity prediction methods are typically trained to reconstruct complete MS/MS spectra via dot-product-based regression loss functions^20,21,90,91. In contrast, our approach focuses on reconstructing only a fraction of randomly sampled m/z values using a cross-entropy loss. Second, retention order prediction is generally treated as a molecular property prediction, often formulated as a regression task to predict retention times^91,92,93. Because we do not assume knowledge of molecular structures, we train DreaMS to classify which spectrum elutes first in chromatography by sampling two spectra from the same LC–MS/MS experiment.

The primary goal of self-supervised pre-training is to obtain embeddings of MS/MS spectra that can be utilized for various downstream applications. Prior works derived embeddings in a supervised manner by employing contrastive learning approaches. For instance, GLEAMS derives embeddings of peptide MS/MS spectra and applies them for downstream clustering¹⁵. Goldman et al.³² investigated the MS/MS spectra embeddings derived by the supervised contrastive MIST model in the context of chemical classes of small molecules.

Transfer learning to spectrum annotation tasks

In this section, we discuss how we transfer the knowledge obtained by the DreaMS model during the self-supervised pre-training to make predictions in scenarios of practical interest. Specifically, we describe how we fine-tune the pre-trained model end to end (that is, fine-tuning all parameters in the PeakEncoder and SpectrumEncoder layers) for various downstream mass spectrometry tasks, using task-specific heads instead of PeakDecoder.

Spectral similarity

The cosine similarity on unsupervised DreaMS embeddings exhibits a strong correlation with Tanimoto similarity (Fig. 4a). However, we observe that it lacks sensitivity to small structural differences among molecules with nearly identical masses (Extended Data Fig. 5b). To address this limitation, we refine the embedding space through contrastive fine-tuning. Specifically, we utilize triplet margin loss function⁹⁴ to disentangle the embeddings of spectra that share similar molecular masses:

$${{\mathcal{L}}}_{{\rm{emb}}}({\bf{z}},{{\bf{z}}}^{+},{{\bf{z}}}^{-})=\max \{\cos ({\bf{z}},{{\bf{z}}}^{+})-\cos ({\bf{z}},{{\bf{z}}}^{-})+{\it{\Delta}}, 0\},$$

(14)

where ${\bf{z}}\in {{\mathbb{R}}}^{d}$ denotes the embedding of a randomly sampled reference spectrum, z⁺; ${{\bf{z}}}^{-}\in {{\mathbb{R}}}^{d}$ are the embeddings of positive and negative examples, respectively; and Δ > 0 (Δ = 0.1 in our final setting) is the contrastive margin. The positive example is defined as a spectrum of the same molecule as the reference spectrum (having the same 14-character prefix in the InChI key), whereas the negative example is given by a spectrum corresponding to a different molecule but with a similar molecular mass (at most 0.05-Da difference). The ${{\mathcal{L}}}_{\rm{emb}}$ loss function optimizes the embedding space so that the reference spectra are closer to the positive examples than to the negative ones. The contrastive margin Δ, intuitively, measures the minimum required gap between the corresponding positive and negative distances. The proximity between two embeddings a and b is measured by cosine similarity:

$$\cos ({\bf{a}},{\bf{b}})=\frac{{{\bf{a}}}^{\top }{\bf{b}}}{\max \{\parallel {\bf{a}}\parallel \parallel {\bf{b}}\parallel, \epsilon \}},$$

(15)

where ϵ, set to 10⁻⁸, is a constant for numerical stability.

The aim of the fine-tuning is to adjust the embedding space using minimal supervision yet still retaining the knowledge acquired during self-supervised pre-training and not introducing biases of spectral libraries scarcity. Therefore, we conduct contrastive training on a refined subset of MoNA histogram-disjoint split containing approximately 25,000 spectra corresponding to 5,500 unique InChI connectivity blocks and do not use any spectra from NIST20 for training. To form a subset, we retain only the spectra satisfying A quality conditions (as shown in Fig. 2b), having [M+H]⁺ adducts and 60-eV collision energy. To simulate the performance evaluation on a new spectral library, we evaluate the cosine similarity in refined embedding space on the high-quality subset of NIST20 satisfying A filtering conditions. We additionally exclude from the validation all NIST20 examples whose InChI key connectivity blocks are present in MoNA. We consider three molecular similarity tasks: estimating the Tanimoto similarity between Morgan fingerprints of underlying molecules, determining the spectra corresponding to the same molecules within the pool of candidate spectra with similar precursor masses and search for structural analogs of the precursor molecules of a query spectrum.

Specifically, in the case of the Tanimoto similarity approximation problem, we measure Pearson correlation between DreaMS cosine similarities and Tanimoto similarities on binary Morgan fingerprints (number of bits = 4,096, radius = 2) using approximately 82,000 pairs of spectra sampled from NIST20 so that they maximize the entropy of the distribution of ground-truth similarities. We benchmark our method against the official implementation (https://github.com/matchms/ms2deepscore) of the state-of-the-art MS2DeepScore model¹³ (as depicted in Fig. 4a).

For the second task of retrieving mass spectra corresponding to the same molecule, we measure the area under the receiver operating characteristic curve (AUROC), which evaluates the classification performance under different similarity thresholds. We sample approximately 750,000 binary classification examples from NIST20 in a way that makes positive class examples correspond to pairs of spectra having the same underlying molecular structure (as measured by the same 14-character prefix in the InChI key) and negative class examples correspond to pairs of spectra having similar precursor masses (with, at most, 10-ppm precursor m/z difference). We benchmark our method against spectral entropy, the state-of-the-art method, as well as 43 other baseline approaches⁹ (as illustrated in Fig. 4b). We use the implementation of all the methods from the official spectral entropy GitHub repository (https://github.com/YuanyueLi/SpectralEntropy).

For the third task of analog search, we utilize the same dataset as used for the Tanimoto similarity benchmark but measure a different metric. Specifically, we base this metric on the maximum common edge subgraph (MCES)⁶⁶, which expresses the edit distance on molecular graphs. This metric approximates the number of structural changes, thereby allowing us to formalize the notion of structural analogs as molecules with an MCES distance of less than or equal to a certain MCES threshold k. We compare the analog search performance of DreaMS, DreaMS (zero-shot), MS2DeepScore, modified cosine similarity and spectral entropy using the AUROC, with true labels defined as pairs of spectra whose molecules have an MCES distance of less than or equal to k ∈ {0, 1, …, 7} (Fig. 4c and Extended Data Fig. 5b).

For the visualization of fine-tuned embeddings (Figs. 4d and 5 and Extended Data Fig. 7), we utilize the UMAP algorithm⁶⁷, with cosine similarity set as the metric. Figure 4d and Extended Data Fig. 7 display 100,000 random embeddings of NIST20 spectra, with all precursor InChI keys disjoint from the precursors of the MoNA subset used for the spectral similarity fine-tuning. Level set plots in Extended Data Fig. 7 present 10 levels of various molecular properties when binning the UMAP axes into 200 bins. Sample-average embeddings in Fig. 5 are computed for 2,810 food samples (that is, .mzML files; 6 million spectra in total) from the MSV00008490 dataset (https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?task=ce3254fe529d43f48077d7ad55b7da09), which have textual food descriptions assigned in the metadata table within the dataset repository. We assigned color categories to individual data points using the ChatGPT-4 model⁷⁶. Specifically, we provided the model with a list of all unique sample descriptions and queried the model with the following prompt: ‘Please summarize all the individual textual descriptions into the minimal number of categories, including a ‘Miscellaneous’ category’. We then iteratively asked the model to further reduce the number of categories by merging related ones (for example, combining ‘Cheese’ and ‘Dairy products’ into ‘Cheese & Dairy Products’). After obtaining 13 categories, we excluded all 948 samples forming the ‘Miscellaneous’ category, such as those containing ‘supplement’ or ‘extract’ in their descriptions.

Molecular fingerprint prediction

The next problem that we tackle with DreaMS is the prediction of molecular fingerprints. We adapt our model via supervised fine-tuning and validate it on the MIST CANOPUS benchmark³² to evaluate the performance against the state-of-the-art deep learning model MIST.

In detail, we fine-tune DreaMS to directly predict molecular fingerprints via the cosine similarity loss function ${{\mathcal{L}}}_{{\rm{fp}}}$ between true y_fp and predicted ${\hat{{\bf{y}}}}_{{\rm{fp}}}$ fingerprints:

$${{\mathcal{L}}}_{{\rm{fp}}}\left({\hat{{\bf{y}}}}_{{\rm{fp}}},{{\bf{y}}}_{{\rm{fp}}}\right)=\cos \left({\hat{{\bf{y}}}}_{{\rm{fp}}},{{\bf{y}}}_{{\rm{fp}}}\right),$$

(16)

where cos is the cosine similarity given by equation (15), discussed previously in the context of comparing embeddings of spectra.

For the fine-tuning and evaluation, we use the CANOPUS benchmark from the official GitHub repository (https://github.com/samgoldman97/mist). Specifically, we use the MIST codebase to generate fingerprints and the candidate pools of molecules for the evaluation. Each pool corresponds to a single spectrum along with positive and negative candidate molecules mined from PubChem. The positive candidates correspond to molecules in PubChem that have the same 14-character prefix in the InChI key as the true underlying molecule, including the true molecule itself. The negative candidates are given by the molecules sharing the same molecular formula. Then, the retrieval performance is evaluated using the accuracy at top k metrics for k ∈ {1, 5, 10, 20, 50, 100, 200}, measuring the number of spectra that have at least one positive molecule in the top k predictions, sorted by the cosine similarity between the predicted and ground-truth fingerprints (Fig. 4e and Extended Data Table 1).

It is worth noting that the MIST CANOPUS benchmark is limited in size, covering 8,000 spectra from 7,000 unique molecules. Given this fact and the recent assessment of MIST’s generalization capabilities after being trained on the MIST CANOPUS dataset⁹⁵, the fine-tuning of DreaMS to predict molecular fingerprints should be understood as a proof of concept that our approach can achieve competitive performance with MIST without relying on domain-specific components (for example, SIRIUS peak formula annotations), rather than as a production-ready fingerprint predictor.

Molecular property prediction

Next, we fine-tune DreaMS to predict molecular properties. For this, we reproduce the evaluation protocols proposed previously by Voronov et al.²⁵ and Gebhard et al.²⁶.

Specifically, we fine-tune our model to jointly predict r = 11 selected molecular properties from spectra, averaging the squared error for each of the properties:

$${{\mathcal{L}}}_{{\rm{props}}}({\hat{{\bf{y}}}}_{{\rm{props}}},{{\bf{y}}}_{{\rm{props}}})=\frac{1}{r}\parallel {\hat{{\bf{y}}}}_{{\rm{props}}}-{{\bf{y}}}_{{\rm{props}}}{\parallel }^{2},$$

(17)

where ${{\bf{y}}}_{{\rm{props}}}\in {{\mathbb{R}}}^{r}$ denotes the vector containing ground-truth molecular properties, such as quantitative estimation of drug-likeness (QED), synthetic accessibility and Bertz complexity (see Fig. 4 for the complete list). Because different properties have different scales and are measured in different units, we normalize them before feeding them to the loss function. In particular, we map each property to the [0, 1] interval via min-max scaling based on the statistics from the training data.

For the training, validation and testing, we use the MoNA and NIST20 dataset splits prepared using our Murcko histograms algorithm. First, inspired by Gebhard et al.²⁶, we evaluate the performance of DreaMS on predicting molecular complexity from mass spectra. In detail, we estimate the capability of DreaMS to predict the Bertz complexity of a molecule from its mass spectum, by measuring its relative prediction error under different minimum true complexity thresholds of interest. The relative prediction error is defined as $| \,{y}_{{\rm{Bertz}}}-{\hat{y}}_{{\rm{Bertz}}}| /{y}_{{\rm{Bertz}}}$ and measures the performance of predicting complexity ${\hat{y}}_{{\rm{Bertz}}}$ robustly under varying absolute values of the true complexity y_Bertz²⁶. We compare our method against XGBoost^26,96 trained on 10,000-dimensional binned spectra with 0.1-Da bin size and the state-of-the-art spectra property predictor MS2Prop²⁵, reimplemented and retrained to predict Bertz complexity among other properties (Fig. 4f). We also evaluate our method and XGBoost on predicting ten other properties addressed by MS2Prop (Fig. 4g). Our reimplementation of MS2Prop uses the hyperparameters described in the original publication²⁵ and the same values as DreaMS for the unspecified hyperparameters (such as batch size and learning rate).

Fluorine detection

We evaluate the performance of DreaMS on detecting fluorinated molecules from mass spectra.

Our fluorine detector is fine-tuned using a binary cross-entropy loss function ${{\mathcal{L}}}_{{\rm{F}}}$ with additional focal loss terms⁹⁷ accounting for class imbalance. For each training example, the loss is computed as:

$${{\mathcal{L}}}_{{\rm{F}}}\left(\,{\hat{y}}_{{\rm{F}}},{y}_{{\rm{F}}}\right)=-{\alpha }_{\rm{F}}{\left(1-{p}_{{\rm{F}}}\right)}^{\gamma }\log {p}_{{\rm{F}}},$$

(18)

where ${\hat{y}}_{{\rm{F}}}$ is the predicted fluorine presence probability, and y_F is the 0 or 1 label, depending on the ground-truth presence of fluorine. Next, p_F is the standard binary cross-entropy term, and α_F and γ are focal loss terms:

$${p}_{\rm{F}}=\left\{\begin{array}{ll}{\hat{y}}_{{\rm{F}}},\quad &{\rm{if}}\,{y}_{{\rm{F}}}=1\\ 1-{\hat{y}}_{{\rm{F}}},\quad &{\rm{otherwise}},\end{array}\right.\qquad {\alpha }_{{\rm{F}}}=\left\{\begin{array}{ll}\alpha, \quad &{\rm{if}}\,{y}_{{\rm{F}}}=1\\ 1-\alpha, \quad &{\rm{otherwise}},\end{array}\right.$$

(19)

where α = 0.8 increases the loss for underrepresented examples, containing fluorine, and decreases the loss otherwise (training data contain approximately 80% of examples with fluorine); γ = 0.5 adjusts the predicted probabilities of correct classes to prioritize misclassified examples.

We fine-tune DreaMS on the spectra from MoNA and NIST using the Murcko histograms algorithm for training/validation splitting. Subsequently, we test the performance of the model on our in-house dataset, consisting of 17,052 [M+H]⁺ Orbitrap mass spectra (3,900 spectra of 1,175 unique fluorinated molecules and 13,152 spectra of 4,055 unique non-fluorinated molecules), by measuring precision and recall under different thresholds (Fig. 4h). The dataset is available under MassIVE accession number MSV000094528 (https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?task=676a38e2dd574a15905e807d78cf1e57). As a baseline, we use SIRIUS 5.6.3 with the possible adducts set to [M+H]⁺, the instrument to Orbitrap and the maximum number of fluorine elements to 5 (maximum number in the dataset)³⁰. By experimenting with the numbers lower than 5, we observe a substantial drop in recall but no improvement in precision.

We prioritize high precision over recall as we find it the most practically important metric when searching for new fluorinated molecules for further wet-lab characterization, considering the difficulty of wet-lab experiments. Consequently, we estimate the coverage of mass spectra with confident predictions using the model operating in the high-precision regime with the precision of 90%. Specifically, we set two predicted probability thresholds (0.46 and 0.75) for classifying spectra containing and not containing fluorine so as to lend the model 90% precision in both cases. Notably, we find that only 5% of spectra have uncertain predictions (with the predicted probabilities in the [0.46, 0.75] interval), whereas the rest of the spectra are covered by high-confidence predictions (Fig. 4i).

DreaMS Atlas

In this section, we provide a description of how the DreaMS Atlas is constructed. We start by outlining the process of selecting and annotating nodes for the DreaMS Atlas, followed by the process of connecting the nodes to form a graph structure.

The construction process begins with generating DreaMS embeddings for 76 million spectra comprising GeMS-C1 subset of the GeMS dataset. This subset represents LSH cluster representatives of 201 million GeMS-C spectra, covering the entire MassIVE GNPS repository. Spectra from blank samples, identified by specific suffixes in their names (for example, ‘blank’, ‘no_inj’, ‘noinj’, ‘empty’, ‘solvent’ or ‘wash’), are excluded. Additionally, we enrich individual nodes with DreaMS molecular property and fluorine presence predictions, along with relevant metadata obtained from the MassIVE repository, such as information about the study species, respective study description and the instrument used for spectrum acquisition. Finally, we include embeddings of mass spectra from the MoNA and NIST20 spectral libraries. To avoid redundancy in the spectral libraries with respect to molecular structures, we merge spectra sharing identical canonical SMILES but differing in adduct species from both MoNA and NIST20, resulting in 79,000 merged spectra from 819,000 library entries.

Next, we employ the NN-Descent algorithm⁷⁷ to compute an approximate 3-NN graph, where nodes represent DreaMS embeddings and edges represent similarities between these embeddings. To further refine the LSH clustering, 3-NN neighborhoods sharing DreaMS similarities above 0.9 are clustered into single nodes, and the k-NN graph is reconstructed for 34 million nodes representing the clusters. More precisely, to cluster the nodes, we iterate over all nodes sorted in descending order by their degrees and run a breadth-first search (BFS) from each node. The BFS stops if either an edge has a DreaMS similarity smaller than 0.9 or the DreaMS similarity between the starting node and the new candidate node is smaller than 0.9. All the nodes aggregated through the BFS are collapsed to a single cluster and are represented by a starting node. This algorithm allows us to cluster the graph in linear time. It is worth noting that, by defining neighborhoods based on similarity thresholds rather than the number of hops, this algorithm adjusts the graph topology, preventing over-representation of certain spectra.

This procedure results in the creation of a final 3-NN graph representing the DreaMS Atlas. We utilize the PyNNDescent implementation of NN-Descent by McInnes et al. (https://github.com/lmcinnes/pynndescent), which provides functionalities for managing the vector database of the k-NN graph, such as querying the graph with new DreaMS embeddings not present in the DreaMS Atlas or extending the DreaMS Atlas with new embeddings.

Benchmarking of MS/MS clustering with LSH and DreaMS embeddings

Throughout our work, we employed three techniques to cluster MS/MS spectra: LSH, DreaMS k-NN aggregation and k-NN classification in the principal component analysis (PCA) space of DreaMS embeddings. This section provides details on each of the techniques and an assessment of their clustering quality.

To reduce redundancy in the GeMS dataset and create a more balanced representation of the underlying molecular structures, we employed an LSH clustering technique. The LSH technique works by comparing the hashes of binned spectra and has two key parameters: the m/z bin width and the number of LSH hyperplanes. Extended Data Fig. 1a illustrates the performance of LSH clustering with varying m/z bin widths across different numbers of hyperplanes. The metrics include the mean intra-cluster modified cosine similarity, the mean precursor m/z standard deviation per cluster, the dataset size (expressed as the fraction of clusters relative to the original dataset size) and LSH computation time. These metrics were computed using 94,837 spectra from the MassSpecGym dataset⁹⁸, specifically the spectra of [M+H]⁺ ions with at least three signals having relative intensities greater than 10%. As the number of hyperplanes increases, LSH clusters show improved consistency in approximating cosine similarity and maintaining precursor m/z values while retaining more unclustered spectra. Interestingly, the m/z bin width has a minor effect on the resulting clusters but substantially impacts the algorithm’s running time.

For constructing the GeMS subsets, we used LSH clustering with 30 hyperplanes and an m/z bin width of 1.0. The leftmost group of bars in Extended Data Fig. 1b presents the performance of this LSH configuration for clustering precursor molecules. The metrics shown include precision and recall based on the 2D InChI key identities, along with the dataset size after clustering.

By design, LSH is a high-precision, low-recall algorithm. To improve recall for constructing the DreaMS Atlas, we combined multiple LSH clusters into new larger clusters. This was achieved using a BFS aggregation in the k-NN graph of the DreaMS Atlas, applying a specific DreaMS similarity cutoff to form new clusters. As shown in Extended Data Fig. 1b, a similarity cutoff of 0.9 increases recall while maintaining a median precision of 1 (the second group of bars). Although lower cutoffs yield higher recall, they also increase the rate of false positives. To ensure that each node in the DreaMS Atlas represents a single molecular structure, we used the 0.9 cutoff. Extended Data Fig. 1c provides an example of a cluster refined using DreaMS k-NN with a 0.9 cutoff, which results in the precision lower than 1.0. The cluster contains 30 spectra, 28 of which correspond to the same structure, whereas two are different but highly similar. Nevertheless, all the structures are highly similar and exhibit highly similar, potentially indistinguishable, MS/MS spectra.

The third clustering technique, k-NN classification in the PCA space of DreaMS embeddings, was used to explore the organization of the self-supervised DreaMS embedding space. Our goal was to evaluate the linearity of the embedding space with respect to the underlying molecular structures. To do so, we clustered the embeddings in affine PCA subspaces of varying dimensionality, grouping k nearest neighbors into individual clusters, and evaluated their molecular composition. Extended Data Fig. 1d generalizes Fig. 3e and quantitatively demonstrates the linear organization of self-supervised DreaMS representations with respect to molecular structures, thus highlighting their robustness to mass spectrometry conditions. The figure shows k-NN majority voting accuracy for 2D InChI keys in the DreaMS embedding space of NIST20 spectra. Nearest neighbors were defined using cosine similarity, and the accuracy was estimated using 100,000 random spectra. In the affine PCA subspace of 1,023 dimensions (embedding dimensionality minus 1), accuracy reaches 0.865 for k = 3 and 0.999 for k = 1, showcasing the linearity of the embedding space in grouping spectra from identical compounds. In other words, the embeddings are clustered according to molecular structures, even in the linear subspace of the original embedding space.

Hyperparameters, ablation studies, implementation details and benchmarking

We report the hyperparameters used for pre-training and fine-tuning in Supplementary Tables 3 and 4, respectively. Extended Data Fig. 3 summarizes the key ablation studies for self-supervised pre-training, highlighting most important features of our method. First, pre-training on the high-quality GeMS-A10 subset of GeMS, containing 24 million MS/MS spectra, results in better performance compared to training on larger but lower-quality subsets, such as GeMS-A1000 or GeMS-B, which include up to 100 million spectra (Extended Data Fig. 3a). Additionally, we found that retaining 60 peaks (as the sequence length for the transformer) instead of 30 leads to improved performance, emphasizing the importance of lower-intensity peaks beyond the most intense ones.

Second, pre-processing m/z values with mass-tolerant Fourier features and subsequently passing them through a high-capacity feed-forward network substantially enhances performance. For instance, tokenizing m/z values with binning decreases performance, similar to using only two feed-forward layers instead of four or employing random Fourier features or sinusoidal embeddings instead of our pre-defined mass-tolerant sine and cosine waves (Extended Data Fig. 3b). Other important components of the architecture include relatively large 1,024-dimensional embeddings and m/z shift augmentations.

Third, framing the masked peak prediction task as classification, rather than regression, has a critical impact on the quality of the resulting embeddings (Extended Data Fig. 3c). Although other design choices, such as the fraction of masked peaks, predicting retention orders or masking peaks based on their intensities, contribute to performance improvements, they are not as impactful as using the cross-entropy classification loss function.

For both pre-training and fine-tuning, we used the Adam optimizer with default parameters⁹⁹. All models were trained using either four AMD MI250X GPUs or eight NVIDIA A100 GPUs in a distributed data parallel (DDP) mode. The final DreaMS model was pre-trained for 48 h, and its fine-tuning runtime never exceeded several hours. With 8 NVIDIA A100 GPUs, the generation speed of embeddings (that is, forward pass through the trained model) averages approximately 1.2 ± 0.002 million embeddings per hour, where the standard deviation is calculated based on 12 chunks comprising 79 million mass spectra from GeMS-C1.

We used matchms¹⁰⁰ and pyOpenMS¹⁰¹ Python libraries for processing mass spectra. All neural networks were implemented in PyTorch¹⁰² and trained using PyTorch Lightning¹⁰³.

We plan to continuously report the performance of more advanced fine-tuning protocols developed in future work within the MassSpecGym benchmarking environment⁹⁸.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Read Entire Article