Role of metabolism in shaping enzyme structures over 400M years

4 months ago 4

Main

Enzymes evolve as part of the metabolic network, a large, interconnected system that possesses a topology dependent on evolution and the chemical properties of its metabolites^3,4,5. Because of the central role of metabolism, enzymes are important drug targets, biomarkers and a focus of bioengineering^6,7,8. Despite its critical role across disciplines, our understanding of the global biochemical constraints that shape enzyme function—and therefore its evolution—remains incomplete.

Comparing enzyme sequences across evolution has revealed various constraints that act at the amino acid level, such as the chemical identity of side chains and epistatic interactions^9,10,11. Moreover, the sequence of enzymes is shaped by the costs of enzyme production¹². Indeed, metabolic cost optimization is observed at the species and molecular levels. For example, under many conditions, cells prefer cost-effective fermentation over oxidative metabolism, despite the latter producing stoichiometrically higher ATP amounts¹³. Furthermore, especially high-abundance enzymes have evolved by incorporating less energetically costly amino acids^14,15,16.

We hypothesized that the systematic accessibility of protein structures enabled by structural prediction¹ would allow us to integrate structural biology with evolutionary genomics and expand our understanding of the relationship between metabolism and protein evolution. We leveraged the extensive characterization of the Saccharomycotina subphylum, which represents 400 million years of evolution and includes Saccharomyces cerevisiae and Candida albicans^2,17,18,19, and examined 11,269 AlphaFold2-predicted and experimentally determined enzyme structures that belong to 424 orthologue groups (orthogroups) associated with 361 metabolic reactions in 224 metabolic pathways. Linking these structures with phenotypic data, enzyme properties and metabolic network reconstructions, we identified structural changes that are associated with metabolic constraints. We report how structural evolution depends on metabolic properties across species, pathways and molecular levels.

Mapping evolution in enzyme structures

We selected 26 out of 332 highly phylogenetically diverse yeast species of the Saccharomycotina subphylum^19,20 (Supplementary Note 1) and included the model fission yeast Schizosaccharomyces pombe as an outgroup to root protein trees (Fig. 1a and Supplementary Table 1.1). Then, for enzymes present in the YeastPathways database, we initially assigned orthologues on the basis of sequence-based clusters¹⁹. From these sequences, we obtained 1,301 structures from AlphaFoldDB²¹ and a further n = 9,968 structures that we predicted using AlphaFold v.2.0.1 at the start of our project (Supplementary Note 2). Our final dataset consisted of 11,269 enzyme structures organized in 424 orthogroups.

Prediction quality was assessed using the predicted local distance difference test (pLDDT) score^22,23, which revealed that our dataset included well-structured proteins (mean pLDDT = 90.4, mean coefficient of variation (c.v.) = 0.15). Compared with the overall structures, the terminal regions were predicted with a lower quality (first 10% of the sequence: mean pLDDT = 79.1, mean c.v. = 0.29; central 80%: mean pLDDT = 92.2, mean c.v. = 0.13; last 10%: mean pLDDT = 87.3, mean c.v. = 0.18) (Extended Data Fig. 1a). Then, we validated and refined orthogroup assignments using hierarchical clustering on the basis of a bidirectional template modelling score²⁴ (Extended Data Fig. 1b,c). On the basis of a linkage cut-off of 0.2, 29 sequence-based orthogroups were split into distinct structural orthogroups (Supplementary Table 1.5), which improved the average template modelling score (from 0.71 to 0.77).

To benefit from the extensive characterization of S. cerevisiae, we calculated pairwise alignments for each orthogroup to S. cerevisiae enzyme structures using the matchmaker algorithm of UCSF Chimera²⁵. To link these structures to metabolic constraints (Fig. 1b), we calculated averaged mapping ratios (MRs) and conservation ratios (CRs). The MR quantifies the percentage of amino acids that are 1:1 mappable to a S. cerevisiae enzyme structure, whereas the CR quantifies the percentage of mapped residues identical to those of the reference structure. The CR was tightly correlated with a CR based on the amino acid types (Extended Data Fig. 1d), thus capturing physicochemical properties. In agreement with this, a strong inverse relationship to the change of the octanol–water partition coefficient is observed (Extended Data Fig. 1e; P < 1 × 10⁻³¹, Spearman’s ρ = −0.49).

We illustrate the MR and CR for the orthogroup of 5-formyltetrahydrofolate cyclo-ligase (27 structures, CR = 0.40, MR = 0.87; Fig. 1c and Extended Data Fig. 1f; Fau1p in S. cerevisiae). The core structures of the enzymes map well and the secondary structural elements (helical or extended) revealed a mean MR of 95.4%, whereas the regions without secondary structures (random coils), which have a higher conformational flexibility, have a mean MR of 77.3%. The median MR for all orthogroups was 87.4% (interquartile range (IQR) = 78.3–93.9%) (Fig. 1d), with missing mapping mainly in low-pLDDT scoring regions (Extended Data Fig. 1g), that is, the carboxy and amino termini (Extended Data Fig. 1a) and random coil regions (60% of the unmapped and 36% of the mapped regions). Although the MR generally correlated with the CR (Spearman’s ρ = 0.55, adjusted P (P_adj.) < 1 × 10⁻⁴¹) (Fig. 1d), both the MR and the CR reflect different properties of structural divergence. As the larger structural rearrangements reflected by the MR were less frequent in our orthogroups, we focused on the CR, for which we observed a high degree of diversity (median CR = 62.9%, IQR = 53.6%, 71.2%, total range = 24.5%, 89.2%) that could be linked to metabolic evolution. Here we refer to sequence divergence in structurally mapped regions as divergence (low CR) and sequence similarity in structurally mapped regions as conservation (high CR).

Biochemical constraints

The impact of metabolic specialization

We asked whether metabolic specializations at the species level are reflected in the protein structures and linked divergence to the growth properties of yeast in 21 different carbon sources^17,18,19,26 (Extended Data Fig. 2a). Enzymes of species able to ferment glucose, raffinose, galactose and sucrose exhibited the smallest P values for differences in average CR between subgroups, alongside enzymes from species that grew aerobically on d-xylose (P_adj.< 1 × 10⁻⁸³, two-sided Wilcoxon signed-rank test). Enzymes from anaerobically fermenting species had a higher conservation relative to the structure from S. cerevisiae, which also ferments (Fig. 2a and Extended Data Fig. 2b). Although this finding corresponds to their closer phylogenetic relationship, some of the largest differences in CR were detected in the orthogroups of enzymes involved in central carbon metabolism and the electron transport chain (ETC), for example, Kgd2p (tricarboxylic acid (TCA) cycle) and Cox7p (respiratory chain). These more divergent orthogroups also included Met10p (methionine and sulfur cycle), Ath1p (trehalose metabolism) and Erg1p (ergosterol biosynthesis). We also observed cases in which the CR in non-fermenting species was higher than in the fermenting species. Again, the enzymes were directly related to oxidative metabolism, including Ndi1p, Ald5p, Idp1p and Ilv6p. Moreover, gene ontology (GO)-slim terms ‘membrane’, ‘lipid metabolism’, ‘endoplasmic reticulum’ and ‘endomembrane system’ were enriched in the first quartile of orthogroups with the largest differences in CR between subgroups for glucose fermentation (P_adj.< 1 × 10⁻², Fisher’s exact test) (Extended Data Fig. 2c).

To study an example of metabolic specialization, we focused on the xylose use pathway. Of the 26 species examined, 12 can grow on d-xylose, 8 cannot and 6 have a conditional phenotype (Extended Data Fig. 2a). Several enzymes required for xylose use, such as transketolase, enzymes in the thiamine biosynthetic pathway and the ETC, were among those with the highest change in CR (Fig. 2b). Notably, the CR measured in relation to the two S. cerevisiae acetyl-CoA synthase paralogues, the aerobic (Acs1p) or the anaerobic (Acs2p), behaved differently depending on the capacity to use xylose (Fig. 2b, Extended Data Fig. 2d,e and Supplementary Note 3.1), indicating specialization for Acs1p. Thus, species that specialize metabolically show different patterns of divergence in enzymes that are related to the relevant metabolic traits.

The impact of pathway membership

Next, we projected diversity in the orthogroups onto a genome-scale reconstruction of the metabolic network (Fig. 2c). Then, we performed a pathway enrichment analysis on the 25% most divergent and conserved enzymes (Extended Data Fig. 3a,b). The most conserved enzyme structures belonged to pathways for purine biosynthesis, specific amino acid biosynthesis as well as central metabolism (P_adj. < 0.05, Fisher’s exact test; Extended Data Fig. 3a,b). The same pathways also showed early enrichment in the receiver operating characteristic curve and high area under the curve (AUC) values (Fig. 2d and Extended Data Fig. 3c). Consistently, these orthogroups were enriched in the GO-slim terms ‘generation of precursor metabolites and energy’ and ‘nucleobase-containing small molecule metabolic process’. The most divergent enzymes tended to be more broadly distributed across the metabolic network, with the exception that enzymes belonging to ‘lipid metabolic process’ were enriched. For instance, Oar1p (CR = 0.384), Tes1p (CR = 0.391) and Eci1p (CR = 0.441) for fatty acid oxidation and Tsc10p (CR = 0.411) for sphingolipid synthesis were within the top-5%-quantile of divergence. We speculate that lipidomes of cells have a greater flexibility to adapt to the metabolic environment. Furthermore, the UDP-N-acetylglucosamine transferases Alg13p and Alg14p, related to N-linked glycosylation, were also among the most divergent orthogroups.

The impact of molecular function

Next, we annotated the biochemical activities by extracting enzyme classification (EC) identifiers from UniProtDB. Of 468 matching EC entries, 238 (51%) were supported by a direct PubMed Evidence code. Our enzyme structures with functional annotation covered 44% (361 out of 817) of all EC classifiers, distributed across 224 metabolic pathways (S. cerevisiae annotation; Supplementary Table 1.4) and encompassed all major enzyme classes, including 119 oxidoreductases, 191 transferases, 55 hydrolases, 49 lyases, 21 isomerases and 29 ligases. The dataset also contains two translocases, but, because of their low number, we refrained from making general conclusions for this EC class.

We noted that in each orthogroup, most enzymes seem to catalyse the same or highly similar reactions, and observed no changes in EC function at levels 1, 2 or 3. A change in the EC level 4 classification was detected in only five orthogroups. For example, in OG1390, containing Hsu1p, Str2p and YML082p in S. cerevisiae, a slight change in the catalysed reaction occurs^27,28, which might be facilitated by changes in a domain close to the binding site (Supplementary Note 3.2).

We detected a clear relationship between enzyme class and diversity. Oxidoreductases are enriched for high CR, and hydrolases for low CR (Fig. 2e and Extended Data Fig. 3b). The high conservation of oxidoreductases was explained by their prominent role in glycolysis, gluconeogenesis and the TCA cycle. After excluding these central metabolic pathways, oxidoreductases were not more conserved than other enzymes (Extended Data Fig. 3d). Conversely, the increased diversity of hydrolases was not explained by their role in specific metabolic pathways (Extended Data Fig. 3e).

Furthermore, we report a role for non-catalytic protein–small molecule interactions. First, metal-binding enzymes are more conserved than non-metal-dependent enzymes (two-sided Wilcoxon–Mann–Whitney U-test, P < 1 × 10⁻⁴, 7.6% decrease in median conservation, Cliff’s Δ = 0.23; Fig. 2f). Furthermore, we observed that enzymes with a higher number of intracellular inhibitors²⁹ are more conserved (Kendall τ = 0.22, P_adj. < 1 × 10⁻⁴; Extended Data Fig. 3f). For instance, Gnd2p, a central enzyme of the pentose phosphate pathway, is inhibited by at least 45 cellular metabolites²⁹, and its orthogroup was one of the least divergent. We thus concluded that characteristics of enzymes, such as pathway membership, dependency on metal ions and the number of small molecule interactions, constrain divergence.

Abundance and flux diversity constraints

Protein abundance is important for sequence conservation^30,31. Given that enzyme expression is contingent on the specific activity and flux of the enzyme, we hypothesized that abundance could be a mechanism through which metabolism influences structural evolution. We obtained nine of the examined species from the UK National Collection of Yeast Cultures²⁰ (Fig. 3a) and used proteomics to estimate their protein abundance (Fig. 3a, Extended Data Fig. 4a,b and Supplementary Methods). We found that low-abundance enzymes were more diverse than high-abundance enzymes (Fig. 3b; Spearman’s ρ = 0.48, P_adj. <1 × 10⁻²⁷). This relationship was dependent on the enzyme class; abundance and diversity were highly interdependent for isomerases, but not for hydrolases (Fig. 3c).

Next, we estimated metabolic flux through each pathway on the basis of genome-scale metabolic models for each of the 26 species¹⁸. We benchmarked the fluxes to ¹³C flux measurements for central metabolites, which are captured by both approaches, and obtained good agreement (Pearson’s r = 0.8 for median (¹³C flux); Extended Data Fig. 4c). Notably, we revealed only a weak correlation between flux and CR (Fig. 3d; Spearman’s ρ = 0.19, P_adj.< 1 × 10⁻⁴). Instead, a stronger signal was detected for flux variability (Spearman’s ρ = −0.27, P_adj. < 1 × 10⁻⁸). Consistently, orthogroups with low CR exhibited a wide range of fluxes (Fig. 3e and Extended Data Fig. 4d). There were exceptions to this trend, such as in the orthogroup containing the sphingosine kinase Ysr3p, indicating highly variable flux and high diversity (low CR), despite high abundance. Moreover, we also detected a dependence on the enzyme class. Although divergence was strongly linked to the flux carried by oxidoreductases, for which all tested measures (median flux, variability of the flux and species in which flux was present for the orthogroup) correlated with conservation, other enzyme classes, especially hydrolases (Fig. 3d; P_adj. > 0.7 for all three measures), lacked these relationships. We also estimated enzyme processivity (k_cat) for each protein sequence in a species-specific manner³², and found a weak relationship between conservation and k_cat. However, we report that the log-transformed variability (standard deviation) of k_cat is higher in orthogroups that are diverse (Extended Data Fig. 4e; Spearman’s ρ = −0.27, P_adj. < 1 × 10⁻⁸). Thus, both flux and k_cat seem to be associated with structural evolution primarily through their variability, rather than their absolute values.

In parallel, we noticed that enzymes can escape the typical relationship between abundance and conservation due to unique functional constraints. Our attention was drawn to the thiamine (vitamin B1) biosynthetic pathway, for which the orthogroup containing Thi5p, Thi11p, Thi12p and Thi13p was highly conserved, despite low abundance (Fig. 3b). Notably, thiamine is extremely energetically costly to synthesize, as two of its reaction steps are catalysed by suicide enzymes that lose an essential amino acid after catalysis³³ (Fig. 3f). This orthogroup thus illustrated that evolutionary diversification is not necessarily directly constrained by abundance, but more likely by cost.

Hierarchy of cost optimization

We thus calculated the average cost per amino acid for each protein averaged over the orthogroup using established cost metrics^{12,14,15,34,35,36} (Supplementary Note 4). Previous studies indicated that high-abundance proteins evolve a less costly amino acid composition^14,15,16. Consistently, low-abundant, diverse enzymes had more costly amino acid compositions than high-abundant enzymes (Fig. 4a). However, the overall protein chain length, a proxy for the total protein cost, showed only a weak correlation with the mean averaged cost per amino acid (Spearman’s ρ = −0.13, P_adj. < 1 × 10⁻²), indicating that cost optimization might be due to selection acting on specific structural elements. Notably, these relationships were indicated by cost models that are both dependent and independent of species-specific metabolic networks (Supplementary Note 4).

To test the role of structural elements, we next identified small-molecule-binding sites using two orthogonal strategies: (1) binding-site information deposited in UniProt to account for directly coordinating residues (UPb) and (2) residues near small molecules from experimentally determined structures obtained from the RCSB Protein Data Bank (PDB) to also account for the physicochemical environment (CSb). Next, to distinguish between surface and core residues, we used the relative amino acid solvent-accessible surface area and designated each residue as either a surface-exposed or core residue³⁷. We found that the amino acid composition is optimized differently in these structural elements. In general, the average cost of core residues was higher than the cost of surface residues, whereas binding sites had a more variable cost. Among surface residues, we also observed increased costs for membrane-bound proteins and short proteins (<100 amino acids), such as Qcr8p or Kti11p in S. cerevisiae, that might be part of larger complexes (Extended Data Fig. 5a).

Moreover, we observed a hierarchy of structural evolution. Enzymes with more conserved surfaces tended to be less costly. This effect, although less strong, was also present for core residues, but was markedly diminished for binding sites, suggesting that surface residues are the primary sites for cost optimization (Fig. 4b). Moreover, the more expensive aromatic amino acids, such as phenylalanine and tyrosine, were more frequent in the variable and low-abundance enzymes. By contrast, the conserved, high-abundant enzymes contained higher amounts of the least expensive amino acids, glycine, glutamate and alanine (Fig. 4c,d and Extended Data Fig. 5b,c).

Addressing the molecular level, we detected a significant association between CR and amino acid content in at least one structural feature, for 11 of the 20 canonical amino acids (P_adj. < 1 × 10⁻⁴; Fig. 4e and Extended Data Fig. 5d). For instance, surfaces and coil regions of highly conserved enzymes had higher glycine content (Fig. 4f), whereas the core and helical regions had higher alanine content. Presumably, these small and inexpensive amino acids can replace many other amino acids, and their physicochemical properties are more suited to either the core or the surface. Furthermore, the surfaces of the variable enzyme structures had a higher cysteine content. This is presumably because cysteine often performs dedicated functions related to its high reactivity, such as catalytic activity or stabilization through disulfide bridges³⁸. Notably, there were no significant associations between the presence of any amino acid in binding sites and the CR of the enzymes.

Conservation of structural features

Hierarchy of structural evolution

At the substructure level, the core of the enzyme was more conserved than the surface (Fig. 5a). Furthermore, the binding sites had the highest CR (Fig. 5b and Extended Data Fig. 6a), and orthogroups containing fewer variable enzymes were significantly enriched for possessing a known binding site (AUC = 0.62 and 0.66; Fig. 5c). An exception was the orthogroup of the fatty acid synthetase subunit Fas2p, in which the binding site (n = 21 (UPb), 95 (CSb) amino acids) was more variable than the overall protein (Extended Data Fig. 6b,c).

At the level of the secondary structure, most helical portions (for example, α-helices) varied more than did extended portions (for example, β-sheets) (Fig. 5d). This result is robust to the tendency of AlphaFold2 to wrongly predict random coil regions (2.28%) as helical more than as extended³⁹ (Supplementary Note 2). To our surprise, helical portions were also more variable than mapped random coils, which are assumed to have a higher dynamic flexibility (Extended Data Fig. 6d and Supplementary Note 5). One explanation for this might be the presence of turns in the random coil regions, which show high conservation (Extended Data Fig. 7a,b).

Another factor that can influence structural evolution is larger architectures (protein folds). To assess this, we extracted fold information for those proteins in our dataset that were present in The Encyclopedia of Domains database⁴⁰. Indeed, amino acids in these folds are slightly more conserved (two-sided Wilcoxon signed-rank test, P < 1 × 10⁻⁴⁴, adjusted Cliff’s Δ = 0.64). Consistently, the two most dominant folds, the Rossmann fold (CATH fold 3.40.50) and TIM barrel (CATH fold 3.20.20), also exhibit higher conservation (two-sided Wilcoxon signed-rank test, P < 1 × 10⁻⁴, adjusted Cliff’s Δ = 0.36 (Rossmann fold) and P < 1 × 10⁻⁷, adjusted Cliff’s Δ = 0.86 (TIM barrel); Extended Data Fig. 6e). The orthogonal bundle (CATH architecture 1.10) was enriched in the most conserved enzymes (Fisher’s exact test, P_adj. < 1 × 10⁻³).

Conserved clusters capture interactions

To identify important functional residues, we conducted evolutionary selection analysis by estimating non-synonymous (dN) and synonymous (dS) substitution rates at both the whole-protein and site levels⁴¹ (Extended Data Fig. 8 and Supplementary Note 6.1). We observed a higher probability of neutral drift (dN/dS < 1) in surface residues (Fig. 5e). Moreover, in specific cases with (1) low conservation and high whole-protein dN/dS values (ETC proteins and Pox1p); or (2) the presence of sites under positive selection across all species, for example the GAPDH orthogroup containing Tdh1/2/3p, we observed residues with evidence of positive selection in specific branches using a branch-site model^42,43. In Pox1p, positively selected residues are found in the homodimer interface (Extended Data Fig. 9a,b), whereas in GAPDH, positively selected residues were primarily on surface-exposed residues, and did not affect the multimeric protein–protein interaction (PPI) sites. Moreover, one positively selected residue in GAPDH is within distance of the substrate–cofactor binding interface (Extended Data Fig. 9c,d). Finally, in the ETC supercomplex, 96 out of 157 unique residues with signatures of positive selection were on internal PPI sites (Extended Data Fig. 9e–g). Thus, despite surfaces being most divergent, followed by core residues and low diversification in the substrate binding sites, positive selection can occur in all structural elements (Supplementary Note 6.2).

In parallel, we noticed that fully conserved structural regions are often organized in clusters. Generating a network of fully conserved residues, resulted in a median number of 4.00 (IQR = 2.00–6.00) clusters with a median size of 18.55 amino acids (IQR = 14.00–23.64) per cluster (Supplementary Table 5.1). For example, in the pyruvate kinase orthogroup (Cdc19p of S. cerevisiae), these clusters correspond to the allosteric activation site for fructose-1,6-bisphosphate, the substrate-binding site for phosphoenolpyruvate and to a symmetric PPI site (Fig. 5f). We thus speculated that these clusters could correspond to metabolite-binding or other interaction sites. Indeed, they contain twice as many substrate- and ligand-binding-site residues (Fig. 5g and Extended Data Fig. 6f). Moreover, 91% (CSb) and 97% (UPb) of the annotated binding sites overlapped with at least one cluster. On the other hand, 27% (CSb) or 50% (UPb) of these clusters did not overlap with annotated binding sites. To test whether clusters can be identified as containing known binding sites, we trained a histogram-based gradient boosting classification tree on physicochemical properties. Evaluating ten fivefold cross-validations, we obtained an average balanced accuracy of 0.63 and average AUC of 0.68 on test data removed before training (Extended Data Fig. 6g), outperforming two random models (balanced accuracy = 0.57, AUC = 0.60 (random binding site) and balanced accuracy = 0.53, AUC = 0.54 (randomized labels); Extended Data Fig. 6h). Projecting the prediction confidence onto individual amino acids confirms that residues in known binding sites are more likely to be predicted as part of a binding site (P < 1 × 10⁻³⁸, two-sided Wilcoxon signed-rank test, adjusted Cliff’s Δ = 0.78). Moreover, we tested the degree to which these clusters can correspond to other types of interaction site and observed a slight over-representation of known PPI sites as extracted from the RCSB PDB (Extended Data Fig. 6i; Cliff’s Δ = 0.23 (compared with small-molecule-binding sites with 0.84 (CSb) and 0.92 (UPb)), which is consistent with a high conservation of PPI sites (Extended Data Fig. 6j). In agreement, enzymes with more physical PPIs (Extended Data Fig. 6k; Kendall τ = 0.32; P_adj. < 1 × 10⁻²⁴) and enzymes involved in protein complexes⁴⁴ (ComplexomeDB; Fig. 5c) were more highly conserved. Thus, although highly conserved clusters are dominated by small-molecule-binding sites, they also reflect PPI sites. Notably, some known small-molecule-binding sites were not in highly conserved clusters. This situation might be explained by the small size of some small-molecule-binding sites (such as metal interaction sites), but also points to the need for better annotation of existing sites.

Discussion

The metabolic network is evolutionarily ancient, and its origins are commonly explained by two prevailing hypotheses. One suggests that its topological organization emerged as a consequence of enzyme evolution, whereas the other proposes that the metabolic network structure originated from non-enzymatic reactions⁴⁵. Although metabolic evolution has probably experienced elements of both, increasing amounts of experimental evidence favour the second scenario. For example, many enzyme-catalysed reactions resemble non-enzymatic reactions as promoted by metal ions found frequently in Archaean sediment^46,47. Moreover, despite considerable divergence in the enzyme sequences, the basic structure of the metabolic network remains conserved^3,48. At the same time, modern metabolic pathways are highly efficient and respond to the environment, which suggests that they are optimized during evolution⁴⁹.

Studies into enzyme evolution support the model in which metabolic pathways evolve alongside chemical topologies, followed by their evolutionary optimization. Comparative genomics and detailed structural and functional investigations have described enzyme evolution as a dynamic process shaped by genetic innovations, biological constraints, cost and ecological interactions^5,50,51,52. Functional clustering of enzymes and their domains into families has provided insights into the extent of gene duplication and divergence across the tree of life^{40,53,54,55,56}. Furthermore, studies on niche adaptation have demonstrated how metabolic capabilities are lost and gained through both the expansion of gene families and the functional diversification of promiscuous enzymes^2,18,19. These findings suggest that enzyme evolution follows selective pressures that balance catalytic efficiency with cellular metabolic demands and resource allocation.

We speculated that the ability to generate protein structures systematically and across species barriers¹ enables the integration of high-resolution structural data into functional genomic approaches, aiding the understanding of evolutionary processes. Such an approach is certainly constrained by the accuracy of structural prediction^39,57. We focused on 26 diverse species selected from the Saccharomycotina subphylum to build on well-annotated genome sequences, enzyme functional annotations, genome-scale metabolic network reconstructions and proteomic data. The 11,269 enzyme structures examined cover most metabolic pathways, GO terms and enzyme classes present in their metabolic networks. We asked whether metabolic properties that differ between species, pathways and enzyme classes can help to explain the range of sequence diversity in structurally conserved regions observed in the different orthogroups. In this way, we identified metabolic constraints that influence structural evolution at different biological scales, from the level of the organism to those of pathways, individual enzymes and enzyme substructures. We also identified pathways, enzyme classes and, in some instances, structural elements that contribute to these relationships.

At the scale of the organism, we found that structural evolution is influenced by niche specialization, such as nutritional preferences. This included changes from oxidative to fermentative metabolism, which are a dominant metabolic module in yeast, other microorganisms and higher organisms. Changes from fermentation to respiration represent a major metabolic shift as fermentation is faster and less costly in terms of resource allocation, but oxidative metabolism has a better stoichiometry for ATP production, and imposes constraints on antioxidant metabolism^13,58,59.

At the pathway scale, our data suggest that enzyme structural evolution depends on pathway membership, the type of reaction catalysed and interactions with other metabolites. Enzymes involved in central carbon metabolism, oxidoreductases and metal-binding enzymes were the most constrained, whereas hydrolases and enzymes of more peripheral metabolic pathways, such as those functioning in lipid metabolism or protein glycosylation, diversified. Interestingly, we report an interdependency between enzyme conservation, metabolic flux and processivity, but find that variability of flux and processivity was more important than the total amount, indicating for these properties that the dynamic nature of metabolism constrains enzyme evolution more than do static properties.

At the structural level, our study confirmed that high-abundance enzymes evolve towards a more cost-efficient amino acid composition^14,15,16 with cost optimization depending on enzyme class and structural element, and distinct amino acid substitutions prevailing in different contexts. Most optimization occurs on surface regions, with the least in binding sites. Furthermore, structural features exhibit specific trends in amino acid substitutions. For example, alanine is more prevalent in the core, whereas glycine residues are more common on the surface of highly conserved enzymes.

It has been proposed that the diversification of enzyme structures can lead to new metabolic capabilities^4,5,55. Our dataset is consistent with this possibility but highlights that most of the structure-driven diversification results in shifts between chemically similar reactions. Our dataset identifies no example of a higher order change in enzyme function, and at the molecular level, binding sites are highly conserved and not optimized for costs. Therefore, we observe the formation of clusters of high structural conservation in small-molecule-binding sites. Our data suggest that these clusters can be used to annotate previously undescribed binding sites, whereas some serve other functions such as PPI.

Across all our analyses, we obtain a consistent picture that the relationship between structure and evolutionary constraints is dominated by catalytic function. Notably, hydrolases differ in several properties (Extended Data Fig. 10a–e) and escape most of the otherwise highly pronounced relationships. This is not due to overall structural differentiation, as the MR was not significantly different to other enzyme classes (Extended Data Fig. 10f), neither was it due to changes in binding site conservation (Extended Data Fig. 10g,h). We speculate that a contributing factor to this situation is that hydrolases do not require a cofactor⁶⁰, that their reaction mechanism is thus less constrained in evolution, and that they participate in a diverse spectrum of metabolic processes. By contrast, the high conservation of oxidoreductases is predominantly explained by their role in central metabolism.

Overall, this dominance of the catalytic constraints across all layers investigated is consistent with a model in which metabolic enzymes evolve alongside the chemical topology of the metabolic network, with structural components involved in catalysis changing the least. An alternative hypothesis, that constraints on enzyme structure drive changes in metabolism, would result in more flexible binding sites in conserved structures. This notion does not rule out that structural changes in enzymes could cause changes in substrate specificity and amplification of promiscuous activities that could pave the way for the evolution of metabolic pathways, for instance, through the selection of a promiscuous reaction. These findings illuminate the relationship between enzyme function, metabolic environment and structural evolution, providing innovative strategies for enzyme annotation and metabolic network engineering.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Protein expression levels (raw data are available via ProteomeXchange with identifier PXD064343), predicted structures, alignments and other data are available at Figshare (https://doi.org/10.6084/m9.figshare.29117966)⁶¹. Supplementary tables are included in the Supplementary Information. Source data are provided with this paper.

Code availability

Functions used to map the structures and visualize the data are available at GitHub (https://github.com/OliverLemke/structure_comparison (including an example of how to use the functions) and https://github.com/OliverLemke/basic_plots). Functions used to assemble MSAs, phylogenetic trees and to do evolutionary selection analysis are available at GitHub (https://github.com/heineike02/diverse_yeast). Additional code is available at Figshare (https://doi.org/10.6084/m9.figshare.29117966)⁶¹.

References

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Opulente, D. A. et al. Genomic factors shape carbon and nitrogen metabolic niche breadth across Saccharomycotina yeasts. Science 384, eadj4503 (2024).
Article CAS PubMed PubMed Central Google Scholar
Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N. & Barabási, A.-L. The large-scale organization of metabolic networks. Nature 407, 651–654 (2000).
Article ADS CAS PubMed Google Scholar
Noda-Garcia, L., Liebermeister, W. & Tawfik, D. S. Metabolite–enzyme coevolution: from single enzymes to metabolic pathways and networks. Annu. Rev. Biochem. 87, 187–216 (2018).
Article CAS PubMed Google Scholar
Ribeiro, A. J. M., Riziotis, I. G., Borkakoti, N. & Thornton, J. M. Enzyme function and evolution through the lens of bioinformatics. Biochem. J 480, 1845–1863 (2023).
Article CAS PubMed Google Scholar
Pavlova, N. N. & Thompson, C. B. The emerging hallmarks of cancer metabolism. Cell Metab. 23, 27–47 (2016).
Article CAS PubMed PubMed Central Google Scholar
Stine, Z. E., Schug, Z. T., Salvino, J. M. & Dang, C. V. Targeting cancer metabolism in the era of precision oncology. Nat. Rev. Drug Discov. 21, 141–162 (2022).
Article CAS PubMed Google Scholar
Porter, J. L., Rusli, R. A. & Ollis, D. L. Directed evolution of enzymes for industrial biocatalysis. ChemBioChem 17, 197–203 (2016).
Article CAS PubMed Google Scholar
Breen, M. S., Kemena, C., Vlasov, P. K., Notredame, C. & Kondrashov, F. A. Epistasis as the primary factor in molecular evolution. Nature 490, 535–538 (2012).
Article ADS CAS PubMed Google Scholar
Suárez-Díaz, E. Molecular evolution in historical perspective. J. Mol. Evol. 83, 204–213 (2016).
Article ADS PubMed Google Scholar
Zuckerkandl, E. & Pauling, L. Molecules as documents of evolutionary history. J. Theor. Biol. 8, 357–366 (1965).
Article ADS CAS PubMed Google Scholar
Chen, Y. & Nielsen, J. Yeast has evolved to minimize protein resource cost for synthesizing amino acids. Proc. Natl Acad. Sci. USA 119, e2114622119 (2022).
Article CAS PubMed PubMed Central Google Scholar
Nilsson, A. & Nielsen, J. Metabolic trade-offs in yeast are caused by F1F0-ATP synthase. Sci. Rep. 6, 22264 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Akashi, H. & Gojobori, T. Metabolic efficiency and amino acid composition in the proteomes of Escherichia coli and Bacillus subtilis. Proc. Natl Acad. Sci. USA 99, 3695–3700 (2002).
Article ADS CAS PubMed PubMed Central Google Scholar
Barton, M. D., Delneri, D., Oliver, S. G., Rattray, M. & Bergman, C. M. Evolutionary systems biology of amino acid biosynthetic cost in yeast. PLoS ONE 5, e11935 (2010).
Article ADS PubMed PubMed Central Google Scholar
Raiford, D. W. et al. Do amino acid biosynthetic costs constrain protein evolution in Saccharomyces cerevisiae? J. Mol. Evol. 67, 621–630 (2008).
Article ADS CAS PubMed Google Scholar
Kurtzman, C., Fell, J. W. & Boekhout, T. The Yeasts: A Taxonomic Study (Elsevier, 2011).
Lu, H. et al. Yeast metabolic innovations emerged via expanded metabolic network and gene positive selection. Mol. Syst. Biol. 17, e10427 (2021).
Article CAS PubMed PubMed Central Google Scholar
Shen, X.-X. et al. Tempo and mode of genome evolution in the budding yeast subphylum. Cell 175, 1533–1545 (2018).
Article CAS PubMed PubMed Central Google Scholar
Wu, J. et al. Yeast diversity in relation to the production of fuels and chemicals. Sci. Rep. 7, 14259 (2017).
Article ADS PubMed PubMed Central Google Scholar
Varadi, M. et al. AlphaFold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2022).
Article CAS PubMed Google Scholar
Guo, H.-B. et al. AlphaFold2 models indicate that protein sequence determines both structure and dynamics. Sci. Rep. 12, 10696 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Mariani, V., Biasini, M., Barbato, A. & Schwede, T. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics 29, 2722–2728 (2013).
Article CAS PubMed PubMed Central Google Scholar
Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309 (2005).
Article CAS PubMed PubMed Central Google Scholar
Meng, E. C., Pettersen, E. F., Couch, G. S., Huang, C. C. & Ferrin, T. E. Tools for integrated sequence-structure analysis with UCSF Chimera. BMC Bioinform. 7, 339 (2006).
Article Google Scholar
Opulente, D. A. et al. Factors driving metabolic diversity in the budding yeast subphylum. BMC Biol. 16, 26 (2018).
Article PubMed PubMed Central Google Scholar
Oss, S. B. V. et al. On the illusion of auxotrophy: met15Δ yeast cells can grow on inorganic sulfur, thanks to the previously uncharacterized homocysteine synthase Yll058w. J. Biol. Chem. 298, 102697 (2022).
Article PubMed PubMed Central Google Scholar
Yu, J. S. L. et al. Inorganic sulfur fixation via a new homocysteine synthase allows yeast cells to cooperatively compensate for methionine auxotrophy. PLoS Biol. 20, e3001912 (2022).
Article CAS PubMed PubMed Central Google Scholar
Alam, M. T. et al. The self-inhibitory nature of metabolic networks and its alleviation through compartmentalization. Nat. Commun. 8, 16018 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Bédard, C., Cisneros, A. F., Jordan, D. & Landry, C. R. Correlation between protein abundance and sequence conservation: what do recent experiments say? Curr. Opin. Genet. Dev. 77, 101984 (2022).
Article PubMed Google Scholar
Mata, J. & Bahler, J. Correlations between gene expression and gene conservation in fission yeast. Genome Res. 13, 2686–2690 (2003).
Article CAS PubMed PubMed Central Google Scholar
Li, F. et al. Deep learning-based k_cat prediction enables improved enzyme-constrained model reconstruction. Nat. Catal. 5, 662–672 (2022).
Perli, T., Wronska, A. K., Ortiz‐Merino, R. A., Pronk, J. T. & Daran, J. Vitamin requirements and biosynthesis in Saccharomyces cerevisiae. Yeast 37, 283–304 (2020).
Article CAS PubMed Google Scholar
Craig, C. L. & Weber, R. S. Selection costs of amino acid substitutions in ColE1 and ColIa gene clusters harbored by Escherichia coli. Mol. Biol. Evol. 15, 774–776 (1998).
Article CAS PubMed Google Scholar
Seligmann, H. Cost-minimization of amino acid usage. J. Mol. Evol. 56, 151–161 (2003).
Article ADS CAS PubMed Google Scholar
Wagner, A. Energy constraints on the evolution of gene expression. Mol. Biol. Evol. 22, 1365–1374 (2005).
Article CAS PubMed Google Scholar
Shrake, A. & Rupley, J. A. Environment and exposure to solvent of protein atoms. Lysozyme and insulin. J. Mol. Biol. 79, 351–371 (1973).
Article CAS PubMed Google Scholar
White, M. E. H., Gil, J. & Tate, E. W. Proteome-wide structural analysis identifies warhead- and coverage-specific biases in cysteine-focused chemoproteomics. Cell Chem. Biol. 30, 828–838 (2023).
Article CAS PubMed Google Scholar
Stevens, A. O. & He, Y. Benchmarking the accuracy of AlphaFold2 in loop structure prediction. Biomolecules 12, 985 (2022).
Article CAS PubMed PubMed Central Google Scholar
Lau, A. M. et al. Exploring structural diversity across the protein universe with The Encyclopedia of Domains. Science 386, eadq4946 (2024).
Article CAS PubMed Google Scholar
Álvarez-Carretero, S., Kapli, P. & Yang, Z. Beginner’s guide on the use of PAML to detect positive selection. Mol. Biol. Evol. 40, msad041 (2023).
Article PubMed PubMed Central Google Scholar
Yang, Z., Wong, W. S. W. & Nielsen, R. Bayes empirical Bayes inference of amino acid sites under positive selection. Mol. Biol. Evol. 22, 1107–1118 (2005).
Article CAS PubMed Google Scholar
Zhang, J., Nielsen, R. & Yang, Z. Evaluation of an improved branch-site likelihood method for detecting positive selection at the molecular level. Mol. Biol. Evol. 22, 2472–2479 (2005).
Article CAS PubMed Google Scholar
Benschop, J. J. et al. A consensus of core protein complex compositions for Saccharomyces cerevisiae. Mol. Cell 38, 916–928 (2010).
Article CAS PubMed Google Scholar
Ralser, M., Varma, S. J. & Notebaart, R. A. The evolution of the metabolic network over long timelines. Curr. Opin. Syst. Biol. 28, 100402 (2021).
Article CAS Google Scholar
Keller, M. A. et al. Conditional iron and pH-dependent activity of a non-enzymatic glycolysis and pentose phosphate pathway. Sci. Adv. 2, e1501235 (2016).
Article ADS PubMed PubMed Central Google Scholar
Muchowska, K. B., Varma, S. J. & Moran, J. Nonenzymatic metabolic reactions and life’s origins. Chem. Rev. 120, 7708–7744 (2020).
Article CAS PubMed Google Scholar
Aulakh, S. K. et al. The molecular landscape of cellular metal ion biology. Cell Syst. https://doi.org/10.1016/j.cels.2025.101319 (2025).
Ng, C. Y., Wang, L., Chowdhury, A. & Maranas, C. D. Pareto optimality explanation of the glycolytic alternatives in nature. Sci. Rep. 9, 2633 (2019).
Article ADS PubMed PubMed Central Google Scholar
Clifton, B. E. et al. Evolution of cyclohexadienyl dehydratase from an ancestral solute-binding protein. Nat. Chem. Biol. 14, 542–547 (2018).
Article CAS PubMed Google Scholar
Nicoll, C. R. et al. Ancestral-sequence reconstruction unveils the structural basis of function in mammalian FMOs. Nat. Struct. Mol. Biol. 27, 14–24 (2020).
Article CAS PubMed Google Scholar
Voordeckers, K. et al. Reconstruction of ancestral metabolic enzymes reveals molecular mechanisms underlying evolutionary innovation through gene duplication. PLoS Biol. 10, e1001446 (2012).
Article CAS PubMed PubMed Central Google Scholar
Barrio-Hernandez, I. et al. Clustering-predicted structures at the scale of the known protein universe. Nature 622, 637–645 (2023).
Blum, M. et al. InterPro: the protein sequence classification resource in 2025. Nucleic Acids Res. 53, D444–D456 (2025).
Article PubMed Google Scholar
Copley, S. D. Evolution of new enzymes by gene duplication and divergence. FEBS J. 287, 1262–1283 (2020).
Article CAS PubMed PubMed Central Google Scholar
Durairaj, J. et al. Uncovering new families and folds in the natural protein universe. Nature 622, 646–653 (2023).
Terwilliger, T. C. et al. AlphaFold predictions are valuable hypotheses and accelerate but do not replace experimental structure determination. Nat. Methods 21, 110–116 (2024).
Article CAS PubMed Google Scholar
Di Bartolomeo, F. et al. Absolute yeast mitochondrial proteome quantification reveals trade-off between biosynthesis and energy generation during diauxic shift. Proc. Natl Acad. Sci. USA 117, 7524–7535 (2020).
Article ADS PubMed PubMed Central Google Scholar
Stincone, A. et al. The return of metabolism: biochemistry and physiology of the pentose phosphate pathway. Biol. Rev. Camb. Philos. Soc. 90, 927–963 (2015).
Article PubMed Google Scholar
Hoyos, P., Hernáiz, M. J. & Alcántara, A. R. in Comprehensive Biotechnology 3rd edn (ed. Moo-Young, M.) 334–373 (Pergamon, 2017).
Lemke, O. et al. The role of metabolism in shaping enzyme structures over 400 million years. Figshare https://doi.org/10.6084/m9.figshare.29117966 (2025).

Download references

Acknowledgements

We thank J. Bahler, J. Yu, J. Muenzner, J. Hartl, C. Jakobsen, T. Alam, M. White, S. Alvarez-Carretero and Z. Yang for scientific advice; J. Tamanini for editing support and L. Szyrwiel, D. Ludwig and M. Muelleder for preparation and support for the proteomics measurements. O.L. and N.C. are supported by the European Research Council (ERC) under grant agreement ERC-SyG-2020 951475 awarded to J.B. and M.R. B.M.H. and S.K.A. are supported by Wellcome Trust grant (IA 200829/Z/16/Z). A.Z. is supported by Biotechnology and Biological Sciences Research Council (BBSRC) grant number BB/Y000730/1, Marius Jakulis Jason Foundation, Swedish Research council (Vetenskapsrådet) grant no. 2023-04254, 2019-05356, Formas grant 2019-01403. The data handling and computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725. The computations for structure predictions were enabled by the Berzelius resource provided by the Knut and Alice Wallenberg Foundation at the National Supercomputer Centre. J.L.S. is a Howard Hughes Medical Institute Awardee of the Life Sciences Research Foundation. Research in the laboratory of A.R. is supported by the National Science Foundation (DEB-2110404), the National Institutes of Health/National Institute of Allergy and Infectious Diseases (R01 AI153356) and the Burroughs Wellcome Fund.

Funding

Open access funding provided by Max Planck Society.

Author information

Author notes

These authors contributed equally: Oliver Lemke, Benjamin Murray Heineike

Authors and Affiliations

Department of Biochemistry, Charité-Universitätsmedizin Berlin, Berlin, Germany
Oliver Lemke, Benjamin Murray Heineike, Nir Cohen, Leonard Spranger, Federica Agostini, Cory Thomas Lee & Markus Ralser
Exploratory Diagnostic Sciences, Berlin Institute of Health at Charité, Berlin, Germany
Oliver Lemke & Markus Ralser
Center for Human Genetics, Nuffield Department of Medicine, University of Oxford, Oxford, UK
Benjamin Murray Heineike, Simran Kaur Aulakh & Markus Ralser
Molecular Biology of Metabolism Laboratory, The Francis Crick Institute, London, UK
Benjamin Murray Heineike, Simran Kaur Aulakh & Markus Ralser
Department of Life Sciences, Chalmers University of Technology, Gothenburg, Sweden
Sandra Viknander, Feiran Li, Jens Nielsen & Aleksej Zelezniak
Howard Hughes Medical Institute, University of California Berkeley, Berkeley, CA, USA
Jacob Lucas Steenwyk
Department of Biological Sciences, Vanderbilt University, Nashville, TN, USA
Jacob Lucas Steenwyk & Antonis Rokas
Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, CA, USA
Jacob Lucas Steenwyk
Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN, USA
Jacob Lucas Steenwyk & Antonis Rokas
Shmunis School of Biomedical and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Ramat Aviv, Israel
Judith Berman
Faculty of Biochemical and Chemical Engineering, TU Dortmund University, Dortmund, Germany
Toni Ingolf Gossmann
Institute of Biotechnology, Life Sciences Centre, Vilnius University, Vilnius, Lithuania
Aleksej Zelezniak
Randall Centre for Cell & Molecular Biophysics, Kingʼs College London, London, UK
Aleksej Zelezniak
Max Planck Institute for Molecular Genetics, Berlin, Germany
Markus Ralser

Authors

Oliver Lemke
Benjamin Murray Heineike
Sandra Viknander
Nir Cohen
Feiran Li
Jacob Lucas Steenwyk
Leonard Spranger
Federica Agostini
Cory Thomas Lee
Simran Kaur Aulakh
Judith Berman
Antonis Rokas
Jens Nielsen
Toni Ingolf Gossmann
Aleksej Zelezniak
Markus Ralser

Contributions

O.L., B.M.H., T.I.G. and M.R. conceptualized the work. O.L., B.M.H., S.V., N.C., J.L.S. and T.I.G. developed the methodology. O.L., B.M.H., S.V., N.C., J.L.S., L.S. and F.L. implemented the software for the analysis. O.L., B.M.H., S.V., F.L., F.A. and C.T.L. validated the data. O.L., B.M.H., N.C., J.L.S. and L.S. performed the formal analysis and investigation of the data. S.V. generated the structures. A.Z. and M.R. provided resources. O.L., B.M.H., S.V., N.C., J.L.S., L.S., F.L., F.A. and S.K.A. were involved in data curation. O.L., B.M.H. and M.R. wrote the original draft. O.L., B.M.H. and M.R. revised the manuscript. O.L. and B.M.H. prepared the figures. T.I.G. and M.R. supervised and O.L., B.M.H. and M.R. administered the project. J.N., A.R., A.Z., J.B. and M.R. provided funding or resources to at least one co-author.

Corresponding author

Correspondence to Markus Ralser.

Ethics declarations

Competing interests

M.R. and A.Z. are founders and shareholders of Eliptica. J.L.S. is an adviser for ForensisGroup. J.L.S. is a scientific consultant for FutureHouse. Over the course of this project, J.L.S. was a Bioinformatics Visiting Scholar at MantleBio. A.R. is a scientific consultant for LifeMine Therapeutics. The other authors declare no competing interests.

Peer review

Peer review information

Nature thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 The predicted protein structures are well-structured, but small orthogroup refinements are required.

a) Mean pLDDT score as calculated by Alphafold2 versus the relative, length-normalised amino acid index for all our simulated structures, divided into 100 bins, due to varying protein lengths. Error bars denote the standard deviation coloured by the coefficient of variation. b + c) Structural alignment-based orthogroup refinement illustrated on orthogroup 1022. b) Hierarchical clustering of structures using US-align scores. The two resulting clusters are highlighted. Cluster 1 (blue) contains the reference structure S. cerevisiae Gre3p (Uniprot: P38715) and is named OG1022_REF_Scer_AF-P38715-F1-model and cluster 2 (red) contains Nit3p (Uniprot: P49954) and is named OG1022_REF_Scer_AF-P49954-F1-model. c) Phylogenetic tree based on US-align structural alignment of the entire original orthogroup 1022. The clusters based on hierarchical clustering correspond to structures that cluster together in the protein tree. d) Mean CR versus mean amino acid type CR. The dotted line denotes the identity line, other lines as in e). e) Mean CR versus variability of the octanol-water partition coefficient (KOW). Solid line is the best linear fit and the dashed line denotes the axis median. f) Example of the structural alignment with respect to the reference structure. In addition, in the bottom panel other features, like pLDDT, solvent-accessible surface-area (SASA), secondary structure prediction (DSSP), fully conserved residues and fully conserved amino acid types are depicted. Colour according to amino acids. In case of light grey, no residue could be aligned to the reference. A dark grey colour indicates an agreement with respect to the reference structure for the respective amino acid. The more dark grey, the higher the conservation ratio. g) Maximum-normalised violin plot of the pLDDT for the density distribution containing every single residue grouped by residues that could be mapped (n = 5,527,262) and that could not be mapped (n = 1,454,754) respectively.

Source data

Extended Data Fig. 2 Nutrient utilization constraining structural evolution in Saccharomycotina.

a) Nutrient utilization for different carbon sources according to Kurtzman et al., 2011^17,18. 1 denotes growth, 0 denotes lack of growth, v denotes variable growth, and n denotes missing data. b) Mean conservation ratio per protein calculated for species that ferment glucose versus species that do not ferment glucose in Kurtzman et al.¹⁷. Pathways related to aerobic respiration or anaerobic fermentation are highlighted. The dotted line denotes the identity line, the solid line denotes the linear fit and the dashed line denotes the axis median. c) Significant GO slim enrichments of the first quartile of the orthogroups containing proteins that show the largest differences in mean conservation ratio between species that ferment and species that do not ferment glucose. Size indicates the number of genes associated with the GO slim term in our selection, and colour indicates adjusted p-value. In addition the area under the receiver operating characteristic curve (AUC) is shown. d) Average-linkage hierarchical clustering of the pairwise CRs for the orthogroup containing Acs1p and Acs2p in S. cerevisiae. e) CR of Acs1,2p orthologues with respect to Acs1p/Acs2p. The labels are colored according to the capability of D-Xylose utilization as shown in A.

Source data

Extended Data Fig. 3 Metabolic factors constraining structural evolution in Saccharomycotina.

a) Upset-plot of the significantly enriched pathways, ordered by group dependency as shown in (b). b) Enrichment of the orthogroups containing the 25% most conserved (red, upper panel) and the 25% most diverse enzymes (blue, lower panel) for metabolic pathways from the yeast pathways database and GO slim terms. Size indicates the number of genes associated with the term or pathway in our selection, and colour indicates adjusted p-value. In addition, the AUC is shown. Super pathway is abbreviated as “Spw”. c) ROC-curves for highly enriched pathways. The number in the legend indicates the AUC. The dashed line denotes the identity line representing random sampling. d) Distribution of the mean conservation ratio of orthogroups assigned to oxidoreductases (red). In addition, the distributions for oxidoreductases that are (blue) or that are not (orange) members of the enriched pathways from (c) and Fig. 2d are depicted. The average distribution for all orthogroups is shown in black. e) Distribution of the mean conservation ratio of orthogroups assigned to hydrolases (blue). In addition, the distributions for hydrolases that are (green) or that are not (orange) members of the GO-term “lipid metabolic process” are depicted. The average distribution for all orthogroups is shown in black. f) Mean conservation ratio versus the number of inhibitors (n = 182). Solid line is the best linear fit and the dashed line denotes the axis median.

Source data

Extended Data Fig. 4 The proteomics data are of high-quality. Metabolic properties influence structural evolution.

a) Completeness of proteomic measurements for the selected orthogroups. For 397 of the original orthogroups protein abundance was measured in at least one species. A black line indicates a measured protein abundance of the orthogroup in the respective species. The bar plots account for the row and column sum. Grey dotted lines indicate the first, second and third quartiles. b) Scatter plot of the technical coefficient of variation (quality control samples) versus the biological coefficient of variation (measured species). The dotted line denotes the identity line and the dashed line denotes the axis median. c) Median simulated flux compared to ¹³C experimental flux in 7 different datasets. The final plot compares median simulated flux to the median ¹³C experimental flux. Only reactions that had data from at least 5 studies were used. d) Spearman’s correlation of median predicted flux, coefficient of variations and Kendall’s τ correlation of the number of budding yeast species (n = 329) in which a flux>0 was predicted¹⁸ versus mean protein abundance and mean orthogroup conservation ratio. A * indicates an adj. p-value < 0.05. e) Mean conservation ratio versus standard deviation of the log2-transformed averaged kcat. Solid line is the best linear fit and the dashed line denotes the axis median.

Source data

Extended Data Fig. 5 Amino acid costs are optimised depending on the structural feature and the physicochemical environment.

a) Conservation ratio of the surface (left), core (centre) and binding site (CSb) (right) versus their respective mean normalised costs. Membrane bound (red, n = 84) and short protein chains (purple, n = 8) are highlighted respectively. For the surface a bimodal cost distribution is observed. The dashed line denotes the axis median. b + c) Bar plot of the median relative amino acid content sorted in descending order. The error bars denote the MAD. The colouring of the bars indicates Spearman’s correlation of the relative amino acid content of the entire protein versus b) the mean overall conservation rate and c) the log-transformed protein abundance. d) Boxplot of Spearman’s correlation coefficients as shown in Fig. 4e (conservation ratio vs. relative amino acid content of different structural features). The line denotes the median and the boxes denote the first and third quartile, the whiskers in maximum 1.5-times the interquartile range. Outliers are shown as single dots.

Source data

Extended Data Fig. 6 There is a hierarchy of evolution of structural features with surface residues being least constrained and binding sites the most.

a) Distribution of the mean conservation ratio for all mapped parts (blue) and only for the binding site (purple and red). The purple distribution depicts the CSb definition, the red distribution the UPb definition (left). In the right panel the pink distribution accounts for residues present in the purple distribution (CSb) and not present in the red one (UPb). b + c) Mean conservation ratio of the overall protein versus the conservation of the respective binding sites for b) the CSb definition and c) the UPb definition. Orthogroups below the dashed identity line are highlighted. Solid line is the best linear fit. Dashed line denotes the axis median. d) Mean conservation ratio of the random coiled versus mean conservation ratio of the helical parts (left) and the extended parts (right) respectively. Lines as in b). e) CR versus CR of known domains. Enzymes with the two most dominant folds are highlighted. Lines as in b). f + i) Percentage of either all or only fully conserved residues that are identified f) as binding sites for the UPb definition (n = 317) of binding sites or i) as protein-protein-interaction (PPI) sites (n = 106). The line denotes the median and the boxes denote the first and third quartile, the whiskers extend up to 1.5-times the interquartile range. *** indicates a p-value < 1E-4, two-sided Wilcoxon signed-rank test. g + h) ROC-curves for a machine-learning model predicting known binding-sites. The performance of the test set, that was removed before training the model, for 50 runs as well as the micro and macro averages are shown. For comparison in g) the original labels and in h) randomized labels were used for training and testing. j) Distribution of the mean CR for all mapped parts (blue) and only for the PPI sites (brown) and the surface (orange) respectively. The pink distribution depicts surface residues that are part of a coordination site only. k) Mean CR versus the log10-transformed number of unique physical interactions taken from BioGRID via the Yeastmine interface (n = 525). Lines as in b).

Source data

Extended Data Fig. 7 Our predicted structures capture multimeric structures well. Glycine and tight turns are highly conserved.

a) Boxplot for CR in tight turns, random coils or the whole protein. b) Distribution of CR for different turn lengths compared to random coils. c + d) Boxplot for c) MR of multimeric (n = 106) and monomeric structures (n = 34) and g) proteins in a complex as defined in ComplexomeDB (n = 128) or not (n = 401) performing two-sided Wilcoxon-Mann-Whitney test. *** indicates a p-value < 1E-4. e + f) Distribution of relative e) glycine and f) alanine content for different turn lengths compared to all random coil residues. g) Boxplot for glycine conservation in tight turns, random coils or the whole protein. The line denotes the median and the boxes denote the first and third quartile, the whiskers extend up to 1.5-times the interquartile range. A *** indicates a p-value < 1E-4, two-sided Wilcoxon signed-rank test.

Source data

Extended Data Fig. 8 Refinement of the dN/dS calculation to remove skewed values.

a) dN (left) and dS (right) versus mean CR. Red dots denote a dS > 3, the solid line indicates the best linear fit and the dashed line denotes the axis median. b) dN/dS versus mean CR. Colouring according to a). The solid line indicates the best linear fit for the blue highlighted orthogroups and the dashed lines denote the axis median for the two groups. c + d) Significantly enriched pathways and GO slim terms of c) the blue (dS ≤ 3, n = 272) and red (dS > 3, n = 94) highlighted orthogroups with respect to the full set and d) orthogroups containing enzymes with the 25% highest dN/dS values. Dot size indicates the number of genes associated with the term or pathway in our selection, and colour indicates adjusted p-value. In addition the AUC is shown. e + f) Enrichment factor for e) enzyme classes and f) the GO term “lipid-metabolic process” according to dN/dS. The dashed line indicates random sampling. g) Probability of a residue being under neutral drift is inversely correlated with CR. h) dN/dS versus the mean CR for orthogroups with dS < 3. Orthogroups that contain proteins of the electron transport chain as well as Pox1p and Tdh1/2/3p are highlighted. The solid line indicates the best linear fit and the dashed line denotes the axis median.

Source data

Extended Data Fig. 9 Positive selection in Pox1p, Tdh1/2/3p and the ETC.

a) Multiple sequence alignment of selected regions of Pox1p orthologs. Asterix indicates residues under positive selection in the branch highlighted with the black box on the left. Other highlighted residues interact with positively selected residues in the structure. The dark square indicates the branch containing S. cerevisiae and W. anomalous orthologs for which residues under positive selection are highlighted in (b). b) Structure of the fatty-acyl coA oxidase homodimer interface obtained by using the crystal structure of the Y. lipolytica AOx protein, ylACO1 (right), (PDB:5Y9D). This crystal structure was also used as a reference to superimpose S. cerevisiae Pox1p structure from Alphafold2 (left). The positively selected amino acids Lys580 and Gln653 as well as the coordinating amino acids Tyr385 and Asn631 are highlighted. The red and the orange surfaces denote different protein chains of the homodimeric complex. The inset table denotes the residues present for various orthologs. Grey shading in the table denotes residues and branches that were under positive selection. c) Multiple sequence alignment of GAPDH enzymes in the region surrounding the K101 residue positively selected in the branch indicated by the dark square. d) Left: Tetrameric structure of GAPDH 3 from S. cerevisiae (PDB:3PYM) with positively selected residues highlighted in white. Right: Zoom into the binding site (predicted structure) with cofactor NAD and ligand GAP extracted from PDB:5JYA. The positively selected Lys101 as well as the coordinating Glu102 and the catalytically important Cys150 are highlighted. e) Multiple sequence alignment of Qcr2 highlighting residues detected as under positive selection in designated branches. f) Connectivity network of the cytochrome-c-oxidase/cytochrome bc1 complex (PDB: 6HU9). The node size indicates the length of the protein chain. Square nodes denote mitochondrially-encoded genes. The nodes are coloured with respect to the conservation ratio, where grey denotes missing orthogroups in our data selection. The edges indicate physical interactions between the proteins and edge width indicates the number of contacts between the surfaces of the two protein chains. The edges are coloured according to the number of positively selected amino acids from either protein chain on the surface between the two protein chains. g) Crystal structure of the cytochrome-c-oxidase/cytochrome bc1 complex. The structure consists of a homodimer made up of two ubiquinol cytochrome c reductase (complex III) complexes surrounded by two cytochrome c oxidase (Complex IV) complexes. Protein chains are coloured with respect to their conservation rate (left part) or chain ID (right part). Red dots denote positions where positive selection was detected in at least one branch. In addition, four interfaces are shown in detail, highlighting positively selected residues between the two surfaces. The respective positions in the complex are denoted in coloured boxes. Multiple sequence alignment illustrations created with Jalview.

Source data

Extended Data Fig. 10 Hydrolases escape most relationships and differ significantly in several investigated properties.

a) Pearson correlation of the mean MR and the mean CR for different enzyme classes. A * indicates an adj. p-value < 1e-2. b) Mean CR versus mean MR with orthogroups containing hydrolases highlighted. The solid line denotes the best linear fit. c-f) Distribution of c) flux coefficient of variation, d) CR, e) log2-transformed protein abundance and f) MR for different enzyme classes. A * indicates an adj. p-value < 1E-2, a *** an adj. p-value < 1E-4. g + h) CR of the binding site using g) the CSb and h) the UPb definition for different enzyme classes.

Source data

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Lemke, O., Heineike, B.M., Viknander, S. et al. The role of metabolism in shaping enzyme structures over 400 million years. Nature (2025). https://doi.org/10.1038/s41586-025-09205-6

Download citation

Received: 27 May 2024
Accepted: 28 May 2025
Published: 09 July 2025
DOI: https://doi.org/10.1038/s41586-025-09205-6

Read Entire Article

Role of metabolism in shaping enzyme structures over 400M years

Main

Mapping evolution in enzyme structures

Biochemical constraints

The impact of metabolic specialization

The impact of pathway membership

The impact of molecular function

Abundance and flux diversity constraints

Hierarchy of cost optimization

Conservation of structural features

Hierarchy of structural evolution

Conserved clusters capture interactions

Discussion

Reporting summary

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data figures and tables

Supplementary information

Rights and permissions

About this article

Cite this article

Related

Multilingualism protects against accelerated aging

PS5 Has Now Officially Outsold Every Xbox Console Ever Relea...

cjsawk: A Self-Hosted Compiler for a Subset of C, JavaScript...