Machine Learning: The Native Language of Biology

4 months ago 20

Enjoy this piece? Show your support by tapping the “heart” ❤️ in the header above. It’s a small gesture that goes a long way in helping me understand what you value and in growing this newsletter. Thanks so much!

When Galileo wrote "the universe cannot be read until we have learned the language and become familiar with the characters in which it is written” he was talking about mathematics — the language that has allowed physicists to capture the fundamental rules of our universe with simple formula like Newton's Second Law of Motion (F=ma), Einstein’s Energy-Mass Equivalence (E=mc²), and Schrödinger Equation (Hψ=Eψ).

These mathematical expressions have given us the power to engineer everything from rocket propulsion systems, to nuclear power plants, and even quantum computers. Yet, when we apply this mathematical lens to biology, it often fails to yield breakthroughs of the same magnitude.

The relationship between mathematics and biology has always been complicated. The traditional language of mathematics that works so well for physics—differential equations, probability theory, and statistics—simply doesn't map as cleanly onto biological systems. This isn't to say math isn't useful for biology. For example, the Lotka-Volterra model accurately captures predator-prey dynamics using systems of differential equations. Similarly, Hill functions can describe biological processes like oxygen binding to hemoglobin, demonstrating that mathematical formulas can indeed capture certain biological phenomena.

However, the most interesting and important biological problems resist such neat mathematical encapsulation. Why is this the case? Three major challenges stand in the way: dimensionality, interconnectedness, and diversity.

Biological systems often exist in a dimensional middle ground—too big and complex for simple reductionist approaches, yet too small and specific to be explained by statistical generalizations. For example, a single cell contains thousands of interacting genes and proteins, which are far too many to be explained with simple equations, yet too few for statistical averaging to smooth out the noise. Additionally, in biology everything seems to interact with everything else. Isolating components for mathematical analysis means losing the critical context that gives these components meaning. Finally, the diversity of living systems defies static characterization — biology is constantly evolving and changing; what was true for yesterday's organism may not be true for tomorrow's.

This has left bioengineers in a frustrating position. While physicists can reliably predict the trajectory of a rocket or the behavior of an electrical circuit, bioengineers often find themselves navigating by intuition and trial-and-error. The language of traditional mathematics just doesn't seem to be the native tongue of biology.

At this point, you may be thinking, "Machine learning algorithms are just math—neural networks are simply functions with parameters, parameter estimation and optimization are calculus, and statistics tie everything together." This is technically true, but it misses a crucial distinction.

Traditional mathematical modeling in biology typically involves creating simplified, human-interpretable equations that approximate biological phenomena. These models are reductionist by necessity—they must be simple enough for a human to understand and manipulate. Additionally, they're usually built on assumptions about linearity, independence of variables, and equilibrium states that rarely hold true in living systems.

Machine learning represents a fundamentally different approach. Rather than imposing human-interpretable equations onto biological data, it allows complex mathematical relationships to emerge from the data itself. A deep neural network with millions of parameters doesn't offer a neat equation you can write on a blackboard, but it can capture non-linear, context-dependent relationships that traditional models simply cannot.

Think of it this way: both Tolstoy’s War and Peace and stop sign can be written using the English language, but they represent fundamentally different forms of communication. Similarly, the mathematics of machine learning and the mathematics of traditional biological modeling share foundations but represent profoundly different approaches to understanding complexity.

What if biology has been waiting for a different kind of language—one that embraces complexity rather than simplifying it? Machine learning appears to be this new language, offering a framework that aligns remarkably well with biological systems.

The parallels between machine learning and natural language are instructive here. Natural language is complex, full of exceptions, contextual, and constantly evolving—just like biology. Traditional rule-based approaches to language processing failed for decades until machine learning methods emerged that could capture these complexities. It's perhaps no coincidence that many algorithms now successfully applied to biological problems, such as hidden Markov models, were originally developed for analyzing human language. In fact, Andrey Markov first developed his statistical models in the early 1900s to analyze patterns in Russian poetry and literature. These same mathematical principles, later extended into hidden Markov models for speech recognition in the 1970s, proved remarkably effective for analyzing biological sequences like DNA—treating genetic code as a language with its own statistical patterns and grammatical rules, not unlike the literary works Markov first studied.

Consider the contextual nature of both language and biology. In English, the word "bank" can mean a financial institution or the side of a river—the meaning depends entirely on context. Biology displays this same context dependence. For example, take the p53 protein, which is often called the guardian of the genome. Typically, we think of p53 as a tumor suppressor that triggers cell death in damaged cells by upregulating various “death effector” proteins. However, in other contexts like embryonic development, p53 promotes cell survival and growth. Traditional mathematical models struggle to capture this type of context-dependence while machine learning models, on the other hand, excel at it.

Machine learning thrives on precisely the challenges that make biology difficult to describe with traditional mathematics. For example, machine learning models can handle systems with thousands of interacting parts, or high dimensional data– a neural network with thousands of nodes is no more difficult to train than one with dozens. Additionally, machine learning models naturally represent biology as an integrated network, capturing complex, non-linear interactions that simple equations cannot. Furthermore, where traditional mathematics struggles with biological variation, machine learning feeds on it. Variation becomes data, and each different cell, species, or DNA sequence becomes a training example.

This relationship between machine learning and biology goes deeper than just modeling biological processes. In fact, the internal workings of cells themselves suggest why machine learning might be so well-suited to understanding biological systems and predicting their behaviors.

Consider how cells interpret their environment. A cell doesn't have direct access to abstract concepts like "heat stress" or "viral infection." Instead, it uses a sophisticated symbolic language to represent these states—a language built around transcription factors.

Transcription factors are proteins that can rapidly shift between active and inactive states in response to environmental signals. When activated, they bind to specific DNA sequences and regulate the expression of target genes. In essence, they function as symbols that represent specific, complex, environmental conditions or cellular states.

Take the heat shock response as an example. When a cell experiences high temperature, heat shock proteins become unfolded, triggering the activation of heat shock factor 1 (HSF1). This activated transcription factor then binds to DNA, increasing the production of proteins that help the cell cope with heat stress. HSF1's activity becomes a symbol that represents "heat stress" within the cell's internal language.

The complexity deepens when we consider how transcription factors interact. The inflammatory transcription factor NF-κB doesn't simply activate a fixed set of genes in all circumstances. Instead, its activity depends on which other transcription factors are active, which cell type it's operating in, and even the timing and duration of its activation. In immune cells, NF-κB might trigger an inflammatory response, while in neural cells, it might promote survival and plasticity. This context-dependent action mirrors how words in natural language take on different meanings in different contexts.

This symbolic, context-dependent nature of cellular signaling explains why machine learning approaches have been so successful in modeling biological systems. Neural networks, with their distributed representations and context-sensitivity, capture the way cells themselves process information.

What's remarkable is how this cellular symbolism mirrors the way machine learning operates. In a neural network, high-dimensional input data is compressed into latent representations— abstract patterns that capture essential features while discarding noise. Similarly, a transcription factor distills complex environmental information into a binary state (active or inactive) that the cell can use to make decisions.

This concept of latent spaces offers another window into why machine learning aligns so well with biology. In cell biology, high-dimensional data like gene expression profiles or microscopy images can be projected into lower-dimensional spaces that capture meaningful biological variation. Each dimension in this latent space ideally corresponds to some biological process or state—cell cycle phase, differentiation stage, or stress response.

Consider a practical example: single-cell RNA sequencing. This technology can measure the expression of thousands of genes across thousands of individual cells. Traditional analysis might involve selecting a handful of marker genes, using them to identify discrete cell types (ie,, myocytes versus endothelial cells, etc), and then identifying genes that are differentially expressed between cell types—effectively imposing a human-defined latent space onto the data. Machine learning approaches, however, can discover latent spaces directly from the data, revealing biological patterns that might not align with our preconceptions.

These machine-discovered latent spaces often reveal surprising biology. For instance, researchers analyzing immune cell data might discover a latent dimension that separates cells based on their metabolic state—a feature that might not have been captured by traditional marker-based approaches. This mirrors how cells themselves organize information: transcription factors don't neatly map to human-defined categories like "inflammation" or "proliferation," but rather to complex, context-dependent cellular states.

A perfectly disentangled representation would map each latent dimension to exactly one biological process. While this ideal is rarely achieved in practice, even partially disentangled representations can reveal insights about how biological systems organize information.

Again, this mirrors how cells themselves operate. A transcription factor like HSF1 acts as a dimension in the cell's internal latent space—it distills complex environmental information into a single axis of variation that influences multiple downstream processes. Other transcription factors represent other dimensions in this space, collectively forming a compressed representation of the cell's environment.

This recognition has given rise to a new field at the intersection of molecular biology and machine learning—what

calls "Predictive Biology". Unlike traditional approaches that focus on cataloging molecular functions or mapping interaction networks, Predictive Biology places prediction at the center of biological understanding.

This represents a profound epistemological shift. Where molecular biology asks "what does this molecule do?" and systems biology asks "how do these molecules interact?", Predictive Biology asks "can we predict what happens next?" It suggests that understanding in biology comes not from reducing systems to components, but from building models that can anticipate how biological systems will respond to new conditions.

This shift mirrors how cells themselves operate. A cell doesn't need to understand the physics of heat transfer to respond appropriately to temperature changes. It simply needs internal representations (transcription factors) that reliably predict which proteins will be beneficial under those conditions. The cell's internal model isn't concerned with causality in an abstract sense—it's concerned with prediction.

Consider the practical implications of this approach. Traditional molecular biology might spend years characterizing the function of a single protein, while systems biology might map its interactions with other molecules. These approaches have yielded tremendous insights, but they're limited by their throughput and scalability. Predictive Biology takes a different approach: gather diverse data spanning the range of biological possibilities, train models to predict outcomes from inputs, and use these models to search global solution spaces. For example, rather than methodically testing mutations to a protein one by one, a Predictive Biologist might measure the activity of thousands of random variants, train a model to predict activity from sequence, and then use this model to identify optimal sequences across the entire space of possibilities. This approach has already yielded proteins with unprecedented properties, far from the range that traditional approaches might have explored.

What makes biology particularly challenging—and particularly well-suited to machine learning approaches—is its inherent messiness. Biology doesn't respect human-defined categories and rarely follows simple rules without exceptions.

Consider the concept of a gene. We might think of a gene as a discrete unit of heredity with a specific function, but the reality is far messier. A single gene can produce multiple protein variants through alternative splicing, can be regulated differently in different cell types, and can have entirely different functions depending on context.

Take the SOX9 gene as an example. In developing embryos, SOX9 is crucial for male sex determination. In cartilage, it regulates chondrocyte differentiation. In the intestine, it maintains stem cell populations. And in pancreatic cancer, it can promote tumor progression. These aren't simply variations on a theme—they're fundamentally different biological roles for the same gene.

This context-dependent functionality mirrors how words function in natural language. The word "set" has over a dozen different definitions depending on context—it can be a tennis term, a mathematical concept, or a verb meaning "to place." Natural language processing struggled with this complexity until machine learning approaches emerged that could capture contextual meanings.

Similarly, traditional mathematical approaches to biology struggle with this context-dependence. An ordinary differential equation describing SOX9's role in cartilage development would likely fail completely if applied to its role in cancer. Machine learning models, however, can learn these context-dependent relationships directly from data.

The messiness extends beyond individual genes. Biological systems rarely respect the neat categories we impose on them. Is aging a metabolic phenomenon, an inflammatory process, or a consequence of DNA damage? The answer is all of the above and more—these processes interact in complex ways that defy simple categorization. This is precisely where machine learning shines. Rather than forcing biological complexity into human-defined categories, machine learning models can learn patterns directly from data, capturing the messy, context-dependent reality of biological systems.

If machine learning truly is the native language of biology, what does this mean for bioengineering? Just as understanding the mathematics of physics enabled us to build rockets, nuclear power plants, and computers, understanding how machine learning describes biology could unlock a new era of biological design.

We're already seeing glimpses of this future. Protein engineers no longer rely solely on rational design based on physics principles; they use machine learning models trained on evolutionary data to predict which amino acid sequences will fold into functional structures. Metabolic engineers don't attempt to write differential equations for entire cellular pathways; they use machine learning to predict how genetic modifications will affect metabolite production.

But the potential goes far beyond these examples. Imagine designing cellular therapies that can sense and respond to complex disease environments, microbiomes tailored to specific environmental challenges, or organisms engineered to produce materials with unprecedented properties. These applications require navigating biological complexity far beyond what traditional approaches can handle.

This doesn't mean traditional approaches have no value. Just as engineers still use Newton's equations for everyday physics problems, biologists will continue to use mathematical models where they apply. But for the most complex biological problems, machine learning will increasingly be the language of choice.

Moreover, we might expect a new generation of hybrid approaches that combine the interpretability of traditional models with the predictive power of machine learning. For instance, neural ordinary differential equations allow us to incorporate mechanistic understanding into machine learning models, potentially offering the best of both worlds.

We're standing at a Galileo moment for biology. After centuries of struggling to read the book of life in the language of mathematics, we may have finally found its native tongue. The universe, as Galileo noted, is written in the language of mathematics. But life, it seems, speaks the language of patterns, abstractions, symbols, and predictions—a language that we translate with machine learning.

Did you enjoy this piece? If so, you may also want to check out other articles in Decoding Biology’s Data Science & Machine Learning collection.

Read Entire Article

Machine Learning: The Native Language of Biology

Related

Show HN: Recipe Converter – Instantly Adapt Recipes to Your ...

Hurricane Melissa is now the strongest storm on the planet t...

Show HN: Gosim Distributed Systems Testing Framework Written...