MetaGraph compresses data archives into a search engine for scientists

2 hours ago 2

Close up view of a researcher's hand interacting with a DNA sequence on a digital display.

The Internet has Google. Now biology has MetaGraph. Detailed today in Nature¹, the search engine can quickly sift through the staggering volumes of biological data housed in public repositories.

“It’s a huge achievement,” says Rayan Chikhi, a biocomputing researcher at the Pasteur Institute in Paris. “They set a new standard” for analysing raw biological data — including DNA, RNA and protein sequences — from databases that can contain millions of billions of DNA letters, amounting to ‘petabases’ of information, more entries than all the webpages in Google’s vast index.

Although MetaGraph is tagged as ‘Google for DNA’, Chikhi likens the tool to a search engine for YouTube, because the tasks are more computationally demanding. In the same way that YouTube searches can retrieve every video that features, say, red balloons even when those key words don’t appear in the title, tags or description, MetaGraph can uncover genetic patterns hidden deep within expansive sequencing data sets without needing those patterns to be explicitly annotated in advance.

“It enables things that cannot be done in any other way,” Chikhi says.

Smart software untangles gene regulation in cells

Indexing life’s library

The motivation behind MetaGraph was to address an accessibility problem in sequencing data sets. The size of these repositories has risen at a blistering pace in the past few decades, but this growth has presented challenges for the scientists using the data they contain. Raw sequencing reads are fragmented, noisy and too numerous to search directly. “The volume of the data, paradoxically, is the main inhibitor of us actually using the data,” says Babaian.

According to the study author, André Kahles, a bioinformatician at the Swiss Federal Institute of Technology (ETH) Zurich in Switzerland, MetaGraph could help researchers to ask biological questions of repositories such as the Sequence Read Archive (SRA), a public database containing in excess of 100 million billion DNA letters²

They tackled the problem through the use of mathematical ‘graphs’ that links overlapping DNA fragments together, much like sentences that share the same words lining up in a book index.

The researchers integrated data from seven publicly funded data repositories, creating 18.8 million unique DNA and RNA sequence sets and 210 billion amino-acid sequence sets across all clades of life — including viruses, bacteria, fungi, plants and animals, including humans. They also developed a search engine for these sequences, in which users use text prompts to search these integrated archives of raw data.

“It is a totally new way to interact with this body of data,” says Kahles. “It’s compressed, but accessible on the fly.”

The huge protein database that spawned AlphaFold and biology’s AI revolution

To demonstrate the utility of MetaGraph, the study authors used it to scan 241,384 human gut microbiome samples for genetic indicators of antibiotic resistance around the world, building on work that used an earlier version of the tool to track drug-resistance genes in bacterial strains that live in subway systems across major urban centres³. The authors say they performed the analysis in about an hour on a high-powered computer.