Google for the Tree of Life: How a Biological Search Engine is Revolutionizing Life Sciences

Hajk-Georg Drost discusses the broad biosphere genomics applications of DIAMOND in front of the ‘DIAMOND-Miner’, the
prominent workstation in his lab that ran the initial benchmarks for their Nature Methods publication.
Supercomputer at the MPCDF in Garching | The experimental study to simulate tree-of-life scale protein searches in the Earth Biogenome era was performed in collaboration with Klaus Reuter and John Kennedy at the MPCDF using >20,000 CPU cores.

"To preserve the biodiversity on our planet, we need to decode the genetic material of the species that are still alive today and try to learn as much as possible from its composition", explains Hajk-Georg Drost the main motivation behind his work at the Max Planck Institute (MPI) for Biology in Tübingen. He and his doctoral student Benjamin Buchfink have developed the next generation of "DIAMOND," a biological search engine for protein sequences. The scientific software makes it possible to compare protein segments across all organisms in the tree of life and seeks to provide important insights for biodiversity research, as well as for medicine in the fight against diseases.

The first prototype of the biological search engine "DIAMOND" (Double Index Alignment of Next Generation Sequencing) was initially developed for specialized microbiome analyses by Benjamin Buchfink in the research group of Daniel Huson, Professor of Algorithmic Bioinformatics at the University of Tübingen. Equipped with this initial version of the software, Benjamin Buchfink joined the MPI and became member of the Research Group for Computational Biology newly founded in 2019 by Hajk-Georg Drost in the Department of Molecular Biology. As group leader, Hajk-Georg immediately realized the great potential of DIAMOND and its programmer: "Benjamin is the best C++ coder I know", he proudly admits.

When Hajk-Georg first learned about DIAMOND, he immediately envisioned the application for the Earth Biogenome Project and the advancement of DIAMOND into the field of biodiversity genomics. The Earth Biogenome Project is an intercontinental consortium that aims to decode the genomes of 1.5 million eukaryotic species (all organisms that have a nucleus in their cells, like animals, plants, algae or fungi) between 2020 and 2030 in order to conserve their genetic makeup before extinction. "Our big vision is to learn as much as possible from this vast genetic resource to facilitate the future of molecular life sciences", Hajk-Georg points out. "However, searching against 1.5 million eukaryotic genomes results in a huge amount of output data that can easily contain several hundred terabytes of search results and even up to several petabytes for very ambitious research questions. With the currently established software ,BLAST' it would take more than 100 years to compare all 1.5 million genomes against each other. But if we want to learn from these genomes to achieve progress today, we cannot afford to wait 100 years."

MPCDF in Garching | The future of biology is data-driven and will continue to rely on intelligent open-source software able to scale in the computing cloud.

In fact, Benjamin and Hajk-Georg succeeded in developing DIAMOND further and further with new ambitions, making it significantly faster and more precise. In collaboration with Klaus Reuter and John Kennedy from the Max Planck Computing and Data Facility (MPCDF) in Garching, the two bioinformaticians were able to test the scalability of DIAMOND with a supercomputer and show that DIAMOND works up to 10,000 times faster than the previous gold standard BLAST - valuable time that allows scientists to focus on the next steps to address the loss of biodiversity rather than struggling with the technical constraints of having to wait for the sequence search to finish. The study introducing the next generation of DIAMOND was published in the prominent scientific journal Nature Methods in 2021.

The daily life of Hajk-Georg and Benjamin takes place exclusively in front of the computer. "It can easily happen that one works on a critical part of the program code for several days in a row while only focusing on a few lines of code. That can be very frustrating sometimes", Benjamin admits when asked about setbacks in the development process. "But when these few lines of code can later make the difference between the tool running 1,000 times faster or not, then all this effort was worth it", he adds.

For Hajk-Georg, the DIAMOND software is only one of many scientific software projects his group is currently working on. His academic ambition is not new. Already with the results of his own bachelor thesis at the University of Halle, Hajk-Georg delivered a publication which became the cover story of the internationally renowned journal Nature in 2012. In his further scientific career at the University of Cambridge in England, he continued to publish important findings in genomics and epigenetics and developed open-source software that can be used to answer fundamental questions in many areas of biology and medicine.

For the future, Hajk-Georg and Benjamin hope for more manpower. "It is only thanks to the generous resources of our department, wonderful collaborations, and the Max Planck Society in general that our team manages to catalyze the work of thousands of life scientists by equipping them with software to automate genomic data retrieval, sequence search, and predictive analytics at tree of life scale. Still, I think it's a pity that there is only so little investment in scientific software development and predictive modeling in life sciences, despite the enormous academic value it generates for this data-driven research community", notes Hajk-Georg. It is therefore particularly remarkable that the two bioinformaticians have solved such a big problem in natural science with their new version of DIAMOND. "If we were just an extra two or three people, we could develop so much more and faster", Hajk-Georg states. "I've already submitted grant proposals and I really hope that funding agencies will also see the potential of our research line." In either outcome they already started working on the next groundbreaking steps in biodiversity genomics: With the help of the MPCDF in Garching, they attempt to employ open-source data technologies used by Google, Facebook, and Amazon to build an efficient distributed database infrastructure able to effectively analyze petabyte-scale DIAMOND search results. This would greatly simplify data analytics and predictive modeling within the Earth Biogenome Project and will benefit many other studies across the life science sector.