DSSS - Sensitive clustering of 20 billion protein sequences at tree-of-life scale using DIAMOND2 DeepClust
- Date: Jun 2, 2023
- Time: 03:00 PM - 04:00 PM (Local Time Germany)
- Speaker: Hajk-Georg Drost
- MPI for Biology Tübingen
- Location: NO.002, MPI für Intelligente Systeme
Our understanding of the origin and natural variation of the global biosphere is largely derived from morphological insights with data collections reaching back to the time of Aristotle. Sequencing the genomes and annotating the protein sequences across the tree of life will transform our access to evolutionary information and may provide a roadmap to characterizing the molecular principles underlying biodiversification. The key to accessing this reservoir of genomic information for molecular exploration and functional annotation is the comparative method, usually enabled by sequence similarity assessments. We introduce DIAMOND2 DeepClust, a ultra-fast and sensitive sequence clustering method optimized to perform protein sequence similarity clustering at low identity levels (e.g. down to 20% identity). Using DIAMOND2 DeepClust, we present an experimental study based on clustering the protein universe currently comprising of ~20 billion protein sequences and show how to overcome computational bottlenecks in the biosphere genomics era.