BioBlobs: Differentiable Graph Partitioning for Protein Representation Learning
Xin Wang, Carlos Oliver
arXiv:2510.01632·q-bio.BM·Published 2025-10-02
Protein function is driven by coherent substructures which vary in size and topology, yet current protein representation learning models (PRL) distort these signals by relying on rigid substructures such as k-hop and fixed radius neighbourhoods. We introduce BioBlobs, a plug-and-play, fully differentiable module that represents proteins by dynamically partitioning structures into flexibly-sized, non-overlapping substructures ("blobs"). The resulting blobs are quantized into a shared and interpretable codebook, yielding a discrete vocabulary of function-relevant protein substructures used to compute protein embeddings. We show that BioBlobs representations improve the performance of widely used protein encoders such as GVP-GNN across various PRL tasks. Our approach highlights the value of architectures that directly capture function-relevant protein substructures, enabling both improved predictive performance and mechanistic insight into protein function.
TopicsMolecular Representation & Learning
Tagsprotein-function
arXiv categoriesq-bio.BM, cs.AI
arXiv abstract pagePDF