Isotropy and Geometry of Pretrained Protein LMs

Sheikh Azizul Hakim, Kowshic Roy, M Saifur Rahman

arXiv:2510.10655·q-bio.OT·Published 2025-10-12

Large pretrained language models have transformed natural language processing, and their adaptation to protein sequences -- viewed as strings of amino acid characters -- has advanced protein analysis. However, the distinct properties of proteins, such as variable sequence lengths and lack of word-sentence analogs, necessitate a deeper understanding of protein language models (LMs). We investigate the isotropy of protein LM embedding spaces using average pairwise cosine similarity and the IsoScore method, revealing that models like ProtBERT and ProtXLNet are highly anisotropic, utilizing only 2--14 dimensions for global and local representations. In contrast, multi-modal training in ProteinBERT, which integrates sequence and gene ontology data, enhances isotropy, suggesting that diverse biological inputs improve representational efficiency. We also find that embedding distances weakly correlate with alignment-based similarity scores, particularly at low similarity.

TopicsLarge Language Models & Materials

Tagsprotein-llm

arXiv categoriesq-bio.OT

arXiv abstract pagePDF