OmniCellTOSG: The First Cell Text-Omic Signaling Graphs Dataset for Graph Language Foundation Modeling
Heming Zhang, Tim Xu, Dekang Cao, Shunning Liang, Guntaas Shergill, Nicholas Hadas, Lars Schimmelpfennig, Levi Kaster, Di Huang, Guangfu Li, S. Peter Goedegebuure, David DeNardo, Li Ding, Ryan C. Fields, J Philip Miller, Pirooz Eghtesady, Carlos Cruchaga, William Buchser, Jonathan Cooper, Marco Sardiello, Patricia Dickson, Yixin Chen, Michael Province, Philip Payne, Fuhai Li
arXiv:2504.02148·cs.AI·Published 2025-04-02·Updated 2026-02-03
With the rapid growth of large-scale single-cell omic datasets, omic foundation models (FMs) have emerged as powerful tools for advancing research in life sciences and precision medicine. However, most existing omic FMs rely primarily on numerical transcriptomic data by sorting genes as sequences, while lacking explicit integration of biomedical prior knowledge and signaling interactions that are critical for scientific discovery. Here, we introduce the Text-Omic Signaling Graph (TOSG), a novel data structure that unifies human-interpretable biomedical textual knowledge, quantitative omic data, and signaling network information. Using this framework, we construct OmniCellTOSG, a large-scale resource comprising approximately half million meta-cell TOSGs derived from around 80 million single-cell and single-nucleus RNA-seq profiles across organs and diseases. We further develop CellTOSG-FM, a multimodal graph language FM, to jointly analyze textual, omic and signaling network context. Across diverse downstream tasks, CellTOSG-FM outperforms existing omic FMs, and provides interpretable insights into disease-associated targets and signaling pathways.
TopicsGenerative Models & Discovery
Tagsscientific-discovery
arXiv categoriescs.AI, cs.LG
arXiv abstract pagePDF