Tokenization for Molecular Foundation Models

Alexius Wadell, Anoushka Bhutani, Venkatasubramanian Viswanathan

arXiv:2409.15370·cs.LG·Published 2024-09-19·Updated 2025-07-08

Text-based foundation models have become an important part of scientific discovery, with molecular foundation models accelerating advancements in material science and molecular design.However, existing models are constrained by closed-vocabulary tokenizers that capture only a fraction of molecular space. In this work, we systematically evaluate 34 tokenizers, including 19 chemistry-specific ones, and reveal significant gaps in their coverage of the SMILES molecular representation. To assess the impact of tokenizer choice, we introduce n-gram language models as a low-cost proxy and validate their effectiveness by pretraining and finetuning 18 RoBERTa-style encoders for molecular property prediction. To overcome the limitations of existing tokenizers, we propose two new tokenizers -- Smirk and Smirk-GPE -- with full coverage of the OpenSMILES specification. The proposed tokenizers systematically integrate nuclear, electronic, and geometric degrees of freedom; facilitating applications in pharmacology, agriculture, biology, and energy storage. Our results highlight the need for open-vocabulary modeling and chemically diverse benchmarks in cheminformatics.

TopicsProperty Prediction & ADMET

Tagsmolecular-representation property-prediction

arXiv categoriescs.LG, cs.AI, physics.chem-ph, q-bio.BM

arXiv abstract page PDF