Notation-level confounding: When inconsistent molecular notations mislead chemical language models

Yosuke Kikuchi, Yasuhiro Yoshikai, Shumpei Nemoto, Ayako Furuhama, Takashi Yamada, Hiroyuki Kusuhara, Tadahaya Mizuno

arXiv:2505.07139·q-bio.QM·Published 2025-05-11·Updated 2026-02-12

Chemical language models (CLMs) are increasingly used for molecular design and property prediction. Because these models learn from textual encodings of molecules, differences in how such encodings are generated may affect their behavior. In cheminformatics, the term canonical SMILES implies a single standardized notation, yet different toolkits define distinct canonicalization rules, yielding multiple canonical strings for the same molecule. To examine how this variability arises and why it matters, we surveyed 264 CLM papers in PubMed and found that about half did not specify their canonicalization procedure, limiting transparency and reproducibility. Using a molecular translation framework, we show that when multiple valid notations are mixed or left undocumented, inconsistent notations distort latent representations and, in some benchmarks, can spuriously inflate predictive accuracy, a phenomenon we term notation-level confounding. These findings demonstrate how subtle differences in SMILES generation can mislead CLMs and highlight the importance of explicitly reporting preprocessing tools and settings.

TopicsGenerative Design & Molecule Optimization

Tagschemical-llm molecular-generation property-prediction

arXiv categoriesq-bio.QM

arXiv abstract page PDF