Reliable OOD Virtual Screening with Extrapolatory Pseudo-Label Matching

Yunni Qu, Bhargav Vaduri, Karthikeya Jatoth, James Wellnitz, Dzung Dinh, Seth Veenbaas, Jonathan Chapman, Alexander Tropsha, Junier Oliva

arXiv:2406.01825·cs.LG·Published 2024-06-03·Updated 2026-03-24

Machine learning (ML) models are increasingly deployed for virtual screening in drug discovery, where the goal is to identify novel, chemically diverse scaffolds while minimizing experimental costs. This creates a fundamental challenge: the most valuable discoveries lie in out-of-distribution (OOD) regions beyond the training data, yet ML models often degrade under distribution shift. Standard novelty-rejection strategies ensure reliability within the training domain but limit discovery by rejecting precisely the novel scaffolds most worth finding. Moreover, experimental budgets permit testing only a small fraction of nominated candidates, demanding models that produce reliable confidence estimates. We introduce EXPLOR (Extrapolatory Pseudo-Label Matching for OOD Uncertainty-Based Rejection), a framework that addresses both challenges through extrapolatory pseudo-labeling on latent-space augmentations, requiring only a single labeled training set and no access to unlabeled test compounds, mirroring the realistic conditions of prospective screening campaigns. Through a multi-headed architecture with a novel per-head matching loss, EXPLOR learns to extrapolate to OOD chemical space while producing reliable confidence estimates, with particularly strong performance in high-confidence regions, which is critical for virtual screening where only top-ranked candidates advance to experimental validation. We demonstrate state-of-the-art performance across chemical and tabular benchmarks using different molecular embeddings.

TopicsGenerative Design & Molecule Optimization, Molecular Representation & Learning, Property Prediction & ADMET

Tagschemical-space drug-discovery molecular-representation

arXiv categoriescs.LG, cs.AI

arXiv abstract page PDF