Spatial statistics for screening molecular structures

Pranoy Ray, Surya R. Kalidindi

arXiv:2605.17147·cond-mat.mtrl-sci·Published 2026-05-16

The dominant paradigm in computational materials discovery relies on heavily parameterized deep architectures, including message-passing graph networks and equivariant models, that require millions of DFT-labeled training structures and produce non-convex latent representations that complicate continuous optimization for inverse design. These architectures are impractical in data-scarce regimes, which is the typical case in molecular screening, and exhibit well-documented limitations in capturing chemically disordered configurations and chiral geometries. This review presents feature engineering based on spatial statistics as a physically rigorous and immediately deployable alternative. Molecular structures are encoded as voxelized scalar fields, and two-point auto- and cross-correlations are evaluated deterministically via Fast Fourier Transforms, explicitly transferring the burden of spatial pattern recognition from the learning algorithm to a closed-form, physics-informed operation. Principal component analysis of the resulting correlation maps yields low-dimensional, strictly convex representations that support lean neural networks (<100k trainable parameters) and non-parametric surrogate models, achieving sub-2% prediction error with as few as 10 training samples. Demonstrated across periodic crystals, chemically disordered high-entropy alloys, and non-periodic organic molecules, this framework enables Bayesian active learning and zero-shot extrapolation on commodity hardware, which current large-scale architectures cannot replicate at equivalent data budgets.

TopicsProcess Modeling & System Identification

Tagsactive-learning dft materials-discovery surrogate-modeling

arXiv categoriescond-mat.mtrl-sci, cond-mat.dis-nn, cond-mat.mes-hall

arXiv abstract page PDF