Active Learning Strategies for Efficient Machine-Learned Interatomic Potentials Across Diverse Material Systems

Mohammed Azeez Khan, Aaron D'Souza, Vijay Choyal

arXiv:2601.06916·cs.LG·Published 2026-01-11·Updated 2026-01-21

Efficient materials discovery requires reducing costly first-principles calculations for training machine-learned interatomic potentials (MLIPs). We develop an active learning (AL) framework that iteratively selects informative structures from the Materials Project and Open Quantum Materials Database (OQMD) using compositional and property-based descriptors with a neural network ensemble model. Query-by-Committee enables real-time uncertainty quantification. We compare four strategies: random sampling (baseline), uncertainty-based sampling, diversity-based sampling (k-means clustering with farthest-point refinement), and a hybrid approach. Experiments across four material systems (C, Si, Fe, and TiO2) with 5 random seeds demonstrate that diversity sampling achieves competitive or superior performance, with 10.9% improvement on TiO2. Our approach achieves equivalent accuracy with 5-13% fewer labeled samples than random baselines. The complete pipeline executes on Google Colab in under 4 hours per system using less than 8 GB RAM, democratizing MLIP development for resource-limited researchers. Open-source code and configurations are available on GitHub. This multi-system evaluation provides practical guidelines for data-efficient MLIP training and highlights integration with symmetry-aware architectures as a promising future direction.

TopicsLarge Language Models & Materials, Quantum Chemistry & Force Fields

Tagsab-initio active-learning materials-discovery mlip

arXiv categoriescs.LG

arXiv abstract pagePDF