Learning where to learn: Training data distribution optimization for scientific machine learning

Nicolas Guerra, Nicholas H. Nelsen, Yunan Yang

arXiv:2505.21626·cs.LG·Published 2025-05-27·Updated 2025-12-05

In scientific machine learning, models are routinely deployed with parameter values or boundary conditions far from those used in training. This paper studies the learning-where-to-learn problem of designing a training data distribution that minimizes average prediction error across a family of deployment regimes. A theoretical analysis shows how the training distribution shapes deployment accuracy. This motivates two adaptive algorithms based on bilevel or alternating optimization in the space of probability measures. Discretized implementations using parametric distribution classes or nonparametric particle-based gradient flows deliver optimized training distributions that outperform nonadaptive designs. Once trained, the resulting models exhibit improved sample complexity and robustness to distribution shift. This framework unlocks the potential of principled data acquisition for learning functions and solution operators of partial differential equations.

TopicsScientific Machine Learning & PINNs

Tagspartial-differential-equations scientific-machine-learning

arXiv categoriescs.LG, math.OC, stat.ML

arXiv abstract page PDF