Investigating Data Hierarchies in Multifidelity Machine Learning for Excitation Energies

Vivin Vinod, Peter Zaspel

arXiv:2410.11392·physics.chem-ph·Published 2024-10-15·Updated 2025-03-25

Recent progress in machine learning (ML) has made high-accuracy quantum chemistry (QC) calculations more accessible. Of particular interest are multifidelity machine learning (MFML) methods where training data from differing accuracies or fidelities are used. These methods usually employ a fixed scaling factor, $γ$, to relate the number of training samples across different fidelities, which reflects the cost and assumed sparsity of the data. This study investigates the impact of modifying $γ$ on model efficiency and accuracy for the prediction of vertical excitation energies using the QeMFi benchmark dataset. Further, this work introduces QC compute time informed scaling factors, denoted as $θ$, that vary based on QC compute times at different fidelities. A novel error metric, error contours of MFML, is proposed to provide a comprehensive view of model error contributions from each fidelity. The results indicate that high model accuracy can be achieved with just 2 training samples at the target fidelity when a larger number of samples from lower fidelities are used. This is further illustrated through a novel concept, the $Γ$-curve, which compares model error against the time-cost of generating training samples, demonstrating that multifidelity models can achieve high accuracy while minimizing training data costs.

TopicsQuantum Chemistry & Force Fields

Tagsquantum-chemistry

arXiv categoriesphysics.chem-ph, cs.LG, physics.comp-ph

arXiv abstract pagePDF