TensorGRaD: Tensor Gradient Robust Decomposition for Memory-Efficient Neural Operator Training

Sebastian Loeschcke, David Pitt, Robert Joseph George, Jiawei Zhao, Cheng Luo, Yuandong Tian, Jean Kossaifi, Anima Anandkumar

arXiv:2501.02379·cs.LG·Published 2025-01-04·Updated 2025-05-30

Scientific problems require resolving multi-scale phenomena across different resolutions and learning solution operators in infinite-dimensional function spaces. Neural operators provide a powerful framework for this, using tensor-parameterized layers to capture complex, multi-dimensional relationships. However, scaling neural operators to high-resolution problems leads to significant computational demands, making the training of industrial-scale models prohibitive. In this work, we introduce \textbf{TensorGRaD}, a novel method that directly addresses the memory challenges associated with optimizing large tensor-structured weights. Our approach, based on a \texit{robust tensor decomposition}, factorizes gradients as the sum of a low-rank tensor and a sparse one to efficiently capture information within optimizer states, including outliers. Additionally, we provide a recipe for mixed precision training of TensorGRaD, achieving further memory savings without sacrificing accuracy. We showcase the effectiveness of TensorGRaD on Fourier Neural Operators, a class of models crucial for solving partial differential equations (PDE). We provide theoretical guarantees for TensorGRaD, demonstrating its fundamental advantage over matrix-based gradient compression methods. We empirically demonstrate large improvements across various PDE tasks, including the challenging turbulent Navier-Stokes case at a Reynolds number of $10^5$. TensorGRaD reduces total memory usage by over $50\%$ while maintaining and sometimes even improving accuracy.

TopicsScientific Machine Learning & PINNs

Tagsneural-operators partial-differential-equations

arXiv categoriescs.LG

arXiv abstract pagePDF