Communication-reduced Conjugate Gradient Variants for GPU-accelerated Clusters

Massimo Bernaschi, Mauro G. Carrozzo, Alessandro Celestini, Giacomo Piperno, Pasqua D'Ambra

arXiv:2501.03743·math.NA·Published 2025-01-07

Linear solvers are key components in any software platform for scientific and engineering computing. The solution of large and sparse linear systems lies at the core of physics-driven numerical simulations relying on partial differential equations (PDEs) and often represents a significant bottleneck in datadriven procedures, such as scientific machine learning. In this paper, we present an efficient implementation of the preconditioned s-step Conjugate Gradient (CG) method, originally proposed by Chronopoulos and Gear in 1989, for large clusters of Nvidia GPU-accelerated computing nodes. The method, often referred to as communication-reduced or communication-avoiding CG, reduces global synchronizations and data communication steps compared to the standard approach, enhancing strong and weak scalability on parallel computers. Our main contribution is the design of a parallel solver that fully exploits the aggregation of low-granularity operations inherent to the s-step CG method to leverage the high throughput of GPU accelerators. Additionally, it applies overlap between data communication and computation in the multi-GPU sparse matrix-vector product. Experiments on classic benchmark datasets, derived from the discretization of the Poisson PDE, demonstrate the potential of the method.

TopicsScientific Machine Learning & PINNs

Tagspartial-differential-equations scientific-machine-learning

arXiv categoriesmath.NA

arXiv abstract pagePDF