Jigsaw: Training Multi-Billion-Parameter AI Weather Models With Optimized Model Parallelism

Update: 2025-10-24

Description

Jigsaw: Training Multi-Billion-Parameter AI Weather Models With Optimized Model ParallelismAuthors: Deifilia Kieckhefen, Markus Götz, Lars H. Heyen, Achim Streit, and Charlotte Debus (Karlsruhe Institute of Technology, Helmholtz AI)

The paper introduces WeatherMixer (WM), a multi-layer perceptron (MLP)-based architecture designed for atmospheric forecasting, which serves as a competitive alternative to Transformer-based models. WM's workload scales linearly with input size, addressing the scaling challenges and quadratic computational complexity associated with the self-attention mechanism in Transformers when dealing with gigabyte-sized atmospheric data.• A novel parallelization scheme called Jigsaw parallelism is proposed, combining both domain parallelism and tensor parallelism to efficiently train multi-billion-parameter models. Jigsaw is optimized for large input data by fully sharding the data, model parameters, and optimizer states across devices, eliminating memory redundancy.

Jigsaw effectively mitigates hardware bottlenecks, particularly I/O-bandwidth limitations frequently encountered in training large scientific AI models. Due to its partitioned data loading (domain parallelism), the scheme achieves superscalar weak scaling in I/O-bandwidth-limited systems.

The method demonstrates excellent scaling behavior on high-performance computing systems, exceeding state-of-the-art performance in strong scaling in computation–communication-limited systems. The training was successfully scaled up to 256 GPUs, reaching peak performances of 9 and 11 PFLOPs.• Beyond hardware efficiency, Jigsaw improves predictive performance: by partitioning the model across more GPUs (model parallelism) instead of relying solely on data parallelism, it naturally enforces smaller global batch sizes, which empirically helps mitigate the problematic large-batch effects observed in AI weather models, leading to lower loss values.

Comments

In Channel

Differentiable and accelerated spherical harmonic and Wigner transforms

2025-12-1913:02

Score-based diffusion nowcasting of GOES imagery

2025-12-1112:46

FuXi-Ocean: A Global Ocean Forecasting System with Sub-Daily Resolution

2025-12-0415:57

Beyond the Training Data: Confidence-Guided Mixing of Parameterizations in a Hybrid AI-Climate Model

2025-11-2815:07

Climate in a Bottle: Towards a Generative Foundation Model for the Kilometer-Scale Global Atmosphere

2025-11-2313:50

Probabilistic Measures for Fair AI and NWP Model Comparison

2025-11-0713:09

Jigsaw: Training Multi-Billion-Parameter AI Weather Models With Optimized Model Parallelism

2025-10-2413:36

XiChen: An observation-scalable fully AI-driven global weather forecasting system with 4D variational knowledge

2025-10-1816:08

FuXi Weather : A data-to-forecast machine learning system for global weather

2025-10-0313:52

Probabilistic Emulation of a Global Climate Model with Spherical DYffusion

2025-09-2519:53

FourCastNet 3: A geometric approach to probabilistic machine-learning weather forecasting at scale

2025-09-1617:39

Can AI weather models predict out-of-distribution gray swan tropical cyclones?

2025-08-1616:27

Probabilistic Emulation of a Global Climate Model with Spherical DYffusion

2025-08-0916:05

Do AI models produce better weather forecasts than physics-based models? A quantitative evaluation case study of Storm Ciarán

2025-08-0314:28

Early Warning of Complex Climate Risk with Integrated Artificial Intelligence

2025-07-0416:34

On Some Limitations of Current Machine Learning Weather Prediction Models

2025-06-2720:29

Artificial intelligence for modeling and understanding extreme weather and climate events

2025-06-1520:01

Fixing the Double Penalty in Data-Driven Weather Forecasting Through a Modified Spherical Harmonic Loss Function

2025-06-0816:37

Climate-invariant machine learning

2025-05-0912:37

ClimaX: A foundation model for weather and climate

2025-05-0213:25

00:00

Jigsaw: Training Multi-Billion-Parameter AI Weather Models With Optimized Model Parallelism

#box-pro-ellipsis-176650259814467{-webkit-line-clamp:2;}Jigsaw: Training Multi-Billion-Parameter AI Weather Models With Optimized Model Parallelism

Jigsaw: Training Multi-Billion-Parameter AI Weather Models With Optimized Model Parallelism

Amirpasha

Jigsaw: Training Multi-Billion-Parameter AI Weather Models With Optimized Model Parallelism