WeatherBench 2

A benchmark for the next generation of data-driven global weather models

Overview

Weather forecasting using machine learning (ML) has seen rapid progress in recent years. With ML models now approaching the skill of operational physics-based models, there is promise that these advances will soon make it possible to improve the accuracy of weather forecasts worldwide. To reach this goal, it is important to openly and reproducibly evaluate novel methods with objective and established metrics.

For this purpose, we introduce WeatherBench 2, a framework for evaluating and comparing various weather forecasting models. This website displays up-to-date scores of many of the state-of-the-art ML and physics-based models. In addition, WeatherBench 2 consists of an open-source evaluation code and publicly available, cloud-optimized ground-truth and baseline datasets, including a comprehensive copy of the ERA5 dataset used for training most ML models. For more information on how to use the WeatherBench evaluation framework and how to add new models to the benchmark, please check out the documentation.

Currently, the focus of WeatherBench 2 is on global, medium-range (1-15 day) prediction. In the future, we will explore adding evaluation and baselines for other tasks, such as nowcasting and one-day (0-24 hour) and long-range (15+ day) prediction. The research community can file a GitHub issue to share ideas and suggestions directly with the WeatherBench 2 team.

Headline scorecards

There is no single metric for measuring weather forecast performance. For example, one end user might be worried about wind gusts, while another might care more about average temperatures. For this reason, WeatherBench 2 contains a range of metrics, which you can find in the navigation at the top of this page. To provide a concise summary, we defined several key - “headline” - metrics that closely mirror the routine evaluation done by weather agencies and the World Meteorological Organization. It is important to remember that these metrics measure some important but not all aspects of what makes a good forecast.

The scorecards below show the skill (measured by the global root mean squared error) of different physical and ML-based methods relative to ECMWF's IFS HRES, one of the world's best operational weather models, on a number of key variables. For a detailed explanation of the different skill metrics and variables, check out the FAQ.

Scorecard for upper-level variables for the year 2020. (Quasi-)operational models are evaluated against IFS analysis. All other models evaluated against ERA5. Order of ML models reflects publication date. For more detail, visit *Deterministic Scores.*

*Scorecard for surface variables for the year 2020.* *(Quasi-)operational models are evaluated against IFS analysis. All other models evaluated against ERA5. Order of ML models reflects publication date.* *Precipitation is evaluated using the* *SEEPS score* *using ERA5 ground truth for all models. For more detail, visit* *Deterministic Scores.* See FAQ for details on how climatology is computed.

Probabilistic scorecards

The scorecard below shows the skill (measured by the continuous ranked probability score = CRPS) of physical and ML-based probabilistic models, relative to ECMWF's IFS ENS model.

Probabilistic scorecard for upper-level variables for the year 2020. (Quasi-)operational models (blue) are evaluated against IFS analysis. All other models (red) evaluated against ERA5. Order of ML models reflects publication date. For more detail, visit *Probabilistic Scores*.

Participating models

Missing an existing model or have a new model that you'd like to see added? Check out this guide for how to participate or submit a GitHub issue. Please also refer to the FAQ for more detailed information.

Model / Dataset	Source	Method	Type	Initial conditions	Horizontal resolution **
ERA5	ECMWF	Physics-based	Reanalysis		0.25°
IFS HRES	ECMWF	Physics-based	Forecast (deterministic)	Operational	0.1°
IFS ENS	ECMWF	Physics-based	Forecast (50 member ensemble)	Operational	0.2° *
Pangu-Weather (operational)	Huawei	ML-based	Forecast (deterministic)	Operational IFS	0.25°
GraphCast (operational)	Google DeepMind	ML-based	Forecast (deterministic)	Operational IFS	0.25°
ERA5 forecasts	ECMWF	Physics-based	Hindcast (deterministic)	ERA5	0.25°
Keisler (2022)	Ryan Keisler	ML-based	Forecast (deterministic)	ERA5	1°
Pangu-Weather	Huawei	ML-based	Forecast (deterministic)	ERA5	0.25°
GraphCast	Google DeepMind	ML-based	Forecast (deterministic)	ERA5	0.25°
FuXi	Fudan University, Shanghai	ML-based	Forecast (deterministic)	ERA5	0.25°
SphericalCNN	Google Research	ML-based	Forecast (deterministic)	ERA5	1.4x0.7°
NeuralGCM 0.7°	Google Research	Hybrid	Forecast (deterministic)	ERA5	0.7°
NeuralGCM ENS	Google Research	Hybrid	Forecast (Ensemble)	ERA5	1.4°

* Since June 2023, IFS ENS is also run at 0.1° resolution.

** Resolution is approximate for some models that don't use an equiangular grid.

About Google

Privacy

Terms