WeatherBench 2

A benchmark for the next generation of data-driven global weather models

A benchmark for the next generation of data-driven global weather models

Overview

Weather forecasting using machine learning (ML) has seen rapid progress in recent years. With ML models now approaching the skill of operational physics-based models, there is promise that these advances will soon make it possible to improve the accuracy of weather forecasts worldwide. To reach this goal, it is important to openly and reproducibly evaluate novel methods with objective and established metrics.

For this purpose, we introduce WeatherBench 2, a framework for evaluating and comparing various weather forecasting models. This website displays up-to-date scores of many of the state-of-the-art ML and physics-based models. In addition, WeatherBench 2 consists of an open-source evaluation code and publicly available, cloud-optimized ground-truth and baseline datasets, including a comprehensive copy of the ERA5 dataset used for training most ML models. For more information on how to use the WeatherBench evaluation framework and how to add new models to the benchmark, please check out the documentation.

Currently, the focus of WeatherBench 2 is on global, medium-range (1-15 day) prediction. In the future, we will explore adding evaluation and baselines for other tasks, such as nowcasting and one-day (0-24 hour) and long-range (15+ day) prediction. The research community can file a GitHub issue to share ideas and suggestions directly with the WeatherBench 2 team.

Headline scorecards

There is no single metric for measuring weather forecast performance. For example, one end user might be worried about wind gusts, while another might care more about average temperatures. For this reason, WeatherBench 2 contains a range of metrics, which you can find in the navigation at the top of this page. To provide a concise summary, we defined several key - “headline” - metrics that closely mirror the routine evaluation done by weather agencies and the World Meteorological Organization. It is important to remember that these metrics measure some important but not all aspects of what makes a good forecast.

The scorecards below show the skill (measured by the global root mean squared error) of different physical and ML-based methods relative to ECMWF's IFS HRES, one of the world's best operational weather models, on a number of key variables. For a detailed explanation of the different skill metrics and variables, check out the FAQ.

Scorecard for upper-level variables for the year 2020. (Quasi-)operational models are evaluated against IFS analysis. All other models evaluated against ERA5. Order of ML models reflects publication date. For more detail, visit Deterministic Scores.

Scorecard for surface variables for the year 2020. (Quasi-)operational models are evaluated against IFS analysis. All other models evaluated against ERA5. Order of ML models reflects publication date. Precipitation is evaluated using the SEEPS score using ERA5 ground truth for all models. For more detail, visit Deterministic Scores. See FAQ for details on how climatology is computed.

Probabilistic scorecards

The scorecard below shows the skill (measured by the continuous ranked probability score = CRPS) of physical and ML-based probabilistic models, relative to ECMWF's IFS ENS model.

Probabilistic scorecard for upper-level variables for the year 2020. (Quasi-)operational models (blue) are evaluated against IFS analysis. All other models (red) evaluated against ERA5. Order of ML models reflects publication date. For more detail, visit Probabilistic Scores.

Participating models

Missing an existing model or have a new model that you'd like to see added? Check out this guide for how to participate or submit a GitHub issue. Please also refer to the FAQ for more detailed information.

Model / Dataset

Source

Method

Type

Initial conditions

Horizontal resolution **

ERA5

ECMWF

Physics-based

Reanalysis

0.25°

IFS HRES

ECMWF

Physics-based

Forecast (deterministic)

Operational

0.1°

IFS ENS

ECMWF

Physics-based

Forecast (50 member ensemble)

Operational

0.2° *

Pangu-Weather (operational)

Huawei

ML-based

Forecast (deterministic)

Operational IFS

0.25°

GraphCast (operational)

Google DeepMind

ML-based

Forecast (deterministic)

Operational IFS

0.25°

ERA5 forecasts

ECMWF

Physics-based

Hindcast (deterministic)

ERA5

0.25°

Keisler (2022)

Ryan Keisler

ML-based

Forecast (deterministic)

ERA5

Pangu-Weather

Huawei

ML-based

Forecast (deterministic)

ERA5

0.25°

GraphCast

Google DeepMind

ML-based

Forecast (deterministic)

ERA5

0.25°

FuXi

Fudan University, Shanghai

ML-based

Forecast (deterministic)

ERA5

0.25°

SphericalCNN

Google Research

ML-based

Forecast (deterministic)

ERA5

1.4x0.7°

NeuralGCM 0.7°

Google Research

Hybrid

Forecast (deterministic)

ERA5

0.7°

NeuralGCM ENS

Google Research

Hybrid

Forecast (Ensemble)

ERA5

1.4°

* Since June 2023, IFS ENS is also run at 0.1° resolution.

** Resolution is approximate for some models that don't use an equiangular grid.