WeatherBench 2
A benchmark for the next generation of data-driven global weather models
Overview
Weather forecasting using machine learning (ML) has seen rapid progress in recent years. With ML models now approaching the skill of operational physics-based models, there is promise that these advances will soon make it possible to improve the accuracy of weather forecasts worldwide. To reach this goal, it is important to openly and reproducibly evaluate novel methods with objective and established metrics.
For this purpose, we introduce WeatherBench 2, a framework for evaluating and comparing various weather forecasting models. This website displays up-to-date scores of many of the state-of-the-art ML and physics-based models. In addition, WeatherBench 2 consists of an open-source evaluation code and publicly available, cloud-optimized ground-truth and baseline datasets, including a comprehensive copy of the ERA5 dataset used for training most ML models. For more information on how to use the WeatherBench evaluation framework and how to add new models to the benchmark, please check out the documentation.
Currently, the focus of WeatherBench 2 is on global, medium-range (1-15 day) prediction. In the future, we will explore adding evaluation and baselines for other tasks, such as nowcasting and one-day (0-24 hour) and long-range (15+ day) prediction. The research community can file a GitHub issue to share ideas and suggestions directly with the WeatherBench 2 team.
Headline scorecards
There is no single metric for measuring weather forecast performance. For example, one end user might be worried about wind gusts, while another might care more about average temperatures. For this reason, WeatherBench 2 contains a range of metrics, which you can find in the navigation at the top of this page. To provide a concise summary, we defined several key - “headline” - metrics that closely mirror the routine evaluation done by weather agencies and the World Meteorological Organization. It is important to remember that these metrics measure some important but not all aspects of what makes a good forecast.
The scorecards below show the skill (measured by the global root mean squared error) of different physical and ML-based methods relative to ECMWF's IFS HRES, one of the world's best operational weather models, on a number of key variables. For a detailed explanation of the different skill metrics and variables, check out the FAQ.
Probabilistic scorecards
The scorecard below shows the skill (measured by the continuous ranked probability score = CRPS) of physical and ML-based probabilistic models, relative to ECMWF's IFS ENS model.
Participating models
Missing an existing model or have a new model that you'd like to see added? Check out this guide for how to participate or submit a GitHub issue. Please also refer to the FAQ for more detailed information.
Model / Dataset |
Source |
Method |
Type |
Initial conditions |
Horizontal resolution ** |
---|---|---|---|---|---|
ECMWF |
Physics-based |
Reanalysis |
0.25° |
||
ECMWF |
Physics-based |
Forecast (deterministic) |
Operational |
0.1° |
|
ECMWF |
Physics-based |
Forecast (50 member ensemble) |
Operational |
0.2° * |
|
Pangu-Weather (operational) |
Huawei |
ML-based |
Forecast (deterministic) |
Operational IFS |
0.25° |
GraphCast (operational) |
Google DeepMind |
ML-based |
Forecast (deterministic) |
Operational IFS |
0.25° |
ECMWF |
Physics-based |
Hindcast (deterministic) |
ERA5 |
0.25° |
|
Ryan Keisler |
ML-based |
Forecast (deterministic) |
ERA5 |
1° |
|
Huawei |
ML-based |
Forecast (deterministic) |
ERA5 |
0.25° |
|
Google DeepMind |
ML-based |
Forecast (deterministic) |
ERA5 |
0.25° |
|
Fudan University, Shanghai |
ML-based |
Forecast (deterministic) |
ERA5 |
0.25° |
|
Google Research |
ML-based |
Forecast (deterministic) |
ERA5 |
1.4x0.7° |
|
Google Research |
Hybrid |
Forecast (deterministic) |
ERA5 |
0.7° |
|
Google Research |
Hybrid |
Forecast (Ensemble) |
ERA5 |
1.4° |
* Since June 2023, IFS ENS is also run at 0.1° resolution.
** Resolution is approximate for some models that don't use an equiangular grid.