WeatherBench 2
A benchmark for the next generation of data-driven global weather models
Overview
Weather forecasting using machine learning (ML) has seen rapid progress in recent years. With ML models now approaching the skill of operational physics-based models, there is promise that these advances will soon make it possible to improve the accuracy of weather forecasts worldwide. To reach this goal, it is important to openly and reproducibly evaluate novel methods with objective and established metrics.
For this purpose, we introduce WeatherBench 2, a framework for evaluating and comparing various weather forecasting models. This website displays up-to-date scores of many of the state-of-the-art ML and physics-based models. In addition, WeatherBench 2 consists of an open-source evaluation code and publicly available, cloud-optimized ground-truth and baseline datasets, including a comprehensive copy of the ERA5 dataset used for training most ML models. For more information on how to use the WeatherBench evaluation framework and how to add new models to the benchmark, please check out the documentation.
Currently, the focus of WeatherBench 2 is on global, medium-range (1-15 day) prediction. In the future, we will explore adding evaluation and baselines for other tasks, such as nowcasting and one-day (0-24 hour) and long-range (15+ day) prediction. The research community can file a GitHub issue to share ideas and suggestions directly with the WeatherBench 2 team.
Headline scorecards
There is no single metric for measuring weather forecast performance. For example, one end user might be worried about wind gusts, while another might care more about average temperatures. For this reason, WeatherBench 2 contains a range of metrics, which you can find in the navigation at the top of this page. To provide a concise summary, we defined several key - “headline” - metrics that closely mirror the routine evaluation done by weather agencies and the World Meteorological Organization. It is important to remember that these metrics measure some important but not all aspects of what makes a good forecast.
The scorecards below show the skill (measured by the global root mean squared error) of different physical and ML-based methods relative to ECMWF's IFS HRES, one of the world's best operational weather models, on a number of key variables. For a detailed explanation of the different skill metrics and variables, check out the FAQ.
Scorecard for upper-level variables for the year 2020. IFS forecasts evaluated against IFS analysis. All other models evaluated against ERA5. Order of ML models reflects publication date. For more detail, visit Deterministic Scores.
Scorecard for surface variables for the year 2020. IFS forecasts evaluated against IFS analysis. All other models evaluated against ERA5. Order of ML models reflects publication date. Precipitation is evaluated using the SEEPS score. For more detail, visit Deterministic Scores. See FAQ for details on how climatology is computed.
Participating models
Missing an existing model or have a new model that you'd like to see added? Check out this guide for how to participate or submit a GitHub issue. Please also refer to the FAQ for more detailed information.
Model / Dataset |
Source |
Method |
Type |
Initial conditions |
Horizontal resolution ** |
---|---|---|---|---|---|
ECMWF |
Physics-based |
Reanalysis |
0.25° |
||
ECMWF |
Physics-based |
Forecast (deterministic) |
Operational |
0.1° |
|
ECMWF |
Physics-based |
Forecast (50 member ensemble) |
Operational |
0.2° * |
|
ECMWF |
Physics-based |
Hindcast (deterministic) |
ERA5 |
0.25° |
|
Ryan Keisler |
ML-based |
Forecast (deterministic) |
ERA5 |
1° |
|
Huawei |
ML-based |
Forecast (deterministic) |
ERA5 |
0.25° |
|
Google DeepMind |
ML-based |
Forecast (deterministic) |
ERA5 |
0.25° |
* Since June 2023, IFS ENS is also run at 0.1° resolution.
** Resolution is approximate for some models that don't use an equiangular grid.