What’s the difference to WeatherBench 1?

The original 2020 WeatherBench paper was based on initial efforts to predict global weather with ML techniques in 2018. These models were run at relatively low resolutions. Since then, the field has progressed rapidly. WeatherBench 2 improves upon its predecessor in several aspects:

  • Higher resolution data

  • A scalable open-source evaluation framework

  • More metrics (for example, spectra and bias metrics)

  • Probabilistic evaluation and baselines

  • Interactive graphics on this website that allow a deep dive into the results

  • Better treatment of baselines: physics-based models are evaluated against their own analysis

Where can I find more details about the participating models?


The climatology is used for computing certain skill scores, in particular, ACC and SEEPS, and as a baseline forecast. Here, we follow the process described in “Scale-dependent verification of ensemble forecasts” to compute a climatology based on ERA5 data from 1990 to 2019. See the WeatherBench 2 paper for more details.

A probabilistic version of the climatology was created by taking each of the 30 years from 1990 to 2019 as an ensemble member.

Note that a 30-year climatology will include some climate drift, especially for temperature. Here, we do not apply any measure to correct for this.


Our primary baseline comes from ECMWF's operational IFS model. ECMWF's forecasts are widely regarded as one of the best medium-range numerical weather prediction models. Since 2016, the IFS in its HRES configuration has been run at 0.1 degree (roughly 9 km) horizontal resolution. The operational model is updated regularly, approximately once to twice a year, which means that the exact model configuration might change during the evaluation period. Usually, updates are associated with slight improvements in most evaluation metrics, though not all. However, changes in the IFS are typically gradual. One can review a comprehensive model description here and a schedule of model upgrades here. Initial conditions are created every 6 hours using an ensemble 4D-Var system using information from the previous assimilation cycle's forecast as well as observations in a +/- 3-hour window. After accounting for the time to perform data assimilation and forward simulation, forecasts have a latency of 5.75 to 7 hours after initialization. Forecasts that start at 00 and 12 UTC run up to a lead time of 10 days. 06 and 18 UTC initializations are run for 3.75 days.

Note, With the 2023 upgrade to a 9 km ensemble resolution, HRES does not exist anymore, rather it is now the ensemble control run.


ECMWF also runs an ensemble version (ENS) of IFS at 0.2 degree resolution for the 2020 evaluation period used here. Since the 2023 upgrade, the ENS also has a 0.1 degree resolution. The ensemble consists of a control run and 50 perturbed members. The initial conditions of the perturbed members are created by running an ensemble of data assimilations (EDA) in which observation, model and boundary condition errors are represented by perturbations. The difference between each the EDA analysis and the mean is then used as a perturbation to the HRES initial conditions. In addition, to more accurately represent forecast uncertainty, singular vector perturbations are added to the initial conditions. Model uncertainties during the forecasts are represented by stochastically perturbed parameterization tendencies (SPPT), in which spatially and temporally correlated perturbations are added to the model physics tendencies. Ensemble forecasts are initialized at 00 and 12 UTC and run out to 15 days.


We also include the IFS ENS mean as a baseline, which we computed by simply averaging over the 50 members. The ensemble mean does not represent a realistic forecast but often performs very well on deterministic metrics.

ERA5 forecasts

For research purposes, ECMWF ran a set of 10-day hindcasts initialized from ERA5 states at 00/12UTC with the same IFS model version used to create ERA5. These forecasts are available in the MARS archive. Note that data until 5 days lead time is available at 6h intervals, and 12h intervals from 5 to 10 days lead time.

The ERA5 forecasts provide a like-for-like baseline for an AI model initialized from and evaluated against ERA5. They benefit from the same, longer assimilation window compared to the operational initial conditions and are run at 0.25-degree resolution---similar to many modern AI methods. Because of the lower resolution and older model relative to the operational IFS HRES in 2020, one would expect the operational model to be more skillful by itself.

Keisler (2022) Graph Neural Network

In "Forecasting Global Weather with Graph Neural Networks" a graph neural network architecture was used with an encoder that maps the original 1-degree latitude-longitude grid to an icosahedron grid, on which several rounds of message-passing computations are performed, before decoding back into latitude-longitude space. The model takes as input the atmospheric states at t=0 and t=-6h and predicts the state at t=6h. The model's outputs are fed back as inputs autoregressively to forecast longer time horizons. The state consists of 6 three-dimensional variables at 13 pressure levels. ERA5 data is used for training with 1991, 2004 and 2017 used for validation, 2012, 2016 and 2020 for testing and the remaining years from 1979 to 2020 for training. During training the model is trained to minimize the cumulative error of up to 12 time steps (3 days). Model training took 5.5 days on a single Nvidia A100 GPU.


Pangu-Weather is a data-driven weather model based on a transformer architecture. It predicts the state of the atmosphere at t=t+ dt based on the current state. The state is described by 5 upper-air variables and 4 surface variables on a 0.25-degree horizontal grid (same as ERA5) with 13 vertical levels for the upper-air variables. The model is trained using ERA5 data from 1979 to 2017 (incl.) with 2019 for validation and 2018, 2020 and 2021 for testing. Here we evaluate forecasts for 2020. Four different model versions are trained for different prediction time steps dt = 1h, 3h, 6h, 24h. To create forecasts for an arbitrary lead time model predictions are chained together autoregressively from the four different lead time models, using the fewest number of steps. For example, to create a 31-h forecast, a 24-h forecast is followed by a 6-h and then a 1-h forecast. The maximum lead time for the data used here is 7 days. Training the model took 16 days on 192 Tesla-V100 GPUs. Creating a prediction with the trained model around 1.5s on a single GPU. Inference code for Pangu-Weather can be found at https://github.com/198808xc/Pangu-Weather.

Pangu-Weather (oper.)

This is a version of Pangu-Weather initialized with the operational IFS HRES analysis.


GraphCast is similar in structure to Keisler (2022) but operates on a higher resolution input with 6 upper-level variables on a 0.25 degree horizontal grid with 37 vertical levels, and additionally 5 surface variables. The model is also trained autoregressively up to a time horizon of 12 time steps (3 days). Training took around four weeks on 32 TPU v4 devices. Creating a single 10-day forecast takes less than a minute on a single TPU. Here, we evaluate a version of GraphCast that was trained on ERA5 data from 1979 to 2019 (incl.). See Suppl. 5.1 of Lam et al. (2022) for details. Code for GraphCast can be found at https://github.com/google-deepmind/graphcast.

GraphCast (oper.)

This is a version of GraphCast finetuned and initialized with the operational IFS HRES analysis. For more detail, refer to https://github.com/google-deepmind/graphcast.


FuXi is an autoregressive cascaded ML weather forecast system based on a transformer architecture with specific models trained for short (0-5 days), medium (5-10 days) and long-range (10-15 days) prediction. For details, refer to the paper.

Spherical CNN

Spherical CNNs generalize CNNs to functions on the sphere, by using spherical convolutions as the main linear operation. For details, refer to the paper. Code can be found at https://github.com/google-research/spherical-cnn.

Neural General Circulation Models (NeuralGCM)

NeuralGCM combines a differentiable solver for atmospheric dynamics with ML components. The deterministic simulations are run at 0.7 degrees and a 50 member ensemble was created at 1.4 degrees. For details, refer to the paper.

What do the different metrics mean?

For detailed equations, refer to the paper.

Deterministic metrics

Short name

Long name



Anomaly Correlation Coefficient

Correlation coefficient of between the forecast and ground truth anomalies with respect to the 30 year climatology. In contrast to the RMSE, the ACC focuses more on the sign of the anomaly rather than the absolute magnitude of the error.



The average bias of the forecast with respect to the ground-truth. E.g. a positive temperature bias indicates that for the given time period and geographical region, the forecasts tended to be too warm.


Mean Absolute Error

Compared to the RMSE, this one does not focus disproportionately on large deviations.


Mean Squared Error

Common skill metric that emphasizes large deviations.


Root Mean Squared Error

Common skill metric that emphasizes large deviations.


Stable Equitable Error in Probability Space

SEEPS is a precipitation metric based on "no rain", "light rain" and "heavy rain" categories. See paper for details.

Probabilistic metrics

Short name

Long name



Continuous Ranked Probability Score

The CRPS measures how well the forecasted distribution (e.g. the ensemble) matched the ground-truth. To achieve good CRPS values, the forecasts have to be reliable, i.e. the forecasted uncertainty has to match the actual uncertainty, and sharp, i.e. a smaller uncertainty is better.

Ensemble Mean RMSE

Ensemble Mean Root Mean Squared Error

The RMSE of the ensemble mean. Also often called "skill".



The standard deviation of the ensemble.

Spread/skill ratio

Spread/skill ratio

In a reliable forecast the Ensemble Mean RMSE should match the spread. Smaller values indicate that the ensemble is underdispersive (too confident), larger values indicate that the ensemble is overdispersive (not confident enough). Note that this is a necessary but not sufficient metric for reliably forecasts.

What do the different levels mean?

In weather modeling, it is common to use pressure, measured in hPa, as a vertical coordinate instead of altitude. Here is a rough conversion based on the Standard Atmosphere:

  • 850 hPa = 1.5 km

  • 700 hPa = 3 km

  • 500 hPa = 5.5 km

What do the different variables mean?



10m U/V Component of Wind

Wind components at 10 meters height. The U component is the wind speed in the zonal direction (East-West), the V component is the wind speed in the meridional direction (North-South).

10m Wind Speed

Wind speed at 10 meters height.

10m Wind Vector

Wind vector at 10 meters height. This is only applicable to RMSE where we compute a special RMSE of both, V and U components. For all other metrics, this will be empty. See paper for details.

6/24h Precipitation

Accumulation of precipitation (rain, hail, snow, etc.) over a 6/24 hour period.

2m Temperature

Temperature at 2 meters height.


Roughly speaking this is the height of the pressure level. This, again, is roughly equivalent to the pressure distribution at a fixed altitude.

Sea Level Pressure

The pressure at a given location if the pressure at the surface was extrapolated to sea level.

Specific Humidity

This is the amount of water vapor (measured in grams) per kg of air. It is related to relative humidity through temperature and pressure. 


Temperature at a given vertical level.

U/V Component of Wind

The U component is the wind speed in the zonal direction (East-West), the V component is the wind speed in the meridional direction (North-South).

Wind Speed

Wind speed at a given vertical level.

Wind Vector

This is only applicable to RMSE where we compute a special RMSE of both, V and U components. For all other metrics, this will be empty. See paper for details.

How are the regions defined?

We adopted the regions from the ECMWF scorecards.

  • Tropics: -20° < lat < 20°

  • Extra-tropics: |lat| > 20°

  • Northern hemisphere: lat > 20°

  • Southern hemisphere: lat < -20°

  • Europe: 35° < lat < 75°, -12.5° < lon < 42.5°

  • North America: 25° < lat < 60°, -120° < lon < -75°

  • North Atlantic: 25° < lat < 60°, -70° < lon < -20°

  • North Pacific: 25° < lat < 60°, 145° < lon < -130°

  • East Asia: 25° < lat < 60°, 102.5° < lon < 150°

  • AusNZ: -45° < lat < -12.5°, 120° < lon < 175°

  • Arctic: 60° < lat < 90°

  • Antarctic: -90° < lat < -60°

What does “vs Analysis” and “vs ERA5” mean?

This refers to the ground truth against which the models were evaluated. Both of them are best guesses of the atmosphere created by ECMWF’s 4D-VAR data assimilation. ERA5 is a reanalysis dataset that was created in retrospect (i.e., not available in “real time”) while “analysis” refers to the operational best guess of the atmosphere that is used to initialize the forecast models (like IFS HRES). It is typical to evaluate ML models that were trained on ERA5 against ERA5 data and operational models against their own analysis. The differences are mostly significant in early lead times. For a detailed discussion, please refer to the WeatherBench 2 paper.

What is the best model currently?

WeatherBench provides an objective, open, and reproducible benchmark for weather forecasts across a variety of metrics, rather than determining a single ‘best model.’ Weather forecasting is a multi-faceted problem with a variety of use cases. No single metric fits all those use cases. Therefore,it is important to look at a number of different metrics and consider how the forecast will be applied. Note also that some of the data-driven models are initialized from conditions (namely ERA5) that would not be available for operational forecasts. Studies have suggested that models trained on ERA5 still perform well when initialized with operational initial conditions. Regardless, care must be taken when interpreting the results.

How good is the ground truth used? What about observations?

For the analysis here, we use ERA5 reanalysis and ECMWF's operational analysis as ground truth (for differences see paragraph above). Both are analysis products, meaning that they result from model simulations that are kept close to the available observations. For the difference between the two, please also refer to the paper.

These (re-)analysis datasets are the best available guess of the global state of the atmosphere but they are not equivalent to direct observation, e.g., from weather stations. Generally, analyses tend to represent the large-scale weather well. This means that upper-level variables like 500hPa geopotential or 850hPa temperature are well estimated. For weather on the ground level deviations from observations can be larger. Still, for variables like temperature and wind speed, (re-)analysis products provide reasonable estimates. In regions with large orographic variations, larger discrepancies can occur with respect to station observations.

Precipitation, on the other hand, is a much more challenging issue. Firstly, high-quality precipitation observations from ground-based radars are only available for a small fraction of the globe. Furthermore, precipitation is not directly assimilated into ERA5 or the operational analysis. In other words, the "precipitation" values in the (re-) analysis products are biased towards  what the underlying model thinks precipitation could be in the given weather situation rather than actual observations. For this reason, analysis datasets should not be used to evaluate precipitation forecasting skill. We still include some precipitation evaluation here, mainly to show the methodology but advise not to over-interpret them.

In future iterations of WeatherBench 2, we plan to add observation datasets, including better precipitation observations.

How can I use WeatherBench 2 and add my model on this website?

To get started with the evaluation code, check out the documentation. If you want your model to be added to WeatherBench 2, refer to this guide. Don’t hesitate to reach out to via a GitHub issue.