A dataset of building footprints to support social good applications.
Building footprints are useful for a range of important applications, from population estimation, urban planning and humanitarian response, to environmental and climate science. This large-scale open dataset contains the outlines of buildings derived from high-resolution satellite imagery in order to support these types of uses. The project is based in Ghana, with an initial focus on the continent of Africa and new updates on South Asia and South-East Asia.
The dataset contains 817 million building detections, across an inference area of 39.1 M km2 within Africa, South Asia and South-East Asia.
For each building in this dataset we include the polygon describing its footprint on the ground, a confidence score indicating how sure we are that this is a building, and a Plus Code corresponding to the centre of the building. There is no information about the type of building, its street address, or any details other than its geometry.
More explanation is in the FAQ.
Potential use cases of the data include:
Building footprints are a key ingredient for estimating population density. In areas of rapid change, or where census information is out of date, population estimates are vital for many kinds of planning and statistics.
To plan the response to a flood, drought, or other natural disaster, it is useful to be able to assess the number of buildings or households affected. This is also useful for disaster risk reduction, e.g. to estimate the number of buildings in a particular hazard area.
Knowledge of settlement density is useful for understanding human impact on the natural environment. For example, it helps with estimating energy needs and carbon emissions in a certain area, or pressure on protected areas and wildlife due to urbanisation.
In many areas buildings do not have formal addresses, which can make it difficult for people to access social benefits and economic opportunities. Building footprint data can help with the rollout of digital addressing systems such as Plus Codes.
Knowing the density of population and settlements helps to anticipate demand for vaccines and the best locations for facilities. This data is also useful for precision epidemiology, as well as eradication efforts such as mosquito net distribution.
Buildings data can be used to help calculate statistical indicators for national planning, such as the numbers of houses in the catchment areas of schools and health centres; mean travel distances to the nearest hospital or forecast of demand for transportation systems.
Zoom or search to explore, and click on building outlines to see metadata. Alternatively, look at the data in Earth Engine.
Curious about why some buildings appear offset from the satellite imagery? Or why detection doesn't work as well in areas with dense or complex buildings? See details about data limitations and quality in the FAQs and technical report.
A deep learning model was trained to determine the footprints of buildings from high resolution satellite imagery. Our accompanying technical report describes the methodology used to generate the first version of the dataset. We however made further improvements for the current version, v2.
The data is shared under the Creative Commons Attribution (CC BY-4.0) license and the Open Data Commons Open Database License (ODbL) v1.0 license. As the user, you can pick which of the two licenses you prefer and use the data under the terms of that license.
We wanted to make the data compatible for ingestion by those working with ODbL-licensed datasets (namely the OpenStreetMap community) while enabling people who don't use ODbL licensing to use it under the terms of the CC BY-4.0 license. We hoped to take away the burden of figuring out whether the two licenses were compatible and simply release the data set under both licenses.
Yes – however, to maintain the quality of OSM, please be mindful of the need for human review when adding machine-generated features, and where possible to do this with the benefit of local knowledge. Errors in the data to look out for include false detections and inaccurate shapes (see more about accuracy below). We also recommend starting by filtering out building detections that have a confidence score below the estimated 90% precision threshold.
The buildings on Google Maps come from a variety of sources, including the model used to generate this dataset. So there is some overlap, but the sets of footprints are not exactly the same.
As the imagery in Google Maps is updated over time, the specific images used to identify these buildings are not necessarily the same images that are currently published in Google Maps. If there is a misalignment between these two sets of imagery, buildings displayed in the data explorer map may appear to be offset from the underlying imagery.
You can view a timeline of the imagery for a specific area using the Historical Imagery feature in Google Earth Pro which may show this imagery offset between different images and dates. To learn a little more about satellite imagery offset see these sites (1, 2). Also see the technical report for details about data limitations and quality.
Despite having a diverse set of training data, some scenarios are challenging for the building detection pipeline, including: 1) geological or vegetation features which can be confused with built structures; 2) settlements with many contiguous buildings not having clear delineations; 3) areas characterised by small buildings, which can appear only a few pixels wide at the given image resolution; and 4) rural or desert areas, where buildings constructed with natural materials tend to visually blend into the surrounding area. See the technical report for more details.
The data is subject to both omission and commission errors, of these types:
Imagery completeness errors: for some areas, up-to-date satellite imagery may not have been available, or there were buildings on the ground that were not visible from the satellite image, or there was cloud cover.
Detection errors: estimated precision and recall curves for our detection model, based on a held-out test set, are as shown below. For more details, including confidence score thresholds, click on the plot. The tradeoff between false positives and false negatives varies between different subsets of the test set grouped by the fraction of the example image that is covered by buildings such that examples with < 5% are classified as low, between 5% and 20% are medium and above 20% are high.
Our model sometimes wrongly detects buildings where there are actually rocks or vegetation features, for example.
By choosing the confidence score threshold at which buildings are filtered out, the tradeoff between precision and recall can be controlled. We provide suggested thresholds with each download tile to obtain estimated 80% and 90% precision levels.
Yes, see plots below. To address this we provide a CSV file with suggested score thresholds to obtain specific precision levels for each download tile.
The dataset freshness is determined by the availability of the high-resolution source imagery which we use to detect buildings. While we have tried to include the most recent images possible, particularly in populated areas, in some cases, the most recent image for some location was several years old or not available to us at all. To look at freshness for a particular area, the Historical Imagery function in Google Earth Pro shows the specific dates and imagery (check for imagery before the inference date given in the version history below). Furthermore, we have not processed imagery for the entire continent: to check whether a particular region has been included, the dataset explorer map above visualises all buildings in the dataset.
We filtered detections to include only those with confidence score 0.6 or greater. Depending on the application, it may be necessary to filter at a higher threshold (e.g. with the score thresholds above to achieve 90% precision).
We currently provide 3 options:
The data is organised into tiles that can be directly downloaded. Alternatively, the example Colab notebook shows how to download data for a specific region, given the geometry of the area of interest.
The underlying satellite imagery is not part of this dataset. However, the source imagery used for detections can be viewed in Google Earth Pro. Different time frames can be viewed using the Historical Imagery function.
In this Colab notebook , we demonstrate some analysis methods on the data for a specific country or region:
We hope to continue improving this dataset, by both refreshing it using new source imagery, and by refining the detection model to improve accuracy. Based on community feedback, we may extend the dataset to new areas or add additional features: please let us know, using the contact details below, any queries or requests.
The initial version, like the current version v2, will be hosted in cloud and can be accessed using the same commands as the v2 with the v2 replaced by v1.
The current version, v2, adds new regions in South and
South-East Asia, has improved accuracy, and is based on
more up-to-date satellite imagery. Click
to view differences in precision-recall curves. The test
set was grouped by the fraction of the example image that
is covered by buildings such that examples with
5% are classified as low, between 5% and 20% are medium
and above 20% are high.
Additionally, we now have points data, see Data Format for details about the points data.
If this dataset is useful, please consider citing our technical report:
W. Sirko, S. Kashubin, M. Ritter, A. Annkah, Y.S.E. Bouchareb, Y. Dauphin, D. Keysers, M. Neumann, M. Cisse, J.A. Quinn. Continental-scale building detection from high resolution satellite imagery. arXiv:2107.12283, 2021.
Please contact email@example.com with any feedback.
The dataset consists of 3 parts: building polygons, building points and score thresholds.
Building polygons and points are stored in spatially sharded CSVs with one CSV per S2 cell level 4. Each row in the CSV represents one building polygon or point and has the following columns:
The estimated score thresholds are stored as one CSV. Each row in the CSV represents one S2 cell level 4 bucket and has the following columns:
The polygon data (73 GB total) is composed of a set of CSV files, with one file per level 4 S2 cell that are up to 5.4 GB in size. Similarly, the points data (20 GB total) are up to 1.52 GB per file. Select a download method below.
To manually download polygons data for a specific cell, click on the map below.
This Colab notebook shows how data can be downloaded for a specific country or region.
Download all polygons (73 GB total) using
gsutil cp -R
and all points (20 GB total) using
gsutil cp -R
Metadata files can be downloaded as follows:
gsutil cp gs://open-buildings-data/v2/score_thresholds_s2_level_4.csv
v1: inference carried out during April 2021 on imagery covering 19.4M km2 of Africa.
v2: inference carried out during August
2022 on imagery covering 39.1M km2 of Africa,
South and South-East Asia.
See FAQ for comparison of versions.