The Data Cards Playbook

Text, images, code, tables or visualizations should have a reason to be presented in their entirety, to help readers justify their decisions without entirely shifting responsibility onto them.

The Data Card should always be an accurate and truthful representation of the dataset that advances the reader’s mental model of the dataset and its implications in practice. This is critical to building long-term trust and accountability with readers of Data Cards.

Provide readers with facts that can persuade and dissuade readers in their decisions.

This reduces selection bias, unintended misdirections or omissions that can lead to preventable misuses. However, note that this is not the same as providing contradictory information, which will simply reduce the credibility of your Data Card.

Authors of the Translated Wikipedia Biographies Data Card wanted readers to gain a clear sense of the gender and geographic diversity represented in their dataset. However, gender that is not self-reported can be a misleading construct. To prevent any misuse of the dataset, the authors clearly state that non-binary individuals are not represented in the dataset, and reasons behind the decision.

Information framed at unsuitable fidelities or without the right context forces readers to make assumptions about the dataset.

Provide sufficient context and use language that is suitable for your reader’s proficiency levels. This could look like annotating examples of datapoints with clear descriptions, or providing descriptions with increasing amounts of detail for more advanced users.

GEM is an NLP benchmarking environment that provides task-specific datasets for model evaluation. The GEM Sportsett Basketball Dataset Data Card relies on a mix of references, explanations and examples to support claims in an answer. These details minimize reader assumptions on why this dataset was included in GEM's collection of datasets.

In contrast, the E2E NLG Dataset in GEM answers the same question with high-level, subjective descriptions that are open to interpretation with no further explanation. The description here is not as concise and contextualized, which forces readers to make relatively huge assumptions.

Support Reader Decisions

Provide readers with facts that can persuade and dissuade readers in their decisions.

Information framed at unsuitable fidelities or without the right context forces readers to make assumptions about the dataset.