The Data Cards Playbook

Reader-induced assumptions, along with other missing or unknown information can create a fragmented and underwhelming reading experience. Adding cues and context across your Data Card can help readers establish a more accurate understanding of your dataset.

Readers are not always going to know the original questions that were answered, or those that were left out. Without the necessary context to interpret results and claims correctly, readers can fill in the blanks with assumptions, which is counter to informed decision making.

Context supports accuracy in interpretation and reproducibility.

Surface the underlying qualitative and anecdotal assumptions, conditions of analysis, seeds, alternative considerations, and any other justifications that could have changed the results of the analysis. This will support reproducibility of analyses, enhance readers’ understanding of the results and can set the stage for future research and investigations.

Documentation that is objective (such as the statistical attributes of labels in a dataset) is often easier to document than subjective (such as the labeling instructions or policies) or speculative (such as potentially unsafe uses of labels). However, easier to document doesn’t guarantee accuracy in understanding. Help readers parse objective information by providing supportive implicit context, and vice-versa.

The difference between the Limited Context and Summarized Context versions of the WikiDialog Data Card demonstrates how summaries and context can make content easily accessible to a wider audience. The Summmarize Context version of the card demonstrates how, in comparison to the Limited Context version, the same content appears more transparent and credible because analysis setup and results are rooted in scientific rigor.

An answer should clearly reflect the question asked.

Readers should be able to infer the questions that you are trying to answer from your content without re-writing the question. This is of particular importance if your Data Card does not have any visual or formatting cues (such as subtitles, or headings) that organize answers.

A good rule of thumb is that a reader should be able to convert an answer back to its original question with minimal deviations. Alternatively, introduce cues that readers can use to understand and contextualize your responses.

The WikiDialog Data Card describes the criteria that makes a use case unsuitable, but does not provide any clear examples of unsuitable use cases. A reader could easily assume that the question asked was "Was this dataset created in accordance with Google's AI Principles?" or "What AI principles were considered in the creation of the dataset?". In contrast, the More Inclusive Annotated People Data Card clearly describes unsafe use cases and the reason for these being unsafe.

Add Context & Cues

Context supports accuracy in interpretation and reproducibility.

An answer should clearly reflect the question asked.