Use Hypotheticals with Caution

Some explanations will require you to reveal the contexts of a datapoint, which can be problematic if your dataset has access or license restrictions, protected data, or you need to obfuscate individual data points.

This is common in the domains of healthcare and finance. In other cases, you may need to explain a combination of behaviors that cannot be explained by a single datapoint alone. Create “hypothetical” data points for focused explanations - but with the necessary cautions. Hypothetical data points designed to illustrate these behaviors are a good alternative, provided they are crafted with care and clearly described as being fictional data points created for the specific purposes of examples and explanations.

Use hypotheticals to convey diversity in data.

When explaining a mechanic or behavior that requires you to reveal the details of a protected datapoint, create hypotheticals that represent both normality and outlier data points in your dataset that supports your explanation.

Use a similarity metric to verify that your hypothetical is reasonably representative of a significant number of data points. Always let readers know that these are hypotheticals, and be open to feedback if readers point out the inapplicability of your hypothetical in their contexts.

Hypotheticals that don’t mimic actual data points can have unintended effects.

A hypothetical that is not representative of the dataset in real-world contexts can potentially suggest or amplify misinformation, and mislead readers in their understanding of the mechanics and dynamics of the dataset when they use it.

Hypotheticals must be carefully crafted and should be bounded in their proximity to the real world. Don’t use hypotheticals in any analysis that claims to generalize over the entire dataset. Always explain the logic used in creating hypotheticals, and rationales for using the hypothetical in the very first place.

The FIT400M dataset is an internal dataset at Google. However, to give external readers clarity on what the dataset contains, the authors generated content for their Data Card that illustrates the content of datapoints. The structure of the datapoints are further described in its accompanied table.