Foundation

Thoughtfully Analyze Your Dataset

Analyses are a reflection of more than your dataset.

Overview

We know that data is information about people or cultural or business contexts that has been captured in a structured way for a specific purpose. However, people, cultures, and businesses are nuanced and described by several dimensions in varying degrees simultaneously. The analysis of a dataset offers a window into the thought that has been put into the dataset itself.

An intersectional analysis of people, for example, can explore the combinations of human factors within a dataset to identify potential disproportionate outcomes, such as when a model trained on a dataset performs better for a subgroup than others. A disaggregated analysis breaks down the dataset based on different factors to reveal important patterns for subgroups or marginalized populations that are typically masked by larger, aggregate data, so readers can anticipate outcomes.

Intersectionality and disaggregated analyses (IDA) are effective ways to communicate a range of plausible outcomes under different circumstances in a Data Card by establishing clear relationships in a dataset. IDA can offer readers vital clues about the representation in your dataset (such as how labels correlated with sensitive entities), gaps in your dataset (e.g. the dataset only has photographs taken in daytime), and the relationship between variables that can subsequently cause AI models to learn spurious correlations or pick on proxies. These analyses become even more useful when they are situated in real-world circumstances reflective of the experience that impacted users may have with a product or service that uses your dataset.

Presenting IDA results in a Data Card helps readers proactively build an intuition about how their ML model, for example, will perform on subsets (also known as slices) in your dataset. While this requires dataset creators to be more diligent in their analysis of the dataset and its presentation in the Data Card, it can ultimately lead to better product outcomes for your stakeholders.

Key Takeaways

  • Intersectional and disaggregated analyses can help readers better intuit how to use your dataset in their models.
  • Work with experts, product teams, and individuals with lived experience to frame disaggregated or intersectional analyses.
  • Intersectionality and disaggregated analyses are often rooted in contexts that need to be explained to readers – or require additional support so readers can interpret these appropriately.

Actions

  1. Explore before you begin your analysis. Develop an intuition for the skews and imbalances in your dataset by exploring it in a tool such as TensorFlow’s Data Validation (TFDV), Know your Data (KYD), or in the context of a model using the Learning Interpretability Tool (LIT). Use the results to inform your analysis design.
  2. Design your analysis carefully. The results of these analyses are heavily influenced by the goals of your evaluation, the access to expertise and resources to conduct the analysis, when and where you conduct the analysis, and the contexts of the AI models in which the analysis is conducted.
  3. Start with factors relevant to your intended use. Align on demographic, sociocultural, behavioral, and morphological factors that can most affect your intended use cases when creating groups of interest, and branch out from there.
  4. Report, don’t comment. Note that factors and assumptions that affect fairness analyses exist in historically and culturally specific social constructs that are hard to quantify. Be cautious of adding commentary that can confuse the reader. Instead, provide ways to reproduce analyses that can help readers calibrate results in their own context.
  5. Plan for the future. Account for additional factors that may crop up in the future by looking at the representation in your dataset, keeping values constant across different scenarios, or combining your analysis with a range of values of additional factors relevant to your dataset.
  6. Non-reproducible results require more context. If metrics cannot be reproduced by downstream stakeholders, provide enough context around the analysis. If a reader can use this information to weigh the pros and cons of the dataset, it can build trust in the dataset.

Considerations

  • What are the goals of your analysis? Do you have the necessary expertise and resources to run a comprehensive and accurate analysis?
  • If a stakeholder or third party were to reproduce your analysis, what information would they need to see in your Data Card? Does the Data Card contain sufficient background information for stakeholders to arrive at an accurate conclusion about your dataset in a comparative study?
  • In presenting the results of your analysis, have you clearly described the rationale behind the groups of interest, factors, and assumptions that shaped the design of the analysis?

Downloadables