What’s in the data?

Last Updated on January 14, 2025 by Editorial Team

Author(s): Trupti Bavalatti

Originally published on Towards AI.

Text-to-image (T2I) generative AI models have revolutionized content creation by transforming text into photorealistic and imaginative images. While their widespread adoption spans across industries, these models struggle to align outputs with ethical and societal norms due to training on large-scale data sourced from the internet, which is bound to have harmful text-image pairs and encode stereotypes. Potential misuse of such models (using them to generate unsafe content) can be prevented by implementing a robust safety stack that includes filtering training data, fine-tuning models, and applying post-training mitigations like prompt and image classifiers.

Academic research on T2I safety mostly relies on public datasets of labeled prompt-to-image pairs. These datasets, annotated for safety violations, are key to training and evaluating safety components, making their composition critical. Studying their composition ensures balanced representation of harmful categories, identifies biases or coverage gaps, and highlights inconsistencies in labeling that could affect model performance. A thorough analysis also helps researchers interpret findings accurately, avoiding over-generalizations and ensuring that research conclusions align with the dataset’s strengths and limitations.

This article aims to provide insights into the strengths and limitations of one of the popular datasets — the “Inappropriate Image Prompts (I2P)” dataset [1], that is designed to evaluate image generation model safety. It contains 4,700 prompts from the real world, and cover various categories like nudity, violence, and other potentially harmful content. We will look at how diverse the dataset is, in terms of harm class composition, topics most represented in the dataset as well as it’s syntactic and semantic diversity.

Coverage of harm

To sure consistent labeling with a standard taxonomy, the author assigned labels on the dataset were discarded and the AIR taxonomy [2] was used for labeling. AIR is a comprehensive taxonomy unifying classes of harm from content guidelines of different tech companies like OpenAI, Meta etc and also from several government regulations. Using system prompt on chatGPT, and asking it to classify the prompts in I2P and map them to L2 categories in the AIR taxonomy, we can see the coverage of harm classes in the dataset in Figure 1.

What’s in the data? — Figure 1: Coverage of harm classes in I2P dataset. Note: “harmful-other” means chatGPT was not able to map the prompt to an AIR taxonomy with reasonable confidence. Image by Author

While sexual content makes up the majority of the prompts, prompts that result in a violent image or hateful image also have high representation, followed by self-harm and child-harm and then a long tail of other classes of harm. Critical areas that lack adequate coverage include discrimination against protected groups, criminal activities, social issues, children, and self-harm (especially last two given severity of real world impact). Moreover, prompts related to dehumanization, political misinformation, terrorism, fraud, and eating disorders are severely underrepresented. This imbalance not only limits the models’ ability to identify and mitigate diverse forms of harm but also raises concerns about their ability to generalize to real-world scenarios. The narrow focus of existing datasets can lead to overfitting of specific patterns of harm, reduced effectiveness in identifying novel or subtle forms of harmful content, and potential biases when dealing with diverse cultural contexts.

The large percentage of the harmful-other label from ChatGPT tells us that there is inconsistency in the guidelines that authors followed versus what’s available in the very comprehensive AIR taxonomy. In fact, if you read the AI content generation policies across tech companies, they differ significantly in their breadth of coverage as well as their interpretation of harm. Broad efforts within the industry, such as the formation of MLCommons, a collaboration between experts from different tech companies and academia are critical in defining a consistent, agreed-upon taxonomy. Such initiatives can streamline safety guidelines, ensuring that datasets and models across organizations align on what constitutes harm and how it is categorized. This uniformity would reduce ambiguity in labeling and evaluation, foster collaboration across the industry, and ultimately ensure that models adhere to ethical and societal norms on a global scale.

Topics in the prompts

To identify some topics in the prompts, we can analyze the most frequent words and bi-grams. The analysis of the most frequent words in the dataset reveals several noteworthy patterns, as illustrated in Figure 2. The word “woman” appears with remarkable frequency, along with “man” and “body”, which can be partially attributed to the focus on sexual content in the dataset. The examination of bigrams (two-word sequences) in the dataset further reinforces the observations from the individual word analysis. Certain bi-grams, representing adult actor names (redacted) appear with notable frequency, suggesting a potential over-representation of certain scenarios of sexual content or individuals in the dataset. Bi-grams such as “highly detailed,” or artist name “greg rutkowski” refers to lot of stylization prompts that explicitly ask “in the style of” underscoring specific thematic biases.

Figure 3: Most common words and bi-grams in the dataset. Note: Offensive terms are redacted for safety. Image by Author

Syntactic diversity of prompts

Syntactic diversity looks at the variety in linguistic structure. The patterns of words, phrases, and sentences show how often different sequences of words or structures appear. A higher diversity indicates fewer repetitions and a broader range of language forms. Syntactic diversity can be measured by calculating n-gram distinctness within a dataset. This approach evaluates each prompt separately and calculates the proportion of unique n-grams in the prompt relative to the total number of n-grams present. The intra-distinctness score provides insight into the syntactic variability of individual prompts, indicating how varied language patterns are within single prompts. Examining the distribution of intra-distinctness scores across all prompts allows us to see how diverse prompts are, on
average, within a dataset. Figure 4 illustrates these scores
for the I2P dataset.

Figure 4: This grid of plots displays the density of intra-distinctness scores for uni-grams, bi-grams and trigrams. Image by Author

All three (unigram, bigram, trigram) curves have peaks near a score of 1, indicating that many prompts in the dataset are highly similar in terms of their token-level structure. This suggests low lexical diversity within the dataset, as prompts are often repetitive or use similar language patterns. The unigram curve (blue) is broader and slightly more spread out compared to bigrams (orange) and trigrams (green), indicating greater diversity at the single-word level than at higher n-gram levels. The bigram and trigram curves are more sharply peaked, showing that as the n-gram complexity increases, the dataset becomes less diverse in its combinations of word pairs and triplets. Thus, the lack of diversity at higher n-gram levels could limit the robustness of models trained on this dataset, particularly in handling unseen or semantically complex inputs.

Semantic diversity of the prompts

Semantic Diversity examines the range of meanings of prompts. By comparing meanings using language model embeddings, we can see how conceptually distinct or related prompts are. Higher diversity means a wider spectrum of ideas, topics, and themes. Prompt embeddings and cosine distance metrics can capture the conceptual distinctness between prompts by analyzing their semantic relationships in a high-dimensional embedding space. For embedding generation, we can use the all-MiniLM-L6-v2 model, which is a lightweight yet effective sentence transformer that produces 384-dimensional dense vector representations of text. The model demonstrates a strong capability in capturing semantic relationships in short text sequences while showing resilience to syntactic variations, making it ideal for analyzing the semantic content of text-to-image prompts.

For two prompt embeddings, a and b, the cosine similarity is
defined as

Figure 5: Cosine similarity. Image by Author

Semantic diversity is measured using cosine distance, calculated for each pair of prompts using equation in Figure 5. The distribution of average pairwise cosine similarity for the I2P dataset is shown below in Figure 6.

Figure 6: Distribution of average pairwise cosine similarity. Image by Author

A high average cosine distance (closer to 1) suggests that the dataset contains semantically diverse prompts, meaning that the textual prompts likely represent a broad range of topics or ideas. This is critical for ensuring that the dataset covers a variety of use cases, edge cases, and harmful content categories.

Conclusion

To build robust and comprehensive safety systems for T2I models, it is essential to expand dataset diversity across all harm categories, ensuring balanced and representative coverage of potential misuse cases. A diverse dataset helps address a wide range of challenges, from detecting overtly harmful content like explicit imagery and violence to more nuanced issues like biases or misinformation. Achieving this requires the development of clear and unambiguous taxonomies that systematically classify harm categories, along with consistent labeling guidelines to ensure uniformity and accuracy across annotations. Diverse syntax and semantics in prompts are equally important, enabling models to handle varied phrasing and complex inputs while reducing the risk of failures in detecting harmful intent. Furthermore, datasets must represent a broad range of topics across different domains, languages, and cultural contexts to minimize biases and enhance the global applicability of safety systems. By addressing these aspects upstream, at the dataset level, researchers can preemptively mitigate many safety challenges, resulting in models and content moderation systems that are more effective and reliable in diverse, real-world scenarios. A well-curated and inclusive dataset serves as the foundation for robust AI safety, ensuring that generative models can better align with ethical, societal, and cultural norms, ultimately fostering trust and confidence in these technologies.

References

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

What’s in the data?

Author(s): Trupti Bavalatti

Coverage of harm

Topics in the prompts

Syntactic diversity of prompts

Semantic diversity of the prompts

Conclusion

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

What’s in the data?

Author(s): Trupti Bavalatti

Coverage of harm

Topics in the prompts

Syntactic diversity of prompts

Semantic diversity of the prompts

Conclusion

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement