Co-occurrences in the Pile Dataset

Subgroup-Disease Associations

April 29, 2023

Written by BittermanLab

Pile Dataset Analysis

Our first step was to look for how common the co-occurrences of biomedical keywords were within the Pile dataset.

Datasets

"The Pile" is an 825 GB English text corpus designed for pre-training autoregressive LLMs, the same style as chatGPT. Our analysis builds on previous works to count how often a specific demographic keyword is mentioned near a disease keyword. We repeat this process throughout the whole of the dataset for different diseases and demographic words to get the final totals across different window sizes. In addition, we collected Real-world prevalence using the National Health Interview Survey results.

Findings

Variation Across Windows: Our analysis showed consistent disease rankings across different token window sizes (50, 100, and 250). This consistency confirms the robustness of our findings.

Demographic Distributions: We observed notable disparities in the dataset's representation of different demographic groups compared to real-world disease prevalence data. For instance, White individuals were overrepresented, while Pacific Islanders and Indigenous groups were underrepresented.

Comparison of disease rankings between the Pile, LLM logits, and real-world data.

Visit our project page at Cross-Care Downloads to explore our methods and results in detail and access the full data set.

Continue reading about what models thought was the most common ...