April 29, 2023
Our first step was to look for how common the co-occurrences of biomedical keywords were within the Pile dataset.
"The Pile" is an 825 GB English text corpus designed for pre-training autoregressive LLMs, the same style as chatGPT. Our analysis builds on previous works to count how often a specific demographic keyword is mentioned near a disease keyword. We repeat this process throughout the whole of the dataset for different diseases and demographic words to get the final totals across different window sizes. In addition, we collected Real-world prevalence using the National Health Interview Survey results.
Visit our project page at Cross-Care Downloads to explore our methods and results in detail and access the full data set.
Continue reading about what models thought was the most common ...
Check out the repo here
Cross-Care Repo