A new study from MIT researchers has uncovered significant transparency issues in datasets used to train large language models, potentially impacting AI accuracy and fairness.
The research team, led by Alex "Sandy" Pentland from MIT's Media Lab, conducted a systematic audit of over 1,800 text datasets from popular hosting sites. They found that more than 70 percent of these datasets lacked complete licensing information, while about 50 percent contained errors in the information provided.
To address this problem, the team developed the Data Provenance Explorer, a tool that automatically generates summaries of a dataset's creators, sources, licences, and allowable uses. This tool aims to help AI practitioners select appropriate training datasets, potentially improving model accuracy in real-world applications.
The study also revealed a concentration of dataset creators in the global north, which could limit a model's cultural relevance when deployed in different regions. Additionally, the researchers noted a significant increase in restrictions placed on datasets created in 2023 and 2024, possibly due to concerns about unintended commercial use.