A new study from MIT researchers has uncovered significant transparency issues in datasets used to train large language models, potentially impacting AI accuracy and fairness.

The research team, led by Alex "Sandy" Pentland from MIT's Media Lab, conducted a systematic audit of over 1,800 text datasets from popular hosting sites. They found that more than 70 percent of these datasets lacked complete licensing information, while about 50 percent contained errors in the information provided.

Robert Mahari, a graduate student at MIT and Harvard Law School, emphasised the importance of data transparency: "One of the best ways to understand the capabilities and limitations of an AI model is understanding what data it was trained on. When you have misattribution and confusion about where data came from, you have a serious transparency issue."

To address this problem, the team developed the Data Provenance Explorer, a tool that automatically generates summaries of a dataset's creators, sources, licences, and allowable uses. This tool aims to help AI practitioners select appropriate training datasets, potentially improving model accuracy in real-world applications.

The study also revealed a concentration of dataset creators in the global north, which could limit a model's cultural relevance when deployed in different regions. Additionally, the researchers noted a significant increase in restrictions placed on datasets created in 2023 and 2024, possibly due to concerns about unintended commercial use.



Share this post
The link has been copied!