The datasets to train AI models need more checks for harmful and illegal materials

This Atlantic conversation between Matteo Wong and Abeba Birhane touches on some critical issues surrounding the use of large datasets to train AI models.

The conversation underscores the urgent need for anti-racist and improved labour rights considerations and rigorous oversight in the development and use of AI technologies, particularly concerning the datasets used for training. It also highlights the responsibility of tech companies and researchers to address these challenges and prioritize the development of AI systems that are fair, transparent, and free from harmful biases.

This LAION dataset, one of the largest publicly available image datasets, was found to contain disturbing content, including images depicting sexual abuse of children. This underscores the pervasive issue of harmful and illegal material being present in datasets used to train AI models. The presence of toxic or illegal content in training data can directly influence the behaviour and output of AI models. Models trained on such data are likely to replicate and propagate harmful decision making and representations present in the training data. Birhane therefore advocates for greater transparency and open sourcing of datasets to facilitate better understanding and scrutiny of their contents. Open-sourcing datasets allows for collaborative efforts to improve data quality and mitigate the risks associated with harmful content.

As datasets grow larger, it becomes increasingly difficult to audit them effectively. Birhane further points out that the scale of data collection often comes at the cost of auditability, making it challenging to understand the full extent of racist, exclusive, and problematic content within these datasets. There’s a notable asymmetry in resource allocation between building AI systems and auditing/cleaning up datasets. While building AI models has become more accessible and cost-effective, the resources required for comprehensive dataset auditing, including intellectual and computational resources, are substantial.

See: Building AI safely is getting harder and harder at the Atlantic.

Header image from the original Atlantic article.