New multimodal dataset will help in the development of ethical AI systems
October 23, 2024
October 23, 2024
By Shaina Raza and Deval Pandya
The Vector Institute’s AI Engineering team has developed Newsmediabias-plus (NMB+), a new multimodal dataset. It includes full-text articles alongside comprehensive publication details. It also features extensive bias categorization, addressing critical issues such as gender and racial biases, and specific topics including ideological leanings and framing, gender discrimination, and environmental concerns.
NMB+ is designed for academic researchers, NGOs, and socially focused groups. This is aligned to Vector’s goal of addressing both near- and long-term risks through the provision of practical tools for safe AI systems. Potential uses include:
Developed by Shaina Raza, Vector Institute Applied Machine Learning Scientist, Responsible AI, the dataset builds on the previously released UnBIAS work by incorporating images alongside text.
The dataset includes around 90,000 news articles, curated from a broad spectrum of reliable sources, including major news outlets from around the globe, from May 2023 to September 2024. These articles were gathered through open data sources using Google RSS, adhering to research ethics guidelines.1, 2
Various machine learning models were built to evaluate the dataset’s effectiveness in detecting biases and fake content, demonstrating its versatility and utility. This benchmarking process shows how the dataset performs across different modalities, including text and images, highlighting its potential for training advanced AI models designed to combat disinformation.
Each entry in the dataset features full article text, publication details (date, outlet, URL), bias assessments for both text and images, as well as topic categorizations and image descriptions and analyses. A commitment to ethical AI governance requires designing transparent AI systems that can be understood and audited, holding developers accountable for the content their AI tools generate, and establishing clear ethical standards for the development and deployment of AI technologies. Developers and researchers should focus on building robust and transparent algorithms, integrating ethical considerations and personal information protection in data, and collaborating with experts across disciplines to enhance disinformation detection techniques. It also requires continuously adapting AI tools to counter evolving disinformation tactics.
NMB+’s development and use are governed by strict ethical standards to align regulatory requirements with technical work. Comprehensive human reviews have been implemented to ensure the accuracy and reliability of the data and its labels. The dataset underwent extensive audits to validate the data collection and labeling methodologies. These audits involve independent reviewers who assess the dataset for adherence to ethical standards and accuracy. They examine the data sources, collection procedures, and labeling criteria to ensure that all elements meet established research integrity and reliability guidelines. This thorough review helps to confirm that the dataset is both robust and trustworthy for use in training and evaluating AI systems.
Researchers, technologists, and the general public are invited to explore the NMB+ dataset and delve into the findings. The dataset is accessible on Vector’s Hugging Face page under a non-commercial license. The details can be found at News Media Bias Plus page.
[1] Does my data collection activity require ethics review? | Research | University of Waterloo