Google Launches WAXAL, an Open Source Dataset for African Languages

News World

Published by Clare Adamson at February 4, 2026

A Large-Scale African Language Resource

By collecting over 11,000 hours of voice data and nearly 2 million recordings, Google has produced one of the largest open-source datasets focused solely on African Languages.

This project marks a major advancement in inclusion and linguistic representation in voice-enabled AI. Developers can use this dataset to build Automated Speech Recognition (ASR) and Text-to-Speech (TTS) systems. These are useful for voice assistants, automated call centers, and TTS tools.

For African startups, WAXAL will lower the cost of building local-language AI products. WAXAL also reduces dependence on foreign datasets that often fail to capture regional dialects.

For Africans By Africans

Google worked with local African institutions, including Makerere University in Uganda, Digital Umuganda in Rwanda, the University of Ghana, and the African Institute for Mathematical Sciences (AIMS), which led the data collection.

Participants recorded speech in their real accents and speaking styles. At the University of Ghana, over 7,000 volunteers contributed to the project by having their voices recorded, making it truly collaborative.

This local approach improves data quality and cultural accuracy, ensuring that African languages are represented authentically.

Data Ownership and Ethical Collaboration.

Instead of extracting the data, Google built mutually beneficial partnerships. Each research institution retains full ownership of the data they collected. As equal collaborators, the organisations can reuse the data for research and education.

This model supports long-term innovation in Africa’s AI ecosystem and encourages ethical, transparent data practices.

Open source Format

The full WAXAL data set is publicly available under an open license on Hugging Face. This allows anyone to access the data set free of charge, creating an equitable playing field. Open access is most important to students and startups that may lack the resources to afford licensing subscriptions.

“This dataset provides the critical foundation for students, researchers, and entrepreneurs to build technology on their own terms, in their own languages,” Aisha Walcott-Bryantt, Head of Google Research Africa, says.

A Growing Movement for African Language Inclusion.

This initiative joins other projects, such as Lelapa AI and N-ATLAS, in pioneering the inclusion of African languages in the development of voice-automated technology. Together, these initiatives signal the importance of including underrepresented languages in the AI economy.

Main Image: Courtesy of Google