A Global First: How a New Sustainable Data Framework & License Are Transforming Language AI

AI shouldn’t just extract from Africa, it should empower it.
Despite the rapid evolution of artificial intelligence, African languages remain vastly underrepresented in AI models. A major reason for this is the lack of sustainable, ethical, and community-driven approaches to language dataset creation.
Lelapa AI, in collaboration with Way With Words and Data Science for Social Impact (DSFSI), is proud to introduce The Esethu Framework – pioneering data governance approach that fundamentally reimagines how low-resource language datasets are created, curated, and sustained.
This framework challenges the traditional exploitative data-sharing model, ensuring that African language speakers are not just contributors to AI research but beneficiaries of its growth.
What is the Esethu Framework?
The Esethu Framework is a sustainable data curation model that gives African communities greater control over their linguistic data while ensuring ongoing reinvestment into new African language datasets.
Key features include:
- Sustainable Licensing: The Esethu License is a first-of-its-kind community-centric licensing scheme creating clear pathways for ethical commercialization, by ensuring that foreign companies using African language data pay it forward, funding more language data collection.
- Community-Led Development: Local linguists and native speakers play a key role in dataset creation, making sure the data is authentic and diverse.
- Scalability & Replicability: The framework is not just for isiXhosa, it can be applied to any low-resource African language that needs better AI representation.
“Lelapa AI has created a novel data framework that prioritises the African language technology ecosystem! It ensures that commercial use is open to African entities, while revenue from non-African companies funds local data creation,” says Jenalea Rajab, Research Lead at Lelapa AI.
A Proof of Concept – Introducing The ViXSD Dataset
As the first dataset developed under the Esethu Framework, the Vuk’uzenzele isiXhosa Speech Dataset (ViXSD) is an open-source Automatic Speech Recognition (ASR) dataset for isiXhosa.
- 10 hours of high-quality isiXhosa speech data
- Diverse speakers across dialects, age groups, and regions
- Ethical licensing that ensures future isiXhosa data growth
This dataset opens new opportunities for building voice AI, transcription tools, and multilingual natural languages processing (NLP) models that better serve isiXhosa speakers.
But this is just the beginning! The Esethu Framework ensures that the same process can be applied to build datasets for other African languages, ensuring long-term, sustainable language AI development across the continent.
Why This Matters for African AI & NLP
For too long, African language data has been freely used by AI giants without reinvesting in the communities that created it. The Esethu Framework changes this model by bringing:
- Better AI for African Languages: More data means better speech recognition, multilingual assistants, and voice technology for African users.
- A Self-Sustaining Ecosystem: Licensing fees from non-African companies fund the creation of even more datasets for underrepresented languages.
- AI with True Inclusion: It’s time for African languages to drive AI innovation, not just be an afterthought.
Access the Resources
🔗 Explore the Esethu Framework & Research Paper
🔗 Download the ViXSD Dataset
🔗 Access the Esethu License
Share this initiative to redefine how AI serves African languages!