How Well Do Large Language Models Understand Your Home Language? The Data Says: Unevenly.

TL;DR

Multilingual language models often claim support for hundreds of languages, but do these systems work well in low-resource languages with limited data?

AFROBENCH, a benchmark developed by Lelapa AI and collaborators, answers this question by evaluating multilingual NLP systems across 64 African languages, 15 tasks and 22 datasets. The results reveal a clear performance gap driven by uneven language data and digital representation.

These findings highlight the need for resource-efficient AI, including efficient language models, lightweight AI systems, and scalable language models designed for constrained environments. This works by using less data, running on less computing power, and prioritising tasks that matter most in real-world, low-bandwidth settings. This work directly informs Lelapa AI's ecosystem, including Vulavula, InkubaLM, and the Esethu framework.

The Hidden Performance Gap in Multilingual AI

Artificial intelligence (AI) systems increasingly claim multilingual capability. Language models can generate text, translate content and answer questions across dozens of languages.

A more important question remains: how well do these systems perform in low-resource languages?

This question matters because AI systems are rapidly expanding into regions where language data is scarce and computing infrastructure is constrained. These conditions are common across much of the Global South, where linguistic diversity is high but digital datasets remain limited.

Research led by Lelapa AI and collaborators introduces AFROBENCH: How Good are Large Language Models on African Languages?, a benchmark designed to evaluate multilingual language models across a wide range of low-resource languages and tasks. The results show that model performance varies significantly depending on language data availability and representation online.

In AI research, a benchmark is a standard way to test how well models perform across tasks like understanding, translation and reasoning. Most benchmarks focus on high-resource languages with large online data, which can hide performance gaps.

Many languages across Africa, Asia and Latin America fall into the low-resource category, where digital data is far more limited. As a result, models trained mainly on high-resource languages often perform unevenly when applied to these under-represented languages. AFROBENCH addresses this by evaluating languages with limited digital data, making those gaps visible.

Designing AI Systems That Work Where Data Is Limited

The same patterns highlighted by AFROBENCH inform how Lelapa AI builds resource-efficient AI systems. Instead of relying on massive datasets and compute, Lelapa focuses on data efficiency and designing for the real conditions where the AI will be used.

In practice, this means:

training on smaller, high-quality, language-specific datasets to improve signal over volume
designing models that handle code-switching, dialects, and local accents common in real conversations
ensuring systems perform in low bandwidth, noisy audio, and limited compute environments

Evaluation work such as AFROBENCH helps identify where models succeed and where improvements are needed.

These insights inform the development of InkubaLM, Lelapa AI's small language model designed for multilingual environments with constrained resources.

They also guide real-world applications such as Vulavula, Lelapa AI's multilingual speech-to-text, translation, and conversational AI in customer service systems.

In parallel, the Esethu framework explores community-aligned approaches to language data creation so that language technologies can grow alongside the communities that use them.

This is the data-efficiency advantage: strong performance without dependence on large-scale data or infrastructure, designed for the environments where most languages are used.

Why Multilingual AI Struggles With Low-Resource Languages

Multilingual language models are trained on large collections of internet text. This approach works well for languages with abundant data.

However, the internet is not linguistically balanced. A small number of languages dominate online content while thousands have limited digital presence.

This imbalance creates two challenges for AI language representation:

Languages with large datasets provide stronger learning signals
Languages with limited datasets contribute far less training data

The result is a performance gap. Models often perform well in high-resource languages but struggle when operating in languages with limited digital representation.

Lelapa AI's data and resource-efficiency approach targets this gap by training on smaller, high-quality, and language-specific datasets, allowing models to learn effectively even when data is limited.

Introducing AFROBENCH: Measuring Multilingual AI Performance

To better understand these gaps, Lelapa AI and collaborators developed AFROBENCH, a multilingual evaluation benchmark.

AFROBENCH measures language model performance across:

64 African languages
15 natural language processing tasks
22 evaluation datasets

These tasks span core multilingual NLP capabilities such as language understanding, text generation, translation, reasoning and question answering.

By evaluating many languages across multiple tasks, AFROBENCH provides a clearer view of how multilingual AI systems behave in real linguistic environments. This makes AFROBENCH one of the most comprehensive frameworks for assessing multilingual AI performance in low-resource settings.

What the Data Evaluation Reveals

The AFROBENCH evaluation highlights several consistent patterns.

First, multilingual language models perform far better in languages that dominate internet datasets, particularly English.

Second, model performance strongly correlates with the amount of digital text available for a language.

Third, specialised models trained for specific tasks can outperform large general-purpose models when working with low-resource languages.

These findings highlight a key insight: simply making models bigger and training them on more data does not guarantee good performance across all languages.

A Research Agenda for Global Language AI

If the global AI industry wants language models that work across the world's languages, it must first measure where current systems succeed and where they fall short.

AFROBENCH provides one of the most comprehensive multilingual NLP evaluations available today, covering 64 languages, 15 tasks and 22 datasets. The benchmark reveals where multilingual language models perform well, where performance gaps remain and what the future of resource-efficient AI must address.

Download the full AFROBENCH: How good are Large Language Models on African Languages? paper to explore the benchmark, methodology and findings in detail.