Nigerian Language AI: Datasets, Models & Tools (2026)

Nigeria has three major indigenous languages — Hausa, Yoruba, and Igbo — each with tens of millions of native speakers, plus Nigerian Pidgin English which is understood by an estimated 60–100 million Nigerians as a lingua franca. These are significant world languages. Yet the AI tooling available for them is a fraction of what exists for much smaller European languages.

The reason is structural: African language data is underrepresented in the internet data used to train large models. Languages with fewer speakers in high-income countries receive less attention from commercial AI labs. This guide collects everything useful that does exist — so that developers and researchers building in this space do not have to rediscover the same scattered resources.

Hausa language AI resources

Speech datasets

Mozilla Common Voice — Hausa. The largest freely available Hausa speech corpus. As of April 2026, the validated dataset contains approximately 8–10 hours of audio from volunteer contributors, predominantly from Nigeria and Niger. Audio quality varies. Ground-truth transcripts are generally good but should be audited for diacritic completeness (ƙ, ɗ, ɓ) before use in model training. Download at commonvoice.mozilla.org. License: CC0.

ALFFA (African Language Technology). The ALFFA project collected Hausa speech data for ASR research. The dataset is available through academic request and contains approximately 5 hours of read speech. The orthography uses standard Hausa romanisation with diacritics. This is higher quality than Common Voice for ASR training because it was recorded in controlled conditions. License: academic research use.

GlobalPhone Hausa. A small Hausa corpus collected as part of the GlobalPhone multilingual speech database. Very clean audio, very small dataset (~1 hour). Commercially licensed; contact the Karlsruhe Institute of Technology.

Text corpora

JW300 Hausa-English. A large parallel text corpus derived from Jehovah's Witnesses publications, which are translated into a very large number of languages including Hausa. The JW300 Hausa-English corpus contains approximately 300,000 sentence pairs. Vocabulary is somewhat biased toward religious terminology but the corpus is invaluable for machine translation and language model training. Available on HuggingFace datasets.

MENYO-20k. A multi-domain Yoruba-English dataset that also includes a small Hausa component. Domains include news, social media, and formal text. Available on HuggingFace. Produced by the Masakhane community.

Masakhane Hausa corpora. The Masakhane project (masakhane.io) has produced multiple Hausa language resources including news translation benchmarks, NER datasets, and POS tagging datasets. These are research-grade annotated corpora valuable for building Hausa NLP pipelines beyond simple transcription.

OPUS Hausa. The OPUS repository aggregates parallel text data across many languages. Hausa is represented via the JW300, CCAligned, and WikiMatrix datasets. Access at opus.nlpl.eu.

Pre-trained models for Hausa

Whisper (OpenAI) — baseline. The openai/whisper-small model on HuggingFace supports Hausa as a language option. Base WER on conversational Nigerian Hausa is approximately 38–42%. Acceptable as a starting point for fine-tuning; not production-quality without further training.

MMS (Massively Multilingual Speech, Meta). Meta's MMS project trained speech models on data from Bible recordings in over 1,100 languages, including Hausa. The MMS-300M model achieves better Hausa WER than base Whisper on read speech, but its training domain (religious text) means it performs poorly on conversational business speech. Available on HuggingFace: facebook/mms-300m.

AfriBERTa. A BERT-style language model pre-trained on text from 11 African languages including Hausa. Useful for Hausa text classification, NER, and sequence labelling tasks. Available at castorini/afriberta_large on HuggingFace.

Maraba fine-tuned Whisper (inference API). Maraba's fine-tuned model achieves approximately 18% WER on conversational Nigerian Hausa. Available via the Hausa STT API at ₦5/min.

Key papers

Aliyu et al. (2020): "HausaNLP at SemEval-2020 Task 12: Data Augmentation, One vs. Rest and Language Adaptation for Hausa Offensive Language Detection"
Abdulmumin et al. (2022): "Separating Grains from the Chaff: Using Data Filtering to Improve Multilingual Translation for Low-Resourced African Languages" — includes Hausa MT benchmarks
Olatunji et al. (2022): "AfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASR" — includes Hausa speakers

Yoruba language AI resources

Speech datasets

Mozilla Common Voice — Yoruba. Similar size to the Hausa Common Voice dataset. Yoruba diacritics (ẹ, ọ, ṣ, tone marks) are generally present in the ground-truth transcripts, making this better quality for training purposes than the equivalent for some other languages. License: CC0.

Yoruba Speech Corpus (University of Lagos). A research corpus of Yoruba speech collected by Nigerian academic institutions. Not widely distributed but referenced in several papers. Contact through the NLP department at the University of Lagos.

AfriSpeech-200. A 200-hour pan-African accented speech dataset that includes clinical and general domain recordings from Yoruba speakers. This is the highest-quality open Yoruba speech dataset for conversational/clinical domain. Available on HuggingFace: intron/afrispeech-200. License: CC BY 4.0.

Text corpora

MENYO-20k. The primary Yoruba-English parallel corpus from the Masakhane community. 20,000 sentence pairs across multiple domains. Essential for any Yoruba MT or language model work. Available on HuggingFace.

YorùbáTwi Benchmark (YTB). A Yoruba text benchmark covering news classification, NER, and sentiment. Produced by Masakhane. Useful for evaluating Yoruba NLP models.

IgboNLP / Yoruba Wikipedia. The Yoruba Wikipedia (yo.wikipedia.org) is small but is one of the few sources of Yoruba encyclopaedic text. Can be accessed via the Wikipedia dumps at dumps.wikimedia.org.

Pre-trained models for Yoruba

Whisper + MMS baseline. Both OpenAI Whisper and Meta MMS support Yoruba. MMS performs marginally better on read speech due to Bible training data including Yoruba; Whisper is better on out-of-domain conversational speech after fine-tuning.

YoruBERT / AfroLM. African language BERT-style models that include Yoruba in their pre-training. AfroLM is particularly noteworthy — it is a multilingual African language model trained on data from 23 African languages. Available on HuggingFace: bonadossou/afrolm_active_learning.

Tonal language models. Yoruba's tonal complexity means standard language models that ignore tone marks produce suboptimal outputs for Yoruba. The masakhane/m2m_418M_seed_yoruba_english_53k model was fine-tuned specifically for Yoruba-English translation with proper diacritic handling.

Key papers

Shode et al. (2022): "YOSM: A Yoruba Sentiment Corpus for Movie Reviews"
Adebara and Abdul-Mageed (2022): "Towards Afrocentric NLP for African Language Technologies: Where We Are and Where We Need to Go"
Olatunji et al. (2021): "Yoruba TTS: A Dataset of Yoruba Speech for Low-Resource Text-to-Speech"

Igbo language AI resources

Speech datasets

Mozilla Common Voice — Igbo. Smaller than the Hausa and Yoruba sets; approximately 2–4 hours of validated audio as of early 2026. Diacritic coverage (ị, ụ, ọ) in ground-truth transcripts is inconsistent and requires auditing. License: CC0.

AfriSpeech-200 — Igbo component. The AfriSpeech dataset includes Igbo speaker recordings. This is likely the best open speech data for Igbo NLP research. Available on HuggingFace.

Text corpora

IgboNLP datasets (Uchenna Magulumorok and collaborators). A collection of Igbo NLP datasets including POS-tagged text, NER annotations, and parallel Igbo-English text. Available through the IgboNLP GitHub organisation.

JW300 Igbo-English. Large parallel corpus, same caveats as the Hausa version regarding domain bias. Available on HuggingFace datasets.

Igbo Wikipedia. Small but existent. Access via Wikipedia dumps.

Pre-trained models for Igbo

AfriBERTa, AfroLM. Both include Igbo in their pre-training data. AfriBERTa is the more established model; AfroLM has better multilingual coverage of African languages.

Igbo-specific fine-tuned models. As of 2026, there are very few Igbo-specific fine-tuned models in public repositories. This is an active research gap. The Masakhane community wiki tracks current work.

Nigerian Pidgin language AI resources

Nigerian Pidgin English (Naijá) is spoken by an estimated 60–100 million Nigerians and is the true lingua franca of Nigeria — the language that speakers of different ethnic backgrounds default to when communicating across language groups. It is structurally distinct from English, with its own grammar, vocabulary, and pragmatics.

Datasets

naija-sentiment. A sentiment analysis dataset in Nigerian Pidgin. Available on HuggingFace.

NaijaSenti. A large-scale Pidgin sentiment corpus (alongside Hausa, Igbo, and Yoruba) produced by the Masakhane NLP community. Approximately 15,000 annotated tweets per language. Available on HuggingFace: Davlan/naijasenti.

BBC Pidgin corpus. BBC Pidgin publishes news in Nigerian Pidgin English at bbc.com/pidgin. This is a substantial body of formal Pidgin text that can be crawled for language model training data. The BBC's terms of service should be reviewed before any scraping project.

Models

No dedicated Pidgin ASR model exists as of 2026. Pidgin speakers on phone calls are typically handled using the English STT model — Pidgin vocabulary and grammar are close enough to English that a Nigerian English-tuned model achieves moderate accuracy. This is an active gap that Maraba is working to address.

Multilingual and cross-lingual resources

Masakhane (masakhane.io). The most important African NLP community. Masakhane has produced datasets, models, and papers covering 50+ African languages, with significant Nigerian language representation. Their GitHub organisation at github.com/masakhane-io is the first place to look for any African language NLP resource. They also run workshops at major NLP conferences (ACL, EMNLP).

AfricanNLP Workshop. The African NLP Workshop at ACL is a key venue for research on African language technologies. Workshop proceedings from 2019–2025 are available on the ACL Anthology and contain the most current research on Nigerian language AI.

Deep Learning Indaba. Africa's premier ML research conference. A significant proportion of the papers presented there cover African language technology. Proceedings and slides are available at deeplearningindaba.com.

African Voices Dataset (AVD, SOAS). A dataset of African language speech recorded for phonetic research at SOAS University of London. Includes some Nigerian language data. Available through SOAS research data archives.

OPUS (opus.nlpl.eu). The OPUS collection aggregates parallel text across hundreds of languages. Nigerian languages are represented via JW300, WikiMatrix, CCAligned, and other subcorpora. OPUS is the best starting point for finding any parallel Nigerian language text data.

Tools and libraries

Pyarabic / Hausa text utilities. There is no widely adopted Python library specifically for Hausa text processing analogous to NLTK for English. The practical approach is to build directly on Python's Unicode string handling (which correctly handles ƙ, ɗ, ɓ etc.) combined with the unicode and unicodedata standard library modules.

The critical rule: never apply .lower() to Hausa, Yoruba, or Igbo text. Python's str.lower() and JavaScript's String.toLowerCase() are designed for European languages and can corrupt Nigerian language diacritics. Use case-insensitive regex with the Unicode flag, or build explicit lookup tables if case normalisation is needed.

spaCy language models. As of 2026, spaCy does not have official models for Hausa, Yoruba, or Igbo. The Masakhane community has produced spaCy-compatible models for some African languages — check their HuggingFace organisation (huggingface.co/masakhane) for availability.

HuggingFace transformers. The primary library for working with all the pre-trained models listed above. The HuggingFace Hub at huggingface.co is the best place to search for Nigerian language models using search terms like "hausa", "yoruba", "igbo", "nigerian".

The inference option: Maraba API

For developers who want to build on top of Nigerian language AI without training their own models, Maraba is building inference APIs (currently in private beta — invite-only):

Hausa STT API — ₦5/min, ~18% WER on conversational Nigerian Hausa
Yoruba TTS API — ₦3 per 1,000 characters, with correct tonal rendering
Igbo STT API — ₦5/min, ~22% WER on conversational Nigerian Igbo
Language Detection API — identify Hausa, Yoruba, Igbo, or English automatically, ₦2 per request

These are not positioned as replacements for open research — they are the practical option when you need to ship a product rather than train a model. If you are a researcher, the datasets and models listed above are your path. If you are a developer building a product, the API is likely faster and more cost-effective than running your own inference.

What is missing

Honest assessment of the gaps in Nigerian language AI as of 2026:

Large-scale conversational speech data. Every available Hausa, Yoruba, and Igbo speech corpus is small by the standards of English ASR. Mozilla Common Voice is community-driven and relies on volunteers; the growth rate is slow. What is needed is a coordinated effort to record conversational speech — not read speech — from a diverse range of Nigerian speakers.
Dialect-specific resources. "Hausa" covers Kano Hausa, Sokoto Hausa, Bauchi Hausa, and Hausa spoken in Niger — all with meaningful phonological differences. Most existing resources sample from one dialect region. Dialect diversity in training data is an unsolved problem.
Pidgin speech data. Nigerian Pidgin has essentially no speech corpus. Given that it is the true national lingua franca, this is a significant gap.
Commercial-domain text. Most existing text corpora for Nigerian languages are religious (JW300), news (MENYO-20k), or social media (NaijaSenti). Business and commercial language — prices, products, schedules, logistics — is underrepresented. This is the domain that matters most for voice AI in commercial settings.

If you are an academic or researcher working on any of these problems, the Maraba team is interested in collaboration. We have proprietary call data (properly anonymised) and are open to data sharing agreements with research institutions.

Build on Nigerian language AI today

Start with the Maraba API for production inference. Sign up free and try the Hausa, Yoruba, and Igbo STT endpoints with included free-tier credits.

Request beta →