Every major AI speech recognition system — Whisper, Google Speech-to-Text, AWS Transcribe, Azure Cognitive Services — was primarily trained on American and British English. Some support "English (Nigeria)" as a locale option. Most do not. And even the ones that do treat Nigerian English as a minor dialect of American English, not as a distinct linguistic environment shaped by four dominant languages with their own phonological rules, tonal systems, and vocabulary.
The result: these models fail Nigerian callers in ways that matter.
The problem isn't accent. It's phonology.
Nigerian English speakers don't just have different accents. They operate in a linguistic environment where:
- Yoruba is a tonal language with three distinct tones (high, mid, low) and diacritics like ẹ, ọ, ṣ that carry semantic meaning
- Hausa has implosive consonants (ɗ, ƙ, ƴ) and glottalised sounds that don't exist in English phonology
- Igbo uses nasal vowels and a two-tone system with a characteristic syllable-final nasalisation
- Nigerian English itself follows phonological patterns distinct from British or American English — syllable timing, vowel reduction, consonant clusters
A model trained on American English has never heard these sounds in its training data. It guesses. On telephone-quality audio — with compression, background noise, and the acoustic characteristics of Nigerian mobile networks — it guesses badly.
Code-switching is the real test
The hardest problem — and the one that most distinguishes Nigerian callers from any other market — is mid-sentence code-switching. A typical caller doesn't stay in one language for the duration of a call. They move between languages the way fluent multilingual speakers always do: naturally, mid-clause, based on what's easier to express in which language.
A sample sentence from a real call: "Ẹ káàsán — I want to book an appointment, and please tell me the doctor dey available on Saturday?"
This opens in Yoruba, switches to standard English, then closes with Nigerian Pidgin. A US-trained model hears the Yoruba greeting as noise, produces gibberish for the Pidgin, and gets the English clause — the least informative part of the sentence — roughly right.
Maraba handles this correctly because it was trained on it. Every line of training data includes the kind of mixed-language utterances real Nigerian callers produce.
What "purpose-built" actually means
Maraba's speech recognition is built on a fine-tuned Whisper base model re-trained on telephone-quality audio from Nigerian speakers across all four languages. "Telephone-quality" matters: 8kHz audio with the codec distortion and background noise characteristics of Nigerian mobile networks is acoustically different from studio recording or broadband audio. A model trained on the latter performs badly on the former.
The training corpus includes:
- Hausa speech samples covering Northern Nigerian regional variation (Kano, Katsina, Sokoto cadence differences)
- Yoruba samples with appropriate tonal annotation — critical for distinguishing words that differ only by tone
- Igbo samples from multiple dialectal regions (Owerri, Onitsha, Enugu)
- Nigerian English samples from Lagos, Abuja, Port Harcourt, and secondary cities
- Code-switched samples — the most important training category, and the hardest to collect
Why this matters for your business
If your AI receptionist mishears a caller's name, misses their appointment date, or transcribes "àárọ̀" (morning) as noise, you're not running an AI receptionist. You're running a call answering service that occasionally produces a garbled log.
The downstream consequences are real: wrong bookings, unanswered escalations, callers who hang up and don't call back. In a market where every missed call is a lost customer, an AI that doesn't understand your callers is worse than no AI at all — it creates the illusion of coverage while failing silently.
Purpose-built Nigerian language AI isn't a differentiator. It's the minimum bar for the product to work.
Listen to Maraba answer in Yoruba, switch to English, and back again — without missing a word.
Hear a live demo →