Mansa Technical Research Blog

Speech remains humanity’s most natural and universal mode of communication. At Spitch, we are committed to advancing the frontier of speech technologies tailored to African contexts and beyond.

In developing our speech recognition system, we aimed to balance customer-specific preferences with competitive accuracy and efficient inference, under the constraints of limited compute and data resources. To this end, we outlined key design objectives our model must address:

Context-aware transcription: The model should adapt to domain-specific conventions (e.g., transcribing “Naira” as “₦” in sales conversations and otherwise in some other setting).
Representation of rare and culturally specific words: African names and linguistic expressions are diverse and semantically rich, requiring enhanced handling of low-frequency tokens.
Timestamp alignment: Word and sentence-level timestamps are necessary to support downstream applications such as dubbing and synchronization.
Speaker diarization: The system should distinguish and track speakers across dialogue segments.
Efficiency: The model must maintain a low memory footprint suitable for deployment in resource-constrained environments.

We present Mansa, an Audio-Text Language Model post-trained for Automatic Speech Recognition (ASR), representing the first iteration of this research direction. Mansa tries to address the above objectives outlined above and exhibits the following key capabilities:

Good transcription accuracy: Achieves competitive recognition performance across diverse acoustic and linguistic conditions.
Robust handling of disfluencies: Performs particularly well at recognizing and transcribing natural speech patterns that include hesitations, repetitions, and false starts (e.g., “I I think we should...”, “uhm, you know, maybe tomorrow”). This makes it more reliable for conversational and spontaneous speech settings.
Precise temporal alignment: Provides reliable word- and sentence-level timestamps suitable for downstream synchronization and dubbing tasks.
Prompt-guided spelling adaptation: Can be instructed on how to correctly spell rare or domain-specific words, enhancing its handling of low-frequency and culturally specific vocabulary.
Resource efficiency: Maintains low memory and compute requirements, making it suitable for deployment in constrained environments.

Architecture

We’ve observed that several architectural principles developed for vision language models can be effectively extended to speech-language modeling. We drew significant inspiration from advances in vision-language models such as the InternVL [1] and QwenVL [2] series, as well as recent developments in audio-language models [3][4][5][6].

Mansa adopts a dual-tower architecture comprising an audio encoder and a language decoder. The audio encoder maps speech into high-dimensional representations, while the decoder generates text conditioned on both the encoded speech and its own language context. Specifically, we use the encoder of Whisper-large-v3 [7] as the speech backbone. Whisper is a supervised encoder–decoder model trained on hundreds of thousands of hours of labeled multilingual speech which transfers well to downstream ASR and alignment..

For our language component, we use Qwen3-0.6B-base [8]. In ablations studies, we found that alignment between modalities (audio and text) converged faster with base models than with post-aligned versions. We think that this behavior arises from the base model’s relatively unconstrained embedding space, which facilitates easier adaptation.

To align the embedding dimensions of the audio encoder and language model, an adapter module sits between them. The adapter consists of a pooling layer that reduces temporal length by half and a linear layer that projects the audio encoder outputs to match the language model's embedding dimensions.

Finally, to allow for word and sentence-level timestamping, Mansa introduces a Connectionist Temporal Classification [10] decoder for audio-text alignment. The CTC head provides temporal alignment between audio frames and text tokens, allowing accurate timing annotations necessary for downstream synchronization and dubbing tasks.

In general, Mansa sums up to be a 1.4B parameter model.

Training

Mansa went through multiple stages of training.

Continual Pretraining: A significant limitation of most open-source base language models lies in the under-representation of African data during pretraining. To further enhance its contextual and cultural alignment, we performed full parameter continued pretraining of our language model on a 2-billion-token corpus curated to reflect African linguistic patterns, idiomatic expressions, and domain-specific knowledge.

Multimodal Alignment: Here we introduce the Audio Encoder and Adapter module to our language model. Given an audio-text pair (x,y), the audio input x is first passed through the audio encoder and adapter to get speech embeddings:

1h_c = {AudioEncoder}(x)

1h_a = {Adapter}(h_c)

similarly, the text sequence y = (y_0, y_1,...,y_{y-1}) is passed through the model’s embedding layer to obtain the text embeddings.

1h_t = {TextEmbeddingLayer}(y_{0:n-1}),

The resulting embeddings are concatenated along the time dimension to form the multimodal input sequence.

1z = [h_{a};h_{t}],

This combined representation is then passed into the transformer decoder to model dependencies across both modalities:

1o = {Transformer}(z)

The model is optimized using the standard cross-entropy loss between the predicted output tokens and the ground truth:

1mathcal{L}{\text{CE}} = - \sum{t} \log P(y_t \mid h_{a},\mathbf{y}_{\leq t})

In this stage, we keep the Audio Encoder frozen and set the language model's learning rate to 10% of the Adapter's learning rate to prevent overfitting or catastrophic forgetting. We train on 20,000 hours of audio.

Prompt-conditioned supervised fine-tuning (SFT):

At this stage, our objective is to allow for prompt-conditioned transcriptions, where the model is given text prompts to guide its generations.

We next construct a balanced training dataset. Audio samples are stratified into bins based on duration, and an equal number of examples is drawn from each bin to ensure uniform length distribution. For each audio–text pair, we apply a Named Entity Recognition (NER) model to the ground-truth transcription to extract salient entities (e.g., names, locations, and domain-specific terms). These extracted entities serve as text prompts during training.

In curating our 5000-hour supervised subset, we also paid particular attention to verbatim transcription quality, especially for multispeaker and conversational audio. Our data strategy prioritized samples retaining disfluencies and fillers rather than removing them, ensuring that Mansa learns realistic conversational dynamics instead of cleaned transcripts.

To encourage robustness and prevent over-reliance on the prompt, we introduce prompt dropout and corruption augmentation:

In 50% of the samples, the text prompt is replaced with an empty string, simulating unprompted transcription.
In an additional 15%, the prompt is either replaced with irrelevant words, partially removed, or corrupted at the token level.

These augmentations promote generalization and protect against performance degradation when prompts are unavailable or noisy. During this stage, the audio encoder remains frozen, and only the adapter and language model parameters are updated. We compute SFT loss over just the transcriptions and not the prompt tokens.

Temporal Alignment:

In this stage, the audio encoder is unfrozen to enable end-to-end optimization and smoother gradient flow across modalities. We attach a CTC head to higher-resolution encoder features (pre-adapter) and perform forced alignment using the ctc-segmentation algorithm during inference, producing word-level boundaries. :

1\mathrm{o_{ctc} = CTCDecoder(h_{c})}

The CTC loss is computed between the logprobs of this output and the ground-truth transcription. The final training objective combines the CTC loss and the supervised fine-tuning (SFT) loss through a weighted sum:

1\mathcal{L} = α\mathcal{L}_{ctc} + (1-\mathcal{α})\mathcal{L}_{SFT}

where α∈[0,1] is an hyperparameter.

Results

Transcription Accuracy

We evaluate Mansa using benchmark datasets from the Hugging Face ASR leaderboard [9], measuring Word Error Rate (WER) across diverse acoustic conditions and domains.

While Mansa generally trails larger commercial models like GPT-4.0, it demonstrates competitive performance and particularly excels on conversational speech with disfluencies. Notably, Mansa achieves the lowest WER on the AMI corpus (0.0773), outperforming GPT-4.0 (0.2237), GPT-4.0 mini (0.1576), and Whisper (0.1095). The AMI dataset contains natural meeting recordings with frequent hesitations, interruptions, and overlapping speech, characteristics that align closely with our training strategy for verbatim transcription quality.

Mansa also shows strong results on SPGISpeech (0.0486), a financial domain dataset.

On clean speech benchmarks, Mansa achieves respectable scores as well, though it does not match the performance of models trained on significantly larger datasets.

Prompting

naija_english is an evaluation dataset of conversations rich in african named entities and context.

Using the Flair toolkit, we extract named entities from every transcription in the evaluation set and feed those entities back as prompts to Mansa, GPT-4o-transcribe, and GPT-4o-mini-transcribe. Whisper is excluded because it does not support prompting. The results show that Mansa performs competitively against these much larger models.

Inference

Given that our APIs are called thousands of times daily and that compute resources at Spitch are limited, we designed an efficient deployment strategy to maximize throughput under constrained hardware conditions.

The Audio Encoder and Adapter modules jointly process inputs of fixed dimensions (B, L, D) corresponding to batch size, sequence length, and embedding dimension, respectively in a single forward pass. This fixed-shape property enables the precomputation of CUDA graphs for several predetermined batch sizes, effectively eliminating Python overhead from repeated kernel launches. For requests smaller than the target batch size of any precomputed graph, input tensors are zero-padded to match the required shape.

Since the CTC decoder is a single linear projection layer, it is co-located with the encoder within the same inference environment to minimize data transfer latency.

For the language decoder, we use vLLM[11] , specifically the v0 engine which supports ingestion of precomputed input embeddings directly into the model runtime with the --enable-prompt-embeds flag. All components operate in bfloat16 precision, providing a balance between numerical stability and memory efficiency.

Inference is fully asynchronous, audio embeddings are generated and transmitted to the vLLM server without blocking, allowing concurrent processing of multiple tasks. This design further enables batch aggregation, where multiple concurrent requests are grouped dynamically for execution in a single forward pass, significantly improving overall throughput and latency.

Concerns

Through both controlled experiments and user feedback, we identified the following limitations in the current iteration of Mansa:

Latency: While transcription accuracy meets practical deployment standards, Mansa does not yet operate at optimal inference speeds for its size. We are restricted to NVIDIA A10s for deployment at Spitch, but still current speeds indicate a need for further optimization in the inference pipeline and deployment architecture. We are also experimenting with an option for users to use CTC transcription results directly for real-time speeds, though with a slight reduction in accuracy.
Codeswitch Ability: Although Mansa was primarily trained on English and Pidgin speech, a large portion of our target users frequently code-switch between English and African languages during conversation. Under such conditions, the model occasionally exhibits hallucinations or incorrect transcriptions. We plan to release a multilingual version soon.

Future

We continue to advance our speech recognition research to meet the objectives outlined above. Specifically:

Diarization Integration: Exploring approaches to integrate a speaker diarization module without compromising transcription accuracy or latency.
Model Scaling: Expanding model capacity and improving both the quality and diversity of training data to achieve stronger generalization and robustness.
Architectural Refinement: Ablation studies focused on the adapter module to improve multimodal alignment between audio and text.
Direct Audio-to-Audio and Audio-to-Text Generation: Extending the system to support both transcription and speech output from audio input for translation.

Mansa.v1 is online and accessible today via the Spitch API and Spitch Studio.

Reference

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency. arXiv:2508.18265
Qwen2.5-VL Technical Report. arXiv:2502.13923
Audio Flamingo 2 (2025): An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities. arXiv:2503.03983
Fun-ASR Technical Report. arXiv:2509.12508
Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition. arXiv:2407.04675
FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models from Encoder-Decoder to LLM Integration. arXiv:2501.14350
Radford et al. (2022): Robust Speech Recognition via Large-Scale Weak Supervision (Whisper). PDF
Qwen3 Technical Report. arXiv:2505.09388
Vaibhav et al. (2022): Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation arXiv:2510.06961
Graves et al. (2006): Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. PDF
Kwon et al. (2023): Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM). arXiv:2309.06180