OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models Paper • 2604.00688 • Published Apr 1 • 14
GlotSuite Collection GlotSuite: Paving the Way for Bringing Generative AI to Underserved Communities • 17 items • Updated Apr 15 • 3
GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts Paper • 2604.12978 • Published Apr 14 • 5
Simba Speech Series Collection Simba bridges the digital divide with a unified suite for African AI: the largest open-source speech benchmark and models covering 61 languages • 13 items • Updated Feb 12 • 1
Paza Collection Paza is a collection of speech models & benchmarks for low resource languages by the Microsoft Research Africa - Nairobi Lab • 4 items • Updated 24 days ago • 6
HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models Paper • 2511.01066 • Published Nov 2, 2025 • 2
Tamazight Collection https://huggingface.co/Tamazight-NLP • 2 items • Updated Dec 26, 2025 • 1
Faithful Persona-based Conversational Dataset Generation with Large Language Models Paper • 2312.10007 • Published Dec 15, 2023 • 11
Awal -- Community-Powered Language Technology for Tamazight Paper • 2510.27407 • Published Oct 31, 2025 • 1
M5 -- A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks Paper • 2407.03791 • Published Jul 4, 2024 • 2
Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data Paper • 2506.00469 • Published May 31, 2025 • 4
Less is More: Recursive Reasoning with Tiny Networks Paper • 2510.04871 • Published Oct 6, 2025 • 514
OLDI and friends Collection This collection groups the datasets that have been featured as part of WMT’s Open Language Data Initiative shared task. • 5 items • Updated Mar 25 • 5
view article Article There is no such thing as a tokenizer-free lunch catherinearnett • Sep 25, 2025 • 98
MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling Paper • 2403.10691 • Published Mar 15, 2024 • 1
view article Article Introducing Wikipedia Monthly: Fresh, Clean Wikipedia Dumps for NLP & AI Research omarkamali • Jul 19, 2025 • 8