model support per CrispASR — pure C++ inference with GGUF quantisation (no Python needed)

#48

by cstr - opened Apr 16

Apr 16

We've built a complete C++ runtime for Voxtral-Mini-3B in CrispASR, a multi-backend ASR tool based on ggml. One binary, one GGUF file — no Python, no PyTorch, no pip install.

What works:

Full transcription pipeline (mel → Whisper encoder → Mistral 3B LLM decode)
Q4_K / Q5_0 / Q8_0 / F16 quantisation (2.5 GB Q4_K vs 6+ GB BF16)
Word-level timestamps via CTC forced alignment (-am canary-ctc-aligner.gguf or -am qwen3-forced-aligner.gguf)
Temperature sampling + best-of-N decoding (--best-of 5 -tp 0.3)
Streaming from mic/stdin (--stream, --mic, --live)
Audio Q&A mode (--ask "What language is this?" — voxtral 3B is a full audio LLM, not just ASR)
Speech translation (--translate -tl de)
Speaker diarisation, language ID, SRT/VTT/JSON output
GPU acceleration via CUDA / Metal / Vulkan (ggml backends)

Quick start:

# Build
git clone https://github.com/CrispStrobe/CrispASR && cd CrispASR
cmake -S . -B build && cmake --build build -j8

# Auto-download model and transcribe
./build/bin/crispasr --backend voxtral -m auto -f audio.wav

# Or use pre-quantised GGUF from HF
./build/bin/crispasr -m voxtral-mini-3b-2507-q4_k.gguf -f audio.wav -osrt

Pre-quantised GGUFs: cstr/voxtral-mini-3b-2507-GGUF

CrispASR supports 11 ASR backends in the same binary (Whisper, Parakeet, Canary, Cohere, Granite, Qwen3, wav2vec2, and both Voxtral variants).

cstr changed discussion title from CrispASR — pure C++ inference with GGUF quantisation (no Python needed) to model support per CrispASR — pure C++ inference with GGUF quantisation (no Python needed) Apr 16

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment