Hey everyone,
I’ve been working on adding vision capabilities to GPT-OSS (Qwen3-MoE) and just released an early preview: GPT-OSS-20B-Vision
The Setup
One NVIDIA DGX Spark. A hotel room in Dubai. 7 days of development, 3.5 days of training. No cluster, no team, no corporate backing. Just me trying to make GPT-OSS see.
What I Built
A vision-language model for GPT-OSS-20B using QLoRA and a multi-scale visual feature extraction approach I’m calling PseudoDeepStack. Instead of using only the final layer of the vision encoder (which is what most VLMs do), I extract features from multiple depths — capturing low-level detail, mid-level structure, and high-level semantics in every visual token. Zero additional inference cost, significantly richer representations.
The Big Lesson: Standard VLM Approaches Fail on MoE
This was the most interesting finding. Projector-only training completely fails on MoE models. I trained a full projector alignment + instruction tuning pipeline (the standard LLaVA approach) and got pure garbage. The expert routing can’t handle visual tokens it’s never seen — they get misrouted and the output is incoherent.
I burned a full training run discovering this before finding a solution that works. If anyone else is working on adding vision to MoE architectures — save yourself the pain. The routing needs to adapt to the new modality, not just the projector.
Honest Assessment
This is a proof of concept at 22% training. Setting expectations:
Works: Object recognition, scene understanding, basic visual Q&A, simple image descriptions.
Doesn’t work yet: Fine-grained OCR, accurate counting, complex spatial reasoning, anything that requires production-level VLM quality.
The current training configuration hits a capacity ceiling. The architecture and pipeline are sound — a higher-capacity configuration with more data should break through. Full technical details, loss curves, and training config are in the model card.
What’s Next
The architecture works. The pipeline works. What I need is compute to train the production version — roughly ~$3,000 of GPU time would get both GPT-OSS-20B and 120B to production quality. Everything stays fully open — weights, code, training logs.
If you have access to compute or want to sponsor GPU hours, reach out: vincentkaufmann (at) protonmail.com
Donation details and full architecture breakdown are on the model page.
If you’re working on VLMs for MoE architectures, or if you’ve got ideas on how to push this further, I’d love to hear from you.
Built with PyTorch, llama.cpp, SigLIP, and way too much Domino’s pizza.
