GPT-OSS-20B-Vision: Adding multimodal to GPT-OSS with a novel multi-scale approach, trained on a single DGX Spark

Hey everyone,

I’ve been working on adding vision capabilities to GPT-OSS (Qwen3-MoE) and just released an early preview: GPT-OSS-20B-Vision

The Setup

One NVIDIA DGX Spark. A hotel room in Dubai. 7 days of development, 3.5 days of training. No cluster, no team, no corporate backing. Just me trying to make GPT-OSS see.

What I Built

A vision-language model for GPT-OSS-20B using QLoRA and a multi-scale visual feature extraction approach I’m calling PseudoDeepStack. Instead of using only the final layer of the vision encoder (which is what most VLMs do), I extract features from multiple depths — capturing low-level detail, mid-level structure, and high-level semantics in every visual token. Zero additional inference cost, significantly richer representations.

The Big Lesson: Standard VLM Approaches Fail on MoE

This was the most interesting finding. Projector-only training completely fails on MoE models. I trained a full projector alignment + instruction tuning pipeline (the standard LLaVA approach) and got pure garbage. The expert routing can’t handle visual tokens it’s never seen — they get misrouted and the output is incoherent.

I burned a full training run discovering this before finding a solution that works. If anyone else is working on adding vision to MoE architectures — save yourself the pain. The routing needs to adapt to the new modality, not just the projector.

Honest Assessment

This is a proof of concept at 22% training. Setting expectations:

Works: Object recognition, scene understanding, basic visual Q&A, simple image descriptions.

Doesn’t work yet: Fine-grained OCR, accurate counting, complex spatial reasoning, anything that requires production-level VLM quality.

The current training configuration hits a capacity ceiling. The architecture and pipeline are sound — a higher-capacity configuration with more data should break through. Full technical details, loss curves, and training config are in the model card.

What’s Next

The architecture works. The pipeline works. What I need is compute to train the production version — roughly ~$3,000 of GPU time would get both GPT-OSS-20B and 120B to production quality. Everything stays fully open — weights, code, training logs.

If you have access to compute or want to sponsor GPU hours, reach out: vincentkaufmann (at) protonmail.com

Donation details and full architecture breakdown are on the model page.

If you’re working on VLMs for MoE architectures, or if you’ve got ideas on how to push this further, I’d love to hear from you.

Built with PyTorch, llama.cpp, SigLIP, and way too much Domino’s pizza.

1 Like

I’d say the proof-of-concept looks solid, and with more GPU time, this could get really powerful. Props for being so honest about what works and what doesn’t - makes it feel realistic.

1 Like

Have you tried it out? Any feedback?

1 Like