AG-BPE: Attention-Guided Tokenization Achieving State-of-the-Art Compression with 12x Smaller Vocabularies
Introduction
I’ve developed Attention-Guided BPE (AG-BPE), a novel tokenization approach that enhances traditional Byte-Pair Encoding by incorporating semantic awareness through contextual attention scores. The key innovation is a hybrid scoring mechanism that combines frequency statistics with Transformer-based contextual understanding.
Key Results
- Compression ratio: 3.77x (competitive with GPT-4 tokenizers)
- Vocabulary efficiency: 16x smaller than industry standards (16K vs 200K+ tokens)
- Decoding speed: 30x faster than traditional models (0.03ms vs 0.8-0.9ms)
- Zero out-of-vocabulary tokens on complex multilingual test cases
- Perfect morphological awareness: Correctly isolates linguistic patterns like “-ing” suffix
Technical Innovation
The core contribution is the hybrid merge scoring:
MergeScore(p) = Freq(p) + λ · AttentionScore(p)
Where AttentionScore(p) comes from a lightweight Transformer encoder (6 layers, 12 heads, 768 hidden dim) that provides contextual guidance to merge decisions.
Community Traction
- 300+ views and 200+ downloads in first month on Zenodo
- Organic discovery and engagement from NLP community
- Available with full reproducible code and datasets
The Research Breakthrough
I also discovered the “Pre-Training Pitfall” - why pre-training the attention module actually hurts performance by 45%. This challenges conventional wisdom in the field.
Root cause: Representational shift - pre-trained models become obsolete as BPE creates new token relationships.
Paper: The Pre-Training Pitfall : Why Contextual Guidance for BPE Must Be Trained Concurrently, Not A-Priori, Space : AG BPE - a Hugging Face Space by RDTvlokip
Looking for Feedback & Collaboration
I’m particularly interested in:
- Scaling experiments with larger vocabularies and datasets
- Integration possibilities with existing tokenization pipelines
- Theoretical analysis of the representational shift phenomenon I discovered
- Potential applications beyond traditional NLP tasks
Research Question
Traditional BPE relies purely on frequency statistics, but should tokenization be semantically-aware? My results suggest that incorporating contextual understanding can achieve superior compression while maintaining linguistic coherence.
Théo Charlet - AI Researcher
GitHub: RDTvlokip