potion-mxbai-2m-512d
A static embedding model trained on 2M sentences that improves on our previous 1M model โ pushing the state of the art for static embeddings further.
Highlights
- 71.28 avg on MTEB English (STS + Classification + PairClassification)
- +1.32 points over potion-base-32M (71.28 vs 69.96)
- +2.23 STS points over our 1M model (71.59 vs 69.36)
- 500x faster than transformer-based embedding models on CPU
- ~32MB model size (63K vocab x 512 dims, float16)
- Pure numpy inference โ no GPU needed
How It Was Made
- Teacher: mixedbread-ai/mxbai-embed-large-v1 (335M params, BERT-large architecture)
- Custom vocabulary: 56K tokens built from 2M C4 English sentences via corpus frequency analysis
- Distillation: model2vec distillation with 512-dim PCA
- Tokenlearn pre-training: Contrastive loss training on 2M C4 sentences using tokenlearn
Benchmark Results
| Model | STS | Classification | PairClassification | Avg |
|---|---|---|---|---|
| potion-mxbai-2m-512d (this) | 71.59 | 65.44 | 76.80 | 71.28 |
| potion-mxbai-512d (1M) | 69.36 | 65.52 | 77.32 | 70.73 |
| potion-base-32M | 65.74 | 65.96 | 78.17 | 69.96 |
What changed from 1M to 2M
| Category | 1M | 2M | Delta |
|---|---|---|---|
| STS | 69.36 | 71.59 | +2.23 |
| Classification | 65.52 | 65.44 | -0.08 |
| PairClassification | 77.32 | 76.80 | -0.52 |
| Overall | 70.73 | 71.28 | +0.55 |
Doubling the training data primarily improved semantic textual similarity, which is the core task tokenlearn's contrastive loss optimizes for.
Usage
from model2vec import StaticModel
model = StaticModel.from_pretrained("blobbybob/potion-mxbai-2m-512d")
embeddings = model.encode(["Hello world", "Static embeddings are fast"])
With Sentence Transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("blobbybob/potion-mxbai-2m-512d")
embeddings = model.encode(["Hello world", "Static embeddings are fast"])
Training Details
- Featurization: 2M C4 sentences across 10 L4 GPUs in parallel (~15 min)
- Training: Tokenlearn contrastive loss, batch size 512, L4 GPU
- Total cost: ~$3-4 on Modal
Citation
@article{minishlab2024model2vec,
author = {Tulkens, Stephan and {van Dongen}, Thomas},
title = {Model2Vec: Fast State-of-the-Art Static Embeddings},
year = {2024},
url = {https://github.com/MinishLab/model2vec}
}
- Downloads last month
- 84
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for blobbybob/potion-mxbai-2m-512d
Base model
mixedbread-ai/mxbai-embed-large-v1Evaluation results
- spearman_cosineself-reported71.590
- accuracyself-reported65.440
- apself-reported76.800