potion-mxbai-2m-512d

A static embedding model trained on 2M sentences that improves on our previous 1M model — pushing the state of the art for static embeddings further.

Highlights

71.28 avg on MTEB English (STS + Classification + PairClassification)
+1.32 points over potion-base-32M (71.28 vs 69.96)
+2.23 STS points over our 1M model (71.59 vs 69.36)
500x faster than transformer-based embedding models on CPU
~32MB model size (63K vocab x 512 dims, float16)
Pure numpy inference — no GPU needed

How It Was Made

Teacher: mixedbread-ai/mxbai-embed-large-v1 (335M params, BERT-large architecture)
Custom vocabulary: 56K tokens built from 2M C4 English sentences via corpus frequency analysis
Distillation: model2vec distillation with 512-dim PCA
Tokenlearn pre-training: Contrastive loss training on 2M C4 sentences using tokenlearn

Benchmark Results

Model	STS	Classification	PairClassification	Avg
potion-mxbai-2m-512d (this)	71.59	65.44	76.80	71.28
potion-mxbai-512d (1M)	69.36	65.52	77.32	70.73
potion-base-32M	65.74	65.96	78.17	69.96

What changed from 1M to 2M

Category	1M	2M	Delta
STS	69.36	71.59	+2.23
Classification	65.52	65.44	-0.08
PairClassification	77.32	76.80	-0.52
Overall	70.73	71.28	+0.55

Doubling the training data primarily improved semantic textual similarity, which is the core task tokenlearn's contrastive loss optimizes for.

Usage

from model2vec import StaticModel

model = StaticModel.from_pretrained("blobbybob/potion-mxbai-2m-512d")
embeddings = model.encode(["Hello world", "Static embeddings are fast"])

With Sentence Transformers:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("blobbybob/potion-mxbai-2m-512d")
embeddings = model.encode(["Hello world", "Static embeddings are fast"])

Training Details

Featurization: 2M C4 sentences across 10 L4 GPUs in parallel (~15 min)
Training: Tokenlearn contrastive loss, batch size 512, L4 GPU
Total cost: ~$3-4 on Modal

Citation

@article{minishlab2024model2vec,
  author = {Tulkens, Stephan and {van Dongen}, Thomas},
  title = {Model2Vec: Fast State-of-the-Art Static Embeddings},
  year = {2024},
  url = {https://github.com/MinishLab/model2vec}
}

Downloads last month: 84

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for blobbybob/potion-mxbai-2m-512d

Base model

mixedbread-ai/mxbai-embed-large-v1

Finetuned

(52)

this model

Evaluation results

spearman_cosine
self-reported

71.590
accuracy
self-reported

65.440
ap
self-reported

76.800