potion-mxbai-2m-512d

A static embedding model trained on 2M sentences that improves on our previous 1M model โ€” pushing the state of the art for static embeddings further.

Highlights

  • 71.28 avg on MTEB English (STS + Classification + PairClassification)
  • +1.32 points over potion-base-32M (71.28 vs 69.96)
  • +2.23 STS points over our 1M model (71.59 vs 69.36)
  • 500x faster than transformer-based embedding models on CPU
  • ~32MB model size (63K vocab x 512 dims, float16)
  • Pure numpy inference โ€” no GPU needed

How It Was Made

  1. Teacher: mixedbread-ai/mxbai-embed-large-v1 (335M params, BERT-large architecture)
  2. Custom vocabulary: 56K tokens built from 2M C4 English sentences via corpus frequency analysis
  3. Distillation: model2vec distillation with 512-dim PCA
  4. Tokenlearn pre-training: Contrastive loss training on 2M C4 sentences using tokenlearn

Benchmark Results

Model STS Classification PairClassification Avg
potion-mxbai-2m-512d (this) 71.59 65.44 76.80 71.28
potion-mxbai-512d (1M) 69.36 65.52 77.32 70.73
potion-base-32M 65.74 65.96 78.17 69.96

What changed from 1M to 2M

Category 1M 2M Delta
STS 69.36 71.59 +2.23
Classification 65.52 65.44 -0.08
PairClassification 77.32 76.80 -0.52
Overall 70.73 71.28 +0.55

Doubling the training data primarily improved semantic textual similarity, which is the core task tokenlearn's contrastive loss optimizes for.

Usage

from model2vec import StaticModel

model = StaticModel.from_pretrained("blobbybob/potion-mxbai-2m-512d")
embeddings = model.encode(["Hello world", "Static embeddings are fast"])

With Sentence Transformers:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("blobbybob/potion-mxbai-2m-512d")
embeddings = model.encode(["Hello world", "Static embeddings are fast"])

Training Details

  • Featurization: 2M C4 sentences across 10 L4 GPUs in parallel (~15 min)
  • Training: Tokenlearn contrastive loss, batch size 512, L4 GPU
  • Total cost: ~$3-4 on Modal

Citation

@article{minishlab2024model2vec,
  author = {Tulkens, Stephan and {van Dongen}, Thomas},
  title = {Model2Vec: Fast State-of-the-Art Static Embeddings},
  year = {2024},
  url = {https://github.com/MinishLab/model2vec}
}
Downloads last month
84
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for blobbybob/potion-mxbai-2m-512d

Finetuned
(52)
this model

Evaluation results