AI & ML interests

None defined yet.

Recent Activity

tomaarsen updated a model about 2 months ago

cross-encoder-testing/mxbai-rerank-base-v2-STv6

tomaarsen updated a model about 2 months ago

cross-encoder-testing/Qwen3-Reranker-0.6B-STv6

tomaarsen new activity about 2 months ago

cross-encoder-testing/Qwen3-Reranker-0.6B-STv6:Update chat_template.jinja

View all activity

tomaarsen

posted an update about 2 hours ago

Post

🤗 Announcing the Ettin Reranker family: six new state-of-the-art CrossEncoder rerankers for search from 17M to 1B parameters, plus the full training data and the ~150-line recipe. Built on the Ettin ModernBERT encoders, Apache 2.0. Details:

All six were trained with the same single-stage pointwise MSE distillation recipe, with mixedbread-ai/mxbai-rerank-large-v2 (1.54B) as the teacher. Only the learning rate and per-device batch size change between sizes. The 1B student matches the teacher within 0.0001 NDCG@10 on MTEB(eng, v2) Retrieval, the 150M is the strongest reranker I tested in the under-600M range, and the 17M beats the 33M ms-marco-MiniLM-L12-v2 by +0.051 NDCG@10 at roughly half the parameter count.

Speed matters as much as quality for a reranker, since it determines whether the model fits the latency budget between retrieval and showing results. Our 17M is the fastest reranker in the whole comparison at 7517 pairs/sec on an H100. Our 150M runs 2.3x faster than the two other 150M ModernBERT-base rerankers (gte-reranker-modernbert-base and granite-embedding-reranker-english-r2) because the modular Transformer module propagates unpadded inputs through every layer rather than just the FA2 attention kernel. And our 1B is 2.4x faster than its 1.5B teacher while matching it on quality.

I bootstrapped the training recipe with the new train-sentence-transformers Agent Skill shipped in Sentence Transformers v5.5.0. Install it with hf skills add train-sentence-transformers --claude and ask Claude Code (or Codex / Cursor / Gemini CLI) to fine-tune a SentenceTransformer, CrossEncoder, or SparseEncoder model on your data.

I wrote a blog post walking through usage, results across six embedder pairings, the speed story, and the complete training script. Check it out, or just point your Agent to the URL:

https://huggingface.co/blog/ettin-reranker

Collection: https://huggingface.co/collections/cross-encoder/ettin-rerankers

tomaarsen

posted an update 7 days ago

Post

331

🤖 I've just published Sentence Transformers v5.5.0, headlined by a new train-sentence-transformers Agent Skill that lets your AI coding agent (Claude Code, Codex, Cursor, Gemini CLI, ...) train and finetune embedding, reranker, and sparse encoder models for you. Plus training losses & fixes. Details:

The skill bundles curated guidance for the whole training workflow across all three model types: base model selection, loss and evaluator choice, hard-negative mining, distillation, LoRA, Matryoshka, multilingual training, static embeddings, etc. It also ships production-ready training template scripts the agent can adapt. Install it with hf skills add train-sentence-transformers, then just describe what you want, e.g. "finetune a reranker on my (question, answer) pairs, mine hard negatives, and push it to the Hub".

On the loss side: EmbedDistillLoss is a new embedding-level distillation loss for SentenceTransformer. Instead of distilling teacher scores like MarginMSELoss, it aligns the student's embeddings directly with pre-computed teacher embeddings, wtih an optional learnable projection for when the student and teacher dimensions differ. Second, ADRMSELoss is a new listwise learning-to-rank loss for CrossEncoder from the Rank-DistilLLM paper, aimed at the LLM-distillation reranking setting.

encode() and predict() also gained a per-call processing_kwargs override, so you can change processor settings like max_length, a vision-language model's image resolution, or a video's fps, for a single call without rebuilding the model.

The Agent Skill is the part of this release I'm most keen for people to try. Curious to hear how it works for you. I've been using it myself a lot to quickly set up some training runs that immediately use a bunch of best practices.

> pip install sentence-transformers==5.5.0
> hf skills add train-sentence-transformers

The full release notes: https://github.com/huggingface/sentence-transformers/releases/tag/v5.5.0

3 replies

tomaarsen

posted an update about 1 month ago

Post

1002

🌐 I've just published Sentence Transformers v5.4 to make the project fully multimodal for embeddings and reranking. The release also includes a modular CrossEncoder, and automatic Flash Attention 2 input flattening. Details:

You can now use SentenceTransformer and CrossEncoder with text, images, audio, and video, with the same familiar API. That means you can compute embeddings for an image and a text query using model.encode(), compare them with model.similarity(), and it just works. Models like Qwen3-VL-Embedding-2B and jinaai/jina-reranker-m0 are supported out of the box.

Beyond multimodal, I also fully modularized the CrossEncoder class. It's now a torch.nn.Sequential of composable modules, just like SentenceTransformer has been. This unlocked support for generative rerankers (CausalLM-based models like mxbai-rerank-v2 and the Qwen3 rerankers) via a new LogitScore module, which wasn't possible before without custom code.

Also, Flash Attention 2 now automatically skips padding for text-only inputs. If your batch has a mix of short and long texts, this gives you a nice speedup and lower VRAM usage for free.

I wrote a blog post walking through the multimodal features with practical examples. Check it out if you want to get started, or just point your Agent to the URL: https://huggingface.co/blog/multimodal-sentence-transformers

This release has set up the groundwork for more easily introducing late-interaction models (both text-only and multimodal) into Sentence Transformers in the next major release. I'm looking forward to it!

tomaarsen

updated 2 models about 2 months ago

cross-encoder-testing/mxbai-rerank-base-v2-STv6

Text Ranking • 0.5B • Updated Mar 25 • 3

cross-encoder-testing/Qwen3-Reranker-0.6B-STv6

Text Ranking • 0.6B • Updated Mar 25 • 6 • 1

tomaarsen

in cross-encoder-testing/Qwen3-Reranker-0.6B-STv6 about 2 months ago

Update chat_template.jinja

#1 opened 5 months ago by

noooop9527

tomaarsen

updated a model 3 months ago

cross-encoder-testing/reranker-bert-tiny-gooaq-bce-v6

Text Ranking • 4.39M • Updated Mar 4 • 5.9k

tomaarsen

updated a model 5 months ago

cross-encoder-testing/ctxl-rerank-v2-instruct-multilingual-1b-STv6

Text Ranking • 1B • Updated Dec 22, 2025 • 3

tomaarsen

published a model 5 months ago

cross-encoder-testing/ctxl-rerank-v2-instruct-multilingual-1b-STv6

Text Ranking • 1B • Updated Dec 22, 2025 • 3

tomaarsen

updated 2 models 5 months ago

cross-encoder-testing/Qwen3-Reranker-0.6B-seq-cls-STv6

Text Ranking • 0.6B • Updated Dec 19, 2025 • 4 • 2

cross-encoder-testing/mxbai-rerank-large-v2-STv6

Text Ranking • 2B • Updated Dec 19, 2025 • 6

tomaarsen

published 4 models 5 months ago

posted an update 5 months ago

Post

4456

🐦‍🔥 I've just published Sentence Transformers v5.2.0! It introduces multi-processing for CrossEncoder (rerankers), multilingual NanoBEIR evaluators, similarity score outputs in mine_hard_negatives, Transformers v5 support and more. Details:

- CrossEncoder multi-processing: Similar to SentenceTransformer and SparseEncoder, you can now use multi-processing with CrossEncoder rerankers. Useful for multi-GPU and CPU settings, and simple to configure: just device=["cuda:0", "cuda:1"] or device=["cpu"]*4 on the model.predict or model.rank calls.

- Multilingual NanoBEIR Support: You can now use community translations of the tiny NanoBEIR retrieval benchmark instead of only the English one, by passing dataset_id, e.g. dataset_id="lightonai/NanoBEIR-de" for the German benchmark.

- Similarity scores in Hard Negatives Mining: When mining for hard negatives to create a strong training dataset, you can now pass output_scores=True to get similarity scores returned. This can be useful for some distillation losses!

- Transformers v5: This release works with both Transformers v4 and the upcoming v5. In the future, Sentence Transformers will only work with Transformers v5, but not yet!

- Python 3.9 deprecation: Now that Python 3.9 has lost security support, Sentence Transformers no longer supports it.

Check out the full changelog for more details: https://github.com/huggingface/sentence-transformers/releases/tag/v5.2.0

I'm quite excited about what's coming. There's a huge draft PR with a notable refactor in the works that should bring some exciting support. Specifically, better multimodality, rerankers, and perhaps some late interaction in the future!

tomaarsen

published a model 6 months ago

cross-encoder-testing/reranker-bert-tiny-gooaq-bce-v6

Text Ranking • 4.39M • Updated Mar 4 • 5.9k

tomaarsen

posted an update 7 months ago

Post

4626

🤗 Sentence Transformers is joining Hugging Face! 🤗 This formalizes the existing maintenance structure, as I've personally led the project for the past two years on behalf of Hugging Face! Details:

Today, the Ubiquitous Knowledge Processing (UKP) Lab is transferring the project to Hugging Face. Sentence Transformers will remain a community-driven, open-source project, with the same open-source license (Apache 2.0) as before. Contributions from researchers, developers, and enthusiasts are welcome and encouraged. The project will continue to prioritize transparency, collaboration, and broad accessibility.

Read our full announcement for more details and quotes from UKP and Hugging Face leadership: https://huggingface.co/blog/sentence-transformers-joins-hf

We see an increasing wish from companies to move from large LLM APIs to local models for better control and privacy, reflected in the library's growth: in just the last 30 days, Sentence Transformer models have been downloaded >270 million times, second only to transformers.

I would like to thank the UKP Lab, and especially Nils Reimers and Iryna Gurevych, both for their dedication to the project and for their trust in myself, both now and two years ago. Back then, neither of you knew me well, yet you trusted me to take the project to new heights. That choice ended up being very valuable for the embedding & Information Retrieval community, and I think this choice of granting Hugging Face stewardship will be similarly successful.

I'm very excited about the future of the project, and for the world of embeddings and retrieval at large!

1 reply

tomaarsen

posted an update 8 months ago

Post

5875

ModernBERT goes MULTILINGUAL! One of the most requested models I've seen, The Johns Hopkins University's CLSP has trained state-of-the-art massively multilingual encoders using the ModernBERT architecture: mmBERT.

Model details:
- 2 model sizes:
- jhu-clsp/mmBERT-small
- jhu-clsp/mmBERT-base
- Uses the ModernBERT architecture, but with the Gemma2 multilingual tokenizer (so: flash attention, alternating global/local attention, unpadding/sequence packing, etc.)
- Maximum sequence length of 8192 tokens, on the high end for encoders
- Trained on 1833 languages using DCLM, FineWeb2, and many more sources
- 3 training phases: 2.3T tokens pretraining on 60 languages, 600B tokens mid-training on 110 languages, and 100B tokens decay training on all 1833 languages.
- Both models are MIT Licensed, and the full datasets and intermediary checkpoints are also publicly released

Evaluation details:
- Very competitive with ModernBERT at equivalent sizes on English (GLUE, MTEB v2 English after finetuning)
- Consistently outperforms equivalently sized models on all Multilingual tasks (XTREME, classification, MTEB v2 Multilingual after finetuning)
- In short: beats commonly used multilingual base models like mDistilBERT, XLM-R (multilingual RoBERTa), multilingual MiniLM, etc.
- Additionally: the ModernBERT-based mmBERT is much faster than the alternatives due to its architectural benefits. Easily up to 2x throughput in common scenarios.

Check out the full blogpost with more details. It's super dense & gets straight to the point: https://huggingface.co/blog/mmbert

Based on these results, mmBERT should be the new go-to multilingual encoder base models at 300M and below. Do note that the mmBERT models are "base" models, i.e. they're currently only trained to perform Mask Filling. They'll need to be finetuned for downstream tasks like semantic search, classification, clustering, etc.

tomaarsen

posted an update 10 months ago

Post

4550

😎 I just published Sentence Transformers v5.1.0, and it's a big one. 2x-3x speedups of SparseEncoder models via ONNX and/or OpenVINO backends, easier distillation data preparation with hard negatives mining, and more:

1️⃣ Faster ONNX and OpenVINO backends for SparseEncoder models
Usage is as simple as backend="onnx" or backend="openvino" when initializing a SparseEncoder to get started, but I also included utility functions for optimization, dynamic quantization, and static quantization, plus benchmarks.

2️⃣ New n-tuple-scores output format from mine_hard_negatives
This new output format is immediately compatible with the MarginMSELoss and SparseMarginMSELoss for training SentenceTransformer, CrossEncoder, and SparseEncoder losses.

3️⃣ Gathering across devices
When doing multi-GPU training using a loss that has in-batch negatives (e.g. MultipleNegativesRankingLoss), you can now use gather_across_devices=True to load in-batch negatives from the other devices too! Essentially a free lunch, pretty big impact potential in my evals.

4️⃣ Trackio support
If you also upgrade transformers, and you install trackio with pip install trackio, then your experiments will also automatically be tracked locally with trackio. Just open up localhost and have a look at your losses/evals, no logins, no metric uploading.

5️⃣ MTEB Documentation
We've added some documentation on evaluating SentenceTransformer models properly with MTEB. It's rudimentary as the documentation on the MTEB side is already great, but it should get you started.

Plus many more smaller features & fixes (crash fixes, compatibility with datasets v4, FIPS compatibility, etc.).

See the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v5.1.0

Big thanks to all of the contributors for helping with the release, many of the features from this release were proposed by others. I have a big list of future potential features that I'd love to add, but I'm

AI & ML interests

Recent Activity

Team members 1

cross-encoder-testing's activity

Update chat_template.jinja