Interesting (but puzzling) cosine-similarity comparison with distilbert

olaffson · August 6, 2021, 5:03pm

Consider the simple example below.

I consider three simple sentences, find their embeddings and and compute their cosine-similarity with each other.

I am puzzled by the results. I would have expected this is a cat to be very similar to this is a dog and both this is a cat and this is a dog to be dissimilar to this is a banana.

However, taken at face value, this is a banana is more similar to this is a dog than the two animals sentences together…

Is this expected?! What do you think?

import tensorflow as tf
from transformers import pipeline
from numpy import dot
from numpy.linalg import norm

def mycos(x,y):
    return dot(x, y)/(norm(x)*norm(y))

mypipe = pipeline('feature-extraction', 'distilbert-base-uncased-finetuned-sst-2-english')

one = mypipe('this is a cat')[0][0]
two = mypipe('this is a dog')[0][0]
three = mypipe('this is a banana')[0][0]

mycos(one,two)
Out[55]: 0.5795413454711928

mycos(one,three)
Out[56]: 0.19475422728604236

mycos(two,three)
Out[57]: 0.5881860164213862

Thanks!

Topic		Replies	Views
Can Similarity Sentence Returns the Similarity Content? 🤗Transformers	0	343	April 27, 2023
Get sentence embedding vector using API? 🤗Transformers	0	364	September 10, 2021
Computing similarity between sentences Intermediate	4	3367	July 31, 2021
Distance between 2 llama tokens Beginners	1	37	February 8, 2025
Calculating Cosine Similarity with XLMRobertaModel Embeddings always leads to 0.99 score Beginners	0	367	March 21, 2024

Interesting (but puzzling) cosine-similarity comparison with distilbert

Related topics