Sentence Similarity With Sentence-Transformers in Python

Рет қаралды 31,548

3 жыл бұрын

🎁 Free NLP for Semantic Search Course:
www.pinecone.io/learn/nlp
Hard mode: • Sentence Similarity Wi...
All we ever seem to talk about nowadays are BERT this, BERT that. I want to talk about something else, but BERT is just too good - so this video will be about BERT for sentence similarity.
A big part of NLP relies on similarity in highly-dimensional spaces. Typically an NLP solution will take some text, process it to create a big vector/array representing said text - then perform several transformations.
It's highly-dimensional magic.
Sentence similarity is one of the clearest examples of how powerful highly-dimensional magic can be.
The logic is this:
- Take a sentence, convert it into a vector.
- Take many other sentences, and convert them into vectors.
- Find sentences that have the smallest distance (Euclidean) or smallest angle (cosine similarity) between them - more on that here.
- We now have a measure of semantic similarity between sentences - easy!
At a high level, there's not much else to it. But of course, we want to understand what is happening in a little more detail and implement this in Python too.
🤖 70% Discount on the NLP With Transformers in Python course:
bit.ly/3DFvvY5
Medium article:
towardsdatascience.com/bert-f...
🎉 Sign-up For New Articles Every Week on Medium!
/ membership
📖 If membership is too expensive - here's a free link:
towardsdatascience.com/bert-f...
👾 Discord
/ discord
🕹️ Free AI-Powered Code Refactoring with Sourcery:
sourcery.ai/?YouTu...

Пікірлер: 53

@eugenesheely5288 2 жыл бұрын

Lots of help man. Left a like and subscribed great job!

@theDrewDag 3 жыл бұрын

Subbed after this video. Will keep on checking your content regularly, James. Keep it up!

@jamesbriggs 3 жыл бұрын

That's awesome, thanks Andrea!

@lfmtube 2 жыл бұрын

Thank you! Super clear. A new subscriber of your channel!

@lakshminarasimhansrinivasa4523 2 жыл бұрын

Thank you so much for a simple tutorial!

@wolfjos1995 2 жыл бұрын

You are the best! Thanks for the tutorial!

@jasondubon Жыл бұрын

This is great!!

@jacopoattolini2085 3 жыл бұрын

Exactly what I was looking for in a clear and quick video. You gained a subscriber

@jamesbriggs 3 жыл бұрын

Awesome to have you here!

@jacopoattolini2085 3 жыл бұрын

@@jamesbriggs I have just implemented your code. Works like a charm! One quick question. Do you suggest cleaning the sentences with tokenizers/lemmatizers and other NLP techniques before passing them to model.encode() or leave them as they are?

@jamesbriggs 3 жыл бұрын

@@jacopoattolini2085 for transformers typically you'd want to keep full words, so I wouldn't use lemmatizations/stemming, or stopword removal - tokenization in some cases yes, like for URLs it can be a good idea. Also, depending on your data source (social media for sure), it can be useful to add unicode normalization, where you'd want to use NFKC in most cases

@jacopoattolini2085 3 жыл бұрын

@@jamesbriggs thanks! Will try to experiment a bit. I am working with job descriptions data so maybe it is better to use the full sentence without transformations

@doyourealise 2 жыл бұрын

subscribed, and thanks for this tutorial :)

@imykadolkar4243 2 жыл бұрын

Wonderful explanation. I had a couple of question 1. How is this model different from the deep learning seamese model? Or is it the same 2. Do you have any video explaining the internal or theoretical working of this model? Thanks once again

@SS-cz2de 3 жыл бұрын

This was very informative. thanks alot

@jamesbriggs 3 жыл бұрын

Welcome! Thanks for watching!

@chiweilin6021 Жыл бұрын

Thanks for your tutorial. I wonder that how about using pooler_output as embedding instead of mean pooling of hidden state of each word? Is pooler_output more suitable for down-stream task so we wouldn't use it as sentence representation?

@jonjino 3 жыл бұрын

Hi James, good video. I've been trying to get semantic similarity on more abstract concepts, e.g.: between "number" and "integer", or "vector" and "list". attempted on a custom word2vec vocabulary and pre-trained Bert but doesn't produce great results, with other words like "string" appearing closer to "integer" using cosine similarity. Is there a specific approach you would use for fine-tuning for a problem like this?

@jamesbriggs 3 жыл бұрын

It might be better to try the BERT token ID embeddings rather than word2vec embeddings - might be more accurate :)

@youwang9156 Жыл бұрын

just wonder compared to OpenAI embedding API, which is better ? thank you for ur video

@nathenmiranda2822 Жыл бұрын

For anyone getting stuck, switch out 'bert-base-nli-mean-tokens' for 'sentence-transformers/all-mpnet-base-v2'. Hugging face says the model in this video gives poor quality and redirects you to better/newer models. The model you'll switch out for has the best average performance as of now. About to try to send this to a database so I have a column of the 'sentences' and a column with the 'scores' so I can sort by the score. Any help sending the final array to a df would be much appreciated!

@jamesbriggs Жыл бұрын

Hey Nathen, yes the original sbert models are pretty outdated - the other model you suggested is much better (and in general, MPNet models make great sentence transformers), thanks for sharing. For your df problem, you should be able to convert the arrays to lists, so assuming you have a sentence list in `sentences` and a score array in `scores`, you can do something like: ``` df = pd.DataFrame({ "sentences": sentences, "scores": scores.tolist() }) ``` You may also need to flatten the scores array, so you'd change the above to `scores[0].tolist()` Hope that helps :)

@huntercoleman1347 6 ай бұрын

Does this tutorial start in the middle from another tutorial? The third step - model = SentenceTransformer(model_name) - does not work. Are there things that we are supposed to download first?

@giveupforu 3 жыл бұрын

Hey James, nice tutorial! Do you know the advantage of using sentence BERT over the average embeddings of all words in a sentence using word2vec?

@jamesbriggs 3 жыл бұрын

they're generally much more expressive thanks to the bidirectional, multi-head attention mechanisms inside the BERT encoder layers - so generally we would expect sentence BERT to outperform word2vec

@vaishnavip2327 Жыл бұрын

How can we save this model as joblib , so we can use it for deployment??

@annadinardo4932 3 жыл бұрын

Thank you so much for the awesome explanation! Do you think this method could be applied also when working with whole paragraphs of texts, and not just single sentences? Or is this library not suited for comparing longer texts? Thank you!

@jamesbriggs 3 жыл бұрын

It depends on the length of your text, it's not necessarily restricted to sentences but is (with the model we use here) 128 tokens, a token being a word/word-piece (sometimes a single word can be split into 2+ tokens). So with 128 tokens, you have a fair bit of flexibility on length :)

@annadinardo4932 3 жыл бұрын

@@jamesbriggs Thank you so much for your answer!

@fernandoandutta 2 жыл бұрын

Hi James, I wonder if you could answer a very simple question. If I am to use "model.encode(sentences)" Is there a way to make it faster?? By default do you know if "ENCODE" applies a max_length=128 or if that value is 512, which is the traditional value applied into BERT. If that is the case, can you adjust it to this smaller value. In your other video, it is very clear how to get the mean after considering max_length=128 (at 6:30 minutes at kzbin.info/www/bejne/oIezlWqietudqsk). However, is it possible to adjust this value for model.encode if by default a value higher than 128 was applied before averaging things? Thanks a lot in advance. Sincerely, F.

@chihuenfong1971 2 жыл бұрын

Where can i download your ipynb file?

@nithinpnair2116 2 жыл бұрын

How can we train and deploy this sentence similarity model in sagemaker ?My ultimate aim is to deploy this model as a REST API, so that I can utilize it from a different application. If you have already made any videos, please do share me the link.

@jamesbriggs 2 жыл бұрын

Nothing specific to sagemaker unfortunately, but I do have an entire (free) course on sentence similarity models here: www.pinecone.io/learn/nlp/ There many chapters on different approaches to training, videos are embedding within each chapter - I hope that helps :)

@vinaykakanuru 2 жыл бұрын

Why are going with bert transformers when we can do the same thing using TFIDFVectorizer. Can you please make a video between pros and cons of these two approaches. If already posted please share the link as reply. Thank you.

@jamesbriggs 2 жыл бұрын

here's a video comparing TFIDF, BM25 and Bert - kzbin.info/www/bejne/sJrMd2Sbe7JmlZY And another for traditional similarity metrics too (Jaccard, w-shingling, and Levenshtein) - kzbin.info/www/bejne/d4qZY61tfdeanrs Currently working on a big series covering similarity search in-depth, so will be plenty more content on this topic over the next couple months :)

@harryrichard2154 3 жыл бұрын

Hey James thanks for the tutorial. How do we fine tune bert model for sentence similarity?? Thank you once again for the tutorial.

@jamesbriggs 3 жыл бұрын

I believe that if you fine-tune using generic language comprehension methods like MLM and NSP, it should enhance the similarity vectors that BERT produces too. I haven't done this though, so I can't say that for sure, but it's something I'm pretty interested in trying and I expect I will work on it soon - when I do, they'll be a video :)

@harryrichard2154 3 жыл бұрын

Thank you for your response, I am kind off confused right know, in the other video where in u built from scratch.. If I do that with custom data... Will there be any changes in my results.. ??

@jamesbriggs 3 жыл бұрын

@@harryrichard2154 in the MLM/NSP videos, I fine-tune BERT, so I take the existing BERT model, and train is some more on custom data. This essentially fine-tunes the weights inside the BERT network to be optimized for the custom dataset (eg better understand the custom style of language). So yes, you would get different results as for sentence similarity it is those internal weights that we are extracting :) Hope that helps

@shahi_gautam 2 жыл бұрын

Instead of two arrays, how can we do with two dataframes(df1 and df2), taking one cell from a column of df1 and matching it with all cell of a column of df2 and so on

@jamesbriggs 2 жыл бұрын

Would probably be best to loop through your rows in df1, pull out the value in your df1 cell, then compare that against the full column of df2, you will want to extract both out as arrays though - I don't believe you sentence-transformers deals directly with df objects

@jamesbriggs 2 жыл бұрын

But, at the same time if you're dealing with a lot of data - I'd recommend something more efficient than this implementation of cosine similarity, Faiss would be a good option for example, much faster. I'm working on a Faiss series at the moment, you can find the first video (which is all you need to get started) on my channel page :)

@shahi_gautam 2 жыл бұрын

@@jamesbriggs It didn't work, the precise problem is listed at stackoverflow.com/questions/68624306/cosine-similarity-between-columns-of-two-different-dataframe/68626354?noredirect=1#comment121282908_68626354 any opinion?

@eugenesheely5288 2 жыл бұрын

The results I'm getting are a hit or miss. I'm inserting a string I want to analyze into the first index of the list and running it as your vid showed. I might have one on English ships and I get the top result as something to do with the sea and a ship, but the string saying "you're fighting like cats and dogs" gives me an incoherent code (autogenerated image name I think) despite there being multiple sentences with fighting, cats and dogs in them. Thoughts? Seems to fail more often than not.

@jamesbriggs 2 жыл бұрын

hey Eugene, it can be a bit hit and miss, but overall the performance should be quite good - for the incoherent code, there's a possibility that by pure chance this is encoded to a similar vector space as your cats and dogs sentence. I assume you're doing all of this with a larger dataset? If so I would recommend using something like faiss, which handles all of the distance computations (and is much faster) - however, in terms of accuracy, this *should* only help if there is something weird happening with your cosine similarity function. Tutorial here kzbin.info/www/bejne/qXzcp6aaettpqM0 Let me know if it helps!

@eugenesheely5288 2 жыл бұрын

@@jamesbriggs data set is almost 7,000. Thanks for the tip I'll give it a try tonight.

@eugenesheely5288 2 жыл бұрын

I was using SequenceMatcher this afternoon for another project since it made more sense for that (companies in two datasets spelled slightly different that need matching). Applied it to my original project and seems to work much better and faster albeit it's not as sophisticated as SentenceTransformer. Works better if I strip out determiners and other useless words. "Fighting like cats and dogs" gives me back string "raining cats and dogs." "Youth Rebel" gives me "Youth Reading" which is close but not ideal but still usable. I'll try to find time for your faiss vid tomorrow night and let you know. Oh and I tried this on dataset almost 200,000 strings runs in about a couple mins

@jamesbriggs 2 жыл бұрын

@@eugenesheely5288 yes I've seen sentencematcher used - afaik it's calculating the syntax similarity - rather than the 'semantic' similarity of sentence transformer. With your use-case of finding misspellings, I think your approach is ideal. Yes I definitely wouldn't recommend using this code on large datasets, or even slightly not-small datasets - 100% go with Faiss for that - let me know how it goes!

@eugenesheely5288 2 жыл бұрын

@@jamesbriggs syntax vs semantics is a very nice way to explain the differences. I'll keep you updated.