3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

Рет қаралды 44,494

James Briggs

Күн бұрын

Пікірлер: 61

@stepkurniawan Жыл бұрын

7 mins to understand TF-IDF, youre my saviour

@fiztehno 11 күн бұрын

13:41 - In the BM25 formula, it should be f(q, D) in the denominator of the first term. This is designed for the so-called “saturation” effect (set b = 0), ensuring that once a term appears frequently enough, additional occurrences don’t keep massively boosting the score. It helps frequent terms stabilize at a high weight while still assigning a smaller weight to infrequent terms.

@mayursanmugam4050 Жыл бұрын

Found this after stumbling around for a good overview of BM25 & SBERT. This is a fantastic initial introduction - enough detail and introduces the right concepts that people can double down on for further learning. Thank you James!

@LuisRomaUSA 3 жыл бұрын

Not many views yet, but please dont stop making content. This is the best video i have found in a week of searching.

@jamesbriggs 3 жыл бұрын

haha happy to hear, I've committed to making videos so I'll be here for a long time 😅 check out the similarity search playlist if you're interested in these things, just finished it!

@mcnubn Жыл бұрын

Really helped clear up BM25 for me! Huge thank you so sharing this!

@bujin1977 2 жыл бұрын

Thanks, this was very helpful! I've recently started using SQL Server's full text search capabilities to drive course searches on our college website, but it was all a bit of a "black box" thing. No idea how it worked, I just trusted that it *did* work! Until I got a query from someone who wanted to know how to alter their search results to change the order that we display them on the website. I'm no stranger to complicated mathematical formulae, but I took one look at the BM25 formula on wikipedia and cried! Your explanation made it so much easier to understand what was going on. Now comes the hard part. Explaining how the staff member in question can alter their data to boost their results... 😬

@MehdiMirzapour 11 ай бұрын

Great work! You are a great teacher! Although, I know these concepts but I enjoyed a lot watching it.

@lukekim4760 3 жыл бұрын

I am into document similarity ranking and I love your videos! Thank you so much :)

@jamesbriggs 3 жыл бұрын

Great to hear! I made a full (and free) course on semantic search if you're interested :) www.pinecone.io/learn/nlp

@tomwalczak4992 3 жыл бұрын

Really good, simple explanations. Also really liked your Udemy course.

@jamesbriggs 3 жыл бұрын

hey Thomas, yes I remember you left a review on the course? Great to see you here too and thanks!

@tomwalczak4992 3 жыл бұрын

@@jamesbriggs Yup ;) I'm gettting into NLP so your videos have been super useful. Just finished my first project that uses both sparse and dense embeddings: share.streamlit.io/tomwalczak/pubmed-abstract-analyzer And as you say in the video, dense embeddings and complex models don't always work better, at least not out-of-the-box. Looking forward to more vids :)

@jamesbriggs 3 жыл бұрын

@@tomwalczak4992 That's a very cool project, first one too? I'm impressed! Awesome to see you're getting into it though, looking forward to seeing you around!

@23232323rdurian Жыл бұрын

28:50....both B and G SHARE this phrase and its several words, so THAT's why they share a high similarity score...

@AjayShivranBCSE 2 жыл бұрын

Great work man!

@peshero8866 10 күн бұрын

In the implementation of the BM25 the denominator why f(t,D) = freq? Would not it be len(sentence)?

@li-pingho1441 Жыл бұрын

extremely simple explanation!!!!!!!!

@arpitqw1 Күн бұрын

Bm25 would generate similarly directly based on score but other 2 are embedding technique so cosine /dot product would require to get search result.

@parth191079 9 ай бұрын

This is super helpful! Thank you for this video.

@UnpluggedPerformance 3 жыл бұрын

that Bert outcome is certainly cool!!!!! you made my day man!! awesome! how can we support you? (besides likes etc.)

@jamesbriggs 3 жыл бұрын

comments like this! Really happy it helped :)

@yonahcitron226 Жыл бұрын

great explanations! thanks!

@ruimelo1039 2 жыл бұрын

I'm doing an uni project in this matter and your explanation was on point! Thank you

@leonardvanduuren8708 2 жыл бұрын

Masterful !! Thx for this and all your other stuff !!

@jamesbriggs 2 жыл бұрын

Glad you're enjoying them!

@Data_scientist_t3rmi 2 жыл бұрын

Excellent video thank you!

@23232323rdurian Жыл бұрын

point taken and understood about eg..the similar meaning, but using different words.....however in practice just a straight-forward word-to-word with frequency stats works pretty good because: words have usage frequencies, so anybody MEANING to say is gonna say not .....like 100-to-1 odds.....and ? well that's extremely rare..... is gonna be 100s of times more frequent in this context than ..... ==then further ACROSS languages (eg English, Japanese) the word frequencies dont necessarily translate....sometimes frequent English words are infrequent in Japanese and vice versa....

@asedaaddai-deseh8152 3 жыл бұрын

Great explanation!

@li-pingho1441 Жыл бұрын

thank you so muchhhhhhh

@qwerty8669 2 жыл бұрын

Thanks this was helpful

@abhishekrathi6253 3 жыл бұрын

Nice explanation

@szymonskorupinski5237 3 жыл бұрын

Great work!

@wenzeloong 3 жыл бұрын

it's a great video!! I need your opinion sir James. In this video you are using Cosine Similarity to calculate the distance. What do you think if we combine these methods with ANN (approximate Nearest Neighbor) with angular distance. is that better than use cosine similarity?

@jamesbriggs 3 жыл бұрын

Hey Iven, thanks! I think you should absolutely use ANN - definitely if you have lots of vectors. As for cosine similarity vs angular similarity, angular similarity can distinguish better between already very similar vectors, but I'm not sure if it is too important in most use-cases. Most applications from pretty smart people tend to use cosine similarity, so that is (for me) evidence that cosine similarity is 'good enough' If you're interested in ANN and more of this, I have a big playlist on it here kzbin.info/aero/PLIUOU7oqGTLhlWpTz4NnuT3FekouIVlqc Hope it helps :)

@wenzeloong 3 жыл бұрын

@@jamesbriggs Thank you for your opinion and the playlist is quite amazing..! It helps me a lot.. thank you !

@smcgpra3874 Жыл бұрын

Can we classify tabular data where each row is one dataset

@pfinardii 3 жыл бұрын

Hi James, fantastic video!!! A question: Using BERT to extract dense representations with hidden_state or last_hidden_state layers and we perform masked_embeddings = embeddings * mask (where mask is the attention_mask BERT output) to put 0 value in padding tokens maybe we need also to consider the special tokens [cls] and [sep]? I mean, the attention mask for these special tokens are 1. So when using some hidden layer from BERT we need perform a slice masked_embeddings = masked_embeddings[ : , 1:sep_token_pos,], where sep_token_pos is the [sep] position in sequence: [[cls], tokens of the sequence [sep], [pad],[pad]...]

@jamesbriggs 3 жыл бұрын

hey Paulo, good question. I believe the other sentence transformer models that build these embeddings keep both, but I have never seen them explicitly state that they do (or why) in any papers, so I can't say for sure sorry! Nonetheless, my understand is that the CLS and SEP tokens are included within the embeddings as they still contain useful information about the input data. The CLS token itself can actually be used in building sentence embeddings (although it is ineffective compared to mean pooling afaik). The significance of that being that the CLS token contains enough information about the sequence to be (somewhat) effectively used as a single vector representation of the whole sequence, therefore it contains quite useful information about the sequence that would be lost if removed. As for the SEP token, I don't believe it is as important as the CLS, but I can't say I know how relevant it is. I'd be curious to see a comparison of embedding performance with/without the CLS/SEP tokens though. I'm sure it has been tested but I've never seen something like that mentioned

@pfinardii 3 жыл бұрын

@@jamesbriggs Hi James I did a test with MNR loss. During tokenization process I set the tokenizer parameter add_special_tokens=False and I got 0.83 against 0.81 with the default value (True). Need to test only without [SEP] token to make the results more robust, thank for the reply :)

@jamesbriggs 3 жыл бұрын

@@pfinardii oh so it's better? Wow I'll have to try it too - that's awesome :)

@UnpluggedPerformance 3 жыл бұрын

bro super good explanations

@Krobongo 2 жыл бұрын

I'm a bit confused. Is SBERT just the embedding layer, which is fed to a ML Model, or is it also the model itself to do e.g. text classification?

@venkateshkulkarni2227 3 жыл бұрын

I think the bert-base-nli-tokens are deprecated now according to the Hugging Face website. Which Sentence Transformer model should we now use for SBERT?

@jamesbriggs 3 жыл бұрын

I like mpnet models the most for generic sentence vets huggingface.co/flax-sentence-embeddings/all_datasets_v3_mpnet-base I cover a load of models, training methods, etc in this playlist: kzbin.info/aero/PLIUOU7oqGTLgz-BI8bNMVGwQxIMuQddJO Hope it helps :)

@peterthomas7523 2 жыл бұрын

Excellent video as always :) kinda makes me wonder why I bother spending my grant money on training courses when your whole channel is simply better. I had a question about using S-BERT for similarities between documents, rather than sentences within a document. Could I just average the embedding for the sentences within each document and calculate cosine similarity between these? Or is there a better way? Thanks!

@jamesbriggs 2 жыл бұрын

you can do this but it's not that effective, other option would be to try and compare all paragraphs and take an average score or create some sort of threshold like "if 5 paragraphs similarity > 0.8" etc. It's hard to do. I have a free 'course' on semantic search here, hopefully you can save some more of your grant money: www.pinecone.io/learn/nlp/

@peterthomas7523 2 жыл бұрын

@@jamesbriggs Thanks a lot :) I've been working through your pinecone course and am really liking it so far!

@AlexGuemez 2 жыл бұрын

Is there a way to "reverse" TFIDF to see if Google uses it in his algorithm?

@jamesbriggs 2 жыл бұрын

I've not heard of a way but it could be possible - Google's algorithms uses a lot of different things though (BERT included), so I'm not sure if it would be possible to identify specific parts of it like TF-IDF

@wilfredomartel7781 2 жыл бұрын

how to train sbert with a specific domain?

@jamesbriggs 2 жыл бұрын

hey I have a few articles+videos on this, what does your training data look like? If you have sentence pairs + scores you can use MSE loss which I cover at the end of: www.pinecone.io/learn/gpl/ If you don't have training data and just text data you can use unsupervised methods like GPL (above), GenQ, or TSDAE (all found here): www.pinecone.io/learn/nlp/ If you have sentence pairs *without* labels you can use softmax or preferably MNR loss: www.pinecone.io/learn/fine-tune-sentence-transformers-mnr/

@鹏程李-w8z 3 жыл бұрын

want some scripts or subtiltes for your video, thank you!

@edgar23vargas53 3 жыл бұрын

Hey is there any way we can get in contact with you?

@jamesbriggs 3 жыл бұрын

Yes on the 'About' page of my YT channel you'll be able to find my email

@edgar23vargas53 3 жыл бұрын

@@jamesbriggs DMed you on Instagram

@edgar23vargas53 3 жыл бұрын

@@jamesbriggs DMed you

@jamesbriggs 3 жыл бұрын

@@edgar23vargas53 got it

@edgar23vargas53 3 жыл бұрын

@@jamesbriggs shot you an email

@gorgolyt 3 жыл бұрын

You need to tighten up your math notation. Writing f(t, D) for the "total number of terms in the document" is really confusing and makes no sense. What is t in this function? You either need to sum over all t in D, which you haven't written, or you should just get rid of the t and use some function g(D) to denote the total number of terms in the document. When you get onto BM25 it's even worse, I'm not sure your explanation of your notation is even correct. It should be f(q, D) on the denominator, the same q that is in the numerator, not f(t, D), whatever that means.