13:41 - In the BM25 formula, it should be f(q, D) in the denominator of the first term. This is designed for the so-called “saturation” effect (set b = 0), ensuring that once a term appears frequently enough, additional occurrences don’t keep massively boosting the score. It helps frequent terms stabilize at a high weight while still assigning a smaller weight to infrequent terms.
@mayursanmugam4050 Жыл бұрын
Found this after stumbling around for a good overview of BM25 & SBERT. This is a fantastic initial introduction - enough detail and introduces the right concepts that people can double down on for further learning. Thank you James!
@LuisRomaUSA3 жыл бұрын
Not many views yet, but please dont stop making content. This is the best video i have found in a week of searching.
@jamesbriggs3 жыл бұрын
haha happy to hear, I've committed to making videos so I'll be here for a long time 😅 check out the similarity search playlist if you're interested in these things, just finished it!
@mcnubn Жыл бұрын
Really helped clear up BM25 for me! Huge thank you so sharing this!
@bujin19772 жыл бұрын
Thanks, this was very helpful! I've recently started using SQL Server's full text search capabilities to drive course searches on our college website, but it was all a bit of a "black box" thing. No idea how it worked, I just trusted that it *did* work! Until I got a query from someone who wanted to know how to alter their search results to change the order that we display them on the website. I'm no stranger to complicated mathematical formulae, but I took one look at the BM25 formula on wikipedia and cried! Your explanation made it so much easier to understand what was going on. Now comes the hard part. Explaining how the staff member in question can alter their data to boost their results... 😬
@MehdiMirzapour11 ай бұрын
Great work! You are a great teacher! Although, I know these concepts but I enjoyed a lot watching it.
@lukekim47603 жыл бұрын
I am into document similarity ranking and I love your videos! Thank you so much :)
@jamesbriggs3 жыл бұрын
Great to hear! I made a full (and free) course on semantic search if you're interested :) www.pinecone.io/learn/nlp
@tomwalczak49923 жыл бұрын
Really good, simple explanations. Also really liked your Udemy course.
@jamesbriggs3 жыл бұрын
hey Thomas, yes I remember you left a review on the course? Great to see you here too and thanks!
@tomwalczak49923 жыл бұрын
@@jamesbriggs Yup ;) I'm gettting into NLP so your videos have been super useful. Just finished my first project that uses both sparse and dense embeddings: share.streamlit.io/tomwalczak/pubmed-abstract-analyzer And as you say in the video, dense embeddings and complex models don't always work better, at least not out-of-the-box. Looking forward to more vids :)
@jamesbriggs3 жыл бұрын
@@tomwalczak4992 That's a very cool project, first one too? I'm impressed! Awesome to see you're getting into it though, looking forward to seeing you around!
@23232323rdurian Жыл бұрын
28:50....both B and G SHARE this phrase and its several words, so THAT's why they share a high similarity score...
@AjayShivranBCSE2 жыл бұрын
Great work man!
@peshero886610 күн бұрын
In the implementation of the BM25 the denominator why f(t,D) = freq? Would not it be len(sentence)?
@li-pingho1441 Жыл бұрын
extremely simple explanation!!!!!!!!
@arpitqw1Күн бұрын
Bm25 would generate similarly directly based on score but other 2 are embedding technique so cosine /dot product would require to get search result.
@parth1910799 ай бұрын
This is super helpful! Thank you for this video.
@UnpluggedPerformance3 жыл бұрын
that Bert outcome is certainly cool!!!!! you made my day man!! awesome! how can we support you? (besides likes etc.)
@jamesbriggs3 жыл бұрын
comments like this! Really happy it helped :)
@yonahcitron226 Жыл бұрын
great explanations! thanks!
@ruimelo10392 жыл бұрын
I'm doing an uni project in this matter and your explanation was on point! Thank you
@leonardvanduuren87082 жыл бұрын
Masterful !! Thx for this and all your other stuff !!
@jamesbriggs2 жыл бұрын
Glad you're enjoying them!
@Data_scientist_t3rmi2 жыл бұрын
Excellent video thank you!
@23232323rdurian Жыл бұрын
point taken and understood about eg..the similar meaning, but using different words.....however in practice just a straight-forward word-to-word with frequency stats works pretty good because: words have usage frequencies, so anybody MEANING to say is gonna say not .....like 100-to-1 odds.....and ? well that's extremely rare..... is gonna be 100s of times more frequent in this context than ..... ==then further ACROSS languages (eg English, Japanese) the word frequencies dont necessarily translate....sometimes frequent English words are infrequent in Japanese and vice versa....
@asedaaddai-deseh81523 жыл бұрын
Great explanation!
@li-pingho1441 Жыл бұрын
thank you so muchhhhhhh
@qwerty86692 жыл бұрын
Thanks this was helpful
@abhishekrathi62533 жыл бұрын
Nice explanation
@szymonskorupinski52373 жыл бұрын
Great work!
@wenzeloong3 жыл бұрын
it's a great video!! I need your opinion sir James. In this video you are using Cosine Similarity to calculate the distance. What do you think if we combine these methods with ANN (approximate Nearest Neighbor) with angular distance. is that better than use cosine similarity?
@jamesbriggs3 жыл бұрын
Hey Iven, thanks! I think you should absolutely use ANN - definitely if you have lots of vectors. As for cosine similarity vs angular similarity, angular similarity can distinguish better between already very similar vectors, but I'm not sure if it is too important in most use-cases. Most applications from pretty smart people tend to use cosine similarity, so that is (for me) evidence that cosine similarity is 'good enough' If you're interested in ANN and more of this, I have a big playlist on it here kzbin.info/aero/PLIUOU7oqGTLhlWpTz4NnuT3FekouIVlqc Hope it helps :)
@wenzeloong3 жыл бұрын
@@jamesbriggs Thank you for your opinion and the playlist is quite amazing..! It helps me a lot.. thank you !
@smcgpra3874 Жыл бұрын
Can we classify tabular data where each row is one dataset
@pfinardii3 жыл бұрын
Hi James, fantastic video!!! A question: Using BERT to extract dense representations with hidden_state or last_hidden_state layers and we perform masked_embeddings = embeddings * mask (where mask is the attention_mask BERT output) to put 0 value in padding tokens maybe we need also to consider the special tokens [cls] and [sep]? I mean, the attention mask for these special tokens are 1. So when using some hidden layer from BERT we need perform a slice masked_embeddings = masked_embeddings[ : , 1:sep_token_pos,], where sep_token_pos is the [sep] position in sequence: [[cls], tokens of the sequence [sep], [pad],[pad]...]
@jamesbriggs3 жыл бұрын
hey Paulo, good question. I believe the other sentence transformer models that build these embeddings keep both, but I have never seen them explicitly state that they do (or why) in any papers, so I can't say for sure sorry! Nonetheless, my understand is that the CLS and SEP tokens are included within the embeddings as they still contain useful information about the input data. The CLS token itself can actually be used in building sentence embeddings (although it is ineffective compared to mean pooling afaik). The significance of that being that the CLS token contains enough information about the sequence to be (somewhat) effectively used as a single vector representation of the whole sequence, therefore it contains quite useful information about the sequence that would be lost if removed. As for the SEP token, I don't believe it is as important as the CLS, but I can't say I know how relevant it is. I'd be curious to see a comparison of embedding performance with/without the CLS/SEP tokens though. I'm sure it has been tested but I've never seen something like that mentioned
@pfinardii3 жыл бұрын
@@jamesbriggs Hi James I did a test with MNR loss. During tokenization process I set the tokenizer parameter add_special_tokens=False and I got 0.83 against 0.81 with the default value (True). Need to test only without [SEP] token to make the results more robust, thank for the reply :)
@jamesbriggs3 жыл бұрын
@@pfinardii oh so it's better? Wow I'll have to try it too - that's awesome :)
@UnpluggedPerformance3 жыл бұрын
bro super good explanations
@Krobongo2 жыл бұрын
I'm a bit confused. Is SBERT just the embedding layer, which is fed to a ML Model, or is it also the model itself to do e.g. text classification?
@venkateshkulkarni22273 жыл бұрын
I think the bert-base-nli-tokens are deprecated now according to the Hugging Face website. Which Sentence Transformer model should we now use for SBERT?
@jamesbriggs3 жыл бұрын
I like mpnet models the most for generic sentence vets huggingface.co/flax-sentence-embeddings/all_datasets_v3_mpnet-base I cover a load of models, training methods, etc in this playlist: kzbin.info/aero/PLIUOU7oqGTLgz-BI8bNMVGwQxIMuQddJO Hope it helps :)
@peterthomas75232 жыл бұрын
Excellent video as always :) kinda makes me wonder why I bother spending my grant money on training courses when your whole channel is simply better. I had a question about using S-BERT for similarities between documents, rather than sentences within a document. Could I just average the embedding for the sentences within each document and calculate cosine similarity between these? Or is there a better way? Thanks!
@jamesbriggs2 жыл бұрын
you can do this but it's not that effective, other option would be to try and compare all paragraphs and take an average score or create some sort of threshold like "if 5 paragraphs similarity > 0.8" etc. It's hard to do. I have a free 'course' on semantic search here, hopefully you can save some more of your grant money: www.pinecone.io/learn/nlp/
@peterthomas75232 жыл бұрын
@@jamesbriggs Thanks a lot :) I've been working through your pinecone course and am really liking it so far!
@AlexGuemez2 жыл бұрын
Is there a way to "reverse" TFIDF to see if Google uses it in his algorithm?
@jamesbriggs2 жыл бұрын
I've not heard of a way but it could be possible - Google's algorithms uses a lot of different things though (BERT included), so I'm not sure if it would be possible to identify specific parts of it like TF-IDF
@wilfredomartel77812 жыл бұрын
how to train sbert with a specific domain?
@jamesbriggs2 жыл бұрын
hey I have a few articles+videos on this, what does your training data look like? If you have sentence pairs + scores you can use MSE loss which I cover at the end of: www.pinecone.io/learn/gpl/ If you don't have training data and just text data you can use unsupervised methods like GPL (above), GenQ, or TSDAE (all found here): www.pinecone.io/learn/nlp/ If you have sentence pairs *without* labels you can use softmax or preferably MNR loss: www.pinecone.io/learn/fine-tune-sentence-transformers-mnr/
@鹏程李-w8z3 жыл бұрын
want some scripts or subtiltes for your video, thank you!
@edgar23vargas533 жыл бұрын
Hey is there any way we can get in contact with you?
@jamesbriggs3 жыл бұрын
Yes on the 'About' page of my YT channel you'll be able to find my email
@edgar23vargas533 жыл бұрын
@@jamesbriggs DMed you on Instagram
@edgar23vargas533 жыл бұрын
@@jamesbriggs DMed you
@jamesbriggs3 жыл бұрын
@@edgar23vargas53 got it
@edgar23vargas533 жыл бұрын
@@jamesbriggs shot you an email
@gorgolyt3 жыл бұрын
You need to tighten up your math notation. Writing f(t, D) for the "total number of terms in the document" is really confusing and makes no sense. What is t in this function? You either need to sum over all t in D, which you haven't written, or you should just get rid of the t and use some function g(D) to denote the total number of terms in the document. When you get onto BM25 it's even worse, I'm not sure your explanation of your notation is even correct. It should be f(q, D) on the denominator, the same q that is in the numerator, not f(t, D), whatever that means.