BM25 : The Most Important Text Metric in Data Science

  Рет қаралды 11,271

ritvikmath

ritvikmath

Күн бұрын

Пікірлер: 27
@rabiumuhammedeffect423
@rabiumuhammedeffect423 8 ай бұрын
This is way way more informative than my lecturer's lecture
@shohrehhaddadan8922
@shohrehhaddadan8922 7 ай бұрын
Not only does this video explain the specific metric, but it also teaches how to analyze metrics! amazing!
@ritvikmath
@ritvikmath 7 ай бұрын
Glad it was helpful!
@gopesh97
@gopesh97 7 ай бұрын
What a superb explanation! I really appreciate how you broke down the problem and its solution into manageable pieces. Your clarity and approach are consistently impressive! 🍻
@rajbhandari9605
@rajbhandari9605 2 ай бұрын
This is super awesome. I have worked as 'search engineer' who already understood TF/IDF and was trying to understand BM25 for a long time -- and it just flew over my head. Yours is the first explanation that just clicked for me. Thank you!!
@haloandavatar1177
@haloandavatar1177 11 ай бұрын
Remarkably well explained and with such concise elegance too. Extra +100 pts for explaining in laymans term what a partial derivative is
@hungchen6604
@hungchen6604 Жыл бұрын
Analyzing an equation using derivatives is brilliant. Thanks for yet another outstanding video, as always.
@twoplustwo5
@twoplustwo5 2 ай бұрын
🎯 Key points for quick navigation: 00:00:01 *📚 Introduction to TF-IDF and BM25* - TF-IDF and BM25 are essential for ensuring the textual relevance of search results, - TF-IDF combines term frequency and inverse document frequency to rank documents by relevance, - Importance of considering query uniqueness in relevance scoring. 00:02:10 *📝 Limitations of TF-IDF* - TF-IDF does not account for document length, leading to potential misranking, - Relevance can be skewed by term frequency without considering document brevity, - Real-world example contrasting document length's effect on relevance. 00:05:00 *🚀 Need for the New Metric: BM25* - Introducing BM25 to address TF-IDF's shortcomings, - BM25 brings diminishing returns to term frequency, combating keyword stuffing, - Importance of penalizing longer documents to prevent manipulation. 00:08:03 *🔍 Understanding BM25 Structure* - Explanation of the BM25 formula and its components, - Iterative improvement from previous versions to achieve optimal matching, - B and K are tunable parameters impacting document ranking. 00:12:38 *📈 BM25 Partial Derivatives and Impact* - Partial derivatives show property changes in document scoring, - Positive yet diminishing returns for increasing term frequency, - Negative impact from increasing document length versus average. 00:16:49 *🎯 Demonstrating BM25's Superiority Over TF-IDF* - Practical example contrasting BM25 with TF-IDF in document scoring, - BM25 selects concise and relevant documents over lengthier non-relevant ones, - Encouragement to understand complexities only when performance improvements are made. Made with HARPA AI
@javiertorcal5053
@javiertorcal5053 Ай бұрын
Simply an awesome explanation! Congrats
@ritvikmath
@ritvikmath Ай бұрын
Glad you liked it!
@MilesLabrador
@MilesLabrador Жыл бұрын
This was wonderfully explained, thank you for the great video!
@rsilveira79
@rsilveira79 10 ай бұрын
Great explanation, thanks!
@naughtrussel5787
@naughtrussel5787 Жыл бұрын
Awesome video, very clear and helpful!
@rishidixit7939
@rishidixit7939 10 күн бұрын
Amazing Explanation
@ArunKumar-bp5lo
@ArunKumar-bp5lo 5 ай бұрын
what a great explanation
@wildlifeonscreen
@wildlifeonscreen Жыл бұрын
Thanks for your video! Doesn't TF already take into account the length of the document? It's a proportion of the number of times a word appears out of the total number of words in the document. So, issue #1 wouldn't be a problem?
@ugestacoolie5998
@ugestacoolie5998 11 ай бұрын
yeah I was thinking that too, and long documents don't really matter as well because the ratios are equivalent, like 1/10 = 100/1000, so I don't really get it as well
@EOI-KSA
@EOI-KSA 8 ай бұрын
Rightly pointed out! but yes other portions of the problem makes sense.
@roct07
@roct07 4 ай бұрын
Frequency is used as is and what you're talking about is frequency normalised with the length of the document. While that is fine, it scales linearly whereas BM25 is exponential and gives lesser score as the proportion increases instead of a constant increase, which produces better results in real use cases
@rajbhandari9605
@rajbhandari9605 2 ай бұрын
TF takes only the current document into account in terms of q. For example, if I have a 1.Doc1: "Cat cat cat dog dog dog dog dog dog" 2 Doc2: "Cat cat" 3 Doc3: "dog" for query : "cat" TF will be higher for Doc1 because Doc1 has cat 3 times compared to 2 times in Doc2. Similarly doc1 will be higher score than Doc3 for query "dog" -- effectively longer the document higher the score without any penalty. In extreme case, if you create a Document that is 10x the length of average document, it will rank higher. Thats the problem with TF/IDF --- it does not take into account the length of the document which BM25 does with the term 'theta'. Note that IDF only takes frequency of query term "cat" across the whole corpus into account but not the document length.
@daspradeep
@daspradeep Жыл бұрын
randomly landed on your channel, your explanations are fundamentally so awesome 🙏🏽
@data_pathavan4585
@data_pathavan4585 Ай бұрын
I think there is some doubts in the explanation - Term Frequency takes the total number of words in the sentence in denominator so 1/10 might have more weightage in the tf-idf calculation,
@vishnum9613
@vishnum9613 6 ай бұрын
isn't the term frequency already dependent on the total number of words in the document?
@Hemewl
@Hemewl Жыл бұрын
So why not just apply length normalization to the document so that tf of cat in A = 1/10 and tf of cat in B = 10/1000?
@siddhantrai7529
@siddhantrai7529 Жыл бұрын
I suppose we are doing that here as well, but instead of treating each document as IID and hence normalising, we are also taking into account the relative difference in size. It's like a weighted normalisation. Plain normalisation would be like comparing bunch of sigmoids with cross entropy. More specifically we are trying to take mutual information amongst the docs into account while calculations.
@Thebrotrain
@Thebrotrain Жыл бұрын
Why would you use additional shorthand when trying to teach something. I get that paper is small, but adding 1 more thing to keep track off for the learner is bad teaching strategy.
@tantzer6113
@tantzer6113 2 ай бұрын
This particular IDF has nothing to do with occupation, apartheid, or genocide.
This is the Math You Need to Master Reinforcement Learning
31:34
t-SNE Simply Explained
25:49
ritvikmath
Рет қаралды 15 М.
Chain Game Strong ⛓️
00:21
Anwar Jibawi
Рет қаралды 41 МЛН
Арыстанның айқасы, Тәуіржанның шайқасы!
25:51
QosLike / ҚосЛайк / Косылайық
Рет қаралды 700 М.
小丑教训坏蛋 #小丑 #天使 #shorts
00:49
好人小丑
Рет қаралды 54 МЛН
Why You Shouldn't Trust Your ML Models (...too much)
16:07
ritvikmath
Рет қаралды 6 М.
Gaussian Processes : Data Science Concepts
24:47
ritvikmath
Рет қаралды 16 М.
BM25 Algorithm: Overcoming the Limitations of TF-IDF
16:03
Why vector search is not enough and we need BM25
8:14
Diffbot
Рет қаралды 19 М.
Berlin Buzzwords 2016: Britta Weber - BM25 demystified #bbuzz
37:23
Every Distance in Data Science (Almost 100K Subs!)
21:25
ritvikmath
Рет қаралды 12 М.
Better RAG: Hybrid Search in Chat with Documents | BM25 and Ensemble
16:08
Prompt Engineering
Рет қаралды 23 М.
Every Ranking Metric : MRR, MAP, NDCG
21:17
ritvikmath
Рет қаралды 1,9 М.
The Most Important (and Surprising) Result from Information Theory
9:10
Mutual Information
Рет қаралды 94 М.
The Easy Trick to Understand any Data Science Formula
14:29
ritvikmath
Рет қаралды 6 М.