CatBoost Part 1: Ordered Target Encoding

  Рет қаралды 38,073

StatQuest with Josh Starmer

StatQuest with Josh Starmer

Күн бұрын

Пікірлер: 84
@statquest
@statquest Жыл бұрын
Corrections: 4:09 It is also worth noting that if there were more than 2 target values, for example, if Loves Troll 2 could be 0, 1 and 2, then, when calculating the OptionCount for a sample with Loves Troll 2 = 1, we would include rows that had Loves Troll 2 = 1 and 2. To learn more about Lightning: lightning.ai/ Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
@mrcoet
@mrcoet Жыл бұрын
Thank you! I'm doing my master thesis and I'm checking your channel every day waiting for Transformers. Thank you again!
@statquest
@statquest Жыл бұрын
I'm still working on it.
@dihancheng952
@dihancheng952 Жыл бұрын
@@statquest same eagerly waiting here
@firstkaransingh
@firstkaransingh Жыл бұрын
Finally a video on cat boost. I was waiting for a proper explanation.
@statquest
@statquest Жыл бұрын
Bam! :)
@xaviernogueira
@xaviernogueira Жыл бұрын
Glad to see CatBoost! Would love to hear more about data leakage mitigation.
@statquest
@statquest Жыл бұрын
Thanks! Yes, I think at one point I need to do a video just on all the types of leakage.
@aghazi94
@aghazi94 Жыл бұрын
I have been waiting for this for so long. Thanks alot
@statquest
@statquest Жыл бұрын
BAM!
@joy5636
@joy5636 Жыл бұрын
Wow, I am so excited to see the Catboost topic! thank u !
@statquest
@statquest Жыл бұрын
BAM! :)
@AllNightNightwish
@AllNightNightwish Жыл бұрын
Hi Josh, I agree with your point here about it being unnecessary (also having seen the previous longer explanation you posted a while back). However, I think their main point and contribution was not the mitigation in a single tree, but throughout the ensemble. If i understand it correctly, by using ordered boosting and randomization over each tree they guarantee that there is no leakage between the separate trees, because none of the samples have ever seen the original value. They use multiple models trained on different fractions of the dataset for each tree, just so they can make predictions that don't have any leakage at all. I'm still not sure that it wouldn't just work with leave one out encoding but given that context it seems to be more useful at least.
@statquest
@statquest Жыл бұрын
Part 2 in this series (which comes out in less than 24 hours), shows how the trees are built using the same approach that limits leakage. I guess one of my issues with CatBoost making such a big deal about leakage is that, even though other methods (XGBoost, lightGBM, Random Forests, etc) might result in leakage, they still perform well - and the whole point of avoiding leakage is simply to have a model perform well.
@murilopalomosebilla2999
@murilopalomosebilla2999 Жыл бұрын
It may be silly, but having a boosting method with cat in its name is really cool haha
@statquest
@statquest Жыл бұрын
Bam! :)
@tapiotanskanen3494
@tapiotanskanen3494 Жыл бұрын
1:57 - Is this correct? On the chapter 3.2 - *Greedy TS* - they talk about a problem _"This estimate is noisy for low-frequency categories...",_ but your example has (maximally) high-frequency category. Later they stipulate _"Assume i-th feature is categorical, _*_all its values_*_ are unique, ..."._ To me this means that there are only single row for each category. In other words, each category (label) is unique, i.e. we have exactly one example per category (label).
@statquest
@statquest Жыл бұрын
The video is correct. If you keep reading the manuscript, just a few more paragraphs, you'll get to the section titled "Leave-one-out TS", and you'll see what I'm talking about in this video.
@texla-kh9qx
@texla-kh9qx Жыл бұрын
The video is talking about the example with constant categorical feature introduced in "Leave-one-out TS" section of their paper. However, I think the formula for target statistics in this video is different from the one in the paper, though the conclusion is still the same. Put it another way, the categorical feature who has uniform value originally carries no information at all. After the target statistics encoding, that categorical feature is transformed to a numerical feature with binary values which exactly distinguishes the binary target classes. This is clearly a target leakage as you can do perfect prediction relies on a single feature.
@TJ-hs1qm
@TJ-hs1qm Жыл бұрын
Hey Josh, I was wondering if you could do a series on graph theory and NLP? exploring this stuff would be really helpful. Thanks!
@statquest
@statquest Жыл бұрын
I'll keep that in mind.
@ravi122133
@ravi122133 10 ай бұрын
@statquest , I think in the paper they take the case when each sample has a unique category to show that it leads to leakage. and not the case that all samples have the same category. Section 3.2 Greedy TS of the CatBoost paper.
@statquest
@statquest 10 ай бұрын
Yes, but in either case, you could just remove that column.
@frischidn3869
@frischidn3869 Жыл бұрын
Hello, thanks for the video. I wanna ask, what if the target variable (Loves Troll 2) is in multiclass (Like, Dislike, So-so). How will the encoding work then for the Favorite Color? And should we encode the target variable first such as 0= Dislike 1= So-so 2= Like Before we then proceed to CatBoost encoding the feature (Favorite Color)?
@statquest
@statquest Жыл бұрын
When there are more than 2 classes, the equation changes, but just a little bit. You can find it in the documentation: catboost.ai/en/docs/concepts/algorithm-main-stages_cat-to-numberic
@frischidn3869
@frischidn3869 Жыл бұрын
@@statquest It is said there "The label values are integer identifiers of target classes (starting from "0")" So I have to encode the target variable first outside CatBoost algorithm as 0, 1, 2 if it is 3 classes?
@statquest
@statquest Жыл бұрын
@@frischidn3869 Sounds like it.
@dl569
@dl569 Жыл бұрын
can't wait to see Transformer, PLEASE!!!!!!
@statquest
@statquest Жыл бұрын
Working on it! :)
@davidguo1267
@davidguo1267 Жыл бұрын
Thanks for the explanation. By the way, have you talked about backpropagation through time in recurrent neural networks? If not, are you planning to talk about it?
@statquest
@statquest Жыл бұрын
Backpropagation through time is just "unroll the RNN and then do normal backpropagation". I have thought about doing a video on it and have notes, but it's not a super high priority right now. Instead I want to get to transformers.
@daniellaicheukpan
@daniellaicheukpan Жыл бұрын
hi Josh. thanks for your videos. I have one question. in your example, color blue can be encoded to several numerical values. Assume that I trained and deployed this model, when a new data comes with color = blue, which have no "loves troll 2" column How can the model know how to encode the color into which value? thanks so much
@statquest
@statquest Жыл бұрын
You use all of the color blue samples in the original training dataset.
@daniellaicheukpan
@daniellaicheukpan Жыл бұрын
@@statquest that means take the average?
@statquest
@statquest Жыл бұрын
@@daniellaicheukpan I was thinking more along the lines of plugging all of the blue rows into the equation. That might be the same as taking the average, but I haven't worked that out.
@ericchang927
@ericchang927 Жыл бұрын
Greate video!!! could you pls also introduce lightgbm? Thanks!
@statquest
@statquest Жыл бұрын
I'll keep that in mind. I have some notes on it already so hopefully I can do it soon.
@tessa10001
@tessa10001 Жыл бұрын
Where was this when i made my master thesis with catboost :(
@statquest
@statquest Жыл бұрын
Late bam?
@beautyisinmind2163
@beautyisinmind2163 Жыл бұрын
Catgorial boosting is only suitable for data with categorial features or we can use it even if our data has no categorical features? While using on continuous features does it require any conversion?
@statquest
@statquest Жыл бұрын
You can certainly use CatBoost on a dataset that doesn't have any categorical features. And it wouldn't require conversion.
@matteomorellini5974
@matteomorellini5974 Жыл бұрын
Hi Josh first of all thanks for your amazing work and passion. I'd like to suggest you a video about Optuna which, at least in my case, would be extremely helpful
@statquest
@statquest Жыл бұрын
I'll keep that in mind.
@matteomorellini5974
@matteomorellini5974 Жыл бұрын
@@statquest thanks Josh❤️
@johndavid5907
@johndavid5907 Жыл бұрын
Hi there sir, can you tell me that the value prior variable is holding is that the value of significance level value?
@statquest
@statquest Жыл бұрын
0.05 is often used as a threshold for statistical significance, but in this case, that concept has nothing to do with how we assign a value to the prior. In theory, the prior could be anything, like 12, and that's not even an option for the threshold for statistical significance.
@shubhamgupta6551
@shubhamgupta6551 Жыл бұрын
How was the ordered target encoding applied at the time of scoring? There will not be any target variable and we don't have a single value for a category i.e Blue color encoded multiple time with different values
@statquest
@statquest Жыл бұрын
We use the entire training datasets to encode new data.
@heteromodal
@heteromodal Жыл бұрын
hey Josh - is there a mathematical justification to the prior in the numerator being defined as 0.05? regardless of a justification existing :) - is it always the case or just what you saw in their examples but it's not certain that that's a fixed value? thank you as always for a great video!
@statquest
@statquest Жыл бұрын
I saw 0.05 used as the prior here: catboost.ai/en/docs/concepts/algorithm-main-stages_cat-to-numberic and, on that page, it says you can set the prior. But I've looked in the documentation and I can't find where it is set, so I really don't know if it is always the case or not.
@heteromodal
@heteromodal Жыл бұрын
@@statquest Thank you!
@BlueRS123
@BlueRS123 Жыл бұрын
Will you cover LightGBM?
@statquest
@statquest Жыл бұрын
I've got notes on it and when I have time I will.
@BlueRS123
@BlueRS123 Жыл бұрын
@@statquest Cool! Are videos of gradient descent optimizers planned, too? (Momentum, Adam, etc.)
@statquest
@statquest Жыл бұрын
@@BlueRS123 I've got notes for Adam as well, so it's just a function of finding some time.
@c.nbhaskar4718
@c.nbhaskar4718 Жыл бұрын
Great tutorial but i am Eagerly waiting for statquest on Transformers
@statquest
@statquest Жыл бұрын
Working on it!
@EvanZamir
@EvanZamir Жыл бұрын
Lightning can be used with CatBoost?
@statquest
@statquest Жыл бұрын
Lightning AI provides a platform to do things easily in the cloud. So, anytime you have a ton of data or a large model, Lightning can help.
@luiscarlospallaresascanio2374
@luiscarlospallaresascanio2374 Жыл бұрын
Que usaste para traducir el texto, en español? :0 ya había visto otros videos de traducción pero no pensé que pasarían a hacer el cambio tan rápido
@statquest
@statquest Жыл бұрын
Uso "Aloud" de Google.
@EvanZamir
@EvanZamir Жыл бұрын
My guess is the ordered target encoding acts like a form of regularization.
@statquest
@statquest Жыл бұрын
Yes, that makes sense to me.
@guimaraesalysson
@guimaraesalysson Жыл бұрын
In this simple example of people who liked the colors whether or not they liked the movie, wouldn't "leakage" make sense? After all, if for example 90% of people who like blue liked the movie, wouldn't knowing that the color the next person likes is blue already provide information? Why is the leak a leak in this case?
@statquest
@statquest Жыл бұрын
Leakage comes form using the same row's target value to modify it's value for Favorite Color. This is typically dealt with by using k-fold target encoding - kzbin.info/www/bejne/a2mcn3Z9mrx6Z9k
@nitinsiwach1989
@nitinsiwach1989 Жыл бұрын
Not only is the motivation unjustifiable. The way Target encoding is done by catboost also makes no sense. Even in your toy example the different categories are numerically exactly the same when encoded and there is absolutely no reason it should be the case
@statquest
@statquest Жыл бұрын
Noted
@Joaopedro_
@Joaopedro_ Жыл бұрын
Manda um salve para o Caio Ducati
@statquest
@statquest Жыл бұрын
:)
CatBoost Part 2: Building and Using Trees
16:16
StatQuest with Josh Starmer
Рет қаралды 21 М.
One-Hot, Label, Target and K-Fold Target Encoding, Clearly Explained!!!
15:23
StatQuest with Josh Starmer
Рет қаралды 57 М.
Beat Ronaldo, Win $1,000,000
22:45
MrBeast
Рет қаралды 154 МЛН
How Many Balloons To Make A Store Fly?
00:22
MrBeast
Рет қаралды 197 МЛН
ML Was Hard Until I Learned These 5 Secrets!
13:11
Boris Meinardus
Рет қаралды 345 М.
Gradient Boost Part 1 (of 4): Regression Main Ideas
15:52
StatQuest with Josh Starmer
Рет қаралды 850 М.
Decision Tree Classification Clearly Explained!
10:33
Normalized Nerd
Рет қаралды 731 М.
StatQuest: Random Forests Part 1 - Building, Using and Evaluating
9:54
StatQuest with Josh Starmer
Рет қаралды 1,2 МЛН
196 - What is Light GBM and how does it compare against XGBoost?
19:05
Построение модели на CatBoost
19:49
loftblog
Рет қаралды 8 М.
AdaBoost, Clearly Explained
20:54
StatQuest with Josh Starmer
Рет қаралды 790 М.
Visual Guide to Random Forests
5:12
Econoscent
Рет қаралды 84 М.
Beat Ronaldo, Win $1,000,000
22:45
MrBeast
Рет қаралды 154 МЛН