CatBoost Part 1: Ordered Target Encoding

  Рет қаралды 34,347

StatQuest with Josh Starmer

StatQuest with Josh Starmer

Күн бұрын

Пікірлер: 84
@statquest
@statquest Жыл бұрын
Corrections: 4:09 It is also worth noting that if there were more than 2 target values, for example, if Loves Troll 2 could be 0, 1 and 2, then, when calculating the OptionCount for a sample with Loves Troll 2 = 1, we would include rows that had Loves Troll 2 = 1 and 2. To learn more about Lightning: lightning.ai/ Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
@mrcoet
@mrcoet Жыл бұрын
Thank you! I'm doing my master thesis and I'm checking your channel every day waiting for Transformers. Thank you again!
@statquest
@statquest Жыл бұрын
I'm still working on it.
@dihancheng952
@dihancheng952 Жыл бұрын
@@statquest same eagerly waiting here
@aghazi94
@aghazi94 Жыл бұрын
I have been waiting for this for so long. Thanks alot
@statquest
@statquest Жыл бұрын
BAM!
@AllNightNightwish
@AllNightNightwish Жыл бұрын
Hi Josh, I agree with your point here about it being unnecessary (also having seen the previous longer explanation you posted a while back). However, I think their main point and contribution was not the mitigation in a single tree, but throughout the ensemble. If i understand it correctly, by using ordered boosting and randomization over each tree they guarantee that there is no leakage between the separate trees, because none of the samples have ever seen the original value. They use multiple models trained on different fractions of the dataset for each tree, just so they can make predictions that don't have any leakage at all. I'm still not sure that it wouldn't just work with leave one out encoding but given that context it seems to be more useful at least.
@statquest
@statquest Жыл бұрын
Part 2 in this series (which comes out in less than 24 hours), shows how the trees are built using the same approach that limits leakage. I guess one of my issues with CatBoost making such a big deal about leakage is that, even though other methods (XGBoost, lightGBM, Random Forests, etc) might result in leakage, they still perform well - and the whole point of avoiding leakage is simply to have a model perform well.
@joy5636
@joy5636 Жыл бұрын
Wow, I am so excited to see the Catboost topic! thank u !
@statquest
@statquest Жыл бұрын
BAM! :)
@murilopalomosebilla2999
@murilopalomosebilla2999 Жыл бұрын
It may be silly, but having a boosting method with cat in its name is really cool haha
@statquest
@statquest Жыл бұрын
Bam! :)
@tessa10001
@tessa10001 Жыл бұрын
Where was this when i made my master thesis with catboost :(
@statquest
@statquest Жыл бұрын
Late bam?
@dl569
@dl569 Жыл бұрын
can't wait to see Transformer, PLEASE!!!!!!
@statquest
@statquest Жыл бұрын
Working on it! :)
@firstkaransingh
@firstkaransingh Жыл бұрын
Finally a video on cat boost. I was waiting for a proper explanation.
@statquest
@statquest Жыл бұрын
Bam! :)
@TJ-hs1qm
@TJ-hs1qm Жыл бұрын
Hey Josh, I was wondering if you could do a series on graph theory and NLP? exploring this stuff would be really helpful. Thanks!
@statquest
@statquest Жыл бұрын
I'll keep that in mind.
@ericchang927
@ericchang927 Жыл бұрын
Greate video!!! could you pls also introduce lightgbm? Thanks!
@statquest
@statquest Жыл бұрын
I'll keep that in mind. I have some notes on it already so hopefully I can do it soon.
@ravi122133
@ravi122133 7 ай бұрын
@statquest , I think in the paper they take the case when each sample has a unique category to show that it leads to leakage. and not the case that all samples have the same category. Section 3.2 Greedy TS of the CatBoost paper.
@statquest
@statquest 7 ай бұрын
Yes, but in either case, you could just remove that column.
@matteomorellini5974
@matteomorellini5974 Жыл бұрын
Hi Josh first of all thanks for your amazing work and passion. I'd like to suggest you a video about Optuna which, at least in my case, would be extremely helpful
@statquest
@statquest Жыл бұрын
I'll keep that in mind.
@matteomorellini5974
@matteomorellini5974 Жыл бұрын
@@statquest thanks Josh❤️
@frischidn3869
@frischidn3869 Жыл бұрын
Hello, thanks for the video. I wanna ask, what if the target variable (Loves Troll 2) is in multiclass (Like, Dislike, So-so). How will the encoding work then for the Favorite Color? And should we encode the target variable first such as 0= Dislike 1= So-so 2= Like Before we then proceed to CatBoost encoding the feature (Favorite Color)?
@statquest
@statquest Жыл бұрын
When there are more than 2 classes, the equation changes, but just a little bit. You can find it in the documentation: catboost.ai/en/docs/concepts/algorithm-main-stages_cat-to-numberic
@frischidn3869
@frischidn3869 Жыл бұрын
@@statquest It is said there "The label values are integer identifiers of target classes (starting from "0")" So I have to encode the target variable first outside CatBoost algorithm as 0, 1, 2 if it is 3 classes?
@statquest
@statquest Жыл бұрын
@@frischidn3869 Sounds like it.
@c.nbhaskar4718
@c.nbhaskar4718 Жыл бұрын
Great tutorial but i am Eagerly waiting for statquest on Transformers
@statquest
@statquest Жыл бұрын
Working on it!
@davidguo1267
@davidguo1267 Жыл бұрын
Thanks for the explanation. By the way, have you talked about backpropagation through time in recurrent neural networks? If not, are you planning to talk about it?
@statquest
@statquest Жыл бұрын
Backpropagation through time is just "unroll the RNN and then do normal backpropagation". I have thought about doing a video on it and have notes, but it's not a super high priority right now. Instead I want to get to transformers.
@shubhamgupta6551
@shubhamgupta6551 Жыл бұрын
How was the ordered target encoding applied at the time of scoring? There will not be any target variable and we don't have a single value for a category i.e Blue color encoded multiple time with different values
@statquest
@statquest Жыл бұрын
We use the entire training datasets to encode new data.
@BlueRS123
@BlueRS123 Жыл бұрын
Will you cover LightGBM?
@statquest
@statquest Жыл бұрын
I've got notes on it and when I have time I will.
@BlueRS123
@BlueRS123 Жыл бұрын
@@statquest Cool! Are videos of gradient descent optimizers planned, too? (Momentum, Adam, etc.)
@statquest
@statquest Жыл бұрын
@@BlueRS123 I've got notes for Adam as well, so it's just a function of finding some time.
@EvanZamir
@EvanZamir Жыл бұрын
Lightning can be used with CatBoost?
@statquest
@statquest Жыл бұрын
Lightning AI provides a platform to do things easily in the cloud. So, anytime you have a ton of data or a large model, Lightning can help.
@johndavid5907
@johndavid5907 Жыл бұрын
Hi there sir, can you tell me that the value prior variable is holding is that the value of significance level value?
@statquest
@statquest Жыл бұрын
0.05 is often used as a threshold for statistical significance, but in this case, that concept has nothing to do with how we assign a value to the prior. In theory, the prior could be anything, like 12, and that's not even an option for the threshold for statistical significance.
@beautyisinmind2163
@beautyisinmind2163 Жыл бұрын
Catgorial boosting is only suitable for data with categorial features or we can use it even if our data has no categorical features? While using on continuous features does it require any conversion?
@statquest
@statquest Жыл бұрын
You can certainly use CatBoost on a dataset that doesn't have any categorical features. And it wouldn't require conversion.
@daniellaicheukpan
@daniellaicheukpan Жыл бұрын
hi Josh. thanks for your videos. I have one question. in your example, color blue can be encoded to several numerical values. Assume that I trained and deployed this model, when a new data comes with color = blue, which have no "loves troll 2" column How can the model know how to encode the color into which value? thanks so much
@statquest
@statquest Жыл бұрын
You use all of the color blue samples in the original training dataset.
@daniellaicheukpan
@daniellaicheukpan Жыл бұрын
@@statquest that means take the average?
@statquest
@statquest Жыл бұрын
@@daniellaicheukpan I was thinking more along the lines of plugging all of the blue rows into the equation. That might be the same as taking the average, but I haven't worked that out.
@EvanZamir
@EvanZamir Жыл бұрын
My guess is the ordered target encoding acts like a form of regularization.
@statquest
@statquest Жыл бұрын
Yes, that makes sense to me.
@luiscarlospallaresascanio2374
@luiscarlospallaresascanio2374 Жыл бұрын
Que usaste para traducir el texto, en español? :0 ya había visto otros videos de traducción pero no pensé que pasarían a hacer el cambio tan rápido
@statquest
@statquest Жыл бұрын
Uso "Aloud" de Google.
@guimaraesalysson
@guimaraesalysson Жыл бұрын
In this simple example of people who liked the colors whether or not they liked the movie, wouldn't "leakage" make sense? After all, if for example 90% of people who like blue liked the movie, wouldn't knowing that the color the next person likes is blue already provide information? Why is the leak a leak in this case?
@statquest
@statquest Жыл бұрын
Leakage comes form using the same row's target value to modify it's value for Favorite Color. This is typically dealt with by using k-fold target encoding - kzbin.info/www/bejne/a2mcn3Z9mrx6Z9k
@Joaopedro_
@Joaopedro_ 10 ай бұрын
Manda um salve para o Caio Ducati
@statquest
@statquest 10 ай бұрын
:)
@nitinsiwach1989
@nitinsiwach1989 10 ай бұрын
Not only is the motivation unjustifiable. The way Target encoding is done by catboost also makes no sense. Even in your toy example the different categories are numerically exactly the same when encoded and there is absolutely no reason it should be the case
@statquest
@statquest 10 ай бұрын
Noted
@junaidbutt3000
@junaidbutt3000 Жыл бұрын
Clear and concise as always Josh! I was wondering if there was a natural way to extend the OptionCount metric for multiclass problems? It makes sense in the binary classification, we count the observations where a category class c co-occurs with the positive class of the target variable (1 in this case). If this was adapted for multiclass problems, how would we adapt this encoding equation?
@statquest
@statquest Жыл бұрын
Great question - and the CatBoost documentation has a good description of how it works for more classes: catboost.ai/en/docs/concepts/algorithm-main-stages_cat-to-numberic
@texla-kh9qx
@texla-kh9qx Жыл бұрын
@@statquest From the documentation, "Multiclassification The label values are integer identifiers of target classes (starting from "0").", it seems that they simply integer encoding the multiclasses? Isn't this introduce an artificial ordering in the target classes?
@statquest
@statquest Жыл бұрын
@@texla-kh9qx You have to remember that we don't split the data based the target value, so using integer values for the target isn't a problem.
@texla-kh9qx
@texla-kh9qx Жыл бұрын
@@statquest The categorical features of independent variables is encoded by target statistics, i.e. the transformation from categories to numerical values. If there is an artificial ordering in target variable y, it propagates to that categorical feature of X. So integer encoding multiclasses seems not a good choice.
@statquest
@statquest Жыл бұрын
@@texla-kh9qx If you look at the equations for target encoding independent variables, you'll see that they don't include the target value, just the number of rows with the same category. So I don't believe that the target values propagate to the independent variables.
@xaviernogueira
@xaviernogueira Жыл бұрын
Glad to see CatBoost! Would love to hear more about data leakage mitigation.
@statquest
@statquest Жыл бұрын
Thanks! Yes, I think at one point I need to do a video just on all the types of leakage.
@tapiotanskanen3494
@tapiotanskanen3494 Жыл бұрын
1:57 - Is this correct? On the chapter 3.2 - *Greedy TS* - they talk about a problem _"This estimate is noisy for low-frequency categories...",_ but your example has (maximally) high-frequency category. Later they stipulate _"Assume i-th feature is categorical, _*_all its values_*_ are unique, ..."._ To me this means that there are only single row for each category. In other words, each category (label) is unique, i.e. we have exactly one example per category (label).
@statquest
@statquest Жыл бұрын
The video is correct. If you keep reading the manuscript, just a few more paragraphs, you'll get to the section titled "Leave-one-out TS", and you'll see what I'm talking about in this video.
@texla-kh9qx
@texla-kh9qx Жыл бұрын
The video is talking about the example with constant categorical feature introduced in "Leave-one-out TS" section of their paper. However, I think the formula for target statistics in this video is different from the one in the paper, though the conclusion is still the same. Put it another way, the categorical feature who has uniform value originally carries no information at all. After the target statistics encoding, that categorical feature is transformed to a numerical feature with binary values which exactly distinguishes the binary target classes. This is clearly a target leakage as you can do perfect prediction relies on a single feature.
@heteromodal
@heteromodal Жыл бұрын
hey Josh - is there a mathematical justification to the prior in the numerator being defined as 0.05? regardless of a justification existing :) - is it always the case or just what you saw in their examples but it's not certain that that's a fixed value? thank you as always for a great video!
@statquest
@statquest Жыл бұрын
I saw 0.05 used as the prior here: catboost.ai/en/docs/concepts/algorithm-main-stages_cat-to-numberic and, on that page, it says you can set the prior. But I've looked in the documentation and I can't find where it is set, so I really don't know if it is always the case or not.
@heteromodal
@heteromodal Жыл бұрын
@@statquest Thank you!
CatBoost Part 2: Building and Using Trees
16:16
StatQuest with Josh Starmer
Рет қаралды 19 М.
One-Hot, Label, Target and K-Fold Target Encoding, Clearly Explained!!!
15:23
StatQuest with Josh Starmer
Рет қаралды 51 М.
АЗАРТНИК 4 |СЕЗОН 3 Серия
30:50
Inter Production
Рет қаралды 1 МЛН
Worst flight ever
00:55
Adam W
Рет қаралды 24 МЛН
CatBoost - градиентный бустинг от Яндекса
1:20:53
Computer Science Center
Рет қаралды 31 М.
Covariance, Clearly Explained!!!
22:23
StatQuest with Josh Starmer
Рет қаралды 555 М.
Cosine Similarity, Clearly Explained!!!
10:14
StatQuest with Josh Starmer
Рет қаралды 90 М.
Построение модели на CatBoost
19:49
loftblog
Рет қаралды 8 М.
АЗАРТНИК 4 |СЕЗОН 3 Серия
30:50
Inter Production
Рет қаралды 1 МЛН