CatBoost Part 2: Building and Using Trees

Рет қаралды 21,462

Күн бұрын

Пікірлер: 100

@statquest Жыл бұрын

NOTE: At 7:23 I should have said that the cosine similarity was 0.71. To learn more about Lightning: lightning.ai/ Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

@sahilpalsaniya724 Жыл бұрын

"BAM" and its variants are stuck in my head. every time I solve a problem my head plays your voice

@statquest Жыл бұрын

bam! :)

@Monkey_uho Жыл бұрын

Awesome work ! I've been watching a lot of your videos to understand the basics ML algorithms, continue like that ! Thank you for taking the time and the energy to spread knowledge with others. Also, I would like to say that, like others, I will also love a video explaining the concepts behind LighGBM.

@statquest Жыл бұрын

Thank you! And one day I hope to do LightGBM

@weipenghu4463 Жыл бұрын

looking forward to it❤

@Quami111 Жыл бұрын

In 2:09 and 12:40, you assigned row with height=1.32 to bin=1, but you said that rows with smaller heights would have bin=0. It doesn't appear in 11:24, row with height=1.32 has bin=0, so I guess it is a mistake.

@statquest Жыл бұрын

Oops! That was a mistake. 1.32 was supposed to be in bin 0 the whole time.

@OscarMartinez-gg5du Жыл бұрын

@@statquest In 1:15 when create the randomized tree for the first build, the height also seems to be shuffled for their corresponding Favorite Color and that changes the examples for the creation of the stumps. However, the explanation is very clear, I love your videos!!

@aakashdusane 7 ай бұрын

Not gonna lie catBoost nuances were significantly more difficult to understand than any other ensemble model till date. Although the basic intuition is pretty straightforward.

@statquest 7 ай бұрын

It's a weird one for sure.

@Mark_mochi Жыл бұрын

In 8:25, why does the threshold change to 0.87 all of a sudden?

@statquest Жыл бұрын

Oops. That looks like a typo.

@TheDataScienceChannel Жыл бұрын

As always a great video. Was wondering if you intend to add a code tutorial as well?

@statquest Жыл бұрын

I'll keep it in mind!

@asmaain5856 Жыл бұрын

@@statquest please for soon I reaaaaally need it

@rikki146 Жыл бұрын

API for shallow models are mostly similar :\

@LL-hj8yh Жыл бұрын

Hey Josh, thanks as always! Are you planning to roll out lightgbm videos as well?

@statquest Жыл бұрын

Eventually that's the plan.

@drelijahmikail3916 3 ай бұрын

How would we extract and abstract the common mechanisms of all decision tree family construction? The decision tree family can consist of Gradient Boost, Regression Trees, Random Forests, XGBoost, AdaBoost, CatBoost, Ensemble, etc, for Regression and Classification. There is detail walkthrough of "how" and less of "why" 1 observation for some of the decision tree family is that they construct multiple "weak learners", compute (SSR: sum of squares due to regression, similar to LMS: least mean squares), and order it from root to lower nodes.

@razielamadorrios7284 Жыл бұрын

Such a great video Josh! I really enjoyed it. Any chance to do an explanation for lightGBM? thanks in advance. Additionally, I'm a huge fan of your work :)

@statquest Жыл бұрын

I'll keep that in mind.

@rishabhsoni Жыл бұрын

Great video. One question: Is the intuition behind using high cosine similarity to pick threshold that essentially we are adding the scaled leaf output to create predictions and if leaf outputs are more closer to residuals then we are moving in right direction as residuals represent how far away are we from actual target? Usually we minimize the residuals which kind of means that you find similarity with target

@statquest Жыл бұрын

I think that is correct. A high similarity means the output value is close to the residuals, so we're moving the right direction.

@rishabhsoni Жыл бұрын

But one question that comes to mind: cosine similarly is based on L2 norm so Euclidean distance. Wouldnt in this case no of rows of data act as dimension and cause weird output due to curse of dimensionality

@alexpowell-perry2233 Жыл бұрын

at 11:48 when you are calculating the output values of the second tree, the residual for the 3rd record with a favourite Colour value of 0.525 and a Residual of 1.81 gets sent down the LHS leaf, even though the LHS leaf contains Residuals that are

@statquest Жыл бұрын

Oops! That's a mistake. Sorry for the confusion!

@nitinsiwach1989 11 ай бұрын

What do bins have to do with the ordered encoding computation as you mentioned at 11:26? In the video, you have mentioned one use-case for the bins which is to reduce the number of thresholds tested like other gradient boosting methods.

@statquest 11 ай бұрын

The bins are used to give us a discrete target value for Ordered Target Encoding (since it doesn't work directly with a continuous target.) For details, see: kzbin.info/www/bejne/gYmyhYahhbFljpY

@aserdargun Жыл бұрын

how was "favorite color < 29" changed into "favorite color < 0.87" in 8:28 ? Could you please explain?

@statquest Жыл бұрын

That's just a horrible and embarrassing typo. :( It should be 0.29.

@satyashah3045 Ай бұрын

How does multiclass classification work in catboost or else in regression when there are many bins, how is ordered target encoding done? Is it individually done for each class? If possible can you make a common video for the multiclass classification problem for the boosting algorithm, your videos are very easy to understand they are really great.

@statquest Ай бұрын

I'll keep that topic in mind. However, often people just create a bunch of 1 vs All models, one per target classification.

@satyashah3045 Ай бұрын

@@statquest Can we also do this using the softmax function?? Like suppose there are three classes in the output then we can calculate the ordered target encoding for each class 0,1 and 2 for each datapoint to make trees. Then we will place the three different logit(log odds) values obtained from adding more trees into the softmax and the predicted class is the highest value we get in the softmax function? So by this we can run the algorithm simultaneously.

@statquest Ай бұрын

@@satyashah3045 I'm not sure you need the softmax in this case, but I know that you use the cross entropy loss function.

@nitinsiwach1989 4 ай бұрын

I have a few questions 1. Catboost is still gradient boosting, right? Residuals are computed as the first derivative of the loss function. That too in the appropriate scale which would be log(odds) for the classification. You are getting the residuals as the difference because you are assuming L2 loss, right? 2. Are the output values also computed as the value that would minimize the loss function? Or, is it always the mean of the target variable falling in the node? For classification would it be the mean of the labels? For classification the output should be in the log(odds) scale. How is that done here?

@statquest 4 ай бұрын

CatBoost is just Gradient Boosting + a different way to build trees. However, everything regarding the loss functions and how things are computed is lifted straight from Gradient Boosting. To learn more about those details, see: kzbin.info/www/bejne/aKnYlYOFd99grNU and kzbin.info/www/bejne/iaW6imiHjLKLedk

@АлександраРыбинская-п3л Жыл бұрын

Dear Josh, I have a question about using Catboost for Classification. In this video, which tells us about using Catboost for Regression, we calculated output values for a leaf as an average of residuals in a leaf. How do we calculate output value for Classification? Do we use the same formula as for Gradient Boosting? I mean, (Sum of residuals) in the numerator and Sum of (Previous probability(i)*(1-Previous probability(i)) in denominator.

@statquest Жыл бұрын

CatBoost is, fundamentally, based on Gradient Boost, which does classification by converting the target into a log(odds) value and then treating it like a regression problem. For details, see: kzbin.info/www/bejne/oKnYf39-asmLedU

@nitinsiwach1989 10 ай бұрын

Hello Josh, Thank you for your amazing channel In the catboost package, why do we have both 'depth' and 'max_leaves' as parameters? One would think that since the trees here are oblivious, the two are deterministically related. Can you shed some light on this?

@statquest 10 ай бұрын

That's good question. Unfortunately, there have been a lot of changes to CatBoost since it was originally published and it's hard to get answers for what's going on.

@sanukurien2752 8 ай бұрын

what happens during inference time when the target is not available? How are the categorical variables encoded then?

@statquest 8 ай бұрын

You use the full training dataset to encode the new data.

@aryanshrajsaxena6961 8 ай бұрын

Will we use k-fold target encoding for the case of more than 2 bins?

@statquest 8 ай бұрын

I believe that is correct.

@reynardryanda245 Жыл бұрын

12:41 how did you get the optionCount for prediction? I thought that it’s the amount of time that color for that bin appears sequentially. But if it’s for prediction, we don’t know the actual bin right?

@statquest Жыл бұрын

At 12:41, we are trying to predict the hight of someone who likes the color blue. So, in order to change "blue" into a number, we look at the training data on the left, which has two rows with the color blue in it. One of those rows is in Bin 0 and the other is in Bin 1. Thus, to get the option count for "blue", we add 0 + 1 = 1. In other words, the option count for the new observation is derived entirely from the training dataset.

@yehonatanavidan9904 3 ай бұрын

First, I'm a big fan-thanks for the excellent explanations! 🙏 Secondly, am I wrong, or did you shuffle the connection between the 'greens' and their target when you randomized the favorite colors?

@statquest 3 ай бұрын

What time point, minutes and seconds are you asking about?

@yehonatanavidan9904 3 ай бұрын

@statquest 1:00 vs 01:27

@statquest 3 ай бұрын

@@yehonatanavidan9904 That's a "typo" at 1:27 - I should have kept the colors and values connected.

@renyuduan Жыл бұрын

TKS a lot~ i'm looking for an answer! For the new data whose "Favorite Color" is blue, why does it belong to bin#0 instead of bin#1 ?

@statquest Жыл бұрын

The new data is not assigned to a bin at all. We just use the old bin numbers associated with the Training Data (and only the training data) to convert the color, "blue", to a number. The bin numbers in the training data are used for the sum of the 1's for the numerator.

@renyuduan Жыл бұрын

@@statquest I misunderstood, sorry～ for new data whose "Favorite Color" is blue, we use all the rows with the same color, "blue", where OptionCount and n.

@statquest Жыл бұрын

@@renyuduan yep

@danieleboch3224 9 ай бұрын

i have a question about leaf outputs. don't gradient boosting algorithms on trees build a new tree all the way down and after that assign some values to their leafs? you rather did it iteratively, calculating outputs when the tree wasn't built yet.

@statquest 9 ай бұрын

As you can see in this video, not all gradient boosting algorithms with trees do things the same way. In this case, the trees are built differently, and this is done to avoid leakage.

@danieleboch3224 9 ай бұрын

@@statquest thanks, i got it now! but i got another question, in the catboost documentation there is a leaf estimation parameter (set to "Newton") and it is weird as the newton method is the exact method that is used in finding leaf values in xgboost, it uses the second derivative of the loss function and creates a tree according to new information criteria based on that method. but why would we need that if we already build trees in the ordered way finding the best split with the cosine similarity function?

@statquest 9 ай бұрын

@@danieleboch3224 To be honest, I an only speculate about this. My guess is that they started to play around with different leaf estimation methods and found the one xgboost uses works better than the one they originally came up with. To be honest, the "theory" of catboost seems to be quite different from how it works in practice, and this is very disappointing to me.

@alphatyad8131 Жыл бұрын

Dr. Starmer, I try to manually calculate and use a calculator too for several times but it was different from the results in 7:23. I get 0.7368, but there is 0.79. Am I missing something? Does anyone get the same result as me?

@statquest Жыл бұрын

That's just a typo in the video. Sorry for the confusion.

@alphatyad8131 Жыл бұрын

Okay. Thank you for your attention and the great explanation, Dr. Josh Starmer. Such an honor and my pleasure to contribute to this video. Have a great day, Dr.

@frischidn3869 Жыл бұрын

What will the residuals and leaf output be when it is a multiclass classification?

@statquest Жыл бұрын

Presumably it's log likelihoods from cross entropy. I don't show how this works with CatBoost, but I show how it works with Neural Networks here: kzbin.info/www/bejne/bHLVhKypatZ7d7c

@alexpowell-perry2233 Жыл бұрын

How does catboost decide on the best split at level 2 in the tree if it has to be symmetric? What if the best threshold for the LHS node is different to the best threshold for the RHS node?

@statquest Жыл бұрын

It finds the best threshold given that it has to be the same for all nodes at that level. Compared to how a normal tree is created, this is not optimal. However, the point is not to make an optimal tree, but instead to create a "weak learner" so that we can combine a lot of them to build something that is good at making predictions. Pretty much all "boosting" methods do something to make the trees a little worse at predicting on their own because trees are notorious for overfitting the training data. By making the trees a little worse, they prevent overfitting.

@alexpowell-perry2233 Жыл бұрын

@@statquest thanks so much for the reply but I still dont quite understand this - so does each LEVEL get a similarity score? I dont understand how you can quantify a threshold when this threshold is being applied to more than 1 node in the tree? In your example you showed us how to calculate the cosine similarity for a split that is being applied to just 1 node - how do we calculate this when its being applied to, (in the case of a level 2 split) 2 nodes simultaneously? I also have one more question - since the tree must be symmetrical, i am assuming that a characteristic (in the case of your example - "Favourite Film") can only ever appear in a tree once?

@statquest Жыл бұрын

@@alexpowell-perry2233 In the video I show how the cosine similarity is calculated using 2 leaves. Adding more leaves doesn't change the process. Regardless of how many leaves are on a level, we calculate the cosine similarity between the residuals and the predictions for all of the data. And yes, a feature will not be used if it can no longer split the data into smaller groups.

@aserdargun Жыл бұрын

I wish, you could replay this part again :)

@statquest Жыл бұрын

@alphatyad8131 Жыл бұрын

Excuse me again Dr. Starmer. Do you know how CatBoost determines the final tree (I mean from many trees of gradient boosting that CatBoost builds) till that becomes a rule so it can predict new data? Cause I haven't found a source that tells an explicit explanation of how CatBoost made the decision trees till it can be used to predict. Thanks in advance, Dr. (Or for anyone who knows, I would appreciate your help)

@statquest Жыл бұрын

You build a bunch of trees and see if the predictions have stopped improving. If so, then you are done. If not, and it looks like the general trend is to continue improving, then build more trees.

@alphatyad8131 Жыл бұрын

I got it & am so appreciate it, Dr. And then if I could ask again; Is it safe if we say CatBoost is similar to the XGBoost method in the way it chooses features for building the tree (made predictor) & defining -in this case- the classification class for the given data?

@statquest Жыл бұрын

@@alphatyad8131 They're pretty different. To learn more about XGBoost, see: kzbin.info/www/bejne/haWnaaqMlqugbKc and kzbin.info/www/bejne/bpOUe3h6q8qhh7c

@alphatyad8131 Жыл бұрын

@@statquest Well explanation, Dr. Josh Starmer. Actually, I still learning by watching your videos on 'Machine Learning'. I appreciate it, feel not stuck in the same place as before thanks to your help. Have a nice day Dr.

@near_. Жыл бұрын

Awesome. I'm your new subscriber 🙂

@statquest Жыл бұрын

Thank you! :)

@yufuzhang1187 Жыл бұрын

Dr. Starmer, when you have a chance, can you please make videos on LIghtGBM, which is quite popular these days? Also, can you do ChatGPT or GPT or Transformer, clearly explained! Thank you so much!

@statquest Жыл бұрын

I'm working on Transformers right now.

@yufuzhang1187 Жыл бұрын

@@statquest Thank you so much! Looking forward!

@xaviernogueira Жыл бұрын

@@statquest excited for that

@DeepaliBaghel-l9n Жыл бұрын

Big Fan !! 🙌

@statquest Жыл бұрын

Thanks!

@ВалерийГайнанов-и9о Жыл бұрын

Thank you for your content! It's very nice, everything is clear, I hope you want stop producing your content :)

@statquest Жыл бұрын

Thank you!

@recklesspanda28 Жыл бұрын

is it still work like that if i use classification?

@statquest Жыл бұрын

I believe classification is just like classification for standard Gradient Boost: kzbin.info/www/bejne/oKnYf39-asmLedU

@recklesspanda28 Жыл бұрын

@@statquest thank you🤗

@bhavanisankarlenka Ай бұрын

Hurray Great BAMM!!😄

@statquest Ай бұрын

Thank you!

@YUWANG-du4pv Жыл бұрын

Dr. Starmer, could you explain lightGBM🤩

@statquest Жыл бұрын

I'll keep that in mind.

@nilaymandal2408 11 ай бұрын

5:28

@statquest 11 ай бұрын

A good moment.

@TrusePkay Жыл бұрын

Do a video on LightGBM

@statquest Жыл бұрын

I'll keep that in mind.

@TheDankGoat Жыл бұрын

obnoxious, arrogant, has mistakes, but useful....

@statquest Жыл бұрын

What parts do you think are obnoxious? What parts are arrogant?And what time points, minutes and seconds, are mistakes? (The mistakes I might be able to correct or at least have a note mentions them).