Decision Tree in Python Part 2/2 - Machine Learning From Scratch 09

Decision Tree in Python Part 2/2 - Machine Learning From Scratch 09 - Python Tutorial

Рет қаралды 26,311

4 жыл бұрын

Get my Free NumPy Handbook:
www.python-engineer.com/numpy...
In this Machine Learning from Scratch Tutorial, we are going to implement a Decision Tree algorithm using only built-in Python modules and numpy. We will also learn about the concept and the math behind this popular ML algorithm.
Part 1 will cover the theory, and Part 2 contains the implementation.
~~~~~~~~~~~~~~ GREAT PLUGINS FOR YOUR CODE EDITOR ~~~~~~~~~~~~~~
✅ Write cleaner code with Sourcery: sourcery.ai/?... *
📓 Notebooks available on Patreon:
/ patrickloeber
⭐ Join Our Discord : / discord
If you enjoyed this video, please subscribe to the channel!
The code can be found here:
github.com/patrickloeber/MLfr...
You can find me here:
Website: www.python-engineer.com
Twitter: / patloeber
GitHub: github.com/patrickloeber
#Python #MachineLearning
----------------------------------------------------------------------------------------------------------
* This is a sponsored link. By clicking on it you will not have any additional costs, instead you will support me and my project. Thank you so much for the support! 🙏

Пікірлер: 62

@vincent4571 3 жыл бұрын

Hello, is there any post pruning method for this decision tree format?

@jensenlow4376 2 жыл бұрын

Thanks for this helpful video! It was very intuitive in nature to follow through. However, a quick question, is it normal that we do not get the same acc score when compared to the scikit-learn implementation of the DecisionTreeClassifier(criterion="entropy")?

@alitaangel8650 4 жыл бұрын

Great video, the last two typos are funny. 😂

@patloeber 4 жыл бұрын

Hehe thanks :)

@stephenv9957 3 жыл бұрын

Hello sir, how to split the x_column if it contain categorical data

@user-wm7ng3bm6x 2 жыл бұрын

hi，every time i run decision tree on the same dataset, i got different result? could you tell me why? i think it's because of the code "feat_idxs = np.random.choice(n_features,size=self.n_feats,replace=False)" but in the breast cancer sample, the len of n_features equals self.n_feats, so in every node we get the same feat_idxs whose element order is different. although the order is different, the best_feat selected in each node should be the same because we always want the feature that can produce best information gain but the reality is not, so why?

@thomas_ad Жыл бұрын

Awesome video. Small bug: should pass depth+1 to grow_tree and not 1 :)

@prathampandey9898 2 жыл бұрын

Is the formula that you have used to calculate entropy correct? I think the correct formula is--> p=p*np.log2(p) rather than p=np.log2(p).

@datascientist2958 3 жыл бұрын

Sir I implemented that in jupyter notebook. Decision tree module error occurred. How to resolve it? Thanks

@harshkumarsingh5815 2 жыл бұрын

Can i know the reason behind the usage of an asterisk in Node class?

@projeshbasnet1695 Жыл бұрын

if you use only * in a function then it forces you to use the keyword argument while calling that function rather than just the positional argument

@revolutionarydefeatism 2 жыл бұрын

What is the recommended way to avoid a greedy search?

@dtmister2167 3 жыл бұрын

sir you are awesome thank you for giving this implementation in easier way.thank u sir

@patloeber 3 жыл бұрын

glad you enjoyed it :)

@shiva2874 6 ай бұрын

Thanks a lot sir

@shreejanshrestha1931 3 жыл бұрын

It would have been great if the link to first video was on the description section. Cause it will make everything managed.❤

@patloeber 3 жыл бұрын

thanks for the tip! I put together an all-in-one video if you want to have everything together

@jayasuryam8575 3 жыл бұрын

This is awesome. Thank you.

@patloeber 3 жыл бұрын

Thanks! Glad you like it

@saurrav3801 3 жыл бұрын

Bro how to implement ml algorithms in pytorch

@phantheduy4522 2 жыл бұрын

Thanks, Sir a lot.

@Rohan-22 4 жыл бұрын

Awesome, thanks a lot from ❤️!

@patloeber 3 жыл бұрын

glad you like it!

@moccadrea2932 3 жыл бұрын

this decions tree using c.45 algorithm?

@ariowaskita6802 4 жыл бұрын

halo, this video is really great. can i print the tree that the class decision tree build from the data set?

@patloeber 4 жыл бұрын

Hi. you need to implement the printing function yourself and add it to the class. This can be easily implemented recursively. (similar to the traverse function). Let me know if you need more help :)

@LIMLIMLIM111 4 жыл бұрын

I noticed when you implement recursion at _grow_tree method, shouldn't the best feature used for current split be eliminated for the next split sequence? Im thinking it is implicitly implemented through indexes.

@LIMLIMLIM111 4 жыл бұрын

For specificity, i meant in the code: left = self._grow_tree(X[left_idxs, :], y[left_idxs, :], depth + 1) right = self._grow_tree(X[right_idxs, :], y[right_idxs, :], depth + 1) Shouldn't there be some kind of mechanism that tells the next node not to consider "Rain" feature , for example, since it was used for split already? I am guessing that: @ _information_gain method: if len(left_idxs == 0) or len(right_idxs) == 0: return 0 this prevents selection of the same split for immediate node, but I am not sure about more distant nodes.

@patloeber 4 жыл бұрын

This is a very good question! But no, used features do not need to be eliminated for the next search. It is very likely that in child nodes the same feature can be used again but with another threshold. We are always looking for the best feat/thresh. In this simple example if rain is already used, the tree can learn that another rain feature will not yield more information, and hence just not select it again. but in complex real world examples the tree could see that a second rain could be beneficial. That's the beauty of this algorithm :) Of course there are different attempts to improve the feature selection process, but usually it is not necessary to exclude the used ones.

@LIMLIMLIM111 4 жыл бұрын

Python Engineer I see, thank you for bringing up the real world scenario. I admire that you constantly answer questions and help people out. I am not only learning the backend of algorithms but also great personality and passion from you. Thank you!

@patloeber 4 жыл бұрын

You are welcome! You can always reach out when you have questions :)

@livingontheedge2767 2 жыл бұрын

Hey Patrick this is very helpful. Are those pdf you personal notes or from some book?

@patloeber 2 жыл бұрын

thank you. they are personal notes...they are available on my Patreon

@aaronk8297 4 жыл бұрын

For the feature indices array (feat_idxs -> 13:38), why does the array have to be in a random order?

@patloeber 4 жыл бұрын

It is not necessary but it will add a random effect to our training. This can improve our training and can help to avoid overfitting. It is also one of the random effects of the "Random Forest". So please check the next tutorial, too :). Usually with a lot if features we do not want to search over all the features but only search over a random subset. And we want to have this in random order so our training is not always exactly the same...

@navinbondade5365 4 жыл бұрын

@@patloeber I guess we chose a random of features dataset to find which has the highest information gain, the one has the highest information gain become parent node. Plz, correct me if I'm wrong.

@patloeber 4 жыл бұрын

@@navinbondade5365 correct!

@navinbondade5365 4 жыл бұрын

Python Engineer sir can I have your email id, actually i’m a beginner and trying to learn machine learning from own

@patloeber 4 жыл бұрын

Please check out my website python-engineer.com for more contact information :)

@alanlee3701 4 жыл бұрын

Helpful video! Could you pls make another episode about decision tree regression for forecasting?

@patloeber 4 жыл бұрын

Thanks! I will consider this

@jasonyam3282 4 жыл бұрын

great video. but I see "for loop" in "for loop". anyway to avoid this inefficiency?

@patloeber 4 жыл бұрын

Hi, thanks for the feedback :) The code is not designed for efficiency but rather to be as simple as possible for learning purpose. You can feel free to optimize it :) However, as we are doing a greedy search and want to go over all features and all thresholds, avoiding two loops is difficult. Maybe one option is to use list comprehension for all the thresholds: gains = np.array([self._information_gain(y, X_column, t) for t in thresholds]) and then use gains.max() to get the best gain, but this is still a second loop. If you have any other ideas let me know :)

@jasonyam3282 4 жыл бұрын

@@patloeber I actually find so great this episode is that can walk beginners through the algorithm while using python. I am deeply attracted to one of the videos. by the way, which IDE do you think is more suitable for data science, VS code or pycharm?

@patloeber 4 жыл бұрын

@@jasonyam3282 Thank you! I'm glad you like it. Both a great IDEs and have a lot of support for data science tasks. I prefer VS Code nowadays, especially since it has Jupyter Notebook support

@jasonyam3282 4 жыл бұрын

@@patloeber can I please ask what do you do? for a living. and what Mac have you been using? Do u run DL project locally or a remote desktop?

@patloeber 4 жыл бұрын

I'm a software engineer. I have a MacBook Air (2014), but yes for expensive DL training I can use a remote machine.

@yousufqadri4893 3 жыл бұрын

Is this recursive C 4.5?

@patloeber 3 жыл бұрын

yes some functions here are implemented in a recursive way. this is very common for tree structures

@techswithahmadadnan1552 3 жыл бұрын

thank you u r amazing

@patloeber 3 жыл бұрын

thanks :)

@techswithahmadadnan1552 3 жыл бұрын

@@patloeber i tried to use a dataset stored in my pc but i got an error in the next step of defining data = data.data what i have to do

@keerthanaanil4558 3 жыл бұрын

I need to detect whether hate speech or not so how can I add its features in this .

@patloeber 3 жыл бұрын

my next video will be about text classification so keep an eye out for that :)

@seyeeet8063 3 жыл бұрын

hmmmmmmm why we have multiple thresholds and how we use them again?

@patloeber 3 жыл бұрын

because we want to find the best split threshold for each node. so we go over all possible thresholds

@RohanOxob 2 жыл бұрын

9:33

@VictorGiustiniPerez_ Жыл бұрын

I think I don't have enough understanding of the algorithms as a whole but nevertheless I think there is not much explanation, and the overview/usage of variables is quite unoverseable. But thanks anyways!

@davidbang9625 3 жыл бұрын

Your entropy calculation is incorrect. It should be -np.sum(p * np.log2(p)) Can check with values here homes.cs.washington.edu/~shapiro/EE596/notes/InfoGain.pdf (slide 4)

@patloeber 3 жыл бұрын

That's basically what I used, didn't I? -np.sum([p * np.log2(p) for p in ps if p > 0]) Only difference is that we filter out negative values because we can't have negatives for log.