Machine Learning with R | Machine Learning with caret

Рет қаралды 102,778

Күн бұрын

Пікірлер: 81

@Datasciencedojo Жыл бұрын

For more captivating community talks featuring renowned speakers, check out this playlist: kzbin.info/aero/PL8eNk_zTBST-EBv2LDSW9Wx_V4Gy5OPFT

@24brophy 4 жыл бұрын

This is the single best ML video on the internet. Dave for President 2020.

@ghexer 5 жыл бұрын

This was really great Dave. I've done a bunch of your tutorials online including the intro to data science videos you did using the Titanic Kaggle competition about 4 years ago. What I enjoyed the most about this video was seeing how much more confident and impassioned you have become as a data scientist since those prior videos. You can tell that it really excites you and that is infectious in a teaching environment. I too have become somewhat hooked on data science and I was one of those students that avoided statistics at all costs at every level of education. I'm looking into coming to one of the data science bootcamps at the data science dojo and really looking forward to learning from people that are equally passionate about data science and hopefully making up some lost ground. Keep up the great work.

@LuthieriadeBanheiro 3 жыл бұрын

Excellent class!

@seanpitcher8957 Жыл бұрын

Oh. My. God. THIS... THIS!!!!! This literally changes everything.

@yanivtubul 5 жыл бұрын

Thanks a lot! Doing my first steps into R and Machine Learning. This talk is exactly what I needed

@QuickFlicksx Жыл бұрын

Great!

@pipertripp 2 жыл бұрын

this was excellent I've leant quite a lot and have a few new books for the reading list. Many thanks!

@Datasciencedojo 2 жыл бұрын

Keep following us for more tutorials.

@pipertripp 2 жыл бұрын

@@Datasciencedojo will do!

@shaunoconnell9506 Жыл бұрын

very nice, i just used this package for an assignment. this got me enthusiastic to learn more

@Datasciencedojo Жыл бұрын

Glad to help you, Shaun.

@acada 2 жыл бұрын

Excellent presentation, you are a great teacher. Thank you

@Datasciencedojo 2 жыл бұрын

Keep following us for more crash courses!

@reubenschneider3921 5 жыл бұрын

Great guide, I was really struggling with a ML assignment and didn't realise what an absolute unit 'caret' is!

@CK-vy2qv 6 жыл бұрын

By far the best video out there for ML in R

@tamafun4745 2 жыл бұрын

Thank you very much Dave & team. Really enjoy the whole presentation and learn a lot!

@Datasciencedojo 2 жыл бұрын

Glad you liked it, Tama. Keep following us for more content.

@bljangir7450 4 жыл бұрын

Simply excellent , I could not hold my self to comment even if few miniues are still left . You are genious to make things so interesting .

@henrique6748 3 жыл бұрын

Thank you so much for sharing this!

@antzlck 5 жыл бұрын

Brilliant and great advert for your bootcamps!

@julianonas 7 жыл бұрын

Thank you for sharing ! Amazing Video and Instructions.

@Datasciencedojo 7 жыл бұрын

@Juliano Nascimento - Glad you liked the video!

@a.useronly2266 2 жыл бұрын

Great 👍🏻

@TIKITAKANEWS 6 жыл бұрын

Thanks very much. I got somewhere to start and do it to the end.. Great!!

@neuro1152 7 жыл бұрын

Hi Dave first of all thanks for the video. AWESOME stuff! One question, are the hyperparameters for the xgboost algorithm universal or are they tuned specifically to this training set? Could I get the reference for the hyperparameters it was cut off in the code editor screen. Thanks again.

@Datasciencedojo 7 жыл бұрын

@Overlooking the Obvious - This is an important question. While the list of hyperparamters for any algorithm will always be the same, the values of each individual hyperparameter are tuned in the context of a particular data set. For example, you may find some values that are optimal for your training data set. You then perform feature engineering and add a new feature. There is no assurance that the previous hyperparamter values are still optimal, hence it is common practice to tune later in the project cycle when you arrive at a stable list of features. Here's a link to a great reference to xgboost hyperparameter tuning: www.slideshare.net/odsc/owen-zhangopen-sourcetoolsanddscompetitions1 HTH, Dave

@KarriemPerry 5 жыл бұрын

I do appreciate Dave's approach. I think it's important to stress that there is a lot more to being a data scientist than simply understanding concepts of M, AI, etc, or taking a few online courses a certificate. I believe it takes graduate coursework and years of being a practitioner underatnding and implementing a list of techniques. Engineers typically vector into data analytics completely differently than I do, having a MS in data analytics. It is a good illustration into just how complex and broad the science of data is in these infant stages.

@Datasciencedojo 7 жыл бұрын

Meetup Starts at: 2:57

@stockspotlightpodcast 4 жыл бұрын

Great video! Only one question. When you say that set.seed(54321) is not random, what do you mean? I thought whatever we put in set.seed could be anything, e.g., set.seed (321). What is the meaning behind your 54321? You sorta glanced over that part and I'd love to dive a little deeper into that.

@sebastianvarela2190 5 жыл бұрын

Hi Dave, very instructive video, congratulations. Please let me ask you a question: I know caret does not impute with factors. But how do you do in practice when you need to impute data to categorical/factor variables? (discarding the mode) In the example of your video, in the dataset "imputed.data" you have two columns/dummies for Sex. If you -hypothetically-impute missing values for them, how do to take them back to the original dataset, in which there is only one column for Sex?

@joseramon4301 4 жыл бұрын

Thank you so much!

@wereskiryan 3 жыл бұрын

Amazing video!

@erinklark 6 жыл бұрын

Thanks for the video! Quick question - why do you have to split the data into a training/test set of 70/30 when you are going to do 10-fold cross-validation (90/10 split?) anyway later on? Are these two different things?

@ravi281381 5 жыл бұрын

I am running through the smiliar problem. I built a very simple model with complete cases with this data. Didn't do test-train split as CV was supposed to give me out of sample metric. I got an ROC score of 0.8. I uploaded the model and the kaggle gave accuracy of .52. Now I am confused what purpose CV served.

@hannukoistinen5329 Жыл бұрын

What is your actual test? What do you want to explain? Model fit: where is it? Coding is 'impressive', but you must get some real results too.

@Datasciencedojo Жыл бұрын

Sure Hannu! We will work on explaining some real results in the future. Thank you for your suggestion.

@atlantaguitar9689 4 жыл бұрын

Great video...Do you feel it is necessary to use dummyvars before doing the imputation ? Isn't it sufficient to do the imputation within the call to the train function as part of the preProcess argument ? That is, is the conversion to one hot encoding outside of the call to train, strictly necessary ?

@arindambpcsrkm 3 жыл бұрын

@dave, i understood how you imputed the age. however if we have like 200 missing data for embark data, will the same method for imputing age work, ? i mean is not it possible that for some cases both Q and S might have values close to 1 for same row? what to do in that case

@shorthand1121 5 жыл бұрын

If you get "subscript out of bounds" in the train() function, change the parallelization engine over to the future engine as it is better at exporting environments: library(parallel) library(future) library(doFuture) plan("multisession") #if you're seeing this error, you're likely on a Windows machine anyway registerDoFuture() And also comment out the makeCluster, registerDoSNOW, and stopCluster lines.

@sbdavid123 4 жыл бұрын

Great video, I have watched several times at this point to get a better understanding of the caret package. It helped me out a lot. However, I have one question. Why do you split in train and validation sets and then use cross validation on train. I always thought that cross validation was repeated train test split. This way you will avoid evaluating your model on only one split, which by chance might be easy (or super hard) to predict i.e. because the test subset contains more extreme values or the train contains more of the imputed instances, etc.. . By repeating the process of splitting the data in train and test several time and averaging the performance metrics over all these splits, you get a better view of the real performance of the model. So why do you split in train test subset and then use cross validation on train? As I understand it now, it looks like you are reintroducing the problem cross validation is trying to solve. Would it not be better to not in train and test and use k-folds cross validation (which is basically a repeated split in train and test). Thanks!

@alisterdcruz1667 3 жыл бұрын

I noticed that the other columns with large number of na's were removed and while imputing Age variable all the other factors were having no na's . What should I do if the variables that are critical for imputation of age variable also has na's ? I'm a noob. So please correct me if there is lack of logic in my doubt.

@flamboyantperson5936 6 жыл бұрын

Extremely helpful video. I don't know the concept of grid search what it does? Can you explain me in simple terms how it work and how it helps in tuning the model? Thank you.

@mahdip.4674 5 жыл бұрын

Thanks for the tutorial. Talking about model based imputation, let us say we have 3 numeric variables to impute. How the imputation will work if we want to impute the first variable? Does caret will consider complete case approach for the rest of data? If so, how then it will impute the original first variable if it happens that for a record one of second or third variable has missing value? What is the procedure here? Thanks.

@yannelfersi3510 6 жыл бұрын

Great video @Dave. Super helpful; I love the step-by-step Q&A. Just curious: is it 'good' practice to include the test set when imputing data? shouldn't it be done on the train set only?

@fredasefamilia 4 жыл бұрын

Thank you for this. However I tried Implementing the code as written in IntroToMachineLearning.R and I get an error at line 159. I have tried it several times and the error message i get is Loading required package: plyr Error in train.default(x, y, weights = w, ...) : The tuning parameter grid should have columns nrounds, max_depth, eta, gamma, colsample_bytree, min_child_weight this is all confusing being that all the columns specified are included in the code. Could this be a result of a bug? Please I'll appreciate an prompt answer to this. Thanks

@bobbird4957 7 жыл бұрын

Dear David, great talk, thank you very much. I have a short question: how do I know which factors are included in the "best" model? Thus, which factors are most predictive in separating survivors from non-survivors? Thank you in advance! Best, Bob

@bobbird4957 7 жыл бұрын

Thanks Dave!

@Datasciencedojo 7 жыл бұрын

@Robert Daihatsu - If you mean individual factor levels, then that can be difficult to get from the models. Finding feature importance, however, is far easier. For example, the following code can be added to the end of the Meetup code file to get the feature importance: xgb.importance(feature_names = colnames(titanic.train), model = caret.cv$finalModel) HTH, Dave

@nikhitharajashekar1637 7 жыл бұрын

Great Video to understand ! But i have doubt, how the resampling result across tunin Parameters are selected?

@Datasciencedojo 7 жыл бұрын

@Nikhitha Rajashekar - As I mention in the video, while caret can perform stratified cross validation, the video does not demonstrate this. As coded, the video illustrates using cross validation with simple random sampling to create each of the 10 folds for each of the repeats (i.e., 30 total folds each created with random sampling). HTH, Dave

@venustat 7 жыл бұрын

Great video

@Datasciencedojo 7 жыл бұрын

@Statsvenu Manneni - Glad you liked the video!

@rajkamalsrivastav7696 7 жыл бұрын

Hi David, thanks for this session!! one question, is it always good to go with imputing using caret(e.g. bagged decision trees for imputing age) or we should do some EDA such as finding a pattern in age using Pclass, sex aggregation and then imputing the age with that value?

@Datasciencedojo 7 жыл бұрын

@Raj kamal Srivastav - I tend to shy away from terms like "always" and "never" when it comes to data science. The only universal answer I've found is "it depends". :-) To answer you specific question, I always strongly suggest doing exploratory data analysis - in fact we spend a good chunk of day 1 in our bootcamp discussing EDA. However, it is often the case that even after EDA you might need a ML model to most accurately impute ages due to the underlying complexities in the patterns in the data. HTH, Dave

@paulvictor3316 7 жыл бұрын

Great video! In regards to preProcess(..., method = "bagImpute") what's your definition of SMALL DATA? Would 5000 rows with 10 columns be small?

@Datasciencedojo 7 жыл бұрын

@Paul Victor - Glad you liked the video. Your question is apt as "big data" vs. "small data" is a subjective measure that depends on the situation. In this particular case, caret will create N bagged decision tree models where N is the number of predictors in your data frame. A 5000x10 matrix would be fine, but you certainly wouldn't want to use this functionality on other problems like text analytics where you could have tens of thousands of rows and tens of thousands of columns. HTH, Dave

@aman_mashetty5185 6 жыл бұрын

thanks for the amazing video dev, it's going to help in future also, but I don't know the grid search concept what it does? Can you explain to me in simple terms how it works and how it helps in tuning the model???

@collinsouru3629 4 жыл бұрын

am reproducing your example but am stack at training the model, it returning an error like Error: The tuning parameter grid should have columns nrounds, max_depth, eta, gamma, colsample_bytree, min_child_weight, subsample what could i be doing wrong? here is the part which is returning error caret.cv

@hasthigiSrivaradhan1 6 жыл бұрын

thank you.

@JerryWho49 4 жыл бұрын

Isn‘t there some sort of data leakage? You‘re imputing the missing ages using the entire data set. So the training set „knows“ something about the test set. That‘s not good. I think you should split first and then use two pipelines for training and testing. Is there support for pipelines in caret?

@apoorvspydy 6 жыл бұрын

Thanks! This was very helpful. Where can I get the rest of the videos on Machine Learning.

@Datasciencedojo 6 жыл бұрын

You can watch more of our Machine Learning tutorials here: tutorials.datasciencedojo.com/azure-machine-learning-tutorial-part-1/ You can also find our other meetups here as well: tutorials.datasciencedojo.com/categories/community-talks/

@aakashchugh9 6 жыл бұрын

Great video.. caret is amazing.. one question though... If we are doing stratified sampling then we don't have to balance the data? Because if we don't balance the data then the outcome will be biased and if we balance the data then it will be manipulation

@gregorkvas6332 3 жыл бұрын

The result of stratified sampling with respect to the outcome (survived/non-survived) is a balanced training and test set with respect to the outcome (survived/non-survived). Best, Gregory

@NikosKatsikanis 7 жыл бұрын

Hi, I am a js expert wanting to get into DS. What tools do you advise me to learn?

@navjotsingh2251 6 жыл бұрын

Quantum Information learn what you need for the job you want. Different jobs require different tools for different tasks. Figure out what you want to do then figure out what tools will get you there.

@djangoworldwide7925 2 жыл бұрын

You are a nice American chap

@coolhead8686 4 жыл бұрын

Are you hiring? I am the same as you. I spend over 20 years doing system development, programmer analyst, data analyst and data scientist.

@junaideffendi4860 7 жыл бұрын

Great video but didnt see use of train.dummy? you worked on train dataset which has the imputed age but not the dummy columns, clear me please.

@junaideffendi4860 7 жыл бұрын

I got it that dummy variables were calculated in order to do the imputation. In this case, you did the dummy variable stuff to teach, because there were not any missing values in any other columns other than the age, so in reality we can skip the dummy part, correct me if I am wrong. Also, actually I thought that the dummy variables were created for the training part too. A question arises after that how the machine learning is performing on the categorical variables? is it converting them into numerical values like one hot encoding automatically or processing directly just like in a simple decision tree?

@junaideffendi4860 7 жыл бұрын

Thanks :)

@Datasciencedojo 7 жыл бұрын

@Junaid Effendi - If I understand your question correctly the following lines of code use train.dummy to generate a new matrix with all missing Age values imputed: pre.process

@Datasciencedojo 7 жыл бұрын

@Junaud Effendi - If you use caret's imputation feature via the preProcess() function then you need to convert to dummy variables. As I mention in the video, the preProcess() function does not work with factor variables. You would not want to skip this step as you are losing potential features that the bagged decision trees could use to potentially build more accurate imputation models. To answer your second question, "it depends". For example, in the case of the mighty Random Forest factors can be used directly so caret will do nothing. However, xgboost does not support factors by default. In this case caret is transforming the factors behind the scenes for you. HTH, Dave

@drnabinpaudel6984 4 жыл бұрын

So, how do we implement this model to a new dataset ?

@gregorkvas6332 3 жыл бұрын

Using the predict() function on the trained caret model object. Best, Gregory

@farhanadham7237 3 жыл бұрын

Waw its amazing for turning parameter on xgboost, because we know xgboost always taking too much time for training

@yishengkim9081 6 жыл бұрын

Great video, but waiting ~ 5 mins to be recognized as having a question is troubling (between ~55:00 - 59:00). #WomeninDataScience

@Ivansnooze 6 жыл бұрын

Troubling? Half the time she does not even put her hand up and furthermore she keeps taking it down. Seems a bit of a stretch to make this a gender issue. I know there is properly a bias against women in this field, but not every situation should be used as a call to arms.

@yishengkim9081 5 жыл бұрын

@@Ivansnooze Isn't that what men ALWAYS do, minimize the gender concerns of women? I'm a part-time chemistry professor, I don't need my students to keep their hands up for the entire lecture to recognize their questions. And when I see a student put their hand down after having raised it earlier, I'll double back to ask them if they still have a question. That's what good lecturers do.

@mushroomdew 4 жыл бұрын

@@yishengkim9081 he did speak to her, her question wasn't missed. Are you sure the causation isn't due to her being at the back of the room?