For more captivating community talks featuring renowned speakers, check out this playlist: kzbin.info/aero/PL8eNk_zTBST-EBv2LDSW9Wx_V4Gy5OPFT
@24brophy4 жыл бұрын
This is the single best ML video on the internet. Dave for President 2020.
@ghexer5 жыл бұрын
This was really great Dave. I've done a bunch of your tutorials online including the intro to data science videos you did using the Titanic Kaggle competition about 4 years ago. What I enjoyed the most about this video was seeing how much more confident and impassioned you have become as a data scientist since those prior videos. You can tell that it really excites you and that is infectious in a teaching environment. I too have become somewhat hooked on data science and I was one of those students that avoided statistics at all costs at every level of education. I'm looking into coming to one of the data science bootcamps at the data science dojo and really looking forward to learning from people that are equally passionate about data science and hopefully making up some lost ground. Keep up the great work.
@LuthieriadeBanheiro3 жыл бұрын
Excellent class!
@seanpitcher8957 Жыл бұрын
Oh. My. God. THIS... THIS!!!!! This literally changes everything.
@yanivtubul5 жыл бұрын
Thanks a lot! Doing my first steps into R and Machine Learning. This talk is exactly what I needed
@QuickFlicksx Жыл бұрын
Great!
@pipertripp2 жыл бұрын
this was excellent I've leant quite a lot and have a few new books for the reading list. Many thanks!
@Datasciencedojo2 жыл бұрын
Keep following us for more tutorials.
@pipertripp2 жыл бұрын
@@Datasciencedojo will do!
@shaunoconnell9506 Жыл бұрын
very nice, i just used this package for an assignment. this got me enthusiastic to learn more
@Datasciencedojo Жыл бұрын
Glad to help you, Shaun.
@acada2 жыл бұрын
Excellent presentation, you are a great teacher. Thank you
@Datasciencedojo2 жыл бұрын
Keep following us for more crash courses!
@reubenschneider39215 жыл бұрын
Great guide, I was really struggling with a ML assignment and didn't realise what an absolute unit 'caret' is!
@CK-vy2qv6 жыл бұрын
By far the best video out there for ML in R
@tamafun47452 жыл бұрын
Thank you very much Dave & team. Really enjoy the whole presentation and learn a lot!
@Datasciencedojo2 жыл бұрын
Glad you liked it, Tama. Keep following us for more content.
@bljangir74504 жыл бұрын
Simply excellent , I could not hold my self to comment even if few miniues are still left . You are genious to make things so interesting .
@henrique67483 жыл бұрын
Thank you so much for sharing this!
@antzlck5 жыл бұрын
Brilliant and great advert for your bootcamps!
@julianonas7 жыл бұрын
Thank you for sharing ! Amazing Video and Instructions.
@Datasciencedojo7 жыл бұрын
@Juliano Nascimento - Glad you liked the video!
@a.useronly22662 жыл бұрын
Great 👍🏻
@TIKITAKANEWS6 жыл бұрын
Thanks very much. I got somewhere to start and do it to the end.. Great!!
@neuro11527 жыл бұрын
Hi Dave first of all thanks for the video. AWESOME stuff! One question, are the hyperparameters for the xgboost algorithm universal or are they tuned specifically to this training set? Could I get the reference for the hyperparameters it was cut off in the code editor screen. Thanks again.
@Datasciencedojo7 жыл бұрын
@Overlooking the Obvious - This is an important question. While the list of hyperparamters for any algorithm will always be the same, the values of each individual hyperparameter are tuned in the context of a particular data set. For example, you may find some values that are optimal for your training data set. You then perform feature engineering and add a new feature. There is no assurance that the previous hyperparamter values are still optimal, hence it is common practice to tune later in the project cycle when you arrive at a stable list of features. Here's a link to a great reference to xgboost hyperparameter tuning: www.slideshare.net/odsc/owen-zhangopen-sourcetoolsanddscompetitions1 HTH, Dave
@KarriemPerry5 жыл бұрын
I do appreciate Dave's approach. I think it's important to stress that there is a lot more to being a data scientist than simply understanding concepts of M, AI, etc, or taking a few online courses a certificate. I believe it takes graduate coursework and years of being a practitioner underatnding and implementing a list of techniques. Engineers typically vector into data analytics completely differently than I do, having a MS in data analytics. It is a good illustration into just how complex and broad the science of data is in these infant stages.
@Datasciencedojo7 жыл бұрын
Meetup Starts at: 2:57
@stockspotlightpodcast4 жыл бұрын
Great video! Only one question. When you say that set.seed(54321) is not random, what do you mean? I thought whatever we put in set.seed could be anything, e.g., set.seed (321). What is the meaning behind your 54321? You sorta glanced over that part and I'd love to dive a little deeper into that.
@sebastianvarela21905 жыл бұрын
Hi Dave, very instructive video, congratulations. Please let me ask you a question: I know caret does not impute with factors. But how do you do in practice when you need to impute data to categorical/factor variables? (discarding the mode) In the example of your video, in the dataset "imputed.data" you have two columns/dummies for Sex. If you -hypothetically-impute missing values for them, how do to take them back to the original dataset, in which there is only one column for Sex?
@joseramon43014 жыл бұрын
Thank you so much!
@wereskiryan3 жыл бұрын
Amazing video!
@erinklark6 жыл бұрын
Thanks for the video! Quick question - why do you have to split the data into a training/test set of 70/30 when you are going to do 10-fold cross-validation (90/10 split?) anyway later on? Are these two different things?
@ravi2813815 жыл бұрын
I am running through the smiliar problem. I built a very simple model with complete cases with this data. Didn't do test-train split as CV was supposed to give me out of sample metric. I got an ROC score of 0.8. I uploaded the model and the kaggle gave accuracy of .52. Now I am confused what purpose CV served.
@hannukoistinen5329 Жыл бұрын
What is your actual test? What do you want to explain? Model fit: where is it? Coding is 'impressive', but you must get some real results too.
@Datasciencedojo Жыл бұрын
Sure Hannu! We will work on explaining some real results in the future. Thank you for your suggestion.
@atlantaguitar96894 жыл бұрын
Great video...Do you feel it is necessary to use dummyvars before doing the imputation ? Isn't it sufficient to do the imputation within the call to the train function as part of the preProcess argument ? That is, is the conversion to one hot encoding outside of the call to train, strictly necessary ?
@arindambpcsrkm3 жыл бұрын
@dave, i understood how you imputed the age. however if we have like 200 missing data for embark data, will the same method for imputing age work, ? i mean is not it possible that for some cases both Q and S might have values close to 1 for same row? what to do in that case
@shorthand11215 жыл бұрын
If you get "subscript out of bounds" in the train() function, change the parallelization engine over to the future engine as it is better at exporting environments: library(parallel) library(future) library(doFuture) plan("multisession") #if you're seeing this error, you're likely on a Windows machine anyway registerDoFuture() And also comment out the makeCluster, registerDoSNOW, and stopCluster lines.
@sbdavid1234 жыл бұрын
Great video, I have watched several times at this point to get a better understanding of the caret package. It helped me out a lot. However, I have one question. Why do you split in train and validation sets and then use cross validation on train. I always thought that cross validation was repeated train test split. This way you will avoid evaluating your model on only one split, which by chance might be easy (or super hard) to predict i.e. because the test subset contains more extreme values or the train contains more of the imputed instances, etc.. . By repeating the process of splitting the data in train and test several time and averaging the performance metrics over all these splits, you get a better view of the real performance of the model. So why do you split in train test subset and then use cross validation on train? As I understand it now, it looks like you are reintroducing the problem cross validation is trying to solve. Would it not be better to not in train and test and use k-folds cross validation (which is basically a repeated split in train and test). Thanks!
@alisterdcruz16673 жыл бұрын
I noticed that the other columns with large number of na's were removed and while imputing Age variable all the other factors were having no na's . What should I do if the variables that are critical for imputation of age variable also has na's ? I'm a noob. So please correct me if there is lack of logic in my doubt.
@flamboyantperson59366 жыл бұрын
Extremely helpful video. I don't know the concept of grid search what it does? Can you explain me in simple terms how it work and how it helps in tuning the model? Thank you.
@mahdip.46745 жыл бұрын
Thanks for the tutorial. Talking about model based imputation, let us say we have 3 numeric variables to impute. How the imputation will work if we want to impute the first variable? Does caret will consider complete case approach for the rest of data? If so, how then it will impute the original first variable if it happens that for a record one of second or third variable has missing value? What is the procedure here? Thanks.
@yannelfersi35106 жыл бұрын
Great video @Dave. Super helpful; I love the step-by-step Q&A. Just curious: is it 'good' practice to include the test set when imputing data? shouldn't it be done on the train set only?
@fredasefamilia4 жыл бұрын
Thank you for this. However I tried Implementing the code as written in IntroToMachineLearning.R and I get an error at line 159. I have tried it several times and the error message i get is Loading required package: plyr Error in train.default(x, y, weights = w, ...) : The tuning parameter grid should have columns nrounds, max_depth, eta, gamma, colsample_bytree, min_child_weight this is all confusing being that all the columns specified are included in the code. Could this be a result of a bug? Please I'll appreciate an prompt answer to this. Thanks
@bobbird49577 жыл бұрын
Dear David, great talk, thank you very much. I have a short question: how do I know which factors are included in the "best" model? Thus, which factors are most predictive in separating survivors from non-survivors? Thank you in advance! Best, Bob
@bobbird49577 жыл бұрын
Thanks Dave!
@Datasciencedojo7 жыл бұрын
@Robert Daihatsu - If you mean individual factor levels, then that can be difficult to get from the models. Finding feature importance, however, is far easier. For example, the following code can be added to the end of the Meetup code file to get the feature importance: xgb.importance(feature_names = colnames(titanic.train), model = caret.cv$finalModel) HTH, Dave
@nikhitharajashekar16377 жыл бұрын
Great Video to understand ! But i have doubt, how the resampling result across tunin Parameters are selected?
@Datasciencedojo7 жыл бұрын
@Nikhitha Rajashekar - As I mention in the video, while caret can perform stratified cross validation, the video does not demonstrate this. As coded, the video illustrates using cross validation with simple random sampling to create each of the 10 folds for each of the repeats (i.e., 30 total folds each created with random sampling). HTH, Dave
@venustat7 жыл бұрын
Great video
@Datasciencedojo7 жыл бұрын
@Statsvenu Manneni - Glad you liked the video!
@rajkamalsrivastav76967 жыл бұрын
Hi David, thanks for this session!! one question, is it always good to go with imputing using caret(e.g. bagged decision trees for imputing age) or we should do some EDA such as finding a pattern in age using Pclass, sex aggregation and then imputing the age with that value?
@Datasciencedojo7 жыл бұрын
@Raj kamal Srivastav - I tend to shy away from terms like "always" and "never" when it comes to data science. The only universal answer I've found is "it depends". :-) To answer you specific question, I always strongly suggest doing exploratory data analysis - in fact we spend a good chunk of day 1 in our bootcamp discussing EDA. However, it is often the case that even after EDA you might need a ML model to most accurately impute ages due to the underlying complexities in the patterns in the data. HTH, Dave
@paulvictor33167 жыл бұрын
Great video! In regards to preProcess(..., method = "bagImpute") what's your definition of SMALL DATA? Would 5000 rows with 10 columns be small?
@Datasciencedojo7 жыл бұрын
@Paul Victor - Glad you liked the video. Your question is apt as "big data" vs. "small data" is a subjective measure that depends on the situation. In this particular case, caret will create N bagged decision tree models where N is the number of predictors in your data frame. A 5000x10 matrix would be fine, but you certainly wouldn't want to use this functionality on other problems like text analytics where you could have tens of thousands of rows and tens of thousands of columns. HTH, Dave
@aman_mashetty51856 жыл бұрын
thanks for the amazing video dev, it's going to help in future also, but I don't know the grid search concept what it does? Can you explain to me in simple terms how it works and how it helps in tuning the model???
@collinsouru36294 жыл бұрын
am reproducing your example but am stack at training the model, it returning an error like Error: The tuning parameter grid should have columns nrounds, max_depth, eta, gamma, colsample_bytree, min_child_weight, subsample what could i be doing wrong? here is the part which is returning error caret.cv
@hasthigiSrivaradhan16 жыл бұрын
thank you.
@JerryWho494 жыл бұрын
Isn‘t there some sort of data leakage? You‘re imputing the missing ages using the entire data set. So the training set „knows“ something about the test set. That‘s not good. I think you should split first and then use two pipelines for training and testing. Is there support for pipelines in caret?
@apoorvspydy6 жыл бұрын
Thanks! This was very helpful. Where can I get the rest of the videos on Machine Learning.
@Datasciencedojo6 жыл бұрын
You can watch more of our Machine Learning tutorials here: tutorials.datasciencedojo.com/azure-machine-learning-tutorial-part-1/ You can also find our other meetups here as well: tutorials.datasciencedojo.com/categories/community-talks/
@aakashchugh96 жыл бұрын
Great video.. caret is amazing.. one question though... If we are doing stratified sampling then we don't have to balance the data? Because if we don't balance the data then the outcome will be biased and if we balance the data then it will be manipulation
@gregorkvas63323 жыл бұрын
The result of stratified sampling with respect to the outcome (survived/non-survived) is a balanced training and test set with respect to the outcome (survived/non-survived). Best, Gregory
@NikosKatsikanis7 жыл бұрын
Hi, I am a js expert wanting to get into DS. What tools do you advise me to learn?
@navjotsingh22516 жыл бұрын
Quantum Information learn what you need for the job you want. Different jobs require different tools for different tasks. Figure out what you want to do then figure out what tools will get you there.
@djangoworldwide79252 жыл бұрын
You are a nice American chap
@coolhead86864 жыл бұрын
Are you hiring? I am the same as you. I spend over 20 years doing system development, programmer analyst, data analyst and data scientist.
@junaideffendi48607 жыл бұрын
Great video but didnt see use of train.dummy? you worked on train dataset which has the imputed age but not the dummy columns, clear me please.
@junaideffendi48607 жыл бұрын
I got it that dummy variables were calculated in order to do the imputation. In this case, you did the dummy variable stuff to teach, because there were not any missing values in any other columns other than the age, so in reality we can skip the dummy part, correct me if I am wrong. Also, actually I thought that the dummy variables were created for the training part too. A question arises after that how the machine learning is performing on the categorical variables? is it converting them into numerical values like one hot encoding automatically or processing directly just like in a simple decision tree?
@junaideffendi48607 жыл бұрын
Thanks :)
@Datasciencedojo7 жыл бұрын
@Junaid Effendi - If I understand your question correctly the following lines of code use train.dummy to generate a new matrix with all missing Age values imputed: pre.process
@Datasciencedojo7 жыл бұрын
@Junaud Effendi - If you use caret's imputation feature via the preProcess() function then you need to convert to dummy variables. As I mention in the video, the preProcess() function does not work with factor variables. You would not want to skip this step as you are losing potential features that the bagged decision trees could use to potentially build more accurate imputation models. To answer your second question, "it depends". For example, in the case of the mighty Random Forest factors can be used directly so caret will do nothing. However, xgboost does not support factors by default. In this case caret is transforming the factors behind the scenes for you. HTH, Dave
@drnabinpaudel69844 жыл бұрын
So, how do we implement this model to a new dataset ?
@gregorkvas63323 жыл бұрын
Using the predict() function on the trained caret model object. Best, Gregory
@farhanadham72373 жыл бұрын
Waw its amazing for turning parameter on xgboost, because we know xgboost always taking too much time for training
@yishengkim90816 жыл бұрын
Great video, but waiting ~ 5 mins to be recognized as having a question is troubling (between ~55:00 - 59:00). #WomeninDataScience
@Ivansnooze6 жыл бұрын
Troubling? Half the time she does not even put her hand up and furthermore she keeps taking it down. Seems a bit of a stretch to make this a gender issue. I know there is properly a bias against women in this field, but not every situation should be used as a call to arms.
@yishengkim90815 жыл бұрын
@@Ivansnooze Isn't that what men ALWAYS do, minimize the gender concerns of women? I'm a part-time chemistry professor, I don't need my students to keep their hands up for the entire lecture to recognize their questions. And when I see a student put their hand down after having raised it earlier, I'll double back to ask them if they still have a question. That's what good lecturers do.
@mushroomdew4 жыл бұрын
@@yishengkim9081 he did speak to her, her question wasn't missed. Are you sure the causation isn't due to her being at the back of the room?