Get started with random forest tuning and tidymodels using IKEA price data

Рет қаралды 11,794

Julia Silge

Күн бұрын

Пікірлер: 50

@BigChewbowski 4 жыл бұрын

Thank you for taking the time to make these videos! The have been an immense help in my R journey!

@alelust7170 4 жыл бұрын

Thanks, Julia You always bring some interesting library in your analysis!

@jennyhansen Жыл бұрын

Thank you, Julia. This was tremendously helpful for me!

@umber_wall Ай бұрын

Thank you, learned so much!

@TURALOWEN 2 жыл бұрын

usemodels package is magic!

@yangyang6008 2 жыл бұрын

Hi Julia, thank you for the great tutorial! For the training set cross-validation, what is the difference between "bootstraps" and "vfold_cv"? Which method is more appropriate for training a machine learning model? Thank you.

@JuliaSilge 2 жыл бұрын

You can check out this chapter for the differences: www.tmwr.org/resampling.html#resampling-methods Also, this Cross Validated answer by Max tells you a bit about when you might choose one over the other: stats.stackexchange.com/a/18355/133241 If you have enough data, cross validation is usually the best bet.

@yangyang6008 2 жыл бұрын

@@JuliaSilge Hi Julia, thank you very much for your explanation!

@seaniam 4 жыл бұрын

Love these videos - thanks Julia!

@JorgeThomasM Жыл бұрын

Hi @JuliaSilge ! Would be volume = height * width * depth a sort of interaction / new variable? Thanks so much for all these wonderful sessions.

@JuliaSilge Жыл бұрын

Yeah, for sure! We'd call that "feature engineering" because you are creating a custom feature from the original variables based on your domain knowledge of how furniture works. 😄

@artathearta 4 жыл бұрын

10:43 why did vfold_cv give you small testing folds? 18:25 I got an error: ``` ... All models failed. See the `.notes` column. ... Warning message: This tuning result has notes. Example notes on model fitting include: preprocessor 1/1: Error in UseMethod("prep"): no applicable method for 'prep' applied to an object of class "c('step_clean_levels', 'step')" preprocessor 1/1: Error in UseMethod("prep"): no applicable method for 'prep' applied to an object of class "c('step_clean_levels', 'step')" preprocessor 1/1: Error in UseMethod("prep"): no applicable method for 'prep' applied to an object of class "c('step_clean_levels', 'step')" ``` I followed your steps exactly, and I even tried directly copying and pasting your code from your blog post. EDIT: I was able to fix this problem by commenting out the workflow() command and instead piping the recipe through prep() after step_knnimpute, and then setting up tune_grid to take in the model as its object and ranger_recipe as its preprocessor.

@JuliaSilge 4 жыл бұрын

With that many folds on data this small, that's just how cross-validation works! You can read about bootstrap vs. cross-validation here: stats.stackexchange.com/questions/18348/differences-between-cross-validation-and-bootstrapping-to-estimate-the-predictio I forgot to mention in the video that `step_clean_levels()` is in the development version of textrecipes, so you'll need to install from GitHub to be able use that function: `devtools::install_github("tidymodels/textrecipes")`

@artathearta 4 жыл бұрын

@@JuliaSilge I uninstalled the copy of textrecipes I got from CRAN and installed it from GitHub and now it doesn't work even if use prep() 😆 It's all good, I'll still follow along and hope it eventually works with my computer or put it on my macbook. Still great video! Excited to use {usemodels}!

@artathearta 4 жыл бұрын

@@JuliaSilge Okay, while I was filling out the steps for submitting a bug on textrecipes, I discovered that it works with the workflow() object if I remove `doParallel::registerDoParallel()` before running `tune_grid`.

@JuliaSilge 4 жыл бұрын

@@artathearta Hmmmm, can you make sure you have the most up-to-date version of tune from CRAN? That contains bug fixes for parallel processing on Windows.

@artathearta 4 жыл бұрын

@@JuliaSilge tune_0.1.2. I can open a github issue if you'd like

@psxcl9817 2 жыл бұрын

Hello Julia! thanks for your video. I hope to whether can I obtain the importance values of each feature in vip package instead of plotting it. I did not find the relevant function in vip.

@JuliaSilge 2 жыл бұрын

Do you mean you want to get the importance values as a dataframe, rather than a visualization? You can use `vi()` for that: koalaverse.github.io/vip/reference/vi.html

@davidjackson7675 4 жыл бұрын

What about calculating the square inches of the top?

@prod.kashkari3075 4 жыл бұрын

Thank god for your course and book, I was seriously struggling trying to learn tidymodels from the docs. One thing, in your course, do you want to maybe add how to use the “stacks” package for stacking models and building ensemble learners?

@MattBirch 2 жыл бұрын

This is awesome. Thanks!

@MoCtheFirst 3 жыл бұрын

When using 'predict()' in the end (24:49) i get the Error: "Workflow has not yet been trained. Do you need to call `fit()'? Any suggestions as to what went wrong? Thanks for all the input!

@JuliaSilge 3 жыл бұрын

If you want to walk through the blog post to follow along, you can call `predict()` on the fitted workflow that is "insight" of `final_res`: juliasilge.com/blog/sf-trees-random-tuning/ You can check out my latest blog post for a more explicit example of how to do this: juliasilge.com/blog/chocolate-ratings/

@seunghoonlee5275 2 жыл бұрын

Thank you so much Julia! It's a great video. I wonder whether I can use weight variable in random forest analysis (or in general tidymodel package). Could you recommend any materials?

@JuliaSilge 2 жыл бұрын

Yes, this has been a focus of the tidymodels team this year! You can read more here: www.tidyverse.org/blog/2022/05/case-weights/ Since that post, much of the case weight work has been released to CRAN.

@seunghoonlee5275 2 жыл бұрын

@@JuliaSilge Thank you so much Julia! I will go over the link.

@panagiotischionas5828 4 жыл бұрын

Hi Julia really love your work. A quick question: since you take the log of price as input to your model, if you want to show the actual price predicted by the model, how would you do that?

@JuliaSilge 4 жыл бұрын

I used log10(), so you can do 10^price to get it back. 👍

@sjrigatti 4 жыл бұрын

Hi. This is great. I work with survival data a lot and I was wondering how an analysis like this would differ with a survival object as the outcome. Is it just a matter of changing the mode of the ranger fit?

@JuliaSilge 4 жыл бұрын

No, actually, we still have a bit of work to do for survival models. We have some notes sketched out here: github.com/tidymodels/planning/tree/master/survival-analysis And there are some proof of concepts floating around in various repos. This is something we will work more on in 2021, so look for survival support next year!

@sjrigatti 4 жыл бұрын

@@JuliaSilge this seems like something Dr. Harrell at Vanderbilt would be interested in working on. Has he contributed anything at this point?

@JuliaSilge 4 жыл бұрын

@@sjrigatti Not at this point, but an interesting idea!

@mkklindhardt 3 жыл бұрын

Hi Julia, Once again thank you for your amazing videos and your great enthusiasm. I have some question. 1) Why do you use knn imputation? You did not really explain why you did not go for linea or mean imputation mode. 2) Can usemodels also be used to prepare my data (recipe, workflow, prep etc) for a linear mixed model? Ultimately I would like to use the same data setup for comparing different regression models, such as; linear mixed models (stepwise AIC regression), kNN regression and Random Forest regression as well as XGBoost. Is it possible to have the same data setup for all my models? I guess that's needed when comparing model performance and evaluate models? Or am I wrong? Thank you

@JuliaSilge 3 жыл бұрын

Choosing nearest neighbors for imputation over something like linear imputation or just a single value (mean/median) is similar to making that choice for modeling overall; it lets you use nonlinear, more complex relationships in the data for the imputation. I think this paper is a pretty nice discussion: www.ncbi.nlm.nih.gov/pmc/articles/PMC4959387/ You can see the models that are currently supported in usemodels here: usemodels.tidymodels.org/reference/index.html If you are interested in comparing quite a number of models, you might check out using the tidyposterior package, as described in this chapter: www.tmwr.org/compare.html

@mkklindhardt 3 жыл бұрын

Appreciated @@JuliaSilge! Is it "fair" to compare linear regression models with machine learning regression models? 1) are there then specific areas, generally, that one needs to be aware of when comparing linear mixed models with machine learning models (e.g. random forest, XGBoost and kNN)? Such as changes in predictor variables, continuous vs. factor for variables, etc? 2) are there tidymodels ways I can deal with or prevent collinearity and high correlation between variables before I perform the linear regression modelling? Perhaps like an AIC stepwise regression? Is that the same as the vip() function? But then my predictors for the linear model will change compared to the ones in the ML regression modelling, right? Sorry for the many questions.. Hope they are somehow clear. Hope you had a good weekend Julia. Your help is precious to me! Best regards , Kamau

@JuliaSilge 3 жыл бұрын

@@mkklindhardt Yep, there is nothing wrong with comparing linear models with models that can account for more complex, non-linear behavior. If you are thinking about comparing models, I recommend reading in detail this section, as well as Chapters 10 and 11: www.tmwr.org/software-modeling.html#model-types In tidymodels, we have preprocessing steps to filter out variables that are highly correlated or a linear combination of each other: recipes.tidymodels.org/reference/index.html#section-step-functions-filters We don't recommend using stepwise regression, for the reasons outlined here: www.stata.com/support/faqs/statistics/stepwise-regression-problems/ More on that here: stats.stackexchange.com/questions/20836/algorithms-for-automatic-model-selection/20856#20856

@JamesLee1 4 жыл бұрын

Hello Julia, thanks for the video. I'm a big fan. Could you please let me know how to make html/notebook outputs from Rmarkdown better looking? When I use your tidytuesday rmd file from your github, the resulting html file has the default singled spaced very small calibre font text. But your website has ~1.5 spaced big custom font that's pretty. If I don't intend to publish my html report on github or online because my work data is sensitive. Could I still make html outputs to have the same formatting as your website? Wowchemy - Academic theme is only for publishing online through github? I would like to change html text formatting locally.

@JuliaSilge 4 жыл бұрын

My website uses Hugo and I'm sure you don't want to get that set up just for individual reports. Instead, take a look at some of the styling options you have for HTML reports. There are built-in options using Bootswatch: bookdown.org/yihui/rmarkdown/html-document.html#appearance-and-style Or other contributed formats like html_pretty and html_clean: rmarkdown.rstudio.com/formats.html

@shahidraza5571 3 жыл бұрын

can you provide me some source where i can learn random forest algorithm for predicting groundwater contamination map due to fluoride using r studio along with Q GIS?

@ROCK962 4 жыл бұрын

Hi Julia! Thank you for your awesome tutorials. I am trying to replicate the Palmer Penguin´s episode, but I am having a problem with the bootstraping step. When I run the bootstraps function from rsample, R is creating empty splits. Do you know what could be the issue?

@JuliaSilge 4 жыл бұрын

Wow, no, I haven't seen that before. Can you work on creating a reprex: www.tidyverse.org/help/ And then posting the problem on RStudio Community? rstd.io/tidymodels-community