Feature engineering & interpretability for xgboost with board game ratings

Рет қаралды 7,178

Julia Silge

Күн бұрын

Пікірлер: 36

@michal.tomczyk 3 жыл бұрын

Great stuff. Amazing R programming and data analysis skills!

3 жыл бұрын

5:12 your excitement for that distribution LOL

@IamHopeman47 2 жыл бұрын

Thanks for the great screencast! For me the little introduction to the finetune package was a big eye-opener. Very smart approach. Looking forward to more of your content (especially the advanced topics).

@whynotfandy 3 жыл бұрын

I love the new lights! It's a great addition for your intros and outros.

@N1loon 3 жыл бұрын

Amazing stuff... I really admire your knowledge in this super complex field. I think it would be cool, if you would do an episode that is mainly centered around all the nuances in tuning. I think this video offered a good general introduction of tuning principles but didn't go too much into all the details such as finding the right balance between over- and underfitting, operating with grids etc... Just an idea! Anyways, really love your content, Julia!

@JuliaSilge 3 жыл бұрын

For now, you might check out some of my other blog posts that focus on hyperparameter tuning, like these: juliasilge.com/blog/scooby-doo/ juliasilge.com/blog/sf-trees-random-tuning/

@hnagaty 2 жыл бұрын

Great screencast and very useful. Many thanks.

@seadhna 3 жыл бұрын

Great video as always! Would it be possible to use native parsnip functions to clean the features instead of doing base string manipulation in a custom function? I think in another video you cleaned the tokens within the recipe step.

@JuliaSilge 2 жыл бұрын

We definitely have tidymodels functions for lots of text manipulation, like those in textrecipes: textrecipes.tidymodels.org/ Or functions like: recipes.tidymodels.org/reference/step_dummy_multi_choice.html But sometimes there isn't something that fits your particular data out of the box, in which case you can extend tidymodels like I walked through in this screencast.

@pokelytics4352 3 жыл бұрын

Fantastic content as always! Quick question, what are your thoughts on training multiple models using the top parameters from HP tuning, and ensembling the predictions/is there an easy way to do something like this with tidymodels? Thanks!!!

@JuliaSilge 3 жыл бұрын

Yep, absolutely a great approach to work toward a bit of performance gain! You can implement ensembling in tidymodels with the stacks package: stacks.tidymodels.org/

@pokelytics4352 3 жыл бұрын

@@JuliaSilge Great thanks so much

@codygoggin1097 2 жыл бұрын

Great video Julia! What would be the proper function to use in order to fit this best model onto new data and view these predictions?

@JuliaSilge 2 жыл бұрын

You use the `predict()` function! workflows.tidymodels.org/reference/predict-workflow.html

@davidjackson7675 3 жыл бұрын

Thanks, that is interesting as always.

@d_b_ Жыл бұрын

Is one takeaway from 44:03 that I should create a short play game for older people with few players that has printed minatures based in deductive fantasy animal war?

@JuliaSilge Жыл бұрын

HA, I think so, yep!

@ammarparmr 3 жыл бұрын

ThankU.. fantastic video If you don't mind, I have a question regarding "mtry".. how we ended up with mtry greater than 6( the number of all the predictors). maybe I am confused with the concept

@JuliaSilge 3 жыл бұрын

After the feature engineering, there are a lot more predictors, 30 from the board game category alone. The data that goes into xgboost is the _processed_ data, not the data in its pre-feature-engineering form.

@ammarparmr 3 жыл бұрын

@@JuliaSilge well explained Thank you so much

@avnavcgm 3 жыл бұрын

Great video! What would then be the best way to 'save' the best trained model so you can predict new with observations in the future that aren't in the train/test split?

@JuliaSilge 3 жыл бұрын

You can _extract_ the workflow from the trained "last fit" object and then save that as a binary, like with `readr::write_rds()`. I show some of that at the end of this blog post: juliasilge.com/blog/chocolate-ratings/

@jaredwsavage 2 жыл бұрын

Great video Julia. Just a quick question. Have you tried using lightgbm or catboost with boost_trees? They are available in the treesnip package and generally run much faster than xgboost.

@JuliaSilge 2 жыл бұрын

HA oh I have had SUCH installation issues with both of those. 🙈 I have a Mac M1 and you can see the current situation for catboost here: github.com/catboost/catboost/issues/1526 I'll have to dig up the lightgbm problems somewhere. Anyway, those are great implementations if you can get them to install!

@jaredwsavage 2 жыл бұрын

@@JuliaSilge Wow, as a Windows user I'm usually the one on the wrong end of installation issues. 😁

@charithwijewardena9493 2 жыл бұрын

Hi Julia I have a question. I'm trying to get my head around the concept of data leakage. You build your model with the outcome being "average", but before you do your split you do EDA on everything. Are we not gaining insight into the test set by doing that? Should we be doing EDA only AFTER splitting our data? Thanks. :)

@JuliaSilge 2 жыл бұрын

This is definitely an important thing to think and make good decisions about. On the one hand, anything you do before the initial split could lead to data leakage. On the other hand, you need to understand something about your data in order to even create that initial split (like stratified sampling). It's most important that anything that you will literally use in creating predictions (like feature engineering) is done after data splitting.

@charithwijewardena9493 2 жыл бұрын

Cool, thank you for the reply. 🙏🏽

@ndiyekebonye208 2 жыл бұрын

Still getting an error at the tune_race_anova despite updating all my packages. Installing latest versions of dials and finetune. Is there a way to overcome this.

@JuliaSilge 2 жыл бұрын

If you are having trouble with one of the racing functions, I recommend just trying to plain old `fit()` with your workflow, or perhaps using `tune_grid()`. Those functions will help you diagnose where the model failures are happening.

@ndiyekebonye208 2 жыл бұрын

@@JuliaSilge thank you so much. Will surely try this

@russelllavery2281 Жыл бұрын

cannot read the fonts

@PA_hunter 3 жыл бұрын

Would it be bad if I used tidymodels steps for non-ml data wrangling, haha!

@JuliaSilge 3 жыл бұрын

I think some people do this for sure. Some things to keep in mind are that it is set up for learning from training data and applying to testing data, so I'd keep that design top of mind for using in other contexts. You can see this blog post where I used recipes for unsupervised work, without heading into a predictive model: juliasilge.com/blog/cocktail-recipes-umap/