Mr.Landry reviewed accuracy, @27:20, based on validation dataset which was used during training to tune the model. It's not a realistic error estimate -- too optimistic -- consequently, when hit_ratio_table is looked at. Better to use new data to estimate error rather than data that was used to tune the model.
@marklandry21408 жыл бұрын
Hi @Geoffrey Anderson. A final/new test set is used, actually. This is introduced at about 18:00, and discussed at more length at 32:00, where it is scored for the first time. It trains on 60%, uses 20% for an internal validation set (early stopping), and the final 20% to evaluate when all tuning is complete.
@pbharadwaj36 жыл бұрын
The coding starts at 16:38
@poojawalavalkar3557 жыл бұрын
Beautifully explained. Thanks Mark!
@nomadjoy6 жыл бұрын
Hi Mark, this was extremely helpful. Can you please share the github path for the same. Thanks.
@sarthakyadav3714 жыл бұрын
You are awesome Mark!
@kojikitagawa73336 жыл бұрын
Could someone please elaborate a little more on the hit ratio table starting at 23:45? I am a little confused on what the score represents at k >= 2
@anitamishra046 жыл бұрын
The best explaination
@hleljihen20075 жыл бұрын
thank you for the video but can you please talk slowly
@Coral_dude4 жыл бұрын
You can control the speed yourself with youtube controls
@chsuresh0097 жыл бұрын
Hi Mark, I could not find anywhere, how to know the optimal number of rounds for GBM, in xgboost in cv, we get to know at what iteration the model reached optimal loss, but h2o, even when I give, validation set, stopping metric (logloss), stopping rounds (150) and stopping error 0.0001, it does not seems to stop. number of trees is always what is set in ntrees
@marklandry21407 жыл бұрын
Hi @Suresh Chinta. Stopping rounds of 150 is quite high. It may be valid in your case, but H2O will wait until the average of 150 consecutive rounds is within the stopping tolerance (you are using 0.0001 it seems) of the prior 150 consecutive rounds. And rounds uses score_tree_interval for how many trees are part of a round (default is variable by scoring time estimation). For reference, I typically use 2 for stopping_rounds. I usually set ntrees at a nearly unattainable number (e.g. 2000, 10000), drop the tolerance to 0, and also set the score_tree_interval to somewhere between 2 and 5. And those models typically stop well before the ntrees limit. In case it helps, since the demo is intended to be fast for people in the audience and that makes it a little less indicative of typical modeling, this is the latest model I've run this week: gbm