Accelerating Deep Learning by Focusing on the Biggest Losers

Рет қаралды 2,748

Yannic Kilcher

Күн бұрын

Пікірлер: 21

@herp_derpingson 5 жыл бұрын

There are so many ML papers these days, the authors have to resort to click baity titles. What a time to be alive.

@connor-shorten 5 жыл бұрын

Aside from the hard example selection, is this identical to the RevNet technique for saving memory needed for backprop?

@YannicKilcher 5 жыл бұрын

In my opinion, RevNet and SB are somewhat orthogonal. RevNet still computes the original loss and gradients, but does so with less memory requirements, while SB computes less gradients, but retains all memory requirements. The cost of RevNet is that they are restricted to certain architectures and need more computations, the cost of SB is the bias introduced to the loss function.

@connor-shorten 5 жыл бұрын

@@YannicKilcher Interesting, thank you! Do you think the restriction in RevNet limits the representational capacity?

@YannicKilcher 5 жыл бұрын

As a theoretical class of functions, probably not (or not much) but in a given practical situation, it might have an influence. It might be worth asking people working on normalizing flows etc. about how much their (very similar) constraints are hurting them.

@connor-shorten 5 жыл бұрын

@@YannicKilcher Interesting, thanks for the discussion!

@simleek 5 жыл бұрын

This actually seems a lot like intrinsically motivated AI. The only difference is that those AIs move to get more high-loss&decrease-in-loss input instead of selecting neurons or examples in a batch when training.

@sehbanomer8151 5 жыл бұрын

won't it just overfit to the selected hard examples and underfit to the easy ones?

@YannicKilcher 5 жыл бұрын

One could argue that at that point, the previously "easy" samples will become "hard" and will be upweighted. But the essence of your comment is correct, there is definitely a bias introduced by the procedure.

@guanfuchen2741 3 жыл бұрын

I think it will be difficult for multi gpu training, because they will forward once and sync the results for a total gpu node batch for forward and backward, and it will be a tradeoff for extra forward time with saving sample backward time.

@AntonPanchishin 5 жыл бұрын

Thanks for this review Yannic. I've been using a loss function to do a similar function and was interested to see how the different ways of training on the hardest stacked up. Here's an interactive colab notebook that demos 'regular' training on MNIST, using a per-batch focus on the biggest losers, and a per-epoch focus. Also, in the notebook is the code involved to change 'regular' training into this new method, and it turns out that it's very easy to do with only a couple lines of code and easily works with Keras. colab.research.google.com/drive/1QrSimz0aDKt7-C8Chg9zZXne2pmPoqPf

@jordyvanlandeghem3457 4 жыл бұрын

Thanks Anton, very easily explained. Those type of explanatory notebooks help see through the fluff and hype introduced in research papers and just focus on fast empirical results by applying it yourself.