Thank you very much for the explanations and for the soft skills in how to read between lines of an algorithm!
@TimScarfe4 жыл бұрын
I think that BERT had two pre-training tasks i.e. 1) masked language model 2) next sentence prediction. It would oscillate between them during training. This is in contrast to MT-DNN addition to BERT from Microsoft which did multi-task training on the fine-tuning stage too. Multi-task learning certainly seems to give regularisation and improved generalisation due to more data + weight sharing. (EDIT: correction, MT-DNN pretrains on a multi-task mixture, then fine-tunes on the end-tasks)
@valthorhalldorsson93004 жыл бұрын
Very interesting paper! This is addresses the most common hang-up for me when I've considered multitask problems - I.e. why would the gradient of (L1 + L2) actually point to anything useful.
@YannicKilcher4 жыл бұрын
Well, since the optimal weight configuration will be able to point both tasks, it will be at a point where L1 and L2 are zero, so descending both makes sense in principle.
@anthrond4 жыл бұрын
Tiny quibble with your reasoning around 22:45. You say you could decrease the learning rate and their theorem would no longer hold, so therefore their method would not be quicker. The "therefore" does not follow from the theorem not holding. Yes, their theorem won't hold, but their method could theoretically still be better, it's just not shown by THIS theorem.
@mechanicalmonk20204 жыл бұрын
This feels similar to the auxiliary output proposed in a paper about playing the Vizdoom deathmatch. They used an auxiliary output (predict some pieces of the game state given a frame) to help improve the policy. (I'm only 4 minutes in)
@jrkirby934 жыл бұрын
Does this require multi-task problems to work? Looking at the math behind it, multiple tasks doesn't appear to be a prerequisite. I suppose multi-task problems are more likely to have conflicting gradients, but that doesn't mean other problems never will. Perhaps classification problems would have conflicting gradients early when training two separate class detection examples.
@YannicKilcher4 жыл бұрын
True, but something like a single multi-class classification is much less likely to exhibit the necessary difference in gradient magnitudes. Also it looks to me like the conflicting gradients must be "systematic", i.e. they are not averaged out by simply aggregating over minibatches of data samples.
@zahrahafezi4547 Жыл бұрын
Sorry Im struggling with somethings. First, should we check the conflicting at each single weight’s gradient? And why are they 2D? You all wrote 2D gradients even in the paper
@herp_derpingson4 жыл бұрын
I was trying to visualize the vector manipulation in my head. Please excuse my ASCII art The addition of vector pairs _\ and _/ both lead to a vector going in the north west direction. In case of the conflicting vector pair. The magnitude of the NW-going vector would be smaller. Cant we achieve the same effect by just increasing the step size, the closer the angle between the vector pair goes to 180 degrees. Something like. new_step_size = step_size + lambda * sin(theta - 90)
@YannicKilcher4 жыл бұрын
Interesting idea. I don't know, but their method heavily relies on the curvature of the loss function in different directions, I don't think that's captured by this, but then again they never explicitly use the curvature in the practical algorithm, so who knows...
@yuhangguo24093 жыл бұрын
Thx for the sharing. Gradients always conflict! hhh