Gradient Surgery for Multi-Task Learning

Рет қаралды 8,384

Yannic Kilcher

Күн бұрын

Пікірлер: 18

@00DarkSoul 4 жыл бұрын

Thank you very much for the explanations and for the soft skills in how to read between lines of an algorithm!

@TimScarfe 4 жыл бұрын

I think that BERT had two pre-training tasks i.e. 1) masked language model 2) next sentence prediction. It would oscillate between them during training. This is in contrast to MT-DNN addition to BERT from Microsoft which did multi-task training on the fine-tuning stage too. Multi-task learning certainly seems to give regularisation and improved generalisation due to more data + weight sharing. (EDIT: correction, MT-DNN pretrains on a multi-task mixture, then fine-tunes on the end-tasks)

@valthorhalldorsson9300 4 жыл бұрын

Very interesting paper! This is addresses the most common hang-up for me when I've considered multitask problems - I.e. why would the gradient of (L1 + L2) actually point to anything useful.

@YannicKilcher 4 жыл бұрын

Well, since the optimal weight configuration will be able to point both tasks, it will be at a point where L1 and L2 are zero, so descending both makes sense in principle.

@anthrond 4 жыл бұрын

Tiny quibble with your reasoning around 22:45. You say you could decrease the learning rate and their theorem would no longer hold, so therefore their method would not be quicker. The "therefore" does not follow from the theorem not holding. Yes, their theorem won't hold, but their method could theoretically still be better, it's just not shown by THIS theorem.

@mechanicalmonk2020 4 жыл бұрын

This feels similar to the auxiliary output proposed in a paper about playing the Vizdoom deathmatch. They used an auxiliary output (predict some pieces of the game state given a frame) to help improve the policy. (I'm only 4 minutes in)

@jrkirby93 4 жыл бұрын

Does this require multi-task problems to work? Looking at the math behind it, multiple tasks doesn't appear to be a prerequisite. I suppose multi-task problems are more likely to have conflicting gradients, but that doesn't mean other problems never will. Perhaps classification problems would have conflicting gradients early when training two separate class detection examples.

@YannicKilcher 4 жыл бұрын

True, but something like a single multi-class classification is much less likely to exhibit the necessary difference in gradient magnitudes. Also it looks to me like the conflicting gradients must be "systematic", i.e. they are not averaged out by simply aggregating over minibatches of data samples.

@zahrahafezi4547 Жыл бұрын

Sorry Im struggling with somethings. First, should we check the conflicting at each single weight’s gradient? And why are they 2D? You all wrote 2D gradients even in the paper

@herp_derpingson 4 жыл бұрын

I was trying to visualize the vector manipulation in my head. Please excuse my ASCII art The addition of vector pairs _\ and _/ both lead to a vector going in the north west direction. In case of the conflicting vector pair. The magnitude of the NW-going vector would be smaller. Cant we achieve the same effect by just increasing the step size, the closer the angle between the vector pair goes to 180 degrees. Something like. new_step_size = step_size + lambda * sin(theta - 90)

@YannicKilcher 4 жыл бұрын

Interesting idea. I don't know, but their method heavily relies on the curvature of the loss function in different directions, I don't think that's captured by this, but then again they never explicitly use the curvature in the practical algorithm, so who knows...