DeBERTa: Decoding-enhanced BERT with Disentangled Attention (Machine Learning Paper Explained)

Рет қаралды 20,523

Yannic Kilcher

Күн бұрын

Пікірлер: 85

@randomisedrandomness 3 жыл бұрын

I don't understand most of your videos, yet i keep watching them.

@GodSeek 3 жыл бұрын

hahaha

@anshul5243 3 жыл бұрын

The old format seemed better, mainly because of the space wastage in this one. The title seems redundant, since the KZbin video title already has the name of the paper, and the logo could be better off as a watermark in a smaller size.

@CosmiaNebula 2 жыл бұрын

Unlike what you stated, the positional encoding vectors do not get added to the token encoding vectors. The new token encoding vectors after each attention layer is merely a weighted sum of the previous token encoding vectors. See the paper's Equation (4).

@willemwestra 3 жыл бұрын

Hi Yannic, Absolutely love your video's, but the new recording setup I find quite a bit worse. The simple paper only setup was very nice, clean and distraction free. Moreover the font rendering is also quite a bit worse. It varied throughout your videos possibly because you might have switched software, so just pick the editor and recording program combination that gives the best pdf rendering quality. Distraction free, crystal clear rendering and good microphone are the things I appreciate the most. Your voice and microphone are great but I long for the old clean and crisp setup :)

@YannicKilcher 3 жыл бұрын

Thanks a lot :)

@timdernedde993 3 жыл бұрын

The old layout was better. Especially as a Mobile user when the screen is smaller and there is a completely unnecessary black bar on the right.

@G12GilbertProduction 3 жыл бұрын

And so sumptuous.

@rbain16 3 жыл бұрын

I came to the comments to say similarly. Not on mobile, but still would rather not have some screen taken up by your twitter pic (what if it changes too?).

@sonOfLiberty100 3 жыл бұрын

I like the old settings more now its smaller

@rohankashyap2252 3 жыл бұрын

Really awesome youtube channel, lucky to get access to such great content.Thanks a lot!

@avatar098 3 жыл бұрын

Thank you so much for doing these videos. Helps me keep current with NLP!

@null4598 2 жыл бұрын

Cool. You make it easy. Thank you, Yannic.

@adamrak7560 3 жыл бұрын

I always feed the positional information before every attention stage. That seemed better, and always converged faster for me.

@biesseti 3 жыл бұрын

Found this video and subscribed. You do it well.

@nghiapham1632 Жыл бұрын

Thanks for your great explaination

@haukurpalljonsson8233 3 жыл бұрын

The positional information does not leak into later layers, at least not directly. The positional information is only in the attention which is then softmaxed and only multiplied to the content information.

@alpers.2123 3 жыл бұрын

How much do you practice to pronounce author names:)

@burakyildiz8921 3 жыл бұрын

It seems that he made up :)

@lloydgreenwald954 Жыл бұрын

Very well done.

@CppExpedition 2 жыл бұрын

BRILLANT! :)

@etiennetiennetienne 3 жыл бұрын

I am not sure with Position fed at first layer means their architecture already "agglomerates" position. Values are produced only from content, but are"weighted" by the hybrid attention. In itself values are just a clever mix of other values, but positional encoding is not really part of the vector itself, like it would be with direct summation or concatenation.

@rezas2626 3 жыл бұрын

Awesome video! Do RANDOM FEATURE ATTENTION next please!

@mschnell75 3 жыл бұрын

Why isn't this called ERNIE?

@MehrdadNsr 3 жыл бұрын

I think we already have a model named ERNIE!

@peterrobinson7748 3 жыл бұрын

Because then it'll rhyme with BERNIE, one of the greatest enemies of America.

@ChlorieHCl 3 жыл бұрын

Because there're already at least 2 models named ERNIE...

@Kram1032 3 жыл бұрын

3:13 OMG so that's why I am hungry!

@andres_pq 3 жыл бұрын

Next do a video about "A straight forward framework for Video Retrieval using CLIP" 👀

@MrJaggy123 3 жыл бұрын

So glue was deprecated when submissions surpassed human performance. This papers submission has done the same thing for superglue (alongside another submission which also does so). Is it time for a new benchmark again? What are your thoughts on what "the benchmark after superglue" would look like?

@dr.mikeybee 3 жыл бұрын

Yannic is all you need!

@susantaghosh504 2 жыл бұрын

Awesome

@sandraviknander7898 3 жыл бұрын

If the context and position were truly disentangled all the way trough the network how would the network be able to learn the transformations to the positional vectors it needs to do to rout context information? 🤔

@drozen214 3 жыл бұрын

This paper makes me wonder whether we really need to use a whole vector to represent a position

@frenchmarty7446 2 жыл бұрын

What is the alternative?

@herp_derpingson 3 жыл бұрын

3:25 THICC vector :) . 10:10 Yeah, this addition thing always felt dirty in the original "Attention is all you need" paper. I am glad I was not the only one who felt so. . 14:13 Never mind, we end up adding them anyways, just with extra steps. . 24:11 Is P learnt? Are we using the same P for all languages? There are two main types of languages, subject-verb-object languages and subject-object-verb languages. I dont think we should use the same learnt values of P for all languages as the position works completely differently in both types of languages. . 35:30 Never mind, we end up using absolute positions anyways. . 41:35 Fermat's last theorem: "I could prove it but I don't have enough battery" . I thought you accidentally recorded your video in wrong aspect ratio LOL

@yaaank6725 3 жыл бұрын

Why the flash traveling back through time talks about a paper half a year ago

@GodSeek 3 жыл бұрын

So why in the beginning you said "the worst" case is disentangled embedding (half for positiong, half for content). But this paper just propose the disentangled one?

@frenchmarty7446 2 жыл бұрын

He meant the worst case would be the model learning its own disentangled embedding, which would mean some chunk of the "content" vector is being occupied by position information.

@frenchmarty7446 2 жыл бұрын

I'm not sure if feeding information into the model at the beginning is necessarily better than at the end. Like you said yourself, the model would have to learn how to propagate that information through. That night be more of a bottleneck than just waiting until the end. There might also be a useful inductive bias here that's close to how humans read (you don't read a word and have both its relative and absolute position in mind).

@Dynidittez 3 жыл бұрын

Hi, looking at the models they seem to have normal versions and versions fine tuned on mnli. Do the ones that are finetuned on mnli perform better in most benchmarks? Also on their git repo they show scores like i.e. 85.6/86.9. Is the second score there meant to represent the finetuned mnli version score?

@andres_pq 3 жыл бұрын

Anyone knows an advanced Pytorch course. One that includes something like creating custom layers, custom training loops and handling weird stuff. I have some resesech ideas that I dont know how to implement.

@snippletrap 3 жыл бұрын

FastAI. The second half of the course is all custom implementations.

@andres_pq 3 жыл бұрын

@@snippletrap thanks

@谢安-k6t 3 жыл бұрын

@@snippletrap I was using FAST.ai framework about two years ago but quit it because it(the framework) was harder to customize then pytorch and has bad API. Do you think its course is worth learning?

@snippletrap 3 жыл бұрын

@@谢安-k6t Yes, I agree, I prefer vanilla PyTorch. I don't use the FastAI library because it's difficult to read the source when most functions rely on callbacks. I still highly recommend the course, you will learn a lot.

@谢安-k6t 3 жыл бұрын

@@snippletrap Get it, thanks a lot.

@mathematicalninja2756 3 жыл бұрын

Every day we arrive at the future

@cerebralm 3 жыл бұрын

The only thing I didn't like about the old layout was that it rendered PDFs in lightmode. Not sure if there's a good way to do a vote on youtube to see which of your audience prefers lightmode and which would prefer darkmode, but that would be the only thing I would change if it was up to me :)

@Hank-y4u 3 жыл бұрын

Hi Yannic, big fan here. Would you take a video about Meta Pseudo Label?

@bishalsantra 3 жыл бұрын

Just curious, what app are you using to annotate?

@timdernedde993 3 жыл бұрын

OneNote

@florianjug 3 жыл бұрын

@@timdernedde993 Is this also true with the new setup used in this video???

@G12GilbertProduction 3 жыл бұрын

New shade of BERT v2, but more metronomical.

@zhangshaojie9790 3 жыл бұрын

Can anyone explain to me what is the difference between transformer encoder and decoder? Other than bidirectional, autoencoding, and extra FFW layer, the two model architecture looks the same to me. I keep hearing ppl said decoder is better at scaling. Do ppl actually mean Bert and GPT.

@frenchmarty7446 2 жыл бұрын

The encoder and decoder are two components of the same model.

@gavinmc5285 3 жыл бұрын

content - ok. positioning - ok. what about context?

@frenchmarty7446 2 жыл бұрын

What exactly do you mean by "context"? Like somekind of additional information not in the word vectors themselves? That would probably be something the model should learn on its own.

@gavinmc5285 2 жыл бұрын

@@frenchmarty7446 ok, well around the 30 minute mark there is a breakdown analysis of relative and absolute positioning merits. and the strength of either technique (or both) seems to be correlated to context. leaving aside computational or processing power (if even they could be considered of relevance) the paper analysis seems to highlight the before or after options of adding absolute positioning (in this paper at the end of the process). nonetheless, the context 'factor' or 'solving' the context (so as to allow accurate word embedding or prediction) remains and surely the optimum solution (approximately or precisely) would be to have - in a positional and hierarchical (content) vector or matrix set with relative values - some form of absolute feed within which absolute values could be accessed without necessarily having to position those as a priority before or after the relative value calculations are processed.

@frenchmarty7446 2 жыл бұрын

@@gavinmc5285 You didn't actually answer my question but ok... When you say "hierarchical" information, I assume you mean some kind of graph. Unstructured graphs have actually been tried before (BP-Transformer) with some success. If you mean somekind of structured graph based on grammar rules, then that is a bad idea. The entire purpose of the self-attention mechanism is to learn the relationships between tokens. The attention mechanism *creates* its own graph at every layer. Transformers are powerful (with large amounts of data) because they impose very little inductive bias. We don't tell the network what is or isn't important, it learns that on its own. Feeding extra information that isn't in the data itself is just extra effort that only biases the network towards one particular way of looking at the data.

@gavinmc5285 2 жыл бұрын

@@frenchmarty7446 ok then, to be more definitive by 'context' i would understand it as such concepts as 'thrust', 'gist', 'essence' or 'meaning'. to interpret and apply context as relevant to subject matter is a function of intelligence. to some extent a lack of supervision - depending on the instance - may be appropriate although it is unlikely that any algorithm (unsupervised or reinforced) that wanders too far from the context within which it is operating (or supposed to be operating) is going to suddenly stumble on the parameters it needs to accurately determine values that require the appropriate context ('store / mall' is used here in the paper analysis example). not consistently time and again anyway.

@frenchmarty7446 2 жыл бұрын

@@gavinmc5285 That is literally *more* vague than just saying "context". You are being less definitive... I also don't know what you mean by "stumble" on the correct parameters. We don't stumble on parameters, we train them. And we do so very consistently. What do you mean by "wander outside the context"? You mean outside the data distribution? That's a different meaning of "context" and we train for that as well. Where exactly are you unsatisfied? You say (paraphrasing) "it is unlikely that any algorithm... is going to stumble on the right parameters to accurately determine the right values". Accurately based on what? What specifically does the network have to output to meet your standard of understanding context?

@sajjadayobi688 3 жыл бұрын

from youtube import Yannic paper = 'any complex architect' easy_to_learn = Yannic(paper)

@sergiomanuel2206 3 жыл бұрын

Hello!! First of all, Thanks for the video. I don't like the new setup, the logo took a lot of screen space, although the title is okay.

@kimchi_taco 3 жыл бұрын

Disentangled Attention is already handled by TransformerXL when it introduces relative positional embedding. In my opinion, no contribution about it.

@paveltarashkevich8387 3 жыл бұрын

The old layout was better. Text resolution was better. Screen space usage was better.

@pensiveintrovert4318 3 жыл бұрын

Two many layers pollute information that may have been decisive when pristine.

@TechyBen 3 жыл бұрын

I came for the old Amiga game... I stayed for the new AI algorithm.

@sedenions 3 жыл бұрын

You are doing a good job. Talk about biologically plausible neural networks next, please.

@GreenManorite 3 жыл бұрын

Why is that an interesting topic? Not being snarky, just trying to understand motivation for the biological parallelism.

@sedenions 3 жыл бұрын

@@GreenManorite I'm biased, I majored in neuroscience and am currently switching careers. I guess this sentiment of mine comes from an interest in how researchers can better build cognitive AI. It seems like many of the early neural networks were 'neural' in name only. We are getting closer and closer to biologically plausible nets, but like you said, they're not that interesting to most.

@willrazen 3 жыл бұрын

Watch his video on "predictive coding"