I don't understand most of your videos, yet i keep watching them.
@GodSeek3 жыл бұрын
hahaha
@anshul52433 жыл бұрын
The old format seemed better, mainly because of the space wastage in this one. The title seems redundant, since the KZbin video title already has the name of the paper, and the logo could be better off as a watermark in a smaller size.
@CosmiaNebula2 жыл бұрын
Unlike what you stated, the positional encoding vectors do not get added to the token encoding vectors. The new token encoding vectors after each attention layer is merely a weighted sum of the previous token encoding vectors. See the paper's Equation (4).
@willemwestra3 жыл бұрын
Hi Yannic, Absolutely love your video's, but the new recording setup I find quite a bit worse. The simple paper only setup was very nice, clean and distraction free. Moreover the font rendering is also quite a bit worse. It varied throughout your videos possibly because you might have switched software, so just pick the editor and recording program combination that gives the best pdf rendering quality. Distraction free, crystal clear rendering and good microphone are the things I appreciate the most. Your voice and microphone are great but I long for the old clean and crisp setup :)
@YannicKilcher3 жыл бұрын
Thanks a lot :)
@timdernedde9933 жыл бұрын
The old layout was better. Especially as a Mobile user when the screen is smaller and there is a completely unnecessary black bar on the right.
@G12GilbertProduction3 жыл бұрын
And so sumptuous.
@rbain163 жыл бұрын
I came to the comments to say similarly. Not on mobile, but still would rather not have some screen taken up by your twitter pic (what if it changes too?).
@sonOfLiberty1003 жыл бұрын
I like the old settings more now its smaller
@rohankashyap22523 жыл бұрын
Really awesome youtube channel, lucky to get access to such great content.Thanks a lot!
@avatar0983 жыл бұрын
Thank you so much for doing these videos. Helps me keep current with NLP!
@null45982 жыл бұрын
Cool. You make it easy. Thank you, Yannic.
@adamrak75603 жыл бұрын
I always feed the positional information before every attention stage. That seemed better, and always converged faster for me.
@biesseti3 жыл бұрын
Found this video and subscribed. You do it well.
@nghiapham1632 Жыл бұрын
Thanks for your great explaination
@haukurpalljonsson82333 жыл бұрын
The positional information does not leak into later layers, at least not directly. The positional information is only in the attention which is then softmaxed and only multiplied to the content information.
@alpers.21233 жыл бұрын
How much do you practice to pronounce author names:)
@burakyildiz89213 жыл бұрын
It seems that he made up :)
@lloydgreenwald954 Жыл бұрын
Very well done.
@CppExpedition2 жыл бұрын
BRILLANT! :)
@etiennetiennetienne3 жыл бұрын
I am not sure with Position fed at first layer means their architecture already "agglomerates" position. Values are produced only from content, but are"weighted" by the hybrid attention. In itself values are just a clever mix of other values, but positional encoding is not really part of the vector itself, like it would be with direct summation or concatenation.
@rezas26263 жыл бұрын
Awesome video! Do RANDOM FEATURE ATTENTION next please!
@mschnell753 жыл бұрын
Why isn't this called ERNIE?
@MehrdadNsr3 жыл бұрын
I think we already have a model named ERNIE!
@peterrobinson77483 жыл бұрын
Because then it'll rhyme with BERNIE, one of the greatest enemies of America.
@ChlorieHCl3 жыл бұрын
Because there're already at least 2 models named ERNIE...
@Kram10323 жыл бұрын
3:13 OMG so that's why I am hungry!
@andres_pq3 жыл бұрын
Next do a video about "A straight forward framework for Video Retrieval using CLIP" 👀
@MrJaggy1233 жыл бұрын
So glue was deprecated when submissions surpassed human performance. This papers submission has done the same thing for superglue (alongside another submission which also does so). Is it time for a new benchmark again? What are your thoughts on what "the benchmark after superglue" would look like?
@dr.mikeybee3 жыл бұрын
Yannic is all you need!
@susantaghosh5042 жыл бұрын
Awesome
@sandraviknander78983 жыл бұрын
If the context and position were truly disentangled all the way trough the network how would the network be able to learn the transformations to the positional vectors it needs to do to rout context information? 🤔
@drozen2143 жыл бұрын
This paper makes me wonder whether we really need to use a whole vector to represent a position
@frenchmarty74462 жыл бұрын
What is the alternative?
@herp_derpingson3 жыл бұрын
3:25 THICC vector :) . 10:10 Yeah, this addition thing always felt dirty in the original "Attention is all you need" paper. I am glad I was not the only one who felt so. . 14:13 Never mind, we end up adding them anyways, just with extra steps. . 24:11 Is P learnt? Are we using the same P for all languages? There are two main types of languages, subject-verb-object languages and subject-object-verb languages. I dont think we should use the same learnt values of P for all languages as the position works completely differently in both types of languages. . 35:30 Never mind, we end up using absolute positions anyways. . 41:35 Fermat's last theorem: "I could prove it but I don't have enough battery" . I thought you accidentally recorded your video in wrong aspect ratio LOL
@yaaank67253 жыл бұрын
Why the flash traveling back through time talks about a paper half a year ago
@GodSeek3 жыл бұрын
So why in the beginning you said "the worst" case is disentangled embedding (half for positiong, half for content). But this paper just propose the disentangled one?
@frenchmarty74462 жыл бұрын
He meant the worst case would be the model learning its own disentangled embedding, which would mean some chunk of the "content" vector is being occupied by position information.
@frenchmarty74462 жыл бұрын
I'm not sure if feeding information into the model at the beginning is necessarily better than at the end. Like you said yourself, the model would have to learn how to propagate that information through. That night be more of a bottleneck than just waiting until the end. There might also be a useful inductive bias here that's close to how humans read (you don't read a word and have both its relative and absolute position in mind).
@Dynidittez3 жыл бұрын
Hi, looking at the models they seem to have normal versions and versions fine tuned on mnli. Do the ones that are finetuned on mnli perform better in most benchmarks? Also on their git repo they show scores like i.e. 85.6/86.9. Is the second score there meant to represent the finetuned mnli version score?
@andres_pq3 жыл бұрын
Anyone knows an advanced Pytorch course. One that includes something like creating custom layers, custom training loops and handling weird stuff. I have some resesech ideas that I dont know how to implement.
@snippletrap3 жыл бұрын
FastAI. The second half of the course is all custom implementations.
@andres_pq3 жыл бұрын
@@snippletrap thanks
@谢安-k6t3 жыл бұрын
@@snippletrap I was using FAST.ai framework about two years ago but quit it because it(the framework) was harder to customize then pytorch and has bad API. Do you think its course is worth learning?
@snippletrap3 жыл бұрын
@@谢安-k6t Yes, I agree, I prefer vanilla PyTorch. I don't use the FastAI library because it's difficult to read the source when most functions rely on callbacks. I still highly recommend the course, you will learn a lot.
@谢安-k6t3 жыл бұрын
@@snippletrap Get it, thanks a lot.
@mathematicalninja27563 жыл бұрын
Every day we arrive at the future
@cerebralm3 жыл бұрын
The only thing I didn't like about the old layout was that it rendered PDFs in lightmode. Not sure if there's a good way to do a vote on youtube to see which of your audience prefers lightmode and which would prefer darkmode, but that would be the only thing I would change if it was up to me :)
@Hank-y4u3 жыл бұрын
Hi Yannic, big fan here. Would you take a video about Meta Pseudo Label?
@bishalsantra3 жыл бұрын
Just curious, what app are you using to annotate?
@timdernedde9933 жыл бұрын
OneNote
@florianjug3 жыл бұрын
@@timdernedde993 Is this also true with the new setup used in this video???
@G12GilbertProduction3 жыл бұрын
New shade of BERT v2, but more metronomical.
@zhangshaojie97903 жыл бұрын
Can anyone explain to me what is the difference between transformer encoder and decoder? Other than bidirectional, autoencoding, and extra FFW layer, the two model architecture looks the same to me. I keep hearing ppl said decoder is better at scaling. Do ppl actually mean Bert and GPT.
@frenchmarty74462 жыл бұрын
The encoder and decoder are two components of the same model.
@gavinmc52853 жыл бұрын
content - ok. positioning - ok. what about context?
@frenchmarty74462 жыл бұрын
What exactly do you mean by "context"? Like somekind of additional information not in the word vectors themselves? That would probably be something the model should learn on its own.
@gavinmc52852 жыл бұрын
@@frenchmarty7446 ok, well around the 30 minute mark there is a breakdown analysis of relative and absolute positioning merits. and the strength of either technique (or both) seems to be correlated to context. leaving aside computational or processing power (if even they could be considered of relevance) the paper analysis seems to highlight the before or after options of adding absolute positioning (in this paper at the end of the process). nonetheless, the context 'factor' or 'solving' the context (so as to allow accurate word embedding or prediction) remains and surely the optimum solution (approximately or precisely) would be to have - in a positional and hierarchical (content) vector or matrix set with relative values - some form of absolute feed within which absolute values could be accessed without necessarily having to position those as a priority before or after the relative value calculations are processed.
@frenchmarty74462 жыл бұрын
@@gavinmc5285 You didn't actually answer my question but ok... When you say "hierarchical" information, I assume you mean some kind of graph. Unstructured graphs have actually been tried before (BP-Transformer) with some success. If you mean somekind of structured graph based on grammar rules, then that is a bad idea. The entire purpose of the self-attention mechanism is to learn the relationships between tokens. The attention mechanism *creates* its own graph at every layer. Transformers are powerful (with large amounts of data) because they impose very little inductive bias. We don't tell the network what is or isn't important, it learns that on its own. Feeding extra information that isn't in the data itself is just extra effort that only biases the network towards one particular way of looking at the data.
@gavinmc52852 жыл бұрын
@@frenchmarty7446 ok then, to be more definitive by 'context' i would understand it as such concepts as 'thrust', 'gist', 'essence' or 'meaning'. to interpret and apply context as relevant to subject matter is a function of intelligence. to some extent a lack of supervision - depending on the instance - may be appropriate although it is unlikely that any algorithm (unsupervised or reinforced) that wanders too far from the context within which it is operating (or supposed to be operating) is going to suddenly stumble on the parameters it needs to accurately determine values that require the appropriate context ('store / mall' is used here in the paper analysis example). not consistently time and again anyway.
@frenchmarty74462 жыл бұрын
@@gavinmc5285 That is literally *more* vague than just saying "context". You are being less definitive... I also don't know what you mean by "stumble" on the correct parameters. We don't stumble on parameters, we train them. And we do so very consistently. What do you mean by "wander outside the context"? You mean outside the data distribution? That's a different meaning of "context" and we train for that as well. Where exactly are you unsatisfied? You say (paraphrasing) "it is unlikely that any algorithm... is going to stumble on the right parameters to accurately determine the right values". Accurately based on what? What specifically does the network have to output to meet your standard of understanding context?
@sajjadayobi6883 жыл бұрын
from youtube import Yannic paper = 'any complex architect' easy_to_learn = Yannic(paper)
@sergiomanuel22063 жыл бұрын
Hello!! First of all, Thanks for the video. I don't like the new setup, the logo took a lot of screen space, although the title is okay.
@kimchi_taco3 жыл бұрын
Disentangled Attention is already handled by TransformerXL when it introduces relative positional embedding. In my opinion, no contribution about it.
@paveltarashkevich83873 жыл бұрын
The old layout was better. Text resolution was better. Screen space usage was better.
@pensiveintrovert43183 жыл бұрын
Two many layers pollute information that may have been decisive when pristine.
@TechyBen3 жыл бұрын
I came for the old Amiga game... I stayed for the new AI algorithm.
@sedenions3 жыл бұрын
You are doing a good job. Talk about biologically plausible neural networks next, please.
@GreenManorite3 жыл бұрын
Why is that an interesting topic? Not being snarky, just trying to understand motivation for the biological parallelism.
@sedenions3 жыл бұрын
@@GreenManorite I'm biased, I majored in neuroscience and am currently switching careers. I guess this sentiment of mine comes from an interest in how researchers can better build cognitive AI. It seems like many of the early neural networks were 'neural' in name only. We are getting closer and closer to biologically plausible nets, but like you said, they're not that interesting to most.