I have been procrastinating reading the paper until now and you just made a video, perfect.
@AICoffeeBreak3 жыл бұрын
You were not procrastinating. You were waiting for us to make the video. 😂
@harumambaru3 жыл бұрын
55 views, I am early bird! I hope you get enough money for coffee from sponsors :) I am not mocking, I really happy that even young channels are supported by sponsors and so happy that this sponsor can be helpful for most of the viewers
@AICoffeeBreak3 жыл бұрын
Thanks! I can totally relate to your point. I feel the same when it comes to small KZbinrs I love.
@harumambaru3 жыл бұрын
@@AICoffeeBreak Could you list couple of small youtubers you love? I am into 3blue1brown, Yannik and 2min papers but they all are pretty huge
@AICoffeeBreak3 жыл бұрын
Small but sponsored? No (except Sabine Hossenfelder, but she is not small). Just small: Machine Learning Street Talk, Alfredo Canziani, Henry AI Labs, Jay Alammar, The AI Epiphany, Aladdin Persson, Gradient Dude, vcubingx
@harumambaru3 жыл бұрын
@@AICoffeeBreak wow, you made my weekend instead of watching Monster Hunter with Milla Jovovich I am going to watch Sabine Hossenfelder protein folding videos
@beizhou40253 жыл бұрын
The animation is awesome. Thank you for taking the effort!
@AICoffeeBreak3 жыл бұрын
Glad you liked it!
@prajwalsood13503 жыл бұрын
Can't thank you enough, I have to present this paper in my class and this helps me alot
@cipritom2 жыл бұрын
In addition, I love the sound effects of the layer growing ! Nice video !
@AICoffeeBreak2 жыл бұрын
Thanks! Including sound effects doesn't mean much most of the time. But at the right spots, it can trigger a sort of 3D effect.
@michaellellouch36823 жыл бұрын
Cool stuff. Thanks for keeping us up to date on papers outside of our domain
@Mrbits013 жыл бұрын
The first time I heard the sound effects you used when expanding stuff (parameters, encoder size) I literally thought it's my stomach growling. Darn it right when it was getting serious :D
@AICoffeeBreak3 жыл бұрын
Lol 😂 You are nominated for the funniest comment award.
I've been working on VITMAE for 2 days. Thanks for this video, very interesting.
@AICoffeeBreak Жыл бұрын
Glad it was helpful! Keen to share what are you planning to do with it? :)
@garyhuntress6871 Жыл бұрын
@@AICoffeeBreak I'm very interested in processing audio, particularly spectrograms. Ideally I think we need the equivalent of a LLM for acoustics. A really good embedding model for time series.
@nilsmuller92863 жыл бұрын
Awesome video! :) Hadn't the paper on my radar yet, now I'll have to read it.
@mattcoleman28192 жыл бұрын
Great video, thanks! I'm a bit confused how the transfer learning/downstream tasks will work with the encoder if it's sequence length now needs to be increased? Or is the encoder sequence length set to the total # patches, and attention masking/padding is used during pretraining?
@soumyasarkar41003 жыл бұрын
your content organisation is very good
@AICoffeeBreak3 жыл бұрын
Thanks! Glad we did a thing right.
@terryr90523 жыл бұрын
I am curious why non-overlapping patches were chosen. I would think that would lead to reconstruction errors.
@AICoffeeBreak3 жыл бұрын
Thanks for the question. But could you please elaborate a little bit why this would cause errors and why overlapping patches would ameliorate the problem? The patches are non-overlapping but tile the entire image. And attention allows for patches to be informed about their fellow patches.
@terryr90523 жыл бұрын
@@AICoffeeBreak I dont really have a rigorous answer but my intuition is telling me that forcing the model to predict every boundary between patches is less accurate than a model that actually gets to see the boundary as data. Thinking more about it though, I do understand though that more patches means more work for the attention and thus would counter the advantage gained from removing patches through masking...
@MengJiun_Chiou2 жыл бұрын
Awesome explanation :)
@AICoffeeBreak2 жыл бұрын
Thanks!
@sadface74573 жыл бұрын
certified classic
@Agrover1122 жыл бұрын
That's a certified hood classic
@DerPylz3 жыл бұрын
I'm old enough!
@Youkouleleh3 жыл бұрын
Thanks for the vidéo. Do you know why bert would not use this strategy and just give to the encoder thé not masked words?
@AICoffeeBreak3 жыл бұрын
Because the masked words have to be predicted, meaning: a representation has to be computed there which in transformers (as much as goes in, goes out again) means that BERT has to process the mask words too. Not even the paper presented in the video gets away from that curse, because the decoder has to see the masks again.
@Youkouleleh3 жыл бұрын
@@AICoffeeBreak Ok, and could BERT do this like in this paper (or why they do not use this same strategy)? aka give the (not masked/swapped) word to the encoder, and in the decoder give the embedded words + the masked worlds (that would be learning, like in this paper). This would also allow to have a bigger encoder during training.
@AICoffeeBreak3 жыл бұрын
@@Youkouleleh Ah, now I see the confusion: BERT does not actually have a (heavyweight ) decoder. The "decoder" is just an MLP performing classification *on the MASK tokens* after they have been encoded. The decoder you just presented, is in a sense already the BERT encoder. See first answer to this question: stackoverflow.com/questions/60382793/what-are-the-inputs-to-the-transformer-encoder-and-decoder-in-bert
@AICoffeeBreak3 жыл бұрын
But it also might be that I am confused. Or Ms. Coffee Bean. If I am right, it is me. If I am wrong, it is Ms. Coffee Bean. 😅
@Youkouleleh3 жыл бұрын
@@AICoffeeBreak thanks for your answer, I had this idea that BERT was some kind of autoencoder, but not really. But it is quite close to an AE + the matching sentence task. If the classification for non-masked words would count in the loss, I think it would be an autoencoder + matching sentence task
@antoinegar.6389 ай бұрын
Hey there, thanks for the video! I'm late to the party, but I don't understand something: How is this architecture usefull for downstream tasks like classification ? I undersatnd you can ditch the decoder and put your downstream classifier instead. However, the architecture of the encoder reads 25% of the input (75% being masked). Won't this seriously lower the quality of the system compared to a classical autoencoder ?
@AICoffeeBreak9 ай бұрын
Hmm, you wouldn't do the masking for classification tasks where one is interested in representations, would you? The masking is just for training.
@akshaygrao7724 күн бұрын
@AICoffeeBreak I think the person who asked this question is missing a detail. The person thinks that showing shorter sequences during training and all of sudden showing longer sequences during inference causes problems. This is true generally in transformers and ROPE embeddings is one way to reduce this problem and another would be windowed attention. But in masked encoder the key difference is that positional embeddings of all positions is seen by model during training and during inference this won't be that much of a surprise as compared to general transformers with absolute encodings. Am I right with this answer?
@AICoffeeBreak24 күн бұрын
Yes, smart answer, thanks!
@c-bon2 жыл бұрын
05:18 Are there any references where it is possible to look at in more detail into the phenomena of the introduction of artifacts generated by the usage of masking in CNN autoencoders? At a first glance I couldn't see the author taking care in highlighting this fact. P.S. The animations are great as always.
@nicolettileo Жыл бұрын
Thank you for your work, but nonetheless, I still struggle to capture the idea of mask tokens, which seems crucial. I'm new to the field of transformers, but used to good old CNN autoencoders, and what bothers me is: how the masked tokens can be directly fed into the decoder even thought their latent representations hasn't been computed? From what I understood, this isn't the masked tokens which are fed but some learnable shared vector. Am I right?
@youssefprojects77572 жыл бұрын
The video is informative and supported by good animations, but you need to speak a little slowly and have some breaks in your speech. Because sometimes there is too much information in one sentence. Thank you for your effort and I hope you will take this feed back. I discovered your channel today and I subscribed.
@pohsoonchang61273 жыл бұрын
👍
@Easyy-Peasyy-Cooking2 жыл бұрын
Thank you for your nice explanation, but I would like to point out that MAE is not the first proposing this idea. in April 2021 which is much much earlier than MAE, we proposed "SiT: self-supervised vision transformers" and showed its merit on small datasets because as a small group, we can not afford training on ImageNet. Despite the fact that we contacted the authors of MAE to acknowledge the original research, they did not respond to us! Similarly, Microsoft also used the same idea in "SimMIM - A Simple Framework for Masked Image Modelling" and they did not acknowledge us. I would really appreciate if you support the original research and mention this story in your channel. Nowadays, the research is only acceptable and acknowledged if it is coming from these tech giants, and there is no place for small groups anymore.
@AICoffeeBreak2 жыл бұрын
As a member of a small group myself, I really feel your pain. I usually do criticize in my videos that the huge companies are dominating. Often times just use larger resources and not much in terms of ideas and it looks more like engineering scale and less like research. It's a pity they did not cite you even after pointing this out. This is bad practice.
@Phenix662 жыл бұрын
Feels so bad hearing about this... Hurts enough to think of something and see that it already exists, but this is worse. In general really feels like david vs goaliath at some point... Even aside from not getting visibility, not having the resources sucks, especially when (as it seems) most of the recent cool papers (pathways, dalle2, etc.) seem to stem from having vast amounts of data & computation power, not having cool new ideas :( when even evaluation is so bloody expensive, even on simple datasets, completely can knock you out of competition...
@peterjackson45304 күн бұрын
I’m late but sorry for hearing that
@Agrover1122 жыл бұрын
Idk what will happen by the time I get into a PhD , AI will be crazy
@AICoffeeBreak2 жыл бұрын
Where are you at the moment?
@Agrover1122 жыл бұрын
@@AICoffeeBreak Bachelors lol
@AICoffeeBreak2 жыл бұрын
I pity you.
@AICoffeeBreak2 жыл бұрын
Hold on there.
@aishik112 жыл бұрын
Any assistance on how to use this model for just encoding without masking , like she suggests at 12:02 ? the huggingface implementation seems to be performing some masking.