Masked Autoencoders Are Scalable Vision Learners

Masked Autoencoders Are Scalable Vision Learners - Paper explained and animated!

Рет қаралды 25,943

AI Coffee Break with Letitia

Күн бұрын

Пікірлер: 55

@Chris9-z7v 3 жыл бұрын

I have been procrastinating reading the paper until now and you just made a video, perfect.

@AICoffeeBreak 3 жыл бұрын

You were not procrastinating. You were waiting for us to make the video. 😂

@harumambaru 3 жыл бұрын

55 views, I am early bird! I hope you get enough money for coffee from sponsors :) I am not mocking, I really happy that even young channels are supported by sponsors and so happy that this sponsor can be helpful for most of the viewers

@AICoffeeBreak 3 жыл бұрын

Thanks! I can totally relate to your point. I feel the same when it comes to small KZbinrs I love.

@harumambaru 3 жыл бұрын

@@AICoffeeBreak Could you list couple of small youtubers you love? I am into 3blue1brown, Yannik and 2min papers but they all are pretty huge

@AICoffeeBreak 3 жыл бұрын

Small but sponsored? No (except Sabine Hossenfelder, but she is not small). Just small: Machine Learning Street Talk, Alfredo Canziani, Henry AI Labs, Jay Alammar, The AI Epiphany, Aladdin Persson, Gradient Dude, vcubingx

@harumambaru 3 жыл бұрын

@@AICoffeeBreak wow, you made my weekend instead of watching Monster Hunter with Milla Jovovich I am going to watch Sabine Hossenfelder protein folding videos

@beizhou4025 3 жыл бұрын

The animation is awesome. Thank you for taking the effort!

@AICoffeeBreak 3 жыл бұрын

Glad you liked it!

@prajwalsood1350 3 жыл бұрын

Can't thank you enough, I have to present this paper in my class and this helps me alot

@cipritom 2 жыл бұрын

In addition, I love the sound effects of the layer growing ! Nice video !

@AICoffeeBreak 2 жыл бұрын

Thanks! Including sound effects doesn't mean much most of the time. But at the right spots, it can trigger a sort of 3D effect.

@michaellellouch3682 3 жыл бұрын

Cool stuff. Thanks for keeping us up to date on papers outside of our domain

@Mrbits01 3 жыл бұрын

The first time I heard the sound effects you used when expanding stuff (parameters, encoder size) I literally thought it's my stomach growling. Darn it right when it was getting serious :D

@AICoffeeBreak 3 жыл бұрын

Lol 😂 You are nominated for the funniest comment award.

@deoabhijit5935 3 жыл бұрын

wonderful explanation, amazing narration elegant editing

@garyhuntress6871 Жыл бұрын

I've been working on VITMAE for 2 days. Thanks for this video, very interesting.

@AICoffeeBreak Жыл бұрын

Glad it was helpful! Keen to share what are you planning to do with it? :)

@garyhuntress6871 Жыл бұрын

@@AICoffeeBreak I'm very interested in processing audio, particularly spectrograms. Ideally I think we need the equivalent of a LLM for acoustics. A really good embedding model for time series.

@nilsmuller9286 3 жыл бұрын

Awesome video! :) Hadn't the paper on my radar yet, now I'll have to read it.

@mattcoleman2819 2 жыл бұрын

Great video, thanks! I'm a bit confused how the transfer learning/downstream tasks will work with the encoder if it's sequence length now needs to be increased? Or is the encoder sequence length set to the total # patches, and attention masking/padding is used during pretraining?

@soumyasarkar4100 3 жыл бұрын

your content organisation is very good

@AICoffeeBreak 3 жыл бұрын

Thanks! Glad we did a thing right.

@terryr9052 3 жыл бұрын

I am curious why non-overlapping patches were chosen. I would think that would lead to reconstruction errors.

@AICoffeeBreak 3 жыл бұрын

Thanks for the question. But could you please elaborate a little bit why this would cause errors and why overlapping patches would ameliorate the problem? The patches are non-overlapping but tile the entire image. And attention allows for patches to be informed about their fellow patches.

@terryr9052 3 жыл бұрын

@@AICoffeeBreak I dont really have a rigorous answer but my intuition is telling me that forcing the model to predict every boundary between patches is less accurate than a model that actually gets to see the boundary as data. Thinking more about it though, I do understand though that more patches means more work for the attention and thus would counter the advantage gained from removing patches through masking...

@MengJiun_Chiou 2 жыл бұрын

Awesome explanation :)

@AICoffeeBreak 2 жыл бұрын

Thanks!

@sadface7457 3 жыл бұрын

certified classic

@Agrover112 2 жыл бұрын

That's a certified hood classic

@DerPylz 3 жыл бұрын

I'm old enough!

@Youkouleleh 3 жыл бұрын

Thanks for the vidéo. Do you know why bert would not use this strategy and just give to the encoder thé not masked words?

@AICoffeeBreak 3 жыл бұрын

Because the masked words have to be predicted, meaning: a representation has to be computed there which in transformers (as much as goes in, goes out again) means that BERT has to process the mask words too. Not even the paper presented in the video gets away from that curse, because the decoder has to see the masks again.

@Youkouleleh 3 жыл бұрын

@@AICoffeeBreak Ok, and could BERT do this like in this paper (or why they do not use this same strategy)? aka give the (not masked/swapped) word to the encoder, and in the decoder give the embedded words + the masked worlds (that would be learning, like in this paper). This would also allow to have a bigger encoder during training.

@AICoffeeBreak 3 жыл бұрын

@@Youkouleleh Ah, now I see the confusion: BERT does not actually have a (heavyweight ) decoder. The "decoder" is just an MLP performing classification *on the MASK tokens* after they have been encoded. The decoder you just presented, is in a sense already the BERT encoder. See first answer to this question: stackoverflow.com/questions/60382793/what-are-the-inputs-to-the-transformer-encoder-and-decoder-in-bert

@AICoffeeBreak 3 жыл бұрын

But it also might be that I am confused. Or Ms. Coffee Bean. If I am right, it is me. If I am wrong, it is Ms. Coffee Bean. 😅

@Youkouleleh 3 жыл бұрын

@@AICoffeeBreak thanks for your answer, I had this idea that BERT was some kind of autoencoder, but not really. But it is quite close to an AE + the matching sentence task. If the classification for non-masked words would count in the loss, I think it would be an autoencoder + matching sentence task

@antoinegar.638 9 ай бұрын

Hey there, thanks for the video! I'm late to the party, but I don't understand something: How is this architecture usefull for downstream tasks like classification ? I undersatnd you can ditch the decoder and put your downstream classifier instead. However, the architecture of the encoder reads 25% of the input (75% being masked). Won't this seriously lower the quality of the system compared to a classical autoencoder ?

@AICoffeeBreak 9 ай бұрын

Hmm, you wouldn't do the masking for classification tasks where one is interested in representations, would you? The masking is just for training.

@akshaygrao77 24 күн бұрын

@AICoffeeBreak I think the person who asked this question is missing a detail. The person thinks that showing shorter sequences during training and all of sudden showing longer sequences during inference causes problems. This is true generally in transformers and ROPE embeddings is one way to reduce this problem and another would be windowed attention. But in masked encoder the key difference is that positional embeddings of all positions is seen by model during training and during inference this won't be that much of a surprise as compared to general transformers with absolute encodings. Am I right with this answer?

@AICoffeeBreak 24 күн бұрын

Yes, smart answer, thanks!

@c-bon 2 жыл бұрын

05:18 Are there any references where it is possible to look at in more detail into the phenomena of the introduction of artifacts generated by the usage of masking in CNN autoencoders? At a first glance I couldn't see the author taking care in highlighting this fact. P.S. The animations are great as always.

@nicolettileo Жыл бұрын

Thank you for your work, but nonetheless, I still struggle to capture the idea of mask tokens, which seems crucial. I'm new to the field of transformers, but used to good old CNN autoencoders, and what bothers me is: how the masked tokens can be directly fed into the decoder even thought their latent representations hasn't been computed? From what I understood, this isn't the masked tokens which are fed but some learnable shared vector. Am I right?

@youssefprojects7757 2 жыл бұрын

The video is informative and supported by good animations, but you need to speak a little slowly and have some breaks in your speech. Because sometimes there is too much information in one sentence. Thank you for your effort and I hope you will take this feed back. I discovered your channel today and I subscribed.

@pohsoonchang6127 3 жыл бұрын

👍

@Easyy-Peasyy-Cooking 2 жыл бұрын

Thank you for your nice explanation, but I would like to point out that MAE is not the first proposing this idea. in April 2021 which is much much earlier than MAE, we proposed "SiT: self-supervised vision transformers" and showed its merit on small datasets because as a small group, we can not afford training on ImageNet. Despite the fact that we contacted the authors of MAE to acknowledge the original research, they did not respond to us! Similarly, Microsoft also used the same idea in "SimMIM - A Simple Framework for Masked Image Modelling" and they did not acknowledge us. I would really appreciate if you support the original research and mention this story in your channel. Nowadays, the research is only acceptable and acknowledged if it is coming from these tech giants, and there is no place for small groups anymore.

@AICoffeeBreak 2 жыл бұрын

As a member of a small group myself, I really feel your pain. I usually do criticize in my videos that the huge companies are dominating. Often times just use larger resources and not much in terms of ideas and it looks more like engineering scale and less like research. It's a pity they did not cite you even after pointing this out. This is bad practice.

@Phenix66 2 жыл бұрын

Feels so bad hearing about this... Hurts enough to think of something and see that it already exists, but this is worse. In general really feels like david vs goaliath at some point... Even aside from not getting visibility, not having the resources sucks, especially when (as it seems) most of the recent cool papers (pathways, dalle2, etc.) seem to stem from having vast amounts of data & computation power, not having cool new ideas :( when even evaluation is so bloody expensive, even on simple datasets, completely can knock you out of competition...

@peterjackson4530 4 күн бұрын

I’m late but sorry for hearing that

@Agrover112 2 жыл бұрын

Idk what will happen by the time I get into a PhD , AI will be crazy

@AICoffeeBreak 2 жыл бұрын

Where are you at the moment?

@Agrover112 2 жыл бұрын

@@AICoffeeBreak Bachelors lol

@AICoffeeBreak 2 жыл бұрын

I pity you.

@AICoffeeBreak 2 жыл бұрын

Hold on there.

@aishik11 2 жыл бұрын

Any assistance on how to use this model for just encoding without masking , like she suggests at 12:02 ? the huggingface implementation seems to be performing some masking.