Generate long form video with Transformers | Phenaki from Google Brain explained

Рет қаралды 11,365

AI Coffee Break with Letitia

Күн бұрын

Пікірлер: 26

@automatescellulaires8543 2 жыл бұрын

This will revolutionize the meme market.

@DerPylz 2 жыл бұрын

Thank you for also explaining Phenaki! I was curious about a non-diffusion model for video generation! 🎊

@davidyang102 2 жыл бұрын

Because it resumes generation from a few frames it will lose context. Imagine generating a paragraph and then the next one using only the last word you generated. Luckily images captured a lot of information so it's not that obvious. But for example you can't do a video that looks around 360 degrees is it's generated with two iterations. Very dreamlike.

@federicolusiani7753 2 жыл бұрын

Thank you for your video, great content as always! One question: in the video, you say that the video encoder is auto-regressive, so that it can be used on arbitrary number of video patches. But aren't standard transformer encoders already able to process inputs of arbitrary length? Usually the auto-regressive architecture is used in the decoder, because at inference time, we need it to generate the output causally. Am I missing something?

@AICoffeeBreak 2 жыл бұрын

Thanks for this great question. Transformer sequence length is an interesting topic, which we've discussed here already: kzbin.info/www/bejne/jqnXpGSfqc2opqs Basically, even if it can generate / take in variable length input, it still has a predefined maximum input / output length due to practical limitations (compute time and memory). You are asking whether a causal model could not generate infinitely long video and -- for practical reasons -- the answer is no. Unmodified causal attention means that one attends to the whole generated past and for very long sequences. This means that the attention window increases linearly and computation time and memory increases quadratically. So because of limited compute time and memory, we cannot generate indefinitely, unless one applies such tricks as the Phenki authors with MaskGIT, to only attend to a small fraction of the tokens of the past generated output.

@Handelsbilanzdefizit 2 жыл бұрын

Maybe there will be a way to visualize memories and dreams, by using Electroencephalography (EEG) and Neural Networks. So you can see what others think. Or see what others see, through their eyes.

@mrinmoybanik5598 2 жыл бұрын

Good luck collecting training dataset🙂

@johnkintner 2 жыл бұрын

researchers have already used fmri to do something similar! This was a while ago :D

@rewixx69420 2 жыл бұрын

i want so much infinite video generation on diffusion models

@AICoffeeBreak 2 жыл бұрын

Soon. Just give Google some time to mount more TPUs in their racks. 😅

@AICoffeeBreak 2 жыл бұрын

twitter.com/_akhaliq/status/1595645248243650560?t=PHepVXOP40pPdc5q3upUbQ&s=19 what about this? Didn't look into it.

@summary7428 Жыл бұрын

great video, but i think it was wrongly placed in your (awesome) diffusers playlist =)

@AICoffeeBreak Жыл бұрын

You are right, it is not a diffusion model. It's about content generation. 😅 I was more comfortable with it being in this playlist (especially as the last video in the row) rather than being nowhere close to it's fellow competition. But sure, I do not have the Paella video in the list, although Paella can be argued to be a diffusion model. I need to clean up.

@elev007 2 жыл бұрын

Great explanation- thank you 🙏

@barberb 2 жыл бұрын

thank you letitia

@TheGatoskilo Жыл бұрын

I wonder how do they pad the video tensors with variable sequence length.

@AICoffeeBreak Жыл бұрын

Do you see this as problematic?

@TheGatoskilo Жыл бұрын

I just wonder to the implementation level, these padding values as well as masking the tokens, did someone decide that we will fill these tensors with 0s? Does it matter what we are going to fill those vectors with? What if these padded/masked values of 0s overlap with actual data, how do we effectively instruct the model to disentangle masked values from 0s corresponding to the actual data?

@TheGatoskilo Жыл бұрын

@@AICoffeeBreak No, I just wonder how it works in the implementation