So do the recent large world model breakthroughs of Sora, Luma, Runway alpha imply that we've returned to auto regressive? Are they a combo of the two? Amazing video, would love to hear your thoughts!
@algorithmicsimplicity6 сағат бұрын
From what little they have released publicly, it seems that they are simply diffusion models applied to videos, i.e. they treat videos as a collection of frames, add noise to all frames, take all noisy frames as input and try to predict all clean frames. I don't think there is any auto-regression done, but maybe that will change when they start generating longer videos.
@dschonhautКүн бұрын
Coming from a background in computational neuroscience, but with no prior NN experience beyond coding a perception on the MNIST dataset for a class years ago, I’ve been watching all your videos lately as a first-pass before delving into these topics more rigorously. The visualizations and your concise summaries that accompany them are hugely helpful! Thanks for all your work on these educational videos, and for releasing them for free! Out of curiosity, how are you scripting your animations? Is it Manim (the @3blue1brown library)? And would you recommend it?
@algorithmicsimplicityКүн бұрын
Thanks for the feedback! I used Manim for the first 3 videos on my channel, and then decided to write my own animation library because I found Manim annoying to use. My animation library is still a work in progress, so my most recent 2 videos are made using both my own library and Manim. I would say that Manim is fine to use for simple animations/graphs, but once you start trying to animate lots of objects at once it gets annoying.
@looooool3145Күн бұрын
i now understand things, thanks!
@downloadableram26662 күн бұрын
State-space models are not necessarily from ML, they're used a lot in control systems actually. Not surprised by their relationship considering both are strongly based on linear algebra.
@downloadableram26662 күн бұрын
I wrote my own CUDA-based neural network implementation a while ago, but I used a sigmoid instead of RELU. Although the "training" covered in this video works, it was kinda odd (usually neural net backpropagation is done with gradient descent of the cost function). You probably cover this in another video, though, haven't gotten there yet. Good video nonetheless, never really thought about it this way.
@algorithmicsimplicity2 күн бұрын
The training method described in this video IS gradient descent. Backprop is just one way of computing gradients (and it is usually used because it is the fastest). It will yield the same results as this training procedure.
@downloadableram26662 күн бұрын
@@algorithmicsimplicity I should have figured it was doing much the same thing, just a lot slower and more 'random.' Your videos are awesome though, I'm trying to go into machine learning after doing aerospace embedded systems.
@Paraxite73 күн бұрын
I finally understand MAMBA! I've been trying to get my head around it for months, but now see that approaching the way the original paper stated wasn't the best way. Thank you.
@heyalchang3 күн бұрын
Thanks! That was a really excellent explanation.
@algorithmicsimplicity3 күн бұрын
Thank you so much!
@chaseghoste86713 күн бұрын
man, what can I say
@laurentkloeble58723 күн бұрын
very original. I learned so much in few minutes. thank you
@nothreeshoes12005 күн бұрын
Please make more videos. They’re fantastic!
@user-ou3ts4hl7p6 күн бұрын
Very good video. I get to konw the straigforward reason: why diffusion idea emerges and why diffusion is intrinsically better than autogression algorithm.
@fractalinfinity7 күн бұрын
I get it now! 🎉 thanks!
@alenqquin45098 күн бұрын
A very good job, I have deepened my understanding of generative AI
@freerockneverdrop12368 күн бұрын
Neural Network can be simple, if you you are Neo :)
@Real-HumanBeing8 күн бұрын
In exactly the same way as regular AI, by containing the dataset.
@yushuowang78209 күн бұрын
VAE: what about me?
@lialkalo40939 күн бұрын
very good explanation
@freerockneverdrop123610 күн бұрын
This is very nice but I think there is one thing either wrong or not clear. At 11:40, he sais that in order to minimize the steps to generate the image, we generate multiple pixels in one step that are scattered around and independent to each other. This way we will not get the blurred image. This works in the latter steps when enough pixels have been generated and they guide the generation of the next batch of pixels. E.g., we know it is a car and we just try to add more details. But in the early steps, when we don't know what it is, if multiple pixels are generated, we fall back to average issue again, e.g. one pixel is for a car but another one is for a bridge. Therefore, I thinks in early steps, we can only generate one pixels each step. What do you think?
@algorithmicsimplicity10 күн бұрын
You are correct that at the early steps there is large amount of uncertainty over the value of each pixel. But what matters is how much this uncertainty is reduced by knowing the value of the previously generated pixels. Even at the first step, knowing the value of one pixel does not help that much in determining the value of far away pixels. Let's say one pixel in the center is blue, does that make you significantly more confident about the color of the top left pixel? Not really.
@freerockneverdrop12367 күн бұрын
@@algorithmicsimplicity hi, glad to see your reply. Let me articulate. Let's just create the first 2 pixels in two ways and see the difference. First approach is create them one by one, and the second approach is to create both in one step. First approach first, at the beginning, there is no pixel, when we create one, not problem, there will be no inconsistency. Then we create the second pixel based on the first one, there are still a lot of possibilities. But it is restrained by the first pixel. On the other hand, for approach two, both pixels are created in the same step, then they cannot restrain each other. Then there could be more combinations. Some of them are include inconsistency. To avoid such thing, we should generate pixels one by one until the theme of the image is set. Then we can create batch by batch. A simplified example, if the model is trained with images include only green pixels of different strength, or red pixels of different strength, then approach one will generate second pixel of the same color as the first one, though probably with different strength. But approach 2 could generate one red and one green.
@Levy111110 күн бұрын
I do hope you'll soon get at least 6 figures subscribers count. The quality of your videos (both in terms of education and presentation) is top notch, people need you to become popular (at least within our small tech bubble).
@official_noself11 күн бұрын
480p? 2023? Are you kidding me?
@LeYuzer12 күн бұрын
Tips: turn on 1.5x
@pshirvalkar12 күн бұрын
A fantastic teacher!!! Thanks! Can you please cover Bayesian monte carlo markov chains? Would be very helpful!
@algorithmicsimplicity12 күн бұрын
Thanks for the suggestion, I will put it on my TODO list.
@alexanderterry18712 күн бұрын
How do the models deal with having different numbers of inputs? E.g. the text label provided can be any length, or not provided at all. I'm sure this is a basic question, but whenever I've used NNs previously they've always had a constant number of inputs or been reapplied to a sequence of data that has the same dimension at each step.
@algorithmicsimplicity12 күн бұрын
For image input, the input is always the same size (same image size and same channels), and the output is always the same size (1 pixel). For text, you can also treat the inputs as all being the same size by padding smaller inputs up to a fixed max length, though transformers can also operate on sequences of different lengths. The output for text is always the same size (a probability distribution over tokens in the vocabulary).
@karigucio12 күн бұрын
so the transformation applied to the weights does not concern purely with initialization? instead, in the expression w=exp(-exp(a)*exp(ib)) numbers a and b are the learned parameters and not w, right?
@algorithmicsimplicity12 күн бұрын
Yes a and b are the learned parameters.
@matthewfynn763513 күн бұрын
I have been working with machine learning models for years and this is the first time i have truly understood through visualisation the use of ReLU activation functions! Great video
@jaimeduncan616713 күн бұрын
Out of nothing? no, it grabs people's work and creates a composite with variations.
@algorithmicsimplicity13 күн бұрын
It isn't correct to say it creates a 'composite' with variations, models can generalize outside of their training dataset in certain ways, and generative models are capable of creating entirely new things that aren't present in the training dataset.
@MichaelBrown-gt4qi13 күн бұрын
I've started binge watching all your videos. 😁
@fergalhennessy77514 күн бұрын
do u have a mewing routine bro love from north korea
@MichaelBrown-gt4qi14 күн бұрын
This is a great video. I have watched videos in the past (years ago) talk about auto-regression and more lately talk about diffusion. But it's nice to see why and how there was such a jump between the two. Amazing! However, I feel this video is a little incomplete when there was no mention of the enhancer model that "cleans up" the final generated image. This enhancing model is able to create a larger image while cleaning up the six fingers gen AI is so famous for. While not technically a part of the diffusion process (because it has no random noise) it is a valuable addition to image gen if anyone is trying to build their own model.
@capcadaverman14 күн бұрын
Not made from nothing. Made by training on real people’s intellectual property. 😂
@algorithmicsimplicity14 күн бұрын
My image generator was trained on data licensed to be used for training machine learning models.
@capcadaverman14 күн бұрын
@@algorithmicsimplicity not everyone is so ethical
@telotawa14 күн бұрын
could diffusion work on text generation?
@algorithmicsimplicity14 күн бұрын
Yes, it absolutely can! Instead of adding normally distributed noise, you randomly mask tokens with some probability, see e.g. arxiv.org/abs/2406.04329 . That said, it tends to produce a bit worse quality text than auto-regression (actually this is true for images as well, it's just on images auto-regression takes too long to be viable.)
@LordDoucheBags14 күн бұрын
What did you mean by causal architectures? Because when I search online I get stuff about causal inference, so I’m guessing there’s a different and more popular term for what you’re referring to?
@julioalmeida464514 күн бұрын
Damn. Amazing piece
@dmitrii.zyrianov14 күн бұрын
Hey! Thanks for the video, it is very informative! I have a question. At 18:17 you say that an average of a bunch of noise is still a valid noise. I'm not sure why it is true here. I'd expect the average of a bunch of noise to be just 0.5 value (if we map rgb values to 0..1 range)
@algorithmicsimplicity14 күн бұрын
Right, the average is just the center of the noise distribution which, let's say the color values are mapped from -1 to 1, is 0. This average doesn't look like noise (it is just a solid grey image), but if you ask what is the probability of this image under the noise distribution, it actually has the highest probability. The noise distribution is a normal distribution centered at 0, so the input which is all 0 has the highest probability. So the average image still lies within the noise distribution, as opposed to natural images where the average moves outside the data distribution
@dmitrii.zyrianov13 күн бұрын
Thank you for the reply, I think I got it now
@simonpenelle257414 күн бұрын
Amazing content I now want to implement this
@dadogwitdabignose15 күн бұрын
great video man suggestion: can you create a video on how generative transformers work? this has been really bothering me and hearing an in-depth explanation of them, like your video here, would be helpful!
@algorithmicsimplicity15 күн бұрын
Generative transformers work in exactly the same way as generative CNNs. It doesn't matter what backbone you use the idea is the same, you will use auto-regression or diffusion to train a transformer to undo the masking/noising process.
@dadogwitdabignose15 күн бұрын
@@algorithmicsimplicity which is more efficient to use and how do they handle text data to map text into tensors?
@algorithmicsimplicity15 күн бұрын
@@dadogwitdabignose I explain how transformer classifiers work in this video: kzbin.info/www/bejne/oYivlpdupJqAaLs . As for which is more efficient, it depends on the data. Usually for text data transformers will be more efficient (for reasons I explain in that video), and for images CNNs will be more efficient.
@Kavukamari15 күн бұрын
"i can do eleventy kajillion computations every second" "okay, what's your memory throughput"
@deep.space.1215 күн бұрын
If there will be a longer version of this video, it might be worth mentioning VAE as well.
@algorithmicsimplicity15 күн бұрын
Thanks for the suggestion.
@wormjuice777216 күн бұрын
This has helped me so much wrapping my head around this whole subject! Thank you for now, and the future!
@codybarton209016 күн бұрын
Crazy video
@gameboyplayer21716 күн бұрын
Nicely explained
@snippletrap17 күн бұрын
Fantastic explanation. Very intuitive
@ibrahimaba896617 күн бұрын
Thank you for this beautiful work!
@algorithmicsimplicity17 күн бұрын
Thank you very much!
@boogati922117 күн бұрын
Crazy how two separate ideas ended up converging into one nearly identical solution.
@andrewy29576 күн бұрын
Totally agree. I feel like that's pretty common in math, robotics, and computer science, but it just shows how every field in stem is interconnected.
@mattshannon511117 күн бұрын
Wow, it requires really deep understanding and a lot of work to make videos this clear that are also so correct and insightful. Very impressive!
@vibaj1618 күн бұрын
wait, can this be used as a ray tracing denoiser? That is, you'd plug your noisy ray traced image into one of the later steps of the diffusion model, so the model tries to make it clear?
@algorithmicsimplicity18 күн бұрын
Yep you could definitely do that, you would probably need to train a model on some examples of noisy ray traced images though.
@Maxawa085117 күн бұрын
Yeah but this is very slow do
@antongromek418018 күн бұрын
Actually, there is no LLM, etc - but 500 million nerds - sitting in basements all over the world.
@artkuts479218 күн бұрын
I still didn't get how the scoring model works. So before you were labeling the important pairs by hand giving it a score based on the semantic value each pair has for a given context, but then it's done automatically by a CNN, how does it define the score though (and it's context free, isn't it)?
@algorithmicsimplicity18 күн бұрын
The entire model is trained end-to-end to minimize the training loss. To start off with, the scoring functions are completely random, but during training they will change to output scores which are useful, i.e. which cause the model's final prediction to better match the training labels. In practice it turns out that what these scoring functions learn while trying to be useful is very similar to the 'semantic scoring' that a human would do.
@lusayonyondo911119 күн бұрын
wow, this is such an amazing resource. I'm glad I stuck around. This is literally the first time this is all making sense to me.
@istoleyourfridgecall91119 күн бұрын
Hands down the best video that explains how these models work. I love that you explain these topics in a way that resembles how the researchers created these models. Your video shows the thinking process behind these models, combined with great animated examples, it is so easy to understand. You really went all out. Only if youtube promoted these kinds of videos instead of brainrot low quality videos made by inexperienced teenagers.