As people have correctly noted: When I talk about the way we train at 9:50, I should say we maximise the similarity on the diagonal, not the distance :) Brain failed me!
@adfaklsdjf7 ай бұрын
we gotcha 💚
@harpersneil7 ай бұрын
Phew, for a second there I thought you were dramatically more intelligent then I am!
@Euler123-h8n6 ай бұрын
Omg, I’m your fan since spiderman 😆, thanks for the lesson!
@lukeflyswatter32035 ай бұрын
Oh man, I thought I was missing something fundamental at that point. Thanks!
@pyajudeme92457 ай бұрын
This guy is one of the best teachers I have ever seen.
@sebastyanpapp6 ай бұрын
Agreed
@edoardogribaldo10587 ай бұрын
Dr. Pound's videos are on another level! He explains things with such passion and such clarity rarely found on the web! Cheers
@joker3451727 ай бұрын
Dr Pound is just amazing. I love all his videos
@adfaklsdjf7 ай бұрын
thank you for "if you want to unlock your face with a phone".. i needed that in my life
@alib83967 ай бұрын
Unlocking my face with my phone is the first thing I do when I wake up everyday.
@MichalKottman7 ай бұрын
9:45 - wasn't it supposed to be "minimize the distance on diagonal, maximize elsewhere"?
@michaelpound98917 ай бұрын
Absolutely yes! I definitely should have added “the distance” or similar :)
@ScottiStudios7 ай бұрын
Yes it should have been *minimise* the diagonal, not maximise.
@rebucato31426 ай бұрын
Or it should be “maximize the similarity on the diagonal, minimize elsewhere”
@orange-vlcybpd26 ай бұрын
The legend has it that the series will only end when the last sheet of continuous printing paper has been written on.
@bluekeybo7 ай бұрын
The man, the myth, the legend, Dr. Pound. The best lecturer on Computerphile.
@skf9577 ай бұрын
These guys are so watchable, and somehow they make an inherently inaccessible subject interesting and easy to follow.
@letsburn007 ай бұрын
KZbin is like you got the best teacher in school. The world has hundreds or thousands of experts. Being able to explain is really hard to do as well.
@eholloway7 ай бұрын
"There's a lot of stuff on the internet, not all of it good, I should add" - Dr Mike Pound, 2024
@rnts087 ай бұрын
Understatement of the century, even for a brit.
@vrottietottie7 ай бұрын
I'm a simple guy. I see a Mike Pound video, I click
@jme_a7 ай бұрын
I pound the like button... ❤
@Afr0deeziac7 ай бұрын
@@jme_aI see what you did there. But same here 🙂
@BooleanDisorder7 ай бұрын
I like to see Mike pound videos too.
@kurdm14827 ай бұрын
Same
@MikeUnity7 ай бұрын
Were all here for an intellectual pounding
@aprilmeowmeow7 ай бұрын
Thanks for taking us to Pound town. Great explanation!
@pierro2812797 ай бұрын
Your profile picture reminds me of my cat ! It's so cute !
@pvanukoff7 ай бұрын
pound town 😂
@rundown1327 ай бұрын
pause
@aprilmeowmeow7 ай бұрын
@@pierro281279 that's my kitty! She's a ragdoll. That must mean your cat is pretty cute, too 😊
@BrandenBrashear6 ай бұрын
Pound was hella sassy this day.
@TheRealWarrior07 ай бұрын
A very important bit that was skipped over is how you get an LLM to talk about an image (multimodal LLM)! After you got your embedding from the vision encoder you train a simple projection layer that aligns the image embedding with the semantic space of the LLM. You train the projection layer so that the embedding of the vision encoder produces the desired text output describing the image (and or executing the instructions in the image+prompt). You basically project the "thoughts" of the part that sees (the vision encoder) into the part that speaks (the massive LLM).
@or1on896 ай бұрын
That’s pretty much what he said after explaining how the LLM infers an image from written text. Did you watch the whole video?
@TheRealWarrior06 ай бұрын
@@or1on89 What? Inferring an image from written text? Is this a typo? You mean image generation? Anyway, did he make my same point? I must have missed it. Could you point to the minute he roughly says that? I don't think he ever said something like "projective layer" and/or talked about how multimodality in LLMs is "bolted-on". It felt to me like he was talking about the actual CLIP paper rather than how CLIP is used on the modern systems (like Copilot).
@exceptionaldifference3926 ай бұрын
I mean the whole video was about how to align the embeddings of the visual transformer with LLM embeddings of captions of the images.
@TheRealWarrior06 ай бұрын
@@exceptionaldifference392 to me, the whole video seems to be about the CLIP paper which is about “zero-shot labelling images”. But that is a prerequisite to make something like LLaVa which is able to talk, ask questions about the image and execute instruction based on the image content! CLIP can’t do that! I described the step from going to having a vision encoder and an LLM to have a multimodal-LLM. That’s it.
@TheRealWarrior06 ай бұрын
@@exceptionaldifference392 To be exceedingly clear: the video is about how you create the "vision encoder" in the first place, (which does require you also train a "text encoder" for matching the image to the caption), not how to attach the vision encoder to the more general LLM.
@Shabazza846 ай бұрын
Excellent. Could listen to him all day and even understand stuff.
@chloupichloupa6 ай бұрын
That cat got progressively more turtle-like with each drawing.
@wouldntyaliktono7 ай бұрын
I love these encoder models. And I have seen these methods implemented in practice, usually as part of a recommender system handling unstructured freetext queries. Embeddings are so cool.
@lucianoag9996 ай бұрын
So, if we want to break AI, we just have to pollute the internet with a couple billion pictures of red cats with the caption “blue dog”.
@StephanBuchin5 ай бұрын
We can also improve ai by precisely describing images.
@jonyleo5007 ай бұрын
At 9:30, doesn't a distance of zero mean the image and caption have the same "meaning", therefore, shouldn't we want to minimize the diagonal, and maximize the rest?
@michaelpound98917 ай бұрын
Yes! We want to maximise the similarity measure on the diagonal - I forgot the word similarity!
@romanemul17 ай бұрын
@@michaelpound9891 Cmon. Its Mike Pound !
@negrumanuel5 ай бұрын
Love the genuine background.
@beardmonster80517 ай бұрын
The biggest problem with unlocking a face with your phone is that you'll laugh too hard to hear the video for a minute or so.
@JohnMiller-mmuldoor7 ай бұрын
Been trying to unlock my face for 10:37 and it’s still not working!
@musikdoktor7 ай бұрын
Love seeing AI problems explained on fanfold paper. Classy!
@rigbyb7 ай бұрын
6:09 "There isn't red cats" Mike is hilarious and a great teacher lol
@xersxo54606 ай бұрын
Just writing this to crystallize my understanding: (and for others to check me for accuracy) So by circumventing the idea of trying to instill “true” understanding (which is a hard incompatibility in this context, due to our semantics); On a high level it’s substituting case specific discrepancies (like how a digital image is made of pixels, so only pixel related properties are important: like color and position) and filtering against them, because it happens to be easier to tell what something isn’t than what it is in this case (like there are WAAAY more cases where a random group of pixels isn’t an image of a cat, so your sample size for correction is also WAAY bigger.) And if you control for the specific property that disqualifies the entity (in this case, of the medium: discrete discrepancies), as he stated with the “ ‘predisposed noise’ subtraction to recreate a clean image’“ training, you can be even more efficient and effective by starting with already relevant cases. Once again because a smattering of colors is not a cat so it’s easier to go ahead and assume your images will already be in some assortment of colors similar to a cat to train on versus the near infinite combinations of random color pixel images. And then in terms of the issue of accuracy through specificity versus scalability, it was just easier to use the huge sample size as a tool to approximate accuracy between the embedded images and texts because as a sample size increases, precision also roughly increases given a rule, (in crude terms). And that it’s also a way to circumvent “ mass hard coding” associations to approximate “meaning” because the system doesn’t even have to deal directly with the user inputs in the first place, just their association value within the embedded bank. I think that’s a clever use of the properties of a system as limitations to solve for our human “black box” results. Because the two methods, organic and mathematical, converge due to a common factor: The fact that digital images in terms of relevance to people are also useful approximations, because we literally can only care about how close an “image” is to something we know, not if it actually is or not, which is why we don’t get tripped up over individual pixels in determining the shape of a cat in the average Google search. So in the same way by relying on pixel resolution and accuracy as variables you can quantify the properties so a computer can calculate a useable result. That’s so cool!
@captaintrep60026 ай бұрын
This video was super helpful. I just wish it came out with the stable diffusion video. Was quite confused how we went from text to meaning.
@user-dv5gm2gc3u7 ай бұрын
i'm an it-guy & programmer, but this is kinda hard to understand. thanks for the video, gives a little idea about the concepts!
@aspzx7 ай бұрын
I'd definitely recommend the last two videos on GPT from 3blue1brown. He explains the concept of embeddings in a really nice way.
@AZTECMAN7 ай бұрын
Clip is fantastic. It can be used as a 'zero-shot' classifier. It's both effective and easy to use.
@chuachua-hj9zd5 ай бұрын
I like how he just sit in an office chair and talk. Simple but high quality talk
@Foxxey7 ай бұрын
14:36 Why can't you just train a network that would decode the vector in the embedded space back into text (being either fixed sized or using a recurrent neural network)? Wouldn't it be as simple as training a decoder and encoder in parallel and using the text input of the encoder as the expected output in the decoder?
@or1on896 ай бұрын
Because that’s a whole different class of problem and would make the process highly inefficient. There are better ways just to do that using a different approach.
@stancooper54367 ай бұрын
Thanks Mike, nice clear explanation. You can still get that printer paper!? Haven't seen that since my Dad worked as a mainframe engineer for ICL in the 80s!
@robsands66565 ай бұрын
Still have to manually generate the sentences for the embedding. It’s just a convoluted way to let a computer generate its own lookup table.
@sukaina49787 ай бұрын
i just feel 10 times smarter after watching any computerphile video
@el_es7 ай бұрын
@dr Pound: sorry if this is off topic here but, i wonder if the problem of hallucinations in AI comes from us not treating the 'i don't know what I'm looking at ' answer of a model, as a very negative outcome? If it was treated by us as a valid neutral answer, could it reduce the rate if hallucinations?
@IceMetalPunk7 ай бұрын
For using CLIP as a classifier: couldn't you train a decoder network at the same time as you train CLIP, such that you now have a network that can take image embeddings and produce semantically similar text, i.e. captions? That way you don't have to guess-and-check every class one-by-one? Anyway, I can't believe CLIP has only existed for 3 years... despite the accelerating pace of AI progress, we really are still in the nascent stages of generalized generative AI, aren't we?
@Misiok896 ай бұрын
6:30 if for LLM you have nodes of meaning then you could look fof "nodes of meaning" in description and make classes based on those "nodes", if you are able to represent every language based on same "nodes of meaning" that is even better to translate text from one language to another then average translator that is not LLM, then you should be able to use it also for clasification.
@WilhelmPendragon6 ай бұрын
So the Visio-Text encoder is dependent on the quality of the captioned photo dataset? If so, where do you find quality datasets ?
@ColibrisMusicLive6 ай бұрын
Please explain 9:2, do you mean that the embeddings laying on the diagonal will receive higher score?
@Sleeperknot6 ай бұрын
At 9:48, did Mike use 'maximize' and 'minimize' wrongly? The distances in the diagonal should be minimal right? EDIT: I saw Mike's pinned comment only after posting this :P
@Stratelier7 ай бұрын
When they say "high dimensional" in the vector context, I like to imagine it like an RPG character stat sheet, as each independent stat on that sheet can be considered its own dimension.
@zxuiji7 ай бұрын
Personally I woulda just did the colour comparison by putting the 24bit RGB integer colour into a double (the 64bit fpn type) and divided one by the other. If the result is greater than 0.01 or less than -0.01 then they're not close enough to deem the same overall colour and thus not part of the same facing of a shape. **Edit:** When searching for images it might be better to use simple line path (both a 2d and 3d one) matching the given text of what to search for and compare the shapes identified in the images to those 2 paths. If at least 20% of the line paths matches a shape in the image set then it likely contains that what was searching for. Similarly when generating images the line paths should then traced for producing each image then layered on to one image. Finally for identifying shapes in a given image you just iterate through all stored line paths. I believe this is how our brains conceptualise shapes in the 1st place given how our brains have nowhere to draw shapes to compare to. Instead they just have connections between...cells? neurons? Someone will correct me. Anyways they just have connections between what are effectively physical functions that equate to something like this in C: int neuron( float connections[CHAR_BIT * sizeof(uint)] ); Which tells me the same subshapes share neurons for comparisons which means a bigger shape will likely be just something initial nueron to visit, how many neurons to vist, and what angle to direct the path at to identify the next neuron to visit. In other words every subshape would be able to revisit a previous subshapes neruon/function. There might be an extra value or 2 but I'm no neural expert so a rough guess should be accurate enough to get the ball rolling.
@pickyourlane64316 ай бұрын
i was curious, when you are showing the paper from above, are you transforming the original footage?
@owocowywonsz65643 ай бұрын
So that's why we had to assign noisy vehicle pictures to "vehicle" word (or were given instruction like "select not animals" )when solving captchas
@LukeTheB7 ай бұрын
Quick question from someone outside computer science: Does the model actually instill "meaning" into the embedded space? What I mean is: Is the Angel between "black car" and "Red car" smaller than "black car" and "bus" and that is smaller than "black car" and "tree"?
@suicidalbanananana7 ай бұрын
Yeah that's correct, "black car" and "red car" will be much closer to each other than "black car" and "bus" or "black car" and "tree" would be. It's just pretty hard to visualize this in our minds because we're talking about some strange sort of thousands-of-dimensions-space with billions of data points in it. But there's definitely discernable "groups of stuff" in this data. (Also, "Angle" not "Angel" but eh, we get what you mean ^^)
@codegallant7 ай бұрын
Computerphile and Dr. Pound ♥️✨ I've been learning AI myself these past few months so this is just wonderful. Thanks a ton! :)
@jonathan-._.-7 ай бұрын
approx how many samples do i need when i just want to do image categorisation (but with multiple categories per image)
@GeoffryGifari7 ай бұрын
Can AI say "I don't know what I'm looking at"? Is there a limit to how much it can recognize parts of an image?
@throttlekitty17 ай бұрын
No, but it can certainly get it wrong! Remember that it's looking for a numerical similarity to things it does know, and by nature has to come to a conclusion.
@OsomPchic7 ай бұрын
Well in some way. It would say that picture have this embedings: cat:0.3, rainy weather: 0.23, white limo 0.1 every number representing a percentage how "confident" it is. So with a lot of tokens below 0.5 you can say it have no idea what's on that picture
@ERitMALT001237 ай бұрын
Monte-Carlo dropout can produce confidence estimations of a model. If the model doesn't know what it's looking at then the confidence should be low. CLIP natively doesn't have this though
@el_es7 ай бұрын
The 'i don't know ' answer is not very evenly treated along users and therefore there is an understandable hate of it embedded into the model;) possibly because it also means more work for the programmers... Therefore it would rather hallucinate than say it doesn't know something.
@therobotocracy4 ай бұрын
I’m not finding a lot of info on fine tuning Clip? Any suggestions?
@Holycrabbe6 ай бұрын
so the length of the clip array training the defusion would have 400 million entries ? so it defines a "corner" of the space we have spanned by the 400 million fotos and foto descriptions ?
@LupinoArts7 ай бұрын
3:55 As someone born in the former GDR, I find it cute to label a Trabi as "a car"...
@j3r3miasmg7 ай бұрын
I didn't read the cited paper, but if I understood correctly, the 5 billion images need to be labeled for the training step?
@Hexanitrobenzene6 ай бұрын
Or "at least" 400 million...
@AzizMajid-e7p3 ай бұрын
So, let's say we have trained all animals with images and texts except wolf and dog images and texts, if we were to ask to a model which has been trained with CLIP and zero-shot classification to draw a picture of a dog and a wolf sitting together, would it still be able to draw them? But it hasn't been trained neither the words of dog and wolf nor images of wolf and dog, if we were to describe as "draw a picture of an animal which howls and looks similar to a fox or canine and lives in the forest" there is a chance it would be able to predict it, but if we say the exact sentence of "draw a picture of wolf" it wouldn't be able to draw it, right? Because it doesn't know the meaning of "wolf", or how it looks like, or it is an animal or a table, in its universe it would be like humans trying to predict how 10th dimension looks like? Am I right?
@thecakeredux6 ай бұрын
Is there a specific reason why this process would have to be single-directional? There doesn't seem to be much difference in principle to an autoencoder. I assume this isn't about whether it would work or not, but rather about approximating this behavior to navigate around impossible amounts of training, or am I mistaken in this assumption?
@zzzaphod85077 ай бұрын
4:35 "There is a lot of stuff on the internet, not all of it good." Today I learned 😀 6:05 I enjoyed that you mentioned the issues of red/black cats and the problem of cat-egorization Video was helpful, explained well, thanks
@aleksszukovskis20746 ай бұрын
there is stray audio in the background that you can faintly hear at 0:05
@robosergTV6 ай бұрын
Please make a Playlist only about GenAI or a separate AIphile channel. I care only about genAI.
@donaldhobson88737 ай бұрын
Once you have a clip, can't you train a diffusion on pure images, just by putting an image into clip, and training the diffusion to output the same image?
@postmanpatpatАй бұрын
Thank you for sharing your knowledge Sir ❤ someone watching from Pakistan.
@mkbrln5 ай бұрын
Was able to keep up to 14:34. I'm progressing!
@FilmFactry7 ай бұрын
When will wee see the multimodal LLMs be able to answer a question with a generated image. Could be how do you wire an electric socket, and it would generate either a diagram or illustration of the wire colors and position. Should be able to do this but it can't yet. Next would be a functional use of SORA rendering a video how you install a starter motor in a Honda.
@thestormtrooperwhocanaim4967 ай бұрын
A good edging session (for my brain)
@brdane7 ай бұрын
Oop. 😳
@proc7 ай бұрын
9:48 I didn't quite get how do similar embeddings end up close to each other if we maximize the distances to all other embeddings in the batch? Wouldn't two images of dogs in the same batch will be pulled further away just like an image of a dog and a cat would? Explain like Dr. Pound please.
@drdca82637 ай бұрын
First: I don’t know. Now I’m going to speculate: Not sure if this had a relevant impact, but: probably there are quite a few copies of the same image with different captions, and of the same caption for different images? Again, maybe that doesn’t have an appreciable effect, idk. Oh, also, maybe the number of image,caption pairs is large compared to the number of dimensions for the embedding vectors? Like, I know the embedding dimension is pretty high, but maybe the number of image,caption pairs is large enough that some need to be kinda close together? Also, presumably the mapping producing the embedding of the image, has to be continuous, so, images that are sufficiently close in pixel space (though not if only semantically similar) should have to have similar embeddings. Another thing they could do, if it doesn’t happen automatically, is to use random cropping and other small changes to the images, so that a variety of slightly different versions of the same image are encouraged to have similar embeddings to the embedding of the same prompt.
@nenharma827 ай бұрын
This is as simple as it’s ingenious and it wouldn’t be possible without the internet being what it is.
@IceMetalPunk7 ай бұрын
True! Although it also requires Transformers to exist, as previous AI architectures would never be able to handle all the varying contexts, so it's a combination of the scale of the internet and the invention of the Transformer that made it all possible.
@Retrofire-477 ай бұрын
@@IceMetalPunk the transformer, as someone who is ignorant, what is that? I only know a transformer as a means of converting electrical voltage from AC - DC
@GeoffryGifari7 ай бұрын
How can AI determine the "importance" of parts of an image? why would it output "people in front of boat" instead of "boat behind people" or "boat surrounded by people"? Or maybe the image is a grid of square white cells. One cell then get its color progressively darken to black. Would the AI describe these transitioning images differently?
@michaelpound98917 ай бұрын
Interesting question! This very much comes down to the training data in my experience. For the network to learn a concept such as "depth ordering", where something is in front of another, what we are really saying is it has learnt a way to extract features (numbers in grids) representing different objects, and then recognize that an object is obscured or some other signal that indicates this concept of being in front of. For this to happen in practice, we will need to see many examples of this in the training data, such that eventually such features occurring in an image lead to a predictable text response.
@GeoffryGifari7 ай бұрын
@@michaelpound9891 The man himself! thank you for your time
@GeoffryGifari7 ай бұрын
@@michaelpound9891 I picked that example because... maybe its not just depth? maybe there are myriad of factors that the AI summarized as "important" For example the man is in front of the boat, but the boat is far enough behind that it looks somewhat small.... Or maybe that small boat has a bright color that contrasts with everything else (including the man in front). But your answer makes sense, that its the training data
@Jononor6 ай бұрын
@@GeoffryGifarisalience and salience detection is what this concept is usually called in computer vision. CLIP style models will learn it as a side effect
@Trooperos906 ай бұрын
This video satisfies my ai scepticism.
@charlesgalant82717 ай бұрын
The answer given for the "we feed the embedding into the denoise process" still felt a little hand-wavey to me as someone who would like to understand better, but overall good video.
@michaelpound98917 ай бұрын
Yes I'm still skipping things :) The process this uses is called attention, which basically is a type of layer we use in modern deep networks. The layer allows features that are related to share information amongst themselves. Rob Miles covered attention a little in the video "AI Language Models & Transformers", but it may well be time to revisit this since attention has become quite a lot more mainstream now, being put in all kinds of networks.
@IceMetalPunk7 ай бұрын
@@michaelpound9891 It is, after all, all you need 😁 Speaking of attention: do you think you could do a video (either on Computerphile or elsewhere) about the recent Infini-Attention paper? It sounds to me like it's a form of continual learning, which I think would be super important to getting large models to learn more like humans, but it's also a bit over my head so I feel like I could be totally wrong about that. I'd appreciate an overview/rundown of it, if you've got the time and desire, please 💗
@utkua7 ай бұрын
How do you go from embedings to text of something never been see. before?
@lancemarchetti86736 ай бұрын
Amazing. Imagine the day when AI is able to detect digital image steganography. Not by vision primarily, but by bit inspection.... iterating over the bytes and spitting out the hidden data. I think we're still years away from that though.
@Ankhyl6 ай бұрын
Mike explains very well, however it is very noticeable that the concept is not easily explained in 20 minutes. There a are a lot of cliffhangers and each step in the process requires its own iteration of 20 min video, probably at least 2 levels deep to really understand what's going on.
@Funkymix187 ай бұрын
Mike is the best
@unvergebeneid7 ай бұрын
But confusing to say that you want to maximise the distances on the diagonal. Of course you can define things however you want but usually you'd say you want to maximise the cosine similarity and thus minimise the cosine distance on the diagonal.
@sebastianscharnagl31736 ай бұрын
Awesome explanation
@JT-hi1cs7 ай бұрын
Awesome! I always wondered how the hell does the AI “gets” that an image is made with a certain type of lens or film stock. Or how the hell AI generates objects that were never filmed in a way, say, The Matrix filmed on fisheye and Panavision in the 1950s.
@bogdyee7 ай бұрын
I'm curios about a thing. If you have a bunch of millions of photos of cats and dogs and they are also correctly labeled (with descriptions) but all these photos have the cats and dogs in the bottom half of the image, will the transformer be able to correctly classify them after training if they are put in the upper half of the image? (or images are rotated, color changed, filtered, etc..).
@Macieks3007 ай бұрын
Yes, it may learn it wrong. That's why scale is necessary for this. If you have a million of photos of a cats and dogs it's very unlikely that all of them are in the bottom half of the image.
@bogdyee7 ай бұрын
@@Macieks300 That's why for me it pose a philosophical question. Will these things actually solve intelligence at some point? If so, what exactly might be the difference between a human brain an an artificial one.
@IceMetalPunk7 ай бұрын
@@bogdyee Well, think of it this way: humans learn very similarly. It may not seem like it, because the chances of a human only ever seeing cats in the bottom of their vision and never anywhere else is basically zero... but we do. The main difference between human learning and AI learning, with modern networks, is the training data: we're constantly learning and gathering tons of data through our senses and changing environments, while these networks learn in batches and only get to learn from the training data we curate, which tends to be relatively static. But give an existing AI model the ability to do online learning (i.e. continual learning, not "look up on the internet" 😅) and put it in a robot body that it can control? And you'll basically have a human brain, perhaps at a different scale. And embodied AIs are constantly being worked on now, and continual learning for large models... I'm not sure about. I think the recent Infini-Attention is similar, though, so we might be making progress on that as well.
@suicidalbanananana7 ай бұрын
@@bogdyee Nah they won't solve intelligence at some point when going down this route they are currently going down, AI industry was working on actual "intelligence" for a while but all this hype about shoving insane amounts of training data into "AI" has reduced the field to really just writing overly complex search engines that sort of mix results together... 🤷♂ Its not trying to think or understand (as is the actual goal of AI field) anything at all at this stage, it's really just trying to match patterns. "Ah the user talked about dogs, my training data contains the following info about dog type a/b/c, oh the user asks about trees, training data contains info about tree type a/b/c", etc. Actual AI (not even getting to the point of 'general ai' yet but certainly getting to somewhere much better than what we have now) would have little to no training data at all, instead it would start 'learning' as its running, so you would talk to it about trees and it would go "idk what a tree is, please tell me more" and then later on it might have some basic understanding of "ah yes, tree, i have heard about them, person x explained them to me, they let you all breathe & exist in type a/b/c, right? please tell me more about trees" Where the weirdness lies is that the companies behind current "AI" are starting to tell the "AI" to respond in a similar smart manner, so they are starting to APPEAR smart, but they're not actually capable of learning. All the current AI's do not remember any conversation they have had outside of training, because that makes it super easy to turn Bing (or whatever) into yet another racist twitter bot (see microsoft's history with ai chatbots)
@suicidalbanananana7 ай бұрын
@@IceMetalPunk The biggest difference is that we (or any other biological intelligence) don't need insanely large amounts of training data, show a baby some spoons and forks and how to use them and that baby/person will recognize and be able to use 99.9% of spoons and forks correctly for the rest of its life, current overhyped AI's would have to see thousands of spoons and forks to maybe get it right 75% of the time & that's just recognizing it, we're not even close yet to 'understanding how to use' Also worth noting is how we (and again, any other biological intelligence) are always "training data" and much more versatile when it comes to new things, if you train an AI to recognize spoons and forks and then show it a knife it's just going to classify it as a fork or spoon, where as we would go "well that's something i've not seen before so it's NOT a spoon and NOT a fork"
@RupertBruce7 ай бұрын
One day, we'll give these models some high resolution images and comprehensive explanations and their minds will be blown! It's astonishing how good even a basic perceptron can be given 28x28 pixel images!
@ianburton92237 ай бұрын
Difficult to see how convergence can be ensured. Lots of very different functions can be closely mapped over certain controlled ranges, but then are wildly different outside those ranges. What I have missed in many AI discussions is these concepts of validity matching and range identities to ensure that there's some degree of controlled convergence. Maybe this is just a human fear of the unknown.
@MikeKoss7 ай бұрын
Can't you do something analogous to stable diffusion for text classification? Get the image embedding, and then start with random noisy text, and iteratively refine it in the direction of the image's embedding to get a progressively more accurate description of the image.
@quonxinquonyi85706 ай бұрын
Image manifolds are of huge dimension compare to text manifolds….so guided diffusion from a low dimension manifold to a very high dimension manifold would have a less information and more noise, basically information theoretic bounds still hold when you transform from high dimensional space to low dimension embedding but the other way around isn’t seems that intuitive…might be some prior must be taken into an account..but it still is a hard problem
@MilesBellas7 ай бұрын
Stable Diffusion 3 = potential topic Optimum workflow strategies using Control Nets, LORAS, VEAs etc....?
@LaYvi5 ай бұрын
I'm an artist and I'm very worried about my art being used to train an AI model. What can I do to prevent that? Any tips?
@MattMcT7 ай бұрын
Do any of you ever get this weird feeling that you need to buy Mike a beer? Or perhaps, a substantial yet unknown factor of beers?
@martin777xyz6 ай бұрын
Really nice to explanation 👍👍
@EkShunya7 ай бұрын
I thought diffusion models had VAE and not ViT Correct me if I m wrong
@quonxinquonyi85706 ай бұрын
Diffusion model is an upgraded version of vae with limitation in sampling speed
@bas_abhiАй бұрын
I still don’t understand how chatGPT or other similar models do it.. it does far more than generate captions for the image
@yanisfalaki4 ай бұрын
I'm not too sure he's correct about how diffusion models use the clip embeddings to generate the outputs. My understanding is that the diffusion model is trained without any injected embeddings, but that when it comes to inference time, we pass the noised image into the clip image encoder, calculate the similarity between its embedding and the clip text embedding, and use it's derivative with respect to the image inputs in order to guide (hence the name guided diffusion) the output probabilities of the diffusion model in the direction that increases the similarity. Although I don't see why his approach wouldn't work, I don't think how its done in the mainstream, especially given the fact using his approach we can't have negative prompts (which would just be the negative gradient of the similarity score).
@bennettzug7 ай бұрын
13:54 you actually probably can, at least to an extent there’s been some recent research on the idea of going backwards from embeddings to text, maybe look at the paper “Text Embeddings Reveal (Almost) As Much As Text” (Morris et al) the same thing has been done with images from a CNN, see “Inverting Visual Representations with Convolutional Networks” (Dosovitsky et al) neither of these are with CLIP models so maybe future research? (not that it’d produce better images than a diffusion model)
@or1on896 ай бұрын
You can, using a different type of network/model. We need to remind that all he said is in the context of a specific type of model and not in absolute terms, otherwise the lesson would go very quickly out of context and hard to follow.
@bennettzug6 ай бұрын
@@or1on89 i don’t see any specific reason why CLIP model embeddings would be especially intractable though
@eigd7 ай бұрын
9:48 Been a while since I did machine learning class... Anyone care to tell me why I'm thinking of PCA? What's the connection?
@Hexanitrobenzene6 ай бұрын
Hm, I'm not an expert either, but... AFAIK, Principal Component Analysis finds directions which maximise/minimise the variance of the data, which can be thought of as average distance. The drawback is that it's only a linear method and it cannot deal with high dimensional data such as images effectively.
@fredrik36857 ай бұрын
Question 🤚 Up until recently all images of a cat on internet were photos of real cars and the system could use them in training. But now more and more cat images are AI generated. If future systems use generated images in training it will be like a blind leading a blind. More and more distortion will be added. Or? Can that be avoided?
@quonxinquonyi85706 ай бұрын
Distortion and perceptual qualities are the tradeoff we make when we use generative ai
@VicenteSchmitt7 ай бұрын
Great video!
@ginogarcia87307 ай бұрын
I wish I could hear Professor Brailsford's thoughts on AI these days man
@barrotem56276 ай бұрын
Brilliant mike !
@genuinefreewilly57067 ай бұрын
Great explainer. Appreciated. I hope someone will cover AI music next
@suicidalbanananana7 ай бұрын
In super short: Most "AI music stuff" is literally just running stable diffusion in the backend, they train a model on the actual images of spectrograms of songs, then ask it to make an image like that & then convert that spectrogram image back to sound.
@genuinefreewilly57067 ай бұрын
@@suicidalbanananana Yes I can see that, however AI music has made a sudden marked departure in quality of late. Its pretty controversial among musicians. I can wrap my head around narrow AI applications in music ie mastering, samples etc.. Its been a mixed bag of results until recently.
@or1on896 ай бұрын
It surely would be interesting…I can see a lot of people embracing it for pop/trap music and genres with “simple” compositions…my worry as a musician is that it would make the landscape more boring than boy bands in the 90s (and somewhat already is without AI being involved). As a software developer I would love instead to explore the tool to refine filters, corrections and sampling during the production process… It’s a bit of a mixed bag…the generative aspect is being marketed as the “real revolution” and that’s a bit scary…knowing more the tech and how ML can help improve our tools would be great…
@nightwishlover89137 ай бұрын
5.02 Never seen a "boat wearing a red jumper" before lol
@RawrxDev7 ай бұрын
Truly a marvel of human applications of mathematics and engineering, but boy do I think these tools have significantly more cons than pros in practical use.
@aprilmeowmeow7 ай бұрын
agreed. The sheer power required is an ethical concern
@suicidalbanananana7 ай бұрын
We're currently experiencing an "AI bubble" that will pop within 2-3 years or less, no doubt about that at all. Companies are wasting money and resources trying to be the first to make something crappy appear less crappy than it actually is, but they don't fully realize yet that it's that's a harder task then it might seem & it's going to be extremely hard to monetize the end result. We need to move back to AI research trying to recreate a biological brain, somehow the field has suddenly been reduced to people trying to recreate a search engine that mixes results or something, which is just ridiculous & running in the opposite direction that AI field should be heading.
@RawrxDev7 ай бұрын
@@suicidalbanananana That's my thought as well, I even recently watched a clip from sam altman saying they have no idea how to actually make money from AI without investors, and that he is just going to ask the AGI how to make a return once they achieve AGI, which to me seems..... optimistic.
@IOSARBX7 ай бұрын
Computerphile, This is great! I liked it and subscribed!
@JeiShian7 ай бұрын
The exchange at 6:50 made me laugh out loud and I had to show that part of the video to the people around me😆😆
@mattlodder4 ай бұрын
In art history, we call this process "ekphrasis", and it's a deep theoretical problem in art criticism and art history. Are computer vision and GenAI researchers engaging with art historians on these problems? We've been thinking about the relationships between images and their textual descriptions for centuries... Off to Google Scholar I go!
@Coondawgwoopwoop11 күн бұрын
10:35 never wanted to unlock a face with a phone unless it was 2:15 am and I'm blitzed
@NeinStein7 ай бұрын
Oh look, a Mike!
@Rapand7 ай бұрын
Each time I watch one of these videos, I could might as well watch Apocalypto without subtitles. My brain is not made for this 🤓
@bryandraughn98307 ай бұрын
I wonder, if every cat image has specific "cat" types of numerical curves, textures, eyes and so on. So a completely numerical calculation would conclude that the image is of a cat. There's only so much variety of pixel arrangements at some resolution, it seems like images could be reduced to pure math. Im probably so wrong. Just curious.
@quonxinquonyi85706 ай бұрын
You are absolutely right….images are of very high dimension but their image manifold is still considered to cover and filled a very low dimension of their whole image hyper space….the only way to manipulate or tweak that image manifold is by adding noise…but noise is of very low dimension compare to that high dimension image manifold…so that perturbation or guidance to image manifold in form of noise disturb it into one of its direction from many of its inherent direction….this is similar to find slope of a curve ( manifold) by linearly approximate it with a line ( noise)…this is the method you learn in your high school maths….if want to discuss more,I will clarify it further…
@zurc_bot7 ай бұрын
Where did they get those images from? Any copyright infringement?
@quonxinquonyi85706 ай бұрын
Internet is a huge public repository since its inception
@MedEighty6 ай бұрын
10:37 "If you want to unlock a face with your phone". Ha ha ha!
@CATANOVA4 ай бұрын
Adorned = Furnished or decked with things that add beauty or worth. There were no such things on that white board.