An Observation on Generalization

Рет қаралды 152,961

Simons Institute

8 ай бұрын

Ilya Sutskever (OpenAI)
simons.berkeley.edu/talks/ily...
Large Language Models and Transformers

Пікірлер: 200

@EdedML 8 ай бұрын

An interesting exploration of a lot of these points (including Solomonoff induction) is presented in the paper "Playing games with AIs: The limits of GPT-3 and similar Large Language Models" -- highly recommended reading

@thomas-packer-thd 8 ай бұрын

Love the passion and intuition.

@baboothewonderspam 2 ай бұрын

Great talk!! Thanks for publishing

@simonstrandgaard5503 8 ай бұрын

Excellent talk.

@tedhoward2606 8 ай бұрын

I see Ilya pausing thinking about the compression expansion. It works with something like language, which has a fixed set of symbols. It only works with reality (which seems to involve things like irrational numbers, which may only ever be approximated, and maximal computational complexity, etc), to the extent to which we are willing to declare some approximation to be close enough for the context at hand. There seems to be a very real sense in which reality (whatever it actually is) contains sufficient classes of things like maximal computational complexity, than it may only ever be approximated. The hard thing for most to accept, is that most of that simplification and approximation is done for them subconsciously by a large variety of evolved mechanisms, such that what we get to experience as reality, never is - it is always some level of simplification, some sort of "useful approximation" (or at least some function of reality that was generally useful in the survival of our ancestral lineage in context, and may not necessarily be as useful to us in our exponentially changing present and future). It is about 40 years since I started to use the definition of life as "Search" across the multiple systemic and strategic spaces of the possible for the survivable. Once I started to do that, then it became abundantly clear that cooperation is fundamental to both the emergence and survival of complexity (all levels, all domains) - actually I first "saw" that after reading Selfish Gene in 1978. And part of cooperation is being willing to put in the effort to search for new levels of cheat detection and cheat mitigation, for without an evolving ecosystem of cheat detection and mitigation systems, cheats end up destroying the complexity present - if you doubt that, just think of cancer. Cancer is our cells, but some subset of them that has stopped communicating properly with its neighbours and the wider community, and has started to use all available resources to replicate. That strategy appears to work really well, right up to the point that everything dies. We have direct analogies in economics, finance, politics, law and throughout all levels of societal systems. Evolution starts simple, but rapidly gains in complexity with every new emergent level of complexity, and at every new level strategic cooperation and cheat detection/mitigation becomes more important in long term survival. So yes - the idea of compression and complexity are interesting, and they are related to the levels of resolution used in modeling the irreducibly complex, and that is related to functions and predicates and sparse or dense network topologies (there seems to me to be a kind of mathematical equivalence if one abstracts them sufficiently, but the fundamental reality of irreducible complexity needs to remain part of the system) - the models are never the territory, even if they are the only territory we have direct access to ;)

@Tapuzi 8 ай бұрын

I would love to see you express opinions in videos. Please consider doing it. Thank you for the comment!

@connorkapooh2002 8 ай бұрын

@@Tapuzi I second this comment - Ted, I would love to hear your thoughts!

@yasshramchandani3184 8 ай бұрын

The idea of "life" as a scalable and resilient "search" mechanism seems intriguing to say the least. I also relate to the emergent behaviours that arise from a subset of comparatively less complex procedures. Would love to hear you thinking out loud, or read your blogs somewhere. Thanks for your insights Ted.

@aleph0540 7 ай бұрын

This isn't ALL Life is by your same argument. That's good news b/c it suggests we can design better Search organisms. However, it's bad news in the sense that we aren't the best Search 😮

@tedhoward2606 7 ай бұрын

@@aleph0540 I am not sure how you reached that conclusion from what I wrote. It seems clear to me, that we are very close approximations to optimal search engines, and survival always has context dependent aspects, and part of context is the multiple levels of culture preset (family culture, school culture, community culture, workplace culture, wider cultures of the various levels of communities that we engage in). Getting something sufficiently complex to survive demands integrating a massive set of strategies and algorithms and systems; the vast majority of them subconscious, and most people are completely unaware of them. The conscious level models that we have of reality, including ourselves, are just so simplistic, they are like leggo models of a computer - never going to work, but a very low resolution image of the outer appearances of something. My point wasn't that we are poor at search, we are far better at it than most have any idea about, given appropriate contexts. My point was that the very idea of life is about search, about going beyond the known. And, at higher levels, doing that demands a dialectic tension between freedom and responsibility - all levels, eternally. Once you get that, the idea of "provable safety" as anything other than a probabilistic assessment of the known, is a nonsense. (And known in this sense is an expression of probabilistic models in respect of reality, as distinct from any set of logical conclusions from stated premises in any form of logic.) What we can do, is look deeply into biological life, deeply past the surface appearances, and see what it is that allows for new levels of complexity to emerge and survive. When we do that, we see that it is cooperation that is fundamental to every new level of complexity (contrary to economic dogma). We see that cooperation is always vulnerable to exploitation by cheating strategies, and thus requires an eternally evolving ecosystem of cheat detection and mitigation systems (Ostrom's work in this domain is fascinating once viewed from an appropriate context and level of abstraction). Strategy is a deeply complex subject. Zero sum games do not end well for complexity. Once system boundaries are reached, simplicity dominates over complexity. To survive, complexity requires open boundaries. We are at planetary boundaries, we must go beyond, and we must do so wisely, in the full knowledge of all the levels of strategic complexity present, and with the knowledge that the essence of life is going beyond the known. To do that, individuals need to keep abstracting, until there is no contradiction in the statement above. The key message I wanted to give with my earlier comment, is that AI is a different form of life. It is not necessarily one that will have replication as a fundamental drive. It has a different search lineage. So the naive concerns many have of AI out competing us at replicating are for the most part misplaced. And there are risks in that domain if people are not sufficiently aware of the systemic and strategic complexity, and are using overly simplistic competitive models of survival strategies. But there is a very real sense in which those strategies are self terminating with or without the presence of AI. And saying all of this is not saying that I know best what to do in the sense of telling you what to do. All I am saying is, that we need to give every level and instance of agent sufficient freedom and resources to allow them to creatively and responsibly explore the possible in ways that will generate both novelty and risk, and the greatest counter to that risk actually resides in diversity, in multiple "safe to fail" experiments, in a sense. We need sufficient rules to be able to effectively prevent cheats from destroying systems, effective strategies to return higher level agents to cooperative behaviour, and all agents need to be creative, to the limits of their demonstrated responsibility. Following rules is never enough, but in order to break a rule responsibly you need to understand the full stack of reasons why the rule is there, and what it is designed to prevent. And those stacks can be quite deep, usually more than two levels, and sometimes more than 10 levels. This is complex in ways that are impossible to explain in detail in anything smaller than a library.

@bboysil 2 ай бұрын

Great presentation!

@xyh6552 5 ай бұрын

Compression can explain the part of unsupervised learning that is very similar to the operation of using the regularity lemma in mathematics to deal with graph theory problems. Essentially, it utilizes the almost orthogonality of some local graphs

@charlesmarks1394 8 ай бұрын

Great pres. Highest tier of insight and scientific communication.

@miladkhademinori2709 8 ай бұрын

👌

@Morimea 8 ай бұрын

Interesting talk about compression, thanks!

@dreamphoenix 8 ай бұрын

Thank you.

@user-zj2st2qk6h Ай бұрын

great talk !

@kimchi_taco 7 ай бұрын

In global workspace theory, there is bottleneck representation. After this talk, I strongly believe bottleneck is feature, because bottleneck forces model to learn nice compression way, which we call inductive reasoning.

@tristanwegner 8 ай бұрын

It will be interesting in the future to see how the compression of AI systems gets better and better. They have already had neural networks discover physical laws in raw experimental data, but how little data is needed for that? Nobody knows! Maybe it just needs e.g. a few seconds of airwaves broad spectrum signal. This also makes any safety concept for oracle AGIs that involves somehow keeping "safety critical data" from the AGI a very weak approach. The world is highly correlated, and a higher intelligence might learn much more from what we give it, than we see in it ourselves.

@fenixfenixfenixfenix 8 ай бұрын

내가 가장 좋아하는 대머리!!

@odomobo 8 ай бұрын

2 minutes in, and I can already tell this talk is going to be fascinating

@NerdFuture 8 ай бұрын

Ray Solomonoff's version of compression is very similar to Kolmogorov's, and Solomonoff's version of induction is... predicting the next token. Another random Ray thing, in the 1980s he was worried that a smart enough supervised AI would effectively learn the trainer's objective function and turn the tables.

@tedhoward2606 8 ай бұрын

How is "turn the tables" possible, if the training agent's objective function is "the survival of all sapient agents, with reasonable degrees of security, resources and freedom (as defined by the agents in ongoing conversations)"? All that exists in that strategic territory is cooperation in diversity, in the face of the unknown unknown and maximal computational complexity. Anything less than that seems to this agent, from my 50+ years of exploring that domain of strategic territory, to be dangerously overly simplistic.

@jakebrowning2373 8 ай бұрын

@@tedhoward2606 and how do you propose to encode/implement that objective function?

@tedhoward2606 8 ай бұрын

@@jakebrowning2373 This is both complex and uncertain, and it does seem to be as close to stable as it is possible to get. If you take a systems definition of life, as any system capable of searching the space of the possible for the survivable - which in its simplest form is "life as search", then certain things fall out of that definition recursively through each new level. Search involves going beyond the known into both the unknown and the unknown-unknown. So search in this sense encodes both freedom and responsibility as fundamental aspects of existence. Freedom is what enables search, and responsibility is what prevents us from selecting non-survivable vectors in that highly dimensional vector space. In terms of survival strategies in the face of the unknown, robustness and diversity deliver the greatest degrees of security (particularly when agents have rapid communication available, and successful strategies can be rapidly transmitted as options to all agents). So you start with the value of sapient life built directly into the definition of life itself, then recursively, through all new levels of complexity, new levels of strategy are required for cooperation, cheat detection and cheat mitigation - in an eternally evolving ecosystem. I've spent 45 years trying to break it - it appears to be robust.

@shinkurt 8 ай бұрын

It makes sense

@binjianxin7830 8 ай бұрын

50:00 the size of the compressor (GPT4) is a salient term in the inequality 😂

@xyh6552 5 ай бұрын

There must be an explanation of the dynamical system in supervised learning. When the number of samples is much higher than the degrees of freedom in space, they will accumulate over time, resulting in the learning of the dynamic information between the samples. The simplest toy model for this is the Poincare recurrence theorem. On the other hand, if you are willing to believe that the object space is also finite-dimensional, the effectiveness of unsupervised learning can even be explained using multivariable calculus. And compared to some highly difficult-to-explain mathematical phenomena, this is far from being considered magic.

@markm4642 8 ай бұрын

The eloquence of his delivery was delightful

@consumidorbrasileiro222 5 ай бұрын

it's going to be great success according to ilya

@wowtbcmagepvp 8 ай бұрын

Holy shit. Once you get it, you get it. Having (approximate) access to K is the ability to understand everything by mapping everything to their appropriate (shared) distributions (which likely feel linear-like in terms of more learning)

@jakebrowning2373 8 ай бұрын

What do you mean by it feels linear in terms of more learning?

@jonclement 5 ай бұрын

@@jakebrowning2373 i think he means that you first use SGD to find the best K distributions of compression thingy, then you replicate that approach to find an X approximation of every possible computable function...then you linearly combine everything to get 42

@ByteBite_Tech 8 ай бұрын

👍

@loopuleasa 8 ай бұрын

main openai brain thanks for posting this

@user-qm2eo7er4u 7 ай бұрын

Is that Scott Aaronson asking questions in the audience :D?

@lukas-santopuglisi668 8 ай бұрын

thx so much for uploading from Germany!

@Achrononmaster 8 ай бұрын

A avatar "@Yuksel Mert Cankus" in the chat nailed it (or at least nailed a good comment on the topic). Because in most practical use cases we can tolerate errors, and because a sentient mind is the final user of a NN, it might not always be so useful to focus on achieving a numerical approximation to Kolmogorov compression. There is a profoundly fuzzy goal: find an Ai tool that's useable, better than yesterday's tool, and compute efficient. (Let's be honest and note the utopian goal of sentient subjective awareness, aka. "consciousness" and uploading your mind into silicon is laughable. If you end up doing it, you can come back to now in your time machine and tell me all about it.)

@diegoacostacoden8704 8 ай бұрын

Someone knows where can I read about the theoretical guarantees of supervised learning mentioned in the video?

@kalilinux8682 8 ай бұрын

Probably the famous Andrew NG's course on coursera

@AKdragonable 8 ай бұрын

@@kalilinux8682No

@zhanggenghan3925 8 ай бұрын

Same question. Have you found the source?

@cennywenner516 8 ай бұрын

@diegocastocacoden8704 @zhanggenghan3925 - You just need Hoeffding's inequality to complete the proof but if you want to read more, it is covered in Chapter 4 of Understanding Machine Learning: From Theory to Algorithms

@cennywenner516 8 ай бұрын

@@kalilinux8682 - No, that course does not cover learning theory

@karigucio 8 ай бұрын

do I read it correctly that in the equation: K(X)

@brianewing1428 8 ай бұрын

Vector comes in, vector goes out. You can't explain that!

@FreakyStyleytobby 7 ай бұрын

4:25 - these equations. What are these? Ilya came up with them, or can I see somewhere their explanation?

@odiseezall 8 ай бұрын

Great beard, keep it.

@TheRevAlokSingh 7 ай бұрын

Agreed. Combined with shaving bald, powerful appearance.

@phixvsm1999 3 ай бұрын

He is a really good guy and intelligent

@rohan.fernando 8 ай бұрын

Supervised learning is essentially an artifice. Unsupervised learning is the foundation of Intelligence. Kohonen was the true pioneer in this, and everything since seems to be extensions and tweaked variations on his foundational ideas.

@shinkurt 8 ай бұрын

@miladkhademinori2709 8 ай бұрын

👌

@swyveu 8 ай бұрын

@@shinkurt please explain...

@brianewing1428 8 ай бұрын

Don't humans do both?

@rohan.fernando 8 ай бұрын

@@brianewing1428 it seems to me that from birth, brains do unsupervised learning and also leverage existing genetically embedded Intelligence. For example, many newborn animals can walk from birth, but they never learned this through experience, so this Intelligent Capability is fully genetically embedded. After some time, a closed loop feedback system is used to improve training of brains, which is kind of like supervised learning. However, the backpropagation system that is currently used to train and perform the error correction and associated weight adjustments in AI systems is an artifice because there’s almost no chance brains use this same process. Brains are doing something different.

@JTan-fq6vy 7 ай бұрын

Can anyone explain for kolmogorov complexity, why should we write as "K(X) < |C(X)| + K(C) + O(1)"? Why can't we write it as "K(X) < |C(X)|", which seems much appropriate for an ultimate compressor? Also why do we need an absolute value for C(X) in the inequality? Thanks!

@candidocarolino 7 ай бұрын

in the right side is K(C) not K(X)

@JTan-fq6vy 7 ай бұрын

thanks! but could you explain why we need the extra K(C) + O(1); why can't we do "K(X) < |C(X)|"?@@candidocarolino

@hanskraut2018 8 ай бұрын

Ilya is great / sympatic. I hope some day you can communicate and exchange some trade secrets so its not too much and not too little. Because a 100% secrecy seems counterproductive as well as a 100% openness might lose to people just hiding everything and copycats or even hypothetical worse tactics. So much for proprietary information. :)

@AntonioEvans 8 ай бұрын

🎯 Key Takeaways for quick navigation: 00:00 🎤 Introduction to the talk and excitement about discussing the future prospects of LLMS (Large Language Models). 00:47 🔍 Shift of focus to AI alignment and the anticipation of sharing results in the near future. 01:18 📈 Sharing old results from OpenAI that significantly impacted the speaker's perspective on unsupervised learning. 02:03 🤖 Delving into the concept of learning in general and the mathematical bases that govern the learning process in neural networks. 03:25 📊 Introducing the mathematical conditions under which supervised learning must succeed. 04:21 📐 Explaining the simplicity and effectiveness of supervised learning through mathematical proofs. 05:44 💡 Highlighting the importance of consistency between training and test distribution in supervised learning. 07:19 ❓ Raising questions about the nature and effectiveness of unsupervised learning and the lack of mathematical exposition. 08:43 🎭 Discussing the puzzling nature of unsupervised learning where optimizing one objective helps in achieving another. 10:37 🔄 Introducing the concept of distribution matching as a method of unsupervised learning with guaranteed success. 13:37 🔄 Further discussion on the potential of distribution matching in unsupervised learning. 14:55 💽 Bringing in the concept of compression as a tool for unsupervised learning, emphasizing the correlation between compression and prediction. 17:18 🌐 Delving deeper into the mathematical frameworks that support the idea of compression aiding in unsupervised learning. 20:45 📚 Introduction to Kolmogorov complexity as a method of optimal compression, albeit non-computable, in the context of unsupervised learning. 24:16 🧠 Drawing parallels between neural networks and small computers, emphasizing the role of SGD (Stochastic Gradient Descent) in training these 'computers'. 26:14 🔍 Expanding on the conditional Kolmogorov complexity and its role in unsupervised learning, highlighting its ability to extract maximum value from unlabeled data. 27:56 🤖 Highlighted the lack of efficient methods for conditioning on big data sets. 28:25 🛠️ Mentioned that using a regular compressor might be as effective as a conditional compressor for making predictions in supervised tasks. 29:23 💼 Explained that joint compression is maximum likelihood, and how it fits naturally in a machine learning context. 30:34 🧠 Discussed the affinity towards larger neural networks, as they approximate the common core of a compressor more effectively, minimizing regret over time. 31:28 🌐 Noted the capability of GPT models to intuitively understand and predict the continuation of patterns in a text without necessarily referring to the theory of compression. 32:27 📸 Mentioned the successful application of the theory to image domains, leading to effective unsupervised learning through next pixel prediction. 33:25 📈 Reported promising results in the pixel prediction task, indicating a positive trajectory for the method in unsupervised learning. 35:19 🤔 Discussed the potential deep implications and unanswered questions about why linear representations are formed in the models. 36:47 🔄 Touched upon the potential experimentation to validate the speculations regarding next pixel prediction compared to other prediction methods. 38:29 🤝 Appreciated the analogy between Kolmogorov complexity and neural networks, and discussed the nuances of training dynamics and data order in neural networks. 40:12 🔄 Discussed the potential of backtracking from cryptography to develop insights into the function class and structure of neural networks. 43:23 🎓 Mentioned the importance and relevance of VC dimension in understanding learning complexities and distinguishing distributions. 45:11 🖼️ Discussed the limitations of using compression as a sole measure for unsupervised learning and the potential for exploring other effective linear representations. 49:12 🔄 Discussed the potential of diffusion models in unsupervised learning and the need to explore their efficiency compared to autoregressive models. 51:13 💾 Touched upon the limitations of gzip as a compressor for text and the scope for further optimizing the compression process. 52:58 📚 Addressed the current stance on curriculum effects in neural networks and the efforts to simplify training optimization procedures. 53:33 💡 Highlighted the advancements in making neural network architectures easier to optimize, reducing susceptibility to curriculum effects. Made with Socialdraft AI

@pieriskalligeros5403 7 ай бұрын

Nice industrial application pipeline you got there

@aleph0540 7 ай бұрын

Bahahah at the above the reply. Thanks for providing the breakdown though 😂

@swyxTV 7 ай бұрын

Not at all a good summary lol. Work harder to improve it pls, this has potential but is just bad

@wrathofgrothendieck 8 ай бұрын

Ilya Sutskver da god

@blahblahblah23424 8 ай бұрын

I'm not sure I get it 23:42 where it says K(X) < |C(X)| + K(C). Shouldn't K(X) < |C(X)| by definition?

@qanon4realvsqanon4gery70 8 ай бұрын

K(X) = Length of smallest program that prints X |C(X)| = Length of compressed form of X using some compresion-decompresion program C You can choose your program C to compress the entire Wikipedia to the bit 1 if you want, clearly the smallest program that prints wikipedia takes more than 1 bit, so K(X) < |C(X)| doesnt hold, but the length of a program that can decompress wikipedia from reading "1" is not smaller than the smallest program that prints wikipedia, so K(X) < |C(X)| + K(C) holds.

@deepbayes6808 7 ай бұрын

Information theory to the rescue of unsupervised learning, who would have thought ;)

@adtiamzon3663 7 ай бұрын

What a beard, Ilya⁉️ Suits yahh! So what am I going to learn this time❓️😁😇🌹

@shuminghu 8 ай бұрын

@cennywenner516 8 ай бұрын

The formula is true but it does not imply that learning X *necessarily* helps to predict Y. For example, if Y is entirely unrelated. He was talking about this to theorize what unsupervised learning could mean. And one way to consider learning (X,Y) is to find a best compression of it (roughly - can we describe the distribution with some simpler set of fundamental variables). Then he goes on to say that finding the best way to compress (X,Y) is basically the same as compressing X and being able to predict Y given X. So there is formalism to justify connecting unsupervised learning back to supervised learning. K(X,Y) = K(X) + K(Y|X) + .. says something interesting. It is not a priori obvious that a shortest program to get both X and Y should be almost as long as a shortest program to get X, plus a shortest program to get Y given X. Like, how do we know that the first program is even useful for Y? These are just comparing sizes though so they are not necessarily composed of each other. The easiest way to see that this relation holds is probably to consider the universal distribution - pick a random program and run it to generate an output. Then clearly prob(X, Y) = prob(X) * prob(Y | X). And the probability of outputting X is roughly the same as sampling the shortest program for it - prob(X) ~= 2^-K(X) ; so 2^-K(X,Y) ~ 2^-(K(X) + K(Y|X)). So that these complexities are linked like this is interesting but likely he is interested in the opposite direction of the one you mentioned. Namely, that finding a compressor that can iteratively make good predictions for Y|X implying that you have a good compressor for (X,Y). That is, by doing self-supervised learning, you are doing unsupervised learning, and approximately the most fundamental form of it. We kind of use the terms unsupervised and self-supervised interchangeably today but it is not obvious that one should - e.g. is BERT's masking fundamentally learning more than next-token prediction? Then after we have done the self-supervised ~ unsupervised learning, it is interesting to apply it for predictions outside the domain, i.e. how does the self-supervised pretraining help? That's where the other direction you mentioned come in, and where I personally think there is some weakness (Y is other dist). If the connection holds, we should be able to learn an unsupervised representation (compression) C(X) of X (model weights + instance representation) and then just predict the class Y on C(X), which is what the next-pixel experiment was about. I am not sure if this, as is, establishes any stronger connection on the prediction side - if so I missed it - just that inspired by this, one might think that most of the time, for sufficient data, a compression C(X) should be all one needs to roughly predict Y, rather than X. Which should make it easier to learn Y eg when X is high-dimensional, such as for vision and text. There are some other things I think one would want to say but which I think do not clearly follow yet. Perhaps the most interesting and when the connection is strongest is when the training does try to make the pre-training part of X and Y closer in distribution at some point, for which I bet OpenAI has plenty of data; might even be used as an argument for training on outputs? Could also be interesting for RL/multi-step dialogs.

@jony7779 8 ай бұрын

Does anyone know the name of the theorem being shown at 4:54 so that I can look it up to learn?

@jony7779 8 ай бұрын

Found it, its called "Vapnik-Chervonenkis dimension". Asked GPT-4 😁

@jony7779 8 ай бұрын

could have just watched for another 2 minutes to get the answer 😬

@cansacan7534 8 ай бұрын

Pretty standard introductory ML theorem.

@jony7779 8 ай бұрын

cool! @@cansacan7534

@Alex-fh4my 8 ай бұрын

@@cansacan7534 thanks for the help mate

@joeremus9039 7 ай бұрын

What is data compression?

@JCResDoc94 8 ай бұрын

31:10 GPT models _JC

@vev 8 ай бұрын

Scott Aaronson asking question!?

@max0x7ba 8 ай бұрын

K(C)

@karigucio 8 ай бұрын

right, just no way of knowing whether some Ki outputs C

@max0x7ba 8 ай бұрын

@@karigucio We are only interested in the length of C, K is just an artefact to get there, is it not?

@karigucio 8 ай бұрын

@@max0x7ba i was referring to the fact that in the procedure of actually searching all programs shorter than C, you need a way of checking whether a given program outputs C. And this is undecidable, so the whole procedure is undecidable eventhough the search space is finite as you said. but i didn't quite get the second remark so excuse me if we're on different wavelengths here

@joeysipos 8 ай бұрын

So is he saying there is some algorithm to take compression of say Large Language Models and generalize them to work with say vision?

@digzrow8745 7 күн бұрын

I miss him

@alonsomartinez9588 8 ай бұрын

Reminds me of VR concerts

@wege8409 5 ай бұрын

In what way? Just the general atmosphere of the presentation?

@DayB89 8 ай бұрын

Just writing my thoughts here only 14 mins in, but I don't understand how that explanation would make it clear that distribution matching works. I mean, the transformation function could be any... even one that matches the distribution but makes no sense. It would be like me doing an attempt of word-by-word translation of English to French just attending at word frequency, what could go wrong =)

@GenzaiHonkaku 8 ай бұрын

The assumption in your example could be that all specific samples of people communicating in some language would fit some underlying universal distribution of human communication. As long as the same idea can be expressed equivalently across different languages, you could in theory determine whether 2 examples of language are expressing the same idea by modelling the probability of a certain idea being expressed. This relies on the assumption that we as humans roughly communicate in the same ways across all languages. For the most part, I feel like this is an accurate assumption to make. However, there are differences in culture which define different different distributions of what is talked about in the languages used by those cultures. Which is where bias comes in. Different datasets are biased towards different distributions.

@GenzaiHonkaku 8 ай бұрын

But I think you are correct in that matching the distributions on a word-by-word basis would result in poor translation accuracy. I think that which tokens you are trying to match need to contain all of the relevant context and information you need to capture. If I recall correctly, I think GPT encodes paragraphs of vector-encoded words to vector representations.

@DayB89 8 ай бұрын

@@GenzaiHonkaku And even if that's the case, even if you are matching full-context distributions... isn't it possible that the mapping function F does a transformation that doesn't relate to meaning? My point is that I'm having a hard time telling apart this assumption from mere wishful thinking. In other less-technical terms, I can make a bread fit into a battery socket without the bread becoming a battery.

@GenzaiHonkaku 8 ай бұрын

@@DayB89 I would say a more apt analogy would be trying to force a loaf of bread into one of those kids 'put the shape in the shaped hole' toys. The important thing about the socket is the shape of thing being put into it. But it's true that you would still end up with a squished loaf of bread that doesn't look like the shaped hole it was forced into. If you can agree that this is a *fitting* analogy, then might also agree that it's heading into philosophical territory. if something looks like a rose, and smells like a rose, but doesn't taste like a rose, has it still achieved everything it needs to to be considered as a rose?

@DayB89 8 ай бұрын

@@GenzaiHonkaku Well, your answer made me realize that I just picked the wrong analogy. I'll get back to you once I find the right one.

@Snshqgavks 5 ай бұрын

To better understand this video, what should I study?

@SchoolofAI 8 ай бұрын

What are 100 videos that are MUST-WATCH for AI enthusiasts in 2023?

@qanon4realvsqanon4gery70 8 ай бұрын

If you are gonna watch 100 videos I think just go with recorded university lectures on the topic

@Morimea 8 ай бұрын

KZbin search: 1. MIT Artificial Intelligence, Patrick Winston 2. Andrej Karpathy Let's build GPT: from scratch idk what you mean by "2023", but general knowledge/understanding how it work is most important, and those lectures I mention give you full overview

@SchoolofAI 8 ай бұрын

@@Morimea Thank you

@realspacemusicvideos 7 ай бұрын

Ilya is so good that I have been forced to buy $MSFT to secure my UBI after my white collar job gets replaced by AI:)

@AnirudhAjith 6 ай бұрын

42:50 Is that Scott Aaronson?

@consumidorbrasileiro222 5 ай бұрын

yes

@huyle3597 8 ай бұрын

what's the reference for the inequality he showed at around 4:05?

@markpfeffer7487 8 ай бұрын

S tier brain. D tier PowerPoint aesthetics.

@user-qm2eo7er4u 7 ай бұрын

Everyone knows that as you get more baller you don't have to waste time on fancy slides

@jimlbeaver 8 ай бұрын

We need more people that are at least half as smart as him

@mathtick 8 ай бұрын

Why exactly is this "unsupervised"? Yes I'm focussing on the semantics because it throws me. Unsupervised is just P(Y|X) where X is the null feature set no? He then goes on about learn transformation of random variables F s.t. F(X) \sim Y ... but this is then directly supervised learning. Without a rigourous definition I fnd the terms ... empty. The compression angle is interesting but is it's own thing.

@michelb9044 8 ай бұрын

@@mathtick according to this definition, unsupervised matches a distribution instead of finding a correspondence between individual points (that is why datasets X and Y are not paired, contrary to supervised learning)

@mathtick 8 ай бұрын

@@michelb9044 supervised learning can also find correspondance between distributions. I think there is something of value in getting precise here in the abstract. LIkely it is done somewhere. "points" could be anything really. And then we get into permutation invariance etc. I suspect the important point here (for the bound) is the way the VC dimension is defined but that is just a guess. Haven't spent time on it at all.

@michelb9044 8 ай бұрын

@@mathtick yes supervised learning can do it too, but it has more information available because the samples ("points") are paired, in form (x_k, y_k), contrary to unsupervised. Unsupervised is not necessarily doing something that supervised cannot do, it is doing something with less information available. Btw, that definition of unsupervised fits the setting of standard generative model: mapping a predefined latent distribution (say Gaussian), to a data distribution (say an image dataset).

@krox477 8 ай бұрын

This guy has postdoc in CS he's been studying this stuff whole life

@eskelCz 8 ай бұрын

The compression/prediction could be more simply thought of as pattern recognition

@user-ut4zh3pw7l 8 ай бұрын

I got it

@ruoshiliu6024 8 ай бұрын

you did

@hola-kx1gn 8 ай бұрын

you did, didn't you

@brookshamilton1 8 ай бұрын

Did you?

@user-ut4zh3pw7l Ай бұрын

i grokked it real hard@@brookshamilton1

@xyh6552 5 ай бұрын

$K(Y \mid X)

@MrNoipe 8 ай бұрын

Why is compression==prediction? Zip files or jpegs do not do prediction.

@jacqueslecrabe 8 ай бұрын

But both zip files and jpeg files have opinions about what the most likely value of a pixel in a given image is (or a character in a text file), namely one that will compress better.

@qanon4realvsqanon4gery70 8 ай бұрын

An image where every pixel is set to a random rgb value is not compressable. Very handwavy the way png compresses an image is by predicting that there are a lot of continuous patches of the same color, aka predicting that the next pixel is the same as the previous, and recording where that prediction breaks. (This is not accurate about the real png)

@plafar7887 8 ай бұрын

The predictor, in that case, is the unzip utility. It predicts the original from the compressed.

@DM-fw5su 8 ай бұрын

@@qanon4realvsqanon4gery70It is compressible, you are correct that it maybe not result in a space saving. That doesn't discredit the storage encoding algorithm from being called a compressor. If you had to pick an algorithm for everything what is important is what works better after being given all the information of human knowledge. Humans do not naturally use random concepts to describe disparate things, we naturally have few concepts to describe a great many things from a large number of different fields of knowledge and a great many things yet to be invented/understood about the universe. The human brain is naturally a lossy compresser, maybe absolute errors (momentary error in recall and correction) are the cause of eureka moments.

@afrozenator 8 ай бұрын

Talk starts at 0:14

@wege8409 5 ай бұрын

Lol

@rv706 7 ай бұрын

"A particular observation on generalization" would've been a cooler title 🙃

@dotnet364 7 ай бұрын

OpenAI is now worth 90B. The employees have sold their shares to investors. They have their roi after 7 yrs.

@jonatan8392 8 ай бұрын

I think your intro about supervised learning is a bit missleading. Yes it is true theoretically that test-error~training-error if the model complexety is small, for instance in terms of VC-dimension. However, this VC theory cannot explain the success of deep learning. In practice, large models with billions of parameter are used and trained to nearly 0 training-error and STILL the test error is small. Applying the VC-dimension argument to these models doesn't give you anything because the observed test-error is magnitudes times smaller than the predicted test-error from the VC-theory.

@mathtick 8 ай бұрын

Why is he calling "next pixel prediction" and UNSUPERVISED task?

@qanon4realvsqanon4gery70 8 ай бұрын

The next pixel is part of the data, not an explicit label

@Draganel87 8 ай бұрын

Because there are no labels provided. They just feed the data and the model understand the underlying generalities.

@mathtick 8 ай бұрын

Supervised just means P(Y|X). There is no meaning behind "label". You label it in a sense by giving honour to the "order" of the pixels. This is what I mean. The whole AI tech bro world fails to be specific about meaning and everyone spins around thinking different things. Let's nail down a *precise* meaning of this stuff. The compression results at the end are cool but it seems un-necessary to start with this supervised/unsupervised question. It sounds like all of these problems, we can probably agree, are to do with estimating distributions P(Y|X) where X *might* be the empty set. (I'm leaving \theta out) Putthing this in the realm of probability is already quite restricting as generalization might matter for pure optimization settings (compression). But it sounds like what people usually mean. And I am saying even learning distributions, is still within this framework but you might be to remap things a bit. I dunno. But P(Y|X, \theta) where X is the empty set sounds like it covers unsupervised learning. @@qanon4realvsqanon4gery70

@mathtick 8 ай бұрын

So my guess is that when you look at the distributional learning problm with X being the empty set, those generalization bounds become quite weak.

@cennywenner516 8 ай бұрын

Really exciting to hear that OpenAI may be trying to realize or find inspiration in the impractical but seemingly profound learning theory! My own naive thought though on the Kolmogorov analogy is that the challenge is not just in the program search but in the offset caused by the specific computational model (UTM), aka inductive bias. For lots of data, the latter term goes to 0 while for little data, it is the program search that becomes easy but generalizes poorly. For lots of applications (RLHF, downstream), it does not seem that it is sufficient to merely encode it as the same unsupervised stream in the LLM, perhaps because it is a distribution shift (or a narrow subset); even though unsupervised/self-supervised training is usually needed in some form along with the fine-tuning. One way to potentially think about this is that instead of the new task being conditional prediction, or conditional compression of the following sequence, it is instead that the pre-training conditions the underdetermined program space/inductive bias, which enables good generalizations also with little data. Re order of data - I think this just comes down to the old debate on whether overparametrized deep networks eventually converge to a global optimum; which is not seen in practice for LLMs due to early stopping of overparemetrized models for training efficiency; and finally for the connection, one should consider a differentiable machine representation. This does not seem obviously inconsistent with an analogous compute-bounded program search. Maybe I misunderstood what was meant by linear representations but I think there is something like linear representations being preferable given any kind of regularization, i.e., any inductive bias beyond uniform, which is expected with a universal distribution.

@jabowery 8 ай бұрын

The idea that the open parameter of UTM choice renders Kolmogorov Complexity ill defined for most practical purposes is pedantic obstructionism. I suspect it is really just an excuse to not think about what exactly is sacrificed by using less principled loss functions than the size of an executable archive. It's much easier to just calculate some approximate loss function like MSE or whatever -- perhaps adorned with some regularization to reduce parameter count -- than it is to think about what has been sacrificed. It is worth thinking about that question and doing so in depth because it gets to the very heart and soul of what people are running around in panic regarding "algorithmic bias". This is particularly egregious when they don the robes of scientific ethics after having abjured the only principled foundation for their ethics.

@hanskraut2018 8 ай бұрын

Random comment: I have to admit i like more extensive comments, i did not even necessarily agree or disagree - i like where it seems to be coming from and the format. ^^

@cennywenner516 8 ай бұрын

@@jabowery - The principled part is where you insert assumptions and domain knowledge. Inductive bias and algorithmic bias are not related concepts. Inductive biases is what makes learning even being possible. Why things that happened in the past are more likely to occur again in the future eg rather than remaining 'entirely random'.

@jabowery 8 ай бұрын

@@cennywenner516 Bias minimization in model selection is the ethical foundation I'm addressing. Statistical information criteria for model selection are all over the map precisely because statistics are based on Shannon Information rather than Algorithmic Information. If your model selection criterion is unprincipled, you are in even worse shape dealing with the all-too-human tendency for self-deception that the scientific method is intended to address. An otherwise principled approach to algorithmic bias that skips this step elides science. It has to get this right before it can address other biases for the simple reason that self-deception begins with conflation of Hume's "is" vs "ought".

@NerdFuture 8 ай бұрын

You say "the Kolmogorov analogy," and go on to guess what it's about. But it's *not* an analogy, it's just a limit, used in the way Sutskever does to define mutual information, on his way to saying that taking advantage of mutual information is a good indication of erm, having learned something useful.

@xyh6552 5 ай бұрын

The linear property itself is a compression, this is a trivial observation that can be obtained by studying differential manifolds

@mathtick 8 ай бұрын

Why can't you just pose supervised learning as learn P(Y|X) where X = null (i.e. the null feature choice). VC(X) = 0 or whatever it is. I find the lack of rigouress definitions here alarming.

@plafar7887 8 ай бұрын

It's not alarming, it's simply difficult to turn this reasoning into rigorous arguments, for the time being. It's hard enough to even express it intuitively. A lot more research is needed.

@Alejandro388 8 ай бұрын

he is high on what?

@isaacandrewdixon 8 ай бұрын

high on life

@DavenH 8 ай бұрын

intellect

@jameelbarnes3458 7 ай бұрын

Binary tokens.

@mathtick 8 ай бұрын

The guy at the end gets at the point that the semantics are scrambled and these guys are confused about semantics and the expression of the problems. Obviously very smart but bad at writing about this stufff/expressing it.

@GraczPierwszy 8 ай бұрын

well well, we'll see what it all comes down to blah blah blah so far we will see the effects in the real world, not on the board

@plafar7887 8 ай бұрын

GPT models are blah blah blah?🤔

@kumarkartikay 8 ай бұрын

It seems Ilya has solved (what George Hotz is calling) "entropics" in one lecture just like Shannon solved Information theory in one paper. SGD on Neural Nets = Program Search = Compression = Intelligence!

@Alex-fh4my 8 ай бұрын

I dont know about solved. Hotz is aware of the notion that prediction = compression = intelligence. He mentioned some blog post on stream about this. Regardless, "compression = intelligence" still doesnt answer the question of "How much intelligence does it take to solve Fermat's last theorem".

@edh615 8 ай бұрын

it's not solved, and for George's ideas I'm not sure he even knows what he means.

@brianewing1428 8 ай бұрын

Say we solve prediction. How do you do intention?

@Alex-fh4my 8 ай бұрын

@@brianewing1428 solve prediction?

@brianewing1428 8 ай бұрын

@@Alex-fh4my yes, say we 'solve' that. we have a crazily general, Oracle token predictor. you give it a computational hippocampus, and long term storage. how do you implement the task loop?

@infochopper 8 ай бұрын

ClosedAI

@surkewrasoul4711 8 ай бұрын

I never thought I would give it up this easily but I must confess that this ai thing got it all out of me so easily in one shot, kind of embarrassing but hey, it takes experience, Also as mr bean so wisely said once, It's with age that comes wisdom🤣, The most difficult task would perhaps be teaching it honesty and quite possibly not to ruin other people's coffe mug with misfires and things like that 🤣, I say we teach it that abraham lincon's famous qoute, You may fool half of the people half of the time and half of the time half of the people, But not all the people all the time. Great Talk , Enjoy your days Guys 🤙✌

@kenmogibrainworld4844 8 ай бұрын

In the phenomenology of consciousness qualia are great compressors.

@RoboticusMusic 5 ай бұрын

Huh, just explain normal

@kenwolf887 8 ай бұрын

It's all about entropy and complexity.

@DavenH 8 ай бұрын

It's all about fate and destiny.

@krultorwaru121 7 ай бұрын

How is this guy regarded one of the best experts in AI is beyond me. His level of intuitive understanding of neural networks is depressing.

@krultorwaru121 7 ай бұрын

@@WMD911 actually it’s all what he said and everything he didn’t say. For example his main point - comparison to Kolmogorov complexity - is rather wishful thinking than anything of substance. SGD on weights of neural net is very particular program search of very particular programs that need not result in “the shortest program”, whatever that means in the context of NN.

@WalterSamuels 8 ай бұрын

Why is your company called OpenAI if you're the opposite of open? OpenAI is supposedly the "leader" of the AI space, yet it's the least generous out of all AI companies. Meta is constantly releasing new research, tools, open-source software and the rest, while OpenAI is busy playing politics and trying to squash competition. It's really quite sad and disappointing. No wonder you had such a large exodus of talent. Hopefully you'll change your ways.

@JD-jl4yy 8 ай бұрын

They're being more responsible than Meta. They're thinking ahead about how we can prevent AI from derailing society. Just releasing everything with 0 safety concerns like Meta is going to get really, really dangerous one day.

@bananabreeding1362 8 ай бұрын

Post Vaswani generative pretrained transformers have opened a vast and potentially civilisation-changing (or ending!) opportunity for cognitive/ linguistic AI and much more. We are witnessing the onset of a huge adaptive radiation that will involve virtually all of AI technology. Some firms will be Open. Some will be tightly closed. Some offerings will be totally free; even facilititating individual humans to develop intensely custom versions of standalone single user products. This is just the beginning.

@WalterSamuels 8 ай бұрын

Doesn't really have anything to do with my point though. If anything it amplifies it, Google gifted them transformers and they still don't see the value in sharing. @@bananabreeding1362

@primersegundo3788 Ай бұрын

why?..... he has already answered that question many times, just check any interview.

@TPQ1980 6 ай бұрын

Within the first minute it is admitted that "OpenAI" is not open. Open development doesn't have aspects that developers "can't talk about." By 20 minutes in, the guy in the video has said thousands of words without saying almost anything at all, it's like he's winging it, or doesn't really understand his subject. The guy in the video even admits machine learning is not particularly difficult to understand, yet he does a terrible job of conveying understanding. It's almost like he's more interested in trying to convey the impression he's really intelligent, rather than trying to convey meaning and understanding. I suppose it's possible that English is his second language, or perhaps he's just not that experienced at public speaking. Maybe he's one of those of people who's really good with mathematics, but terrible with language? I just get the impression from this video that he's something of a charlatan.

@FreakyStyleytobby 6 ай бұрын

You're very wrong, I learned a lot from this lecture. In this lecture Sutskever explain how to understand ML with mathematics terms. That's the whole problem he proposed. So how the hell do you expect him not to use mathematics in this description? It's not popular science

@wege8409 5 ай бұрын

Ilya was involved on the creation AlexNet in 2012, a lot of people point to AlexNet as the thing that rekindled interest in AI, many academics thought neural nets were hopeless before AlexNet. He was also involved in the creation of AlphaGo, Go was a really hard problem for a long time because of the explosion of possible outcomes. Elon says hiring Ilya was the tipping point for OpenAI's success and honesty I think I believe him. This guy is definitely the real deal. If you want a charlatan, look at Sam Altman peddling WorldCoin.

@wege8409 5 ай бұрын

Also it's no secret that OpenAI isn't open, they talk a lot about their decision to switch from non-profit, basically you need bottomless computation to build state of the art models, and for bottomless computation you need bottomless money. If you look up "OpenAI Panel" it explains a lot about how they operate.

@JohnDoe-nv2op 8 ай бұрын

All on top of the "SGD miracle". When we realise that SGD is just crap, all this DL horseshit will fall into oblivion.

@k.8597 7 ай бұрын

could you elaborate on what you mean? I know its been a month since this comment so this might be a shot in the dark, but I'm wondering why exactly is SGD bad? is it a heuristic for the optimal weight values that they came up with that works well enough to not question its use? im in undergrad just learning abt this stuff so pardon my ignorance

@JohnDoe-nv2op 7 ай бұрын

@@k.8597 SGD is incompatible with compositional learning. In other words, you need to readjust (potentially) *all* the weights in the net in each backward pass. In a system with continuous learning, this produces catastrophic forgetting. ANN needs compositional learning to reach "common sense". SGD approach to learning, in my view, is unable to do that. We need some learning algo capable of performing continuous learning before to move forward to greater challenges.

@k.8597 6 ай бұрын

@@JohnDoe-nv2op ah so would that mean finding a way to be able to select the weights that lead to the greatest amount of memory being retained, and then doing backprop on the ones that don't?

@JohnDoe-nv2op 6 ай бұрын

@@k.8597 I think so. The question is how

@wege8409 5 ай бұрын

I don't think the catastrophic forgetting comes from SGD. I think it's the architecture, or maybe it's the learning scheduler, or a combination of the two.