An Observation on Generalization

  Рет қаралды 166,281

Simons Institute

Simons Institute

Күн бұрын

Пікірлер: 210
@baboothewonderspam
@baboothewonderspam 10 ай бұрын
Great talk!! Thanks for publishing
@charlesmarks1394
@charlesmarks1394 Жыл бұрын
Great pres. Highest tier of insight and scientific communication.
@miladkhademinori2709
@miladkhademinori2709 Жыл бұрын
👌
@EdedML
@EdedML Жыл бұрын
An interesting exploration of a lot of these points (including Solomonoff induction) is presented in the paper "Playing games with AIs: The limits of GPT-3 and similar Large Language Models" -- highly recommended reading
@tedhoward2606
@tedhoward2606 Жыл бұрын
I see Ilya pausing thinking about the compression expansion. It works with something like language, which has a fixed set of symbols. It only works with reality (which seems to involve things like irrational numbers, which may only ever be approximated, and maximal computational complexity, etc), to the extent to which we are willing to declare some approximation to be close enough for the context at hand. There seems to be a very real sense in which reality (whatever it actually is) contains sufficient classes of things like maximal computational complexity, than it may only ever be approximated. The hard thing for most to accept, is that most of that simplification and approximation is done for them subconsciously by a large variety of evolved mechanisms, such that what we get to experience as reality, never is - it is always some level of simplification, some sort of "useful approximation" (or at least some function of reality that was generally useful in the survival of our ancestral lineage in context, and may not necessarily be as useful to us in our exponentially changing present and future). It is about 40 years since I started to use the definition of life as "Search" across the multiple systemic and strategic spaces of the possible for the survivable. Once I started to do that, then it became abundantly clear that cooperation is fundamental to both the emergence and survival of complexity (all levels, all domains) - actually I first "saw" that after reading Selfish Gene in 1978. And part of cooperation is being willing to put in the effort to search for new levels of cheat detection and cheat mitigation, for without an evolving ecosystem of cheat detection and mitigation systems, cheats end up destroying the complexity present - if you doubt that, just think of cancer. Cancer is our cells, but some subset of them that has stopped communicating properly with its neighbours and the wider community, and has started to use all available resources to replicate. That strategy appears to work really well, right up to the point that everything dies. We have direct analogies in economics, finance, politics, law and throughout all levels of societal systems. Evolution starts simple, but rapidly gains in complexity with every new emergent level of complexity, and at every new level strategic cooperation and cheat detection/mitigation becomes more important in long term survival. So yes - the idea of compression and complexity are interesting, and they are related to the levels of resolution used in modeling the irreducibly complex, and that is related to functions and predicates and sparse or dense network topologies (there seems to me to be a kind of mathematical equivalence if one abstracts them sufficiently, but the fundamental reality of irreducible complexity needs to remain part of the system) - the models are never the territory, even if they are the only territory we have direct access to ;)
@Tapuzi
@Tapuzi Жыл бұрын
I would love to see you express opinions in videos. Please consider doing it. Thank you for the comment!
@connorkapooh2002
@connorkapooh2002 Жыл бұрын
@@Tapuzi I second this comment - Ted, I would love to hear your thoughts!
@yasshramchandani3184
@yasshramchandani3184 Жыл бұрын
The idea of "life" as a scalable and resilient "search" mechanism seems intriguing to say the least. I also relate to the emergent behaviours that arise from a subset of comparatively less complex procedures. Would love to hear you thinking out loud, or read your blogs somewhere. Thanks for your insights Ted.
@aleph0540
@aleph0540 Жыл бұрын
This isn't ALL Life is by your same argument. That's good news b/c it suggests we can design better Search organisms. However, it's bad news in the sense that we aren't the best Search 😮
@tedhoward2606
@tedhoward2606 Жыл бұрын
​@@aleph0540 I am not sure how you reached that conclusion from what I wrote. It seems clear to me, that we are very close approximations to optimal search engines, and survival always has context dependent aspects, and part of context is the multiple levels of culture preset (family culture, school culture, community culture, workplace culture, wider cultures of the various levels of communities that we engage in). Getting something sufficiently complex to survive demands integrating a massive set of strategies and algorithms and systems; the vast majority of them subconscious, and most people are completely unaware of them. The conscious level models that we have of reality, including ourselves, are just so simplistic, they are like leggo models of a computer - never going to work, but a very low resolution image of the outer appearances of something. My point wasn't that we are poor at search, we are far better at it than most have any idea about, given appropriate contexts. My point was that the very idea of life is about search, about going beyond the known. And, at higher levels, doing that demands a dialectic tension between freedom and responsibility - all levels, eternally. Once you get that, the idea of "provable safety" as anything other than a probabilistic assessment of the known, is a nonsense. (And known in this sense is an expression of probabilistic models in respect of reality, as distinct from any set of logical conclusions from stated premises in any form of logic.) What we can do, is look deeply into biological life, deeply past the surface appearances, and see what it is that allows for new levels of complexity to emerge and survive. When we do that, we see that it is cooperation that is fundamental to every new level of complexity (contrary to economic dogma). We see that cooperation is always vulnerable to exploitation by cheating strategies, and thus requires an eternally evolving ecosystem of cheat detection and mitigation systems (Ostrom's work in this domain is fascinating once viewed from an appropriate context and level of abstraction). Strategy is a deeply complex subject. Zero sum games do not end well for complexity. Once system boundaries are reached, simplicity dominates over complexity. To survive, complexity requires open boundaries. We are at planetary boundaries, we must go beyond, and we must do so wisely, in the full knowledge of all the levels of strategic complexity present, and with the knowledge that the essence of life is going beyond the known. To do that, individuals need to keep abstracting, until there is no contradiction in the statement above. The key message I wanted to give with my earlier comment, is that AI is a different form of life. It is not necessarily one that will have replication as a fundamental drive. It has a different search lineage. So the naive concerns many have of AI out competing us at replicating are for the most part misplaced. And there are risks in that domain if people are not sufficiently aware of the systemic and strategic complexity, and are using overly simplistic competitive models of survival strategies. But there is a very real sense in which those strategies are self terminating with or without the presence of AI. And saying all of this is not saying that I know best what to do in the sense of telling you what to do. All I am saying is, that we need to give every level and instance of agent sufficient freedom and resources to allow them to creatively and responsibly explore the possible in ways that will generate both novelty and risk, and the greatest counter to that risk actually resides in diversity, in multiple "safe to fail" experiments, in a sense. We need sufficient rules to be able to effectively prevent cheats from destroying systems, effective strategies to return higher level agents to cooperative behaviour, and all agents need to be creative, to the limits of their demonstrated responsibility. Following rules is never enough, but in order to break a rule responsibly you need to understand the full stack of reasons why the rule is there, and what it is designed to prevent. And those stacks can be quite deep, usually more than two levels, and sometimes more than 10 levels. This is complex in ways that are impossible to explain in detail in anything smaller than a library.
@bboysil
@bboysil 11 ай бұрын
Great presentation!
@xyh6552
@xyh6552 Жыл бұрын
Compression can explain the part of unsupervised learning that is very similar to the operation of using the regularity lemma in mathematics to deal with graph theory problems. Essentially, it utilizes the almost orthogonality of some local graphs
@G1364-g5u
@G1364-g5u 6 ай бұрын
1. Introduction and Context (0:00 - 1:47) - Ilya Sutskever speaking at an event - Unable to discuss current technical work at OpenAI - Focused on AI alignment research recently - Will discuss old results from 2016 that influenced thinking on unsupervised learning 2. Fundamentals of Learning (1:47 - 5:51) - Questions why learning works at all mathematically - Discusses supervised learning theory (PAC learning, statistical learning theory) - Explains mathematical conditions for supervised learning success - Mentions importance of training and test distributions being the same 3. Unsupervised Learning Challenge (5:51 - 11:08) - Contrasts unsupervised learning with supervised learning - Questions why unsupervised learning works when optimizing one objective but caring about another - Discusses limitations of existing explanations for unsupervised learning 4. Distribution Matching Approach (11:08 - 15:32) - Introduces distribution matching as a guaranteed unsupervised learning method - Explains how it can work for tasks like machine translation - Links to Sutskever's independent discovery of this approach in 2015 5. Compression Theory of Unsupervised Learning (15:32 - 24:43) - Proposes compression as a framework for understanding unsupervised learning - Explains thought experiment of jointly compressing two datasets - Introduces concept of algorithmic mutual information - Links compression theory to prediction and machine learning algorithms 6. Kolmogorov Complexity and Neural Networks (24:43 - 30:52) - Explains Kolmogorov complexity as the ultimate compressor - Draws parallels between Kolmogorov complexity and neural networks - Discusses conditional Kolmogorov complexity for unsupervised learning - Links theory to practical neural network training 7. Empirical Validation: iGPT (30:52 - 35:46) - Describes iGPT as an expensive proof of concept for the compression theory - Explains application to image domain using next pixel prediction - Presents results showing improved unsupervised learning performance 8. Linear Representations and Open Questions (35:46 - 38:27) - Discusses mystery of why linear representations form in neural networks - Compares autoregressive models to BERT for linear representations - Speculates on reasons for differences in representation quality 9. Q&A Session (38:27 - 54:37) - Addresses questions on various topics including: - Comparison to other theories in cryptography - Limitations of the compression analogy - Relationship to energy-based models - Implications for supervised learning - Importance of autoregressive modeling - Relationship to model size and compression ability - Curriculum effects in neural network training
@thomas-packer-thd
@thomas-packer-thd Жыл бұрын
Love the passion and intuition.
@NerdFuture
@NerdFuture Жыл бұрын
Ray Solomonoff's version of compression is very similar to Kolmogorov's, and Solomonoff's version of induction is... predicting the next token. Another random Ray thing, in the 1980s he was worried that a smart enough supervised AI would effectively learn the trainer's objective function and turn the tables.
@tedhoward2606
@tedhoward2606 Жыл бұрын
How is "turn the tables" possible, if the training agent's objective function is "the survival of all sapient agents, with reasonable degrees of security, resources and freedom (as defined by the agents in ongoing conversations)"? All that exists in that strategic territory is cooperation in diversity, in the face of the unknown unknown and maximal computational complexity. Anything less than that seems to this agent, from my 50+ years of exploring that domain of strategic territory, to be dangerously overly simplistic.
@jakebrowning2373
@jakebrowning2373 Жыл бұрын
​@@tedhoward2606 and how do you propose to encode/implement that objective function?
@tedhoward2606
@tedhoward2606 Жыл бұрын
​@@jakebrowning2373 This is both complex and uncertain, and it does seem to be as close to stable as it is possible to get. If you take a systems definition of life, as any system capable of searching the space of the possible for the survivable - which in its simplest form is "life as search", then certain things fall out of that definition recursively through each new level. Search involves going beyond the known into both the unknown and the unknown-unknown. So search in this sense encodes both freedom and responsibility as fundamental aspects of existence. Freedom is what enables search, and responsibility is what prevents us from selecting non-survivable vectors in that highly dimensional vector space. In terms of survival strategies in the face of the unknown, robustness and diversity deliver the greatest degrees of security (particularly when agents have rapid communication available, and successful strategies can be rapidly transmitted as options to all agents). So you start with the value of sapient life built directly into the definition of life itself, then recursively, through all new levels of complexity, new levels of strategy are required for cooperation, cheat detection and cheat mitigation - in an eternally evolving ecosystem. I've spent 45 years trying to break it - it appears to be robust.
@englishredneckintexas6604
@englishredneckintexas6604 6 ай бұрын
This was fantastic. I actually understand these concepts now.
@kimchi_taco
@kimchi_taco Жыл бұрын
In global workspace theory, there is bottleneck representation. After this talk, I strongly believe bottleneck is feature, because bottleneck forces model to learn nice compression way, which we call inductive reasoning.
@AntonioEvans
@AntonioEvans Жыл бұрын
🎯 Key Takeaways for quick navigation: 00:00 🎤 Introduction to the talk and excitement about discussing the future prospects of LLMS (Large Language Models). 00:47 🔍 Shift of focus to AI alignment and the anticipation of sharing results in the near future. 01:18 📈 Sharing old results from OpenAI that significantly impacted the speaker's perspective on unsupervised learning. 02:03 🤖 Delving into the concept of learning in general and the mathematical bases that govern the learning process in neural networks. 03:25 📊 Introducing the mathematical conditions under which supervised learning must succeed. 04:21 📐 Explaining the simplicity and effectiveness of supervised learning through mathematical proofs. 05:44 💡 Highlighting the importance of consistency between training and test distribution in supervised learning. 07:19 ❓ Raising questions about the nature and effectiveness of unsupervised learning and the lack of mathematical exposition. 08:43 🎭 Discussing the puzzling nature of unsupervised learning where optimizing one objective helps in achieving another. 10:37 🔄 Introducing the concept of distribution matching as a method of unsupervised learning with guaranteed success. 13:37 🔄 Further discussion on the potential of distribution matching in unsupervised learning. 14:55 💽 Bringing in the concept of compression as a tool for unsupervised learning, emphasizing the correlation between compression and prediction. 17:18 🌐 Delving deeper into the mathematical frameworks that support the idea of compression aiding in unsupervised learning. 20:45 📚 Introduction to Kolmogorov complexity as a method of optimal compression, albeit non-computable, in the context of unsupervised learning. 24:16 🧠 Drawing parallels between neural networks and small computers, emphasizing the role of SGD (Stochastic Gradient Descent) in training these 'computers'. 26:14 🔍 Expanding on the conditional Kolmogorov complexity and its role in unsupervised learning, highlighting its ability to extract maximum value from unlabeled data. 27:56 🤖 Highlighted the lack of efficient methods for conditioning on big data sets. 28:25 🛠️ Mentioned that using a regular compressor might be as effective as a conditional compressor for making predictions in supervised tasks. 29:23 💼 Explained that joint compression is maximum likelihood, and how it fits naturally in a machine learning context. 30:34 🧠 Discussed the affinity towards larger neural networks, as they approximate the common core of a compressor more effectively, minimizing regret over time. 31:28 🌐 Noted the capability of GPT models to intuitively understand and predict the continuation of patterns in a text without necessarily referring to the theory of compression. 32:27 📸 Mentioned the successful application of the theory to image domains, leading to effective unsupervised learning through next pixel prediction. 33:25 📈 Reported promising results in the pixel prediction task, indicating a positive trajectory for the method in unsupervised learning. 35:19 🤔 Discussed the potential deep implications and unanswered questions about why linear representations are formed in the models. 36:47 🔄 Touched upon the potential experimentation to validate the speculations regarding next pixel prediction compared to other prediction methods. 38:29 🤝 Appreciated the analogy between Kolmogorov complexity and neural networks, and discussed the nuances of training dynamics and data order in neural networks. 40:12 🔄 Discussed the potential of backtracking from cryptography to develop insights into the function class and structure of neural networks. 43:23 🎓 Mentioned the importance and relevance of VC dimension in understanding learning complexities and distinguishing distributions. 45:11 🖼️ Discussed the limitations of using compression as a sole measure for unsupervised learning and the potential for exploring other effective linear representations. 49:12 🔄 Discussed the potential of diffusion models in unsupervised learning and the need to explore their efficiency compared to autoregressive models. 51:13 💾 Touched upon the limitations of gzip as a compressor for text and the scope for further optimizing the compression process. 52:58 📚 Addressed the current stance on curriculum effects in neural networks and the efforts to simplify training optimization procedures. 53:33 💡 Highlighted the advancements in making neural network architectures easier to optimize, reducing susceptibility to curriculum effects. Made with Socialdraft AI
@pieriskalligeros5403
@pieriskalligeros5403 Жыл бұрын
Nice industrial application pipeline you got there
@aleph0540
@aleph0540 Жыл бұрын
Bahahah at the above the reply. Thanks for providing the breakdown though 😂
@swyxTV
@swyxTV Жыл бұрын
Not at all a good summary lol. Work harder to improve it pls, this has potential but is just bad
@tristanwegner
@tristanwegner Жыл бұрын
It will be interesting in the future to see how the compression of AI systems gets better and better. They have already had neural networks discover physical laws in raw experimental data, but how little data is needed for that? Nobody knows! Maybe it just needs e.g. a few seconds of airwaves broad spectrum signal. This also makes any safety concept for oracle AGIs that involves somehow keeping "safety critical data" from the AGI a very weak approach. The world is highly correlated, and a higher intelligence might learn much more from what we give it, than we see in it ourselves.
@simonstrandgaard5503
@simonstrandgaard5503 Жыл бұрын
Excellent talk.
@FreakyStyleytobby
@FreakyStyleytobby Жыл бұрын
4:25 - these equations. What are these? Ilya came up with them, or can I see somewhere their explanation?
@DistortedV12
@DistortedV12 3 ай бұрын
It’s just basic stats learning theory
@diegoacostacoden8704
@diegoacostacoden8704 Жыл бұрын
Someone knows where can I read about the theoretical guarantees of supervised learning mentioned in the video?
@kalilinux8682
@kalilinux8682 Жыл бұрын
Probably the famous Andrew NG's course on coursera
@AlekFrohlich
@AlekFrohlich Жыл бұрын
​@@kalilinux8682No
@zhanggenghan3925
@zhanggenghan3925 Жыл бұрын
Same question. Have you found the source?
@cennywenner516
@cennywenner516 Жыл бұрын
@diegocastocacoden8704 @zhanggenghan3925 - You just need Hoeffding's inequality to complete the proof but if you want to read more, it is covered in Chapter 4 of Understanding Machine Learning: From Theory to Algorithms
@cennywenner516
@cennywenner516 Жыл бұрын
@@kalilinux8682 - No, that course does not cover learning theory
@cc98-oe7ol
@cc98-oe7ol 8 ай бұрын
Great talk!
@于磊-w5p
@于磊-w5p 10 ай бұрын
great talk !
@odomobo
@odomobo Жыл бұрын
2 minutes in, and I can already tell this talk is going to be fascinating
@Jonathan-k3r8r
@Jonathan-k3r8r Жыл бұрын
Is that Scott Aaronson asking questions in the audience :D?
@jony7779
@jony7779 Жыл бұрын
Does anyone know the name of the theorem being shown at 4:54 so that I can look it up to learn?
@jony7779
@jony7779 Жыл бұрын
Found it, its called "Vapnik-Chervonenkis dimension". Asked GPT-4 😁
@jony7779
@jony7779 Жыл бұрын
could have just watched for another 2 minutes to get the answer 😬
@cansacan7534
@cansacan7534 Жыл бұрын
Pretty standard introductory ML theorem.
@jony7779
@jony7779 Жыл бұрын
cool! @@cansacan7534
@Alex-fh4my
@Alex-fh4my Жыл бұрын
@@cansacan7534 thanks for the help mate
@mamotivated
@mamotivated Жыл бұрын
The eloquence of his delivery was delightful
@JTan-fq6vy
@JTan-fq6vy Жыл бұрын
Can anyone explain for kolmogorov complexity, why should we write as "K(X) < |C(X)| + K(C) + O(1)"? Why can't we write it as "K(X) < |C(X)|", which seems much appropriate for an ultimate compressor? Also why do we need an absolute value for C(X) in the inequality? Thanks!
@candidocarolino
@candidocarolino Жыл бұрын
in the right side is K(C) not K(X)
@JTan-fq6vy
@JTan-fq6vy Жыл бұрын
thanks! but could you explain why we need the extra K(C) + O(1); why can't we do "K(X) < |C(X)|"?@@candidocarolino
@xyh6552
@xyh6552 Жыл бұрын
There must be an explanation of the dynamical system in supervised learning. When the number of samples is much higher than the degrees of freedom in space, they will accumulate over time, resulting in the learning of the dynamic information between the samples. The simplest toy model for this is the Poincare recurrence theorem. On the other hand, if you are willing to believe that the object space is also finite-dimensional, the effectiveness of unsupervised learning can even be explained using multivariable calculus. And compared to some highly difficult-to-explain mathematical phenomena, this is far from being considered magic.
@wowtbcmagepvp
@wowtbcmagepvp Жыл бұрын
Holy shit. Once you get it, you get it. Having (approximate) access to K is the ability to understand everything by mapping everything to their appropriate (shared) distributions (which likely feel linear-like in terms of more learning)
@jakebrowning2373
@jakebrowning2373 Жыл бұрын
What do you mean by it feels linear in terms of more learning?
@jonclement
@jonclement Жыл бұрын
@@jakebrowning2373 i think he means that you first use SGD to find the best K distributions of compression thingy, then you replicate that approach to find an X approximation of every possible computable function...then you linearly combine everything to get 42
@consumidorbrasileiro222
@consumidorbrasileiro222 Жыл бұрын
it's going to be great success according to ilya
@blahblahblah23424
@blahblahblah23424 Жыл бұрын
I'm not sure I get it 23:42 where it says K(X) < |C(X)| + K(C). Shouldn't K(X) < |C(X)| by definition?
@qanon4realvsqanon4gery70
@qanon4realvsqanon4gery70 Жыл бұрын
K(X) = Length of smallest program that prints X |C(X)| = Length of compressed form of X using some compresion-decompresion program C You can choose your program C to compress the entire Wikipedia to the bit 1 if you want, clearly the smallest program that prints wikipedia takes more than 1 bit, so K(X) < |C(X)| doesnt hold, but the length of a program that can decompress wikipedia from reading "1" is not smaller than the smallest program that prints wikipedia, so K(X) < |C(X)| + K(C) holds.
@Morimea
@Morimea Жыл бұрын
Interesting talk about compression, thanks!
@huyle3597
@huyle3597 Жыл бұрын
what's the reference for the inequality he showed at around 4:05?
@afrozenator
@afrozenator Жыл бұрын
Talk starts at 0:14
@wege8409
@wege8409 Жыл бұрын
Lol
@dreamphoenix
@dreamphoenix Жыл бұрын
Thank you.
@Snshqgavks
@Snshqgavks Жыл бұрын
To better understand this video, what should I study?
@rohan.fernando
@rohan.fernando Жыл бұрын
Supervised learning is essentially an artifice. Unsupervised learning is the foundation of Intelligence. Kohonen was the true pioneer in this, and everything since seems to be extensions and tweaked variations on his foundational ideas.
@shinkurt
@shinkurt Жыл бұрын
Bs
@miladkhademinori2709
@miladkhademinori2709 Жыл бұрын
👌
@swyveu
@swyveu Жыл бұрын
@@shinkurt please explain...
@brianewing1428
@brianewing1428 Жыл бұрын
Don't humans do both?
@rohan.fernando
@rohan.fernando Жыл бұрын
@@brianewing1428 it seems to me that from birth, brains do unsupervised learning and also leverage existing genetically embedded Intelligence. For example, many newborn animals can walk from birth, but they never learned this through experience, so this Intelligent Capability is fully genetically embedded. After some time, a closed loop feedback system is used to improve training of brains, which is kind of like supervised learning. However, the backpropagation system that is currently used to train and perform the error correction and associated weight adjustments in AI systems is an artifice because there’s almost no chance brains use this same process. Brains are doing something different.
@Achrononmaster
@Achrononmaster Жыл бұрын
A avatar "@Yuksel Mert Cankus" in the chat nailed it (or at least nailed a good comment on the topic). Because in most practical use cases we can tolerate errors, and because a sentient mind is the final user of a NN, it might not always be so useful to focus on achieving a numerical approximation to Kolmogorov compression. There is a profoundly fuzzy goal: find an Ai tool that's useable, better than yesterday's tool, and compute efficient. (Let's be honest and note the utopian goal of sentient subjective awareness, aka. "consciousness" and uploading your mind into silicon is laughable. If you end up doing it, you can come back to now in your time machine and tell me all about it.)
@loopuleasa
@loopuleasa Жыл бұрын
main openai brain thanks for posting this
@AiExplicado0001
@AiExplicado0001 5 ай бұрын
:( we need more updates on the Safe Superintelligence Inc. Initiative :(((((((((((
@binjianxin7830
@binjianxin7830 Жыл бұрын
50:00 the size of the compressor (GPT4) is a salient term in the inequality 😂
@karigucio
@karigucio Жыл бұрын
do I read it correctly that in the equation: K(X)
@vev
@vev Жыл бұрын
Scott Aaronson asking question!?
@RunDaChansey
@RunDaChansey 3 ай бұрын
Search 'An Observation on Generalization' from Simons to get the better version of this video
@odiseezall
@odiseezall Жыл бұрын
Great beard, keep it.
@TheRevAlokSingh
@TheRevAlokSingh Жыл бұрын
Agreed. Combined with shaving bald, powerful appearance.
@phixvsm1999
@phixvsm1999 11 ай бұрын
He is a really good guy and intelligent
@joeremus9039
@joeremus9039 Жыл бұрын
What is data compression?
@joeysipos
@joeysipos Жыл бұрын
So is he saying there is some algorithm to take compression of say Large Language Models and generalize them to work with say vision?
@deepbayes6808
@deepbayes6808 Жыл бұрын
Information theory to the rescue of unsupervised learning, who would have thought ;)
@hanskraut2018
@hanskraut2018 Жыл бұрын
Ilya is great / sympatic. I hope some day you can communicate and exchange some trade secrets so its not too much and not too little. Because a 100% secrecy seems counterproductive as well as a 100% openness might lose to people just hiding everything and copycats or even hypothetical worse tactics. So much for proprietary information. :)
@MrNoipe
@MrNoipe Жыл бұрын
Why is compression==prediction? Zip files or jpegs do not do prediction.
@jacqueslecrabe
@jacqueslecrabe Жыл бұрын
But both zip files and jpeg files have opinions about what the most likely value of a pixel in a given image is (or a character in a text file), namely one that will compress better.
@qanon4realvsqanon4gery70
@qanon4realvsqanon4gery70 Жыл бұрын
An image where every pixel is set to a random rgb value is not compressable. Very handwavy the way png compresses an image is by predicting that there are a lot of continuous patches of the same color, aka predicting that the next pixel is the same as the previous, and recording where that prediction breaks. (This is not accurate about the real png)
@plafar7887
@plafar7887 Жыл бұрын
The predictor, in that case, is the unzip utility. It predicts the original from the compressed.
@DM-fw5su
@DM-fw5su Жыл бұрын
​​​@@qanon4realvsqanon4gery70It is compressible, you are correct that it maybe not result in a space saving. That doesn't discredit the storage encoding algorithm from being called a compressor. If you had to pick an algorithm for everything what is important is what works better after being given all the information of human knowledge. Humans do not naturally use random concepts to describe disparate things, we naturally have few concepts to describe a great many things from a large number of different fields of knowledge and a great many things yet to be invented/understood about the universe. The human brain is naturally a lossy compresser, maybe absolute errors (momentary error in recall and correction) are the cause of eureka moments.
@specifictoken
@specifictoken Жыл бұрын
내가 가장 좋아하는 대머리!!
@JCResDoc94
@JCResDoc94 Жыл бұрын
31:10 GPT models _JC
@shinkurt
@shinkurt Жыл бұрын
It makes sense
@brianewing1428
@brianewing1428 Жыл бұрын
Vector comes in, vector goes out. You can't explain that!
@aburaziel
@aburaziel Жыл бұрын
Ilya is so good that I have been forced to buy $MSFT to secure my UBI after my white collar job gets replaced by AI:)
@IAjayMukhiya
@IAjayMukhiya Жыл бұрын
👍
@shuminghu
@shuminghu Жыл бұрын
K(X, Y) = K(X) + K(Y|X) + |log[K(X, Y)| Why does unsupervised learning on X, i.e., learning K(X), help learning K(Y|X)? It's not obvious from this formula C(X), good compressor for X , helps compress Y|X.
@cennywenner516
@cennywenner516 Жыл бұрын
The formula is true but it does not imply that learning X *necessarily* helps to predict Y. For example, if Y is entirely unrelated. He was talking about this to theorize what unsupervised learning could mean. And one way to consider learning (X,Y) is to find a best compression of it (roughly - can we describe the distribution with some simpler set of fundamental variables). Then he goes on to say that finding the best way to compress (X,Y) is basically the same as compressing X and being able to predict Y given X. So there is formalism to justify connecting unsupervised learning back to supervised learning. K(X,Y) = K(X) + K(Y|X) + .. says something interesting. It is not a priori obvious that a shortest program to get both X and Y should be almost as long as a shortest program to get X, plus a shortest program to get Y given X. Like, how do we know that the first program is even useful for Y? These are just comparing sizes though so they are not necessarily composed of each other. The easiest way to see that this relation holds is probably to consider the universal distribution - pick a random program and run it to generate an output. Then clearly prob(X, Y) = prob(X) * prob(Y | X). And the probability of outputting X is roughly the same as sampling the shortest program for it - prob(X) ~= 2^-K(X) ; so 2^-K(X,Y) ~ 2^-(K(X) + K(Y|X)). So that these complexities are linked like this is interesting but likely he is interested in the opposite direction of the one you mentioned. Namely, that finding a compressor that can iteratively make good predictions for Y|X implying that you have a good compressor for (X,Y). That is, by doing self-supervised learning, you are doing unsupervised learning, and approximately the most fundamental form of it. We kind of use the terms unsupervised and self-supervised interchangeably today but it is not obvious that one should - e.g. is BERT's masking fundamentally learning more than next-token prediction? Then after we have done the self-supervised ~ unsupervised learning, it is interesting to apply it for predictions outside the domain, i.e. how does the self-supervised pretraining help? That's where the other direction you mentioned come in, and where I personally think there is some weakness (Y is other dist). If the connection holds, we should be able to learn an unsupervised representation (compression) C(X) of X (model weights + instance representation) and then just predict the class Y on C(X), which is what the next-pixel experiment was about. I am not sure if this, as is, establishes any stronger connection on the prediction side - if so I missed it - just that inspired by this, one might think that most of the time, for sufficient data, a compression C(X) should be all one needs to roughly predict Y, rather than X. Which should make it easier to learn Y eg when X is high-dimensional, such as for vision and text. There are some other things I think one would want to say but which I think do not clearly follow yet. Perhaps the most interesting and when the connection is strongest is when the training does try to make the pre-training part of X and Y closer in distribution at some point, for which I bet OpenAI has plenty of data; might even be used as an argument for training on outputs? Could also be interesting for RL/multi-step dialogs.
@digzrow8745
@digzrow8745 8 ай бұрын
I miss him
@xyh6552
@xyh6552 Жыл бұрын
$K(Y \mid X)
@jimlbeaver
@jimlbeaver Жыл бұрын
We need more people that are at least half as smart as him
@syphiliticpangloss
@syphiliticpangloss Жыл бұрын
Why exactly is this "unsupervised"? Yes I'm focussing on the semantics because it throws me. Unsupervised is just P(Y|X) where X is the null feature set no? He then goes on about learn transformation of random variables F s.t. F(X) \sim Y ... but this is then directly supervised learning. Without a rigourous definition I fnd the terms ... empty. The compression angle is interesting but is it's own thing.
@michelb9044
@michelb9044 Жыл бұрын
@@syphiliticpangloss according to this definition, unsupervised matches a distribution instead of finding a correspondence between individual points (that is why datasets X and Y are not paired, contrary to supervised learning)
@syphiliticpangloss
@syphiliticpangloss Жыл бұрын
@@michelb9044 supervised learning can also find correspondance between distributions. I think there is something of value in getting precise here in the abstract. LIkely it is done somewhere. "points" could be anything really. And then we get into permutation invariance etc. I suspect the important point here (for the bound) is the way the VC dimension is defined but that is just a guess. Haven't spent time on it at all.
@michelb9044
@michelb9044 Жыл бұрын
@@syphiliticpangloss yes supervised learning can do it too, but it has more information available because the samples ("points") are paired, in form (x_k, y_k), contrary to unsupervised. Unsupervised is not necessarily doing something that supervised cannot do, it is doing something with less information available. Btw, that definition of unsupervised fits the setting of standard generative model: mapping a predefined latent distribution (say Gaussian), to a data distribution (say an image dataset).
@krox477
@krox477 Жыл бұрын
This guy has postdoc in CS he's been studying this stuff whole life
@max0x7ba
@max0x7ba Жыл бұрын
K(C)
@karigucio
@karigucio Жыл бұрын
right, just no way of knowing whether some Ki outputs C
@max0x7ba
@max0x7ba Жыл бұрын
@@karigucio We are only interested in the length of C, K is just an artefact to get there, is it not?
@karigucio
@karigucio Жыл бұрын
​@@max0x7ba i was referring to the fact that in the procedure of actually searching all programs shorter than C, you need a way of checking whether a given program outputs C. And this is undecidable, so the whole procedure is undecidable eventhough the search space is finite as you said. but i didn't quite get the second remark so excuse me if we're on different wavelengths here
@adtiamzon3663
@adtiamzon3663 Жыл бұрын
What a beard, Ilya⁉️ Suits yahh! So what am I going to learn this time❓️😁😇🌹
@lukas-santopuglisi668
@lukas-santopuglisi668 Жыл бұрын
thx so much for uploading from Germany!
@wrathofgrothendieck
@wrathofgrothendieck Жыл бұрын
Ilya Sutskver da god
@SchoolofAI
@SchoolofAI Жыл бұрын
What are 100 videos that are MUST-WATCH for AI enthusiasts in 2023?
@qanon4realvsqanon4gery70
@qanon4realvsqanon4gery70 Жыл бұрын
If you are gonna watch 100 videos I think just go with recorded university lectures on the topic
@Morimea
@Morimea Жыл бұрын
KZbin search: 1. MIT Artificial Intelligence, Patrick Winston 2. Andrej Karpathy Let's build GPT: from scratch idk what you mean by "2023", but general knowledge/understanding how it work is most important, and those lectures I mention give you full overview
@SchoolofAI
@SchoolofAI Жыл бұрын
@@Morimea Thank you
@eskelCz
@eskelCz Жыл бұрын
The compression/prediction could be more simply thought of as pattern recognition
@alonsomartinez9588
@alonsomartinez9588 Жыл бұрын
Reminds me of VR concerts
@wege8409
@wege8409 Жыл бұрын
In what way? Just the general atmosphere of the presentation?
@markpfeffer7487
@markpfeffer7487 Жыл бұрын
S tier brain. D tier PowerPoint aesthetics.
@Jonathan-k3r8r
@Jonathan-k3r8r Жыл бұрын
Everyone knows that as you get more baller you don't have to waste time on fancy slides
@DayB89
@DayB89 Жыл бұрын
Just writing my thoughts here only 14 mins in, but I don't understand how that explanation would make it clear that distribution matching works. I mean, the transformation function could be any... even one that matches the distribution but makes no sense. It would be like me doing an attempt of word-by-word translation of English to French just attending at word frequency, what could go wrong =)
@GenzaiHonkaku
@GenzaiHonkaku Жыл бұрын
The assumption in your example could be that all specific samples of people communicating in some language would fit some underlying universal distribution of human communication. As long as the same idea can be expressed equivalently across different languages, you could in theory determine whether 2 examples of language are expressing the same idea by modelling the probability of a certain idea being expressed. This relies on the assumption that we as humans roughly communicate in the same ways across all languages. For the most part, I feel like this is an accurate assumption to make. However, there are differences in culture which define different different distributions of what is talked about in the languages used by those cultures. Which is where bias comes in. Different datasets are biased towards different distributions.
@GenzaiHonkaku
@GenzaiHonkaku Жыл бұрын
But I think you are correct in that matching the distributions on a word-by-word basis would result in poor translation accuracy. I think that which tokens you are trying to match need to contain all of the relevant context and information you need to capture. If I recall correctly, I think GPT encodes paragraphs of vector-encoded words to vector representations.
@DayB89
@DayB89 Жыл бұрын
@@GenzaiHonkaku And even if that's the case, even if you are matching full-context distributions... isn't it possible that the mapping function F does a transformation that doesn't relate to meaning? My point is that I'm having a hard time telling apart this assumption from mere wishful thinking. In other less-technical terms, I can make a bread fit into a battery socket without the bread becoming a battery.
@GenzaiHonkaku
@GenzaiHonkaku Жыл бұрын
@@DayB89 I would say a more apt analogy would be trying to force a loaf of bread into one of those kids 'put the shape in the shaped hole' toys. The important thing about the socket is the shape of thing being put into it. But it's true that you would still end up with a squished loaf of bread that doesn't look like the shaped hole it was forced into. If you can agree that this is a *fitting* analogy, then might also agree that it's heading into philosophical territory. if something looks like a rose, and smells like a rose, but doesn't taste like a rose, has it still achieved everything it needs to to be considered as a rose?
@DayB89
@DayB89 Жыл бұрын
@@GenzaiHonkaku Well, your answer made me realize that I just picked the wrong analogy. I'll get back to you once I find the right one.
@jonatan8392
@jonatan8392 Жыл бұрын
I think your intro about supervised learning is a bit missleading. Yes it is true theoretically that test-error~training-error if the model complexety is small, for instance in terms of VC-dimension. However, this VC theory cannot explain the success of deep learning. In practice, large models with billions of parameter are used and trained to nearly 0 training-error and STILL the test error is small. Applying the VC-dimension argument to these models doesn't give you anything because the observed test-error is magnitudes times smaller than the predicted test-error from the VC-theory.
@cennywenner516
@cennywenner516 Жыл бұрын
Really exciting to hear that OpenAI may be trying to realize or find inspiration in the impractical but seemingly profound learning theory! My own naive thought though on the Kolmogorov analogy is that the challenge is not just in the program search but in the offset caused by the specific computational model (UTM), aka inductive bias. For lots of data, the latter term goes to 0 while for little data, it is the program search that becomes easy but generalizes poorly. For lots of applications (RLHF, downstream), it does not seem that it is sufficient to merely encode it as the same unsupervised stream in the LLM, perhaps because it is a distribution shift (or a narrow subset); even though unsupervised/self-supervised training is usually needed in some form along with the fine-tuning. One way to potentially think about this is that instead of the new task being conditional prediction, or conditional compression of the following sequence, it is instead that the pre-training conditions the underdetermined program space/inductive bias, which enables good generalizations also with little data. Re order of data - I think this just comes down to the old debate on whether overparametrized deep networks eventually converge to a global optimum; which is not seen in practice for LLMs due to early stopping of overparemetrized models for training efficiency; and finally for the connection, one should consider a differentiable machine representation. This does not seem obviously inconsistent with an analogous compute-bounded program search. Maybe I misunderstood what was meant by linear representations but I think there is something like linear representations being preferable given any kind of regularization, i.e., any inductive bias beyond uniform, which is expected with a universal distribution.
@jabowery
@jabowery Жыл бұрын
The idea that the open parameter of UTM choice renders Kolmogorov Complexity ill defined for most practical purposes is pedantic obstructionism. I suspect it is really just an excuse to not think about what exactly is sacrificed by using less principled loss functions than the size of an executable archive. It's much easier to just calculate some approximate loss function like MSE or whatever -- perhaps adorned with some regularization to reduce parameter count -- than it is to think about what has been sacrificed. It is worth thinking about that question and doing so in depth because it gets to the very heart and soul of what people are running around in panic regarding "algorithmic bias". This is particularly egregious when they don the robes of scientific ethics after having abjured the only principled foundation for their ethics.
@hanskraut2018
@hanskraut2018 Жыл бұрын
Random comment: I have to admit i like more extensive comments, i did not even necessarily agree or disagree - i like where it seems to be coming from and the format. ^^
@cennywenner516
@cennywenner516 Жыл бұрын
@@jabowery - The principled part is where you insert assumptions and domain knowledge. Inductive bias and algorithmic bias are not related concepts. Inductive biases is what makes learning even being possible. Why things that happened in the past are more likely to occur again in the future eg rather than remaining 'entirely random'.
@jabowery
@jabowery Жыл бұрын
@@cennywenner516 Bias minimization in model selection is the ethical foundation I'm addressing. Statistical information criteria for model selection are all over the map precisely because statistics are based on Shannon Information rather than Algorithmic Information. If your model selection criterion is unprincipled, you are in even worse shape dealing with the all-too-human tendency for self-deception that the scientific method is intended to address. An otherwise principled approach to algorithmic bias that skips this step elides science. It has to get this right before it can address other biases for the simple reason that self-deception begins with conflation of Hume's "is" vs "ought".
@NerdFuture
@NerdFuture Жыл бұрын
You say "the Kolmogorov analogy," and go on to guess what it's about. But it's *not* an analogy, it's just a limit, used in the way Sutskever does to define mutual information, on his way to saying that taking advantage of mutual information is a good indication of erm, having learned something useful.
@rv706
@rv706 Жыл бұрын
"A particular observation on generalization" would've been a cooler title 🙃
@KemalCetinkaya-i3q
@KemalCetinkaya-i3q Жыл бұрын
I got it
@ruoshiliu6024
@ruoshiliu6024 Жыл бұрын
you did
@hola-kx1gn
@hola-kx1gn Жыл бұрын
you did, didn't you
@brookshamilton1
@brookshamilton1 Жыл бұрын
Did you?
@KemalCetinkaya-i3q
@KemalCetinkaya-i3q 9 ай бұрын
i grokked it real hard@@brookshamilton1
@AnirudhAjith
@AnirudhAjith Жыл бұрын
42:50 Is that Scott Aaronson?
@consumidorbrasileiro222
@consumidorbrasileiro222 Жыл бұрын
yes
@syphiliticpangloss
@syphiliticpangloss Жыл бұрын
Why is he calling "next pixel prediction" and UNSUPERVISED task?
@qanon4realvsqanon4gery70
@qanon4realvsqanon4gery70 Жыл бұрын
The next pixel is part of the data, not an explicit label
@Draganel87
@Draganel87 Жыл бұрын
Because there are no labels provided. They just feed the data and the model understand the underlying generalities.
@syphiliticpangloss
@syphiliticpangloss Жыл бұрын
Supervised just means P(Y|X). There is no meaning behind "label". You label it in a sense by giving honour to the "order" of the pixels. This is what I mean. The whole AI tech bro world fails to be specific about meaning and everyone spins around thinking different things. Let's nail down a *precise* meaning of this stuff. The compression results at the end are cool but it seems un-necessary to start with this supervised/unsupervised question. It sounds like all of these problems, we can probably agree, are to do with estimating distributions P(Y|X) where X *might* be the empty set. (I'm leaving \theta out) Putthing this in the realm of probability is already quite restricting as generalization might matter for pure optimization settings (compression). But it sounds like what people usually mean. And I am saying even learning distributions, is still within this framework but you might be to remap things a bit. I dunno. But P(Y|X, \theta) where X is the empty set sounds like it covers unsupervised learning. @@qanon4realvsqanon4gery70
@syphiliticpangloss
@syphiliticpangloss Жыл бұрын
So my guess is that when you look at the distributional learning problm with X being the empty set, those generalization bounds become quite weak.
@Alejandro388
@Alejandro388 Жыл бұрын
he is high on what?
@isaacandrewdixon
@isaacandrewdixon Жыл бұрын
high on life
@DavenH
@DavenH Жыл бұрын
intellect
@dotnet364
@dotnet364 Жыл бұрын
OpenAI is now worth 90B. The employees have sold their shares to investors. They have their roi after 7 yrs.
@xyh6552
@xyh6552 Жыл бұрын
The linear property itself is a compression, this is a trivial observation that can be obtained by studying differential manifolds
@winsomehax
@winsomehax 7 ай бұрын
This was very shortly before the OpenAI crisis when he tried to get altman fired.
@RickySupriyadi
@RickySupriyadi 6 ай бұрын
that is my friend, has something to do with national security issue which must be done, it is has to be done so in that period of time something can get secured. it's done and succeed, well sam going back isn't bad also, openAI now got a military general for their cybersecurity. well um... this kind of issue will not get any simpler it's will getting more complicated and might out of my reach for a solo...
@syphiliticpangloss
@syphiliticpangloss Жыл бұрын
The guy at the end gets at the point that the semantics are scrambled and these guys are confused about semantics and the expression of the problems. Obviously very smart but bad at writing about this stufff/expressing it.
@GraczPierwszy
@GraczPierwszy Жыл бұрын
well well, we'll see what it all comes down to blah blah blah so far we will see the effects in the real world, not on the board
@plafar7887
@plafar7887 Жыл бұрын
GPT models are blah blah blah?🤔
@syphiliticpangloss
@syphiliticpangloss Жыл бұрын
Why can't you just pose supervised learning as learn P(Y|X) where X = null (i.e. the null feature choice). VC(X) = 0 or whatever it is. I find the lack of rigouress definitions here alarming.
@plafar7887
@plafar7887 Жыл бұрын
It's not alarming, it's simply difficult to turn this reasoning into rigorous arguments, for the time being. It's hard enough to even express it intuitively. A lot more research is needed.
@kumarkartikay
@kumarkartikay Жыл бұрын
It seems Ilya has solved (what George Hotz is calling) "entropics" in one lecture just like Shannon solved Information theory in one paper. SGD on Neural Nets = Program Search = Compression = Intelligence!
@Alex-fh4my
@Alex-fh4my Жыл бұрын
I dont know about solved. Hotz is aware of the notion that prediction = compression = intelligence. He mentioned some blog post on stream about this. Regardless, "compression = intelligence" still doesnt answer the question of "How much intelligence does it take to solve Fermat's last theorem".
@edh615
@edh615 Жыл бұрын
it's not solved, and for George's ideas I'm not sure he even knows what he means.
@brianewing1428
@brianewing1428 Жыл бұрын
Say we solve prediction. How do you do intention?
@Alex-fh4my
@Alex-fh4my Жыл бұрын
@@brianewing1428 solve prediction?
@brianewing1428
@brianewing1428 Жыл бұрын
@@Alex-fh4my yes, say we 'solve' that. we have a crazily general, Oracle token predictor. you give it a computational hippocampus, and long term storage. how do you implement the task loop?
@kenmogibrainworld4844
@kenmogibrainworld4844 Жыл бұрын
In the phenomenology of consciousness qualia are great compressors.
@surkewrasoul4711
@surkewrasoul4711 Жыл бұрын
I never thought I would give it up this easily but I must confess that this ai thing got it all out of me so easily in one shot, kind of embarrassing but hey, it takes experience, Also as mr bean so wisely said once, It's with age that comes wisdom🤣, The most difficult task would perhaps be teaching it honesty and quite possibly not to ruin other people's coffe mug with misfires and things like that 🤣, I say we teach it that abraham lincon's famous qoute, You may fool half of the people half of the time and half of the time half of the people, But not all the people all the time. Great Talk , Enjoy your days Guys 🤙✌
@kob8634
@kob8634 22 күн бұрын
Sicophantic crowd _trying_ to laugh is about the worst sound a group of people can make. He's smart, he's not a god, stfu and listen!
@Sarmadpervez186
@Sarmadpervez186 Жыл бұрын
ClosedAI
@jameelbarnes3458
@jameelbarnes3458 Жыл бұрын
Binary tokens.
@RoboticusMusic
@RoboticusMusic Жыл бұрын
Huh, just explain normal
@krultorwaru121
@krultorwaru121 Жыл бұрын
How is this guy regarded one of the best experts in AI is beyond me. His level of intuitive understanding of neural networks is depressing.
@krultorwaru121
@krultorwaru121 Жыл бұрын
@@WMD911 actually it’s all what he said and everything he didn’t say. For example his main point - comparison to Kolmogorov complexity - is rather wishful thinking than anything of substance. SGD on weights of neural net is very particular program search of very particular programs that need not result in “the shortest program”, whatever that means in the context of NN.
@WalterSamuels
@WalterSamuels Жыл бұрын
Why is your company called OpenAI if you're the opposite of open? OpenAI is supposedly the "leader" of the AI space, yet it's the least generous out of all AI companies. Meta is constantly releasing new research, tools, open-source software and the rest, while OpenAI is busy playing politics and trying to squash competition. It's really quite sad and disappointing. No wonder you had such a large exodus of talent. Hopefully you'll change your ways.
@JD-jl4yy
@JD-jl4yy Жыл бұрын
They're being more responsible than Meta. They're thinking ahead about how we can prevent AI from derailing society. Just releasing everything with 0 safety concerns like Meta is going to get really, really dangerous one day.
@bananabreeding1362
@bananabreeding1362 Жыл бұрын
Post Vaswani generative pretrained transformers have opened a vast and potentially civilisation-changing (or ending!) opportunity for cognitive/ linguistic AI and much more. We are witnessing the onset of a huge adaptive radiation that will involve virtually all of AI technology. Some firms will be Open. Some will be tightly closed. Some offerings will be totally free; even facilititating individual humans to develop intensely custom versions of standalone single user products. This is just the beginning.
@WalterSamuels
@WalterSamuels Жыл бұрын
Doesn't really have anything to do with my point though. If anything it amplifies it, Google gifted them transformers and they still don't see the value in sharing. @@bananabreeding1362
@primersegundo3788
@primersegundo3788 9 ай бұрын
why?..... he has already answered that question many times, just check any interview.
@kenwolf887
@kenwolf887 Жыл бұрын
It's all about entropy and complexity.
@DavenH
@DavenH Жыл бұрын
It's all about fate and destiny.
@TPQ1980
@TPQ1980 Жыл бұрын
Within the first minute it is admitted that "OpenAI" is not open. Open development doesn't have aspects that developers "can't talk about." By 20 minutes in, the guy in the video has said thousands of words without saying almost anything at all, it's like he's winging it, or doesn't really understand his subject. The guy in the video even admits machine learning is not particularly difficult to understand, yet he does a terrible job of conveying understanding. It's almost like he's more interested in trying to convey the impression he's really intelligent, rather than trying to convey meaning and understanding. I suppose it's possible that English is his second language, or perhaps he's just not that experienced at public speaking. Maybe he's one of those of people who's really good with mathematics, but terrible with language? I just get the impression from this video that he's something of a charlatan.
@FreakyStyleytobby
@FreakyStyleytobby Жыл бұрын
You're very wrong, I learned a lot from this lecture. In this lecture Sutskever explain how to understand ML with mathematics terms. That's the whole problem he proposed. So how the hell do you expect him not to use mathematics in this description? It's not popular science
@wege8409
@wege8409 Жыл бұрын
Ilya was involved on the creation AlexNet in 2012, a lot of people point to AlexNet as the thing that rekindled interest in AI, many academics thought neural nets were hopeless before AlexNet. He was also involved in the creation of AlphaGo, Go was a really hard problem for a long time because of the explosion of possible outcomes. Elon says hiring Ilya was the tipping point for OpenAI's success and honesty I think I believe him. This guy is definitely the real deal. If you want a charlatan, look at Sam Altman peddling WorldCoin.
@wege8409
@wege8409 Жыл бұрын
Also it's no secret that OpenAI isn't open, they talk a lot about their decision to switch from non-profit, basically you need bottomless computation to build state of the art models, and for bottomless computation you need bottomless money. If you look up "OpenAI Panel" it explains a lot about how they operate.
@JohnDoe-nv2op
@JohnDoe-nv2op Жыл бұрын
All on top of the "SGD miracle". When we realise that SGD is just crap, all this DL horseshit will fall into oblivion.
@k.8597
@k.8597 Жыл бұрын
could you elaborate on what you mean? I know its been a month since this comment so this might be a shot in the dark, but I'm wondering why exactly is SGD bad? is it a heuristic for the optimal weight values that they came up with that works well enough to not question its use? im in undergrad just learning abt this stuff so pardon my ignorance
@JohnDoe-nv2op
@JohnDoe-nv2op Жыл бұрын
@@k.8597 SGD is incompatible with compositional learning. In other words, you need to readjust (potentially) *all* the weights in the net in each backward pass. In a system with continuous learning, this produces catastrophic forgetting. ANN needs compositional learning to reach "common sense". SGD approach to learning, in my view, is unable to do that. We need some learning algo capable of performing continuous learning before to move forward to greater challenges.
@k.8597
@k.8597 Жыл бұрын
@@JohnDoe-nv2op ah so would that mean finding a way to be able to select the weights that lead to the greatest amount of memory being retained, and then doing backprop on the ones that don't?
@JohnDoe-nv2op
@JohnDoe-nv2op Жыл бұрын
@@k.8597 I think so. The question is how
@wege8409
@wege8409 Жыл бұрын
I don't think the catastrophic forgetting comes from SGD. I think it's the architecture, or maybe it's the learning scheduler, or a combination of the two.
Accelerating scientific discovery with AI
29:02
Vetenskapsakademien
Рет қаралды 53 М.
Леон киллер и Оля Полякова 😹
00:42
Канал Смеха
Рет қаралды 4,7 МЛН
小丑教训坏蛋 #小丑 #天使 #shorts
00:49
好人小丑
Рет қаралды 54 МЛН
Cat mode and a glass of water #family #humor #fun
00:22
Kotiki_Z
Рет қаралды 42 МЛН
Visualizing transformers and attention | Talk for TNG Big Tech Day '24
57:45
It's Not About Scale, It's About Abstraction
46:22
Machine Learning Street Talk
Рет қаралды 105 М.
AI can't cross this line and we don't know why.
24:07
Welch Labs
Рет қаралды 1,5 МЛН
The moment we stopped understanding AI [AlexNet]
17:38
Welch Labs
Рет қаралды 1,5 МЛН
[1hr Talk] Intro to Large Language Models
59:48
Andrej Karpathy
Рет қаралды 2,4 МЛН
AlphaGo - The Movie | Full award-winning documentary
1:30:28
Google DeepMind
Рет қаралды 36 МЛН
Transformers (how LLMs work) explained visually | DL5
27:14
3Blue1Brown
Рет қаралды 4,3 МЛН
Леон киллер и Оля Полякова 😹
00:42
Канал Смеха
Рет қаралды 4,7 МЛН