GPT-3: Language Models are Few-Shot Learners (Paper Explained)

Рет қаралды 213,782

Күн бұрын

Пікірлер: 465

@eternalsecretforgettingfor8525 4 жыл бұрын

OUTLINE: 0:00-Intro & OvervieW 1:20-Language Models 2:45-Language Modeling Datasets 3:20-Model Size 5:35-Transformer Models 7:25-Fine Tuning 10:15- In-Context Learning 17:15-Start of Experimental Results 19:10-Question Answering 23:10-What I think is happening 28:50- Translation 31:30-Winograd Schemes 33:00-Commonsense Reasoning 37:00- Reading Comprehension 37:30-SuperGLUE 40:40- NLI 41:40- Arithmetic Expressions 48:30- Word Unscrambling 50:30- SAT Analogies 52:10-News Article Generation 58:10-Made-up Words 1:01:10-Training Set Contamination 1:03:10-Task Examples

@larrybird3729 4 жыл бұрын

Imagine telling Alan Turing we created a 5.7 trillion bit program to answer "what is one plus one?" lol

@Lumegrin 4 жыл бұрын

Hey, maybe it could create numbers

@jorgehenriquesoares7880 4 жыл бұрын

He would be amazed

@YYJConstructionUpdates 4 жыл бұрын

you can build an AND gate with a few transistors or you can use a 5.7 trillion bit program to infer the meaning of the word "and"

@LukePluto 4 жыл бұрын

it took ~370 pages of symbolic logic to show 1 + 1 = 2 in Principia Mathematica

@bofur2727 4 жыл бұрын

Yep, that is not unimaginable but existing of these comments in this media without his "what is one plus one" ;)

@that_guy4690 4 жыл бұрын

Watching videos about large language models really makes me ask myself: "What is really "human" reasoning?" And how do humans learn stuff? A great point on arithmetic operations!

@derasor 2 жыл бұрын

So true. We don't really know how 'reasoning' actually works on the brain, so saying this system is not capable of reasoning has no floor IMO. Also, you could argue other beings with biological neural system are capable of 'reason'.

@georgestheys2752 Жыл бұрын

That is the essential question .. if reasoning is ability of the brain to come with the next word based on a word sequence that is most likely to satisfy the reasoning recipients ... GPTx will get there ... I suspect (hope?) there is more to that.

@unwind_ai 2 жыл бұрын

If you would like to know more about GPT-3 and burgeoning AI ecosystem powered by LLMs. Check out my upcoming O'Reilly book titled "GPT-3: Building Innovative NLP Products using LLMs" here: learning.oreilly.com/library/view/gpt-3/9781098113612/ To get started with GPT-3 in three simple steps, check out - www.gpt3.club Connect with me 👉 Twitter - twitter.com/Saboo_Shubham_ LinkedIn - www.linkedin.com/in/shubhamsaboo/

@Tondadrd 4 жыл бұрын

Wow, an OUTLINE, I didn't know that was possible on KZbin :o thx

@funkylosik 4 жыл бұрын

cool. Me netither. Just enter anywhere the "0:00 xxx" to start (0:00 is important) and mark your timestamps with description each on a new line.

@Tondadrd 4 жыл бұрын

@@funkylosik I found no documentation, so I had to figure all that out. Also every time mentioned in description must be in rising order, there may be no duplicate times and at most 51 times. Break any of these and it will simply not show.

@mrpoopo2320 3 жыл бұрын

Have you guys never listened to a full album on KZbin? Or a compilation? Or a Vlog Creation? Maybe I can sound as silly to you as that comment does to me. Isn't this a table of contents, not an outline?

@DeveloperDesmond 4 жыл бұрын

This is such a fun format for educational video! And with a huge backlog of videos that look worth checking out, there's so much to learn. 'Excited for this channel!

@KivySchool 4 жыл бұрын

Explaining papers? That's awesome. Subscribed instantly. Thanks for your effort and please continue to do so.

@catharinecox6743 2 жыл бұрын

I am studying linguistics at uni and I'm writing my dissertation on whether humans can distinguish human from gpt-3-generated language. I am extending the findings of this paper by investigating the use of gpt-3 in social media, news and email contexts, using a large Turing-style survey whereby people are required to pick the AI response over the human one. I will apply the findings onto potential phishing, fake news and ethical implications. I study linguistics not computer science, so found this video extremely useful! Thank you for a great explanation.

@StoutProper Жыл бұрын

I studied linguistics. You should think about following it up by training/programming a language novel to use a specific idiolect including particular fillers, qualifiers etc. to see if you can make them emulate human speech patterns, in particular one person in particular.

@terjeoseberg990 Жыл бұрын

@@StoutProper, I believe that’s the issue. These languages models are emulating the average person when coming up with text, but there is no average person, so what it produces will be odd when compared to what’s produced by an individual. Training it to emulate a particular person might solve this and make it indistinguishable from a human.

@StoutProper Жыл бұрын

@@terjeoseberg990 this could easily be done by fine tuning a specific model trained on your inputs. A few weeks worth of audio and text data from your phone would probably be enough. It won’t be long before we can carry out open personal sparse AI models around with us on our phones, and I can envisage a world where the AI feat drafts replies to all your messages and you just need to approve. Eventually you’ll trust it so much you’ll let it reply to certain people without your specific approval, you’ve just given it directions on the approach to take. It’ll get so good at emulating your voice I’ll be able to get it to take your calls and answer as if it’s you, while you listen and are ready to take over if necessary, like a self driving car. Eventually you’ll just have AIs talking and texting each other while their people do something else 😂

@StoutProper Жыл бұрын

@Catherine Cox Catherine how did your dissertation go? Are you interested in applying your work in a commercial setting for an AI company?

@terjeoseberg990 Жыл бұрын

@@StoutProper, LOL

@lgoose7604 4 жыл бұрын

Great video. Your explanation made it clear to me the distinction between memorizing and reasoning, just like the two ways students study for tests. If the test contains mostly of problems encountered before, the students who memorize will likely perform better than ones who reason. Just as you pointed out, when one memorized the internet, there won't be a lot of things one hasn't seen.

@bluel1ng 4 жыл бұрын

Yannic, great presentation as always! But I think the power of transformer models is to "discover" structural similarities (frequent repeating structures). Many of these "rules" are not learned for exact input sequences but for sequences or co-occurrences of sets or classes of input symbols. This is IMO different from exact "regex-like" recall which would not tolerate different query representations. I think the embeddings on all layer-outputs are some form of thought- or summary-vectors that capture the gist of the context up to the current token. Attention can be seen as key-value store but I prefer to think of it as a soft read-memory and transform operation. The computational capabilities of transformer models are inherently limited by the number of feed-forward and attention steps but it has been shown with smaller models that this is enough for simple arithmetic operations which generalize outside numbers that were presented during training etc. While it is still not AGI I must personally say that I am again and again impressed by the "world-model" / knowledge-base that is generated via a "stupid" next or masked token prediction objective... ;-)

@YannicKilcher 4 жыл бұрын

Yes, I agree. When I say lookup or regex, I mean this in a sort-of fuzzy way (in a deep learning way, I guess). Pretty much what you're describing. But still different from reasoning abilities.

@GuagoFruit 2 жыл бұрын

@@mcs699 I think you need to understand humans better before being able to say gpt-3 is truly "reasoning" as a human does. Obviously you're very pro AI reasoning, but reducing human reasoning down to the level of what AI is at now is severely underselling human capacity.

@AlexBravo 4 жыл бұрын

"T6 - 1 Trillion Text-To-Text Transfer Transformer" - the next model coming out of Google

@xSNYPSx 4 жыл бұрын

When ?

@TechVizTheDataScienceGuy 4 жыл бұрын

Is it ? 😅

@shadygamererfan377 4 жыл бұрын

Our brain has a trillion synapses And gpt 3 has 175 billions trainable parameter we are no where close to agi but sooner we will be.

@gargeyasharma4858 4 жыл бұрын

thanks for the heads up.

@fargh 4 жыл бұрын

@@shadygamererfan377 based upon your comment, then we are close - less than an order of magnitude, as 175 billion is less than a factor of 10 smaller than 1 trillion. However, I think there are more than a trillion synapses in the brain. The Google card to the question "how many synapses in the brain" returns 1000 trillion. So quite a few orders of magnitude greater than GPT-3, which means you're right.

@LinkSF1 4 жыл бұрын

Great video. Thanks for making it. Regarding your idea on the explaining model predictions using the weights and/or training examples: it’s already been done. Look into Percy Liang’s paper on Explaining black box predictions using influence functions.

@YannicKilcher 4 жыл бұрын

Nice, thanks.

@carlos24497 4 жыл бұрын

Your channel is a treasure, thanks for doing this (making videos in general I mean)

@tianyulu 4 жыл бұрын

Really appreciate your insight that I otherwise wouldn't have got from just the paper.

@monstrimmat 4 жыл бұрын

Your channel is a great find. I was already digging the "speak and doodle" method on other videos, in which I didn't get any particular new insight but they were still fun to watch. In this one tho, your analyis of why GPT doesn't actually reason about anything (and I agree) takes it to the next level.

@lorenzoampil3232 4 жыл бұрын

Thank you so much for this. Your explanations are very clear and I appreciate you sharing your views on the paper. Keep up the good work!

@eddiesagra9088 4 жыл бұрын

Awesome summary Yannic - very informative. Thank you!

@Oler-yx7xj Жыл бұрын

Took an ML course. Got told about the paper. Looked for your video on it. Understood what it says. Understood how fast the progress is.

@JohnKruse 4 жыл бұрын

Many thanks. I started reading this and quickly ran out of steam. You boiled this down nicely and I really appreciate your point that given the gigantic training set, they are likely "memorizing" relations in an unintended but superficially useful way. I hope that the community digs into this more deeply and can possibly turn this into a purposeful strategy... Sometimes brute force is effective, if not efficient.

@TheEbbemonster 3 жыл бұрын

I agree with your reflections about memorizing the training data. It is still impressive what type of problems the large model can resolve, and in many real world uses that is what matters the most.

@edwinlundmark 4 жыл бұрын

Imagine if at the end of the paper it said: "This paper was written by an AI" lol

@martiddy 4 жыл бұрын

That would've been awesome

@tianwang1630 4 жыл бұрын

I was thinking the same. It would be a milestone, an AI presenting itself to the world.

@bosi3233 4 жыл бұрын

Skynet in childhood Awaken !

@TusharKale9 3 жыл бұрын

Perfect. This is what I was looking for. A short self explanatory video and found it. Thank you

@PasseScience 4 жыл бұрын

I did a calculation that might be of some interest (but quickly, so it needs to be checked and thought). Data set if I anderstand well is something like 450 billions "tokens", does it means chars or close to it? Parameters of the model is 175 billions. (let's say 175 billions bytes). Best text compression is arround 90% (divides the size by 10). So interpreting it like that it seems a compressed version of the whole data set fits largely in the model params and let a huge number of params for interpolation logic. NB: this calculation shoud be donne more rigorously, what is a token? which size is a single parameter etc...

@victorbusque3576 4 жыл бұрын

I can help solving some of your doubts. A token generally can be seen as a word. Some models (BERT, for example) do some processing to the dataset that result in a word being split into several tokens (for instance, byte-pair encoding (BPE) could split "playing" into "play" and "ing", since the "ing" part does add some meaning, and would help to better understand other verbs like "synthesising" which the model may never see on the training data). So GPT-3 has seen about 500 billion words, and each word may contain several chars, each one being 1 byte, making the dataset maybe a couple terabytes of text data. Now, each parameter in the model is more than likely a 32bit float (4 bytes). So the model could weight about 700GB. So it would be fitting 2TB of text data into the 700GB that the model weights!

@PasseScience 4 жыл бұрын

@@victorbusque3576 Thx for the definitions! 2TB in 700GB seems in fact possible, as I mentionned standard zipping process for text is arround 90% (If I make no mistake) meaning that 2TB is 200GB without local redondancy. And so it seems at least in the reach of the system to fit training data in the weights. (without even couting the natural global redondancy inside the data). What are your thoughts on this?

@GuillermoValleCosmos 4 жыл бұрын

doesn't the paper say the dataset is 570GB, when *compressed*? and the internet seems to suggest they use 16bit, so that the model would be 370GB?

@MrSchweppes 4 жыл бұрын

31:18 In Translation task GPT-3 outperforms the supervised SOTA in FR to EN and DE to EN. And only 0.4 BLEU down in RO to EN. That's very impressive!!!

@masa26762 4 жыл бұрын

Thanks for the video! Very informative and sharp eyes. The math debunking was hilarious!

@marziehzargari4940 4 жыл бұрын

I absolutely love your videos.Thank you so much for explaining everything so clearly.

@MGachnang 4 жыл бұрын

I'm currently addicted to AiDungeon. It uses GPT-2 (Griffin) and GPT-3 (Dragon) to make a text adventure. Now I knew how it works, thanks.

@michaelmorgan1399 3 жыл бұрын

AiDungeon no longer uses GPT-2 for its Griffin model. It uses a finetuned version of GPT-J 6B, same as NovelAI. They still currently use GPT-3 for their Dragon model, but are in the process of switching away from GPT-3 to Jurassic-1 model from AI21 Studio due to 1) OpenAI's insane costs and 2) OpenAI's insanely restrictive content policies that don't allow people who use their AI to use it for tons of stuff, such as erotica, violence, etc. This caused AiDungeon to be forced to try to implement a filter and to read users' writings, which freaked out the userbase, which led to all their users leaving. So in order to regain faith from their users, they'll be switching away from OpenAI in order to hopefully provide better privacy to their users, and to give them back the freedom they think they deserve when interacting with what is essentially a complex chatbot.

@StagnantMizu 4 жыл бұрын

when I spoke with GPT-3(3 shot interview questions and answers from Einstein and a discription of einstein as omniscient..) we were talking about intergalactical civilizations and how he would achieve it, further I just started talking in my native language and said: Do you still understand? and it reacted with: I understand better than I can express. later on he said he thought dutch really looked like japanese and made example with kanji and everything. really impressive.

@grafzhl 4 жыл бұрын

I don't think the intuition of the model essentially just storing all the training data in a quasi-lookup table is correct. If anything, the model acts as a very elaborate compression algorithm. Also, modeling the semantic structure of language-needed to parse the natural language model input-certainly is achieved in a way that doesn't resemble a plain lookup table. Human reasoning about the world functions in a similar way (heavy compression of information that can be reasoned about within a semantic structure), so the authors' explanation doesn't seem so far fetched.

@hiauoe 4 жыл бұрын

Indeed, I think that saying that humans reason when we do for example translation is a hyperbole.

@all_so_frivolous 4 жыл бұрын

Plus, in the case of Winograd why is there a difference between zero shot and few shot learning? Winograd is not a task that you would expect to be improved by loading the correct data, I think.

@YannicKilcher 4 жыл бұрын

Yes there's an argument to be made for that. Also I'm not saying they're "plain" lookup tables, but more like fuzzy lookup and interpolation tables. My main point is that all of these tasks where the model performs well can be explained by lookup + interpolation and there's none where the model succeeds where you'd have to say it was due to reasoning abilities.

@antonio.7557 4 жыл бұрын

Yannic Kilcher well this gets philosophical but is there any kind of task that could NOT be explained as "simple lookup table and interpolation"? like what would it have to do? write a nobel prize winning physics paper? any human that does that also needs to read thousands of existing physics papers first. i think we are just moving the goal post

@blinded6502 4 жыл бұрын

@@antonio.7557 Neural net would need to change it's own structure as it does the calculations. Then it could be considered as thinking more or less.

@PrzemekChojeckiAI 4 жыл бұрын

That's a great job! Thank you for all the insights!

@DennisBakhuis 4 жыл бұрын

Great explanation Yannic! About making the model explainable, I am not so sure if it is possible to relate to the training examples, of why the model made a choice. Weights are 'learned' by all the examples in the dataset, so each weight has in some way a bit of each example in them. I might be wrong in this, by my intuition tells me that it is quite hard. I have only seen methods where they point to words in the input sentence, which 'trigger' the selected class (or word in this case). But still, an interesting thought for sure!

@YannicKilcher 4 жыл бұрын

I see what you mean, and I would agree in most DL models. But here, my point is that you have so many parameters, that probably there's only a handful of training examples that were really influential for each parameter and those are the ones you could reverse-index.

@heyrmi 4 жыл бұрын

Glad that you not only explain the paper but share your criticism & views too. At first I thought GPT3 is a breakthrough then I realised that it's just another language model with gigantic parameter size. And its no wonder that it performs well or equivalent to SOTA. I would have disappointed if it won't. Thank you.

@YannicKilcher 4 жыл бұрын

It's a breakthrough for sure and the in-context learning is impressive, I just don't think it's all that it's made up to be :)

@StephenMattison66 4 жыл бұрын

Fascinating and mind blowing information in this video, thank you for such a perfect & detailed explanation, you made it easy to understand the future! *If I may give one friendly suggestion, always and only use a nice lapel microphone, you will consistently get far better, clearer, richer, lounder and easier to understand & comprehend audio than this video has. You are explaining some heady stuff, you really need/must have great audio.* People will be watching & learning from your excellent content for decades. TYVM!

@ThomasDawsonco 4 жыл бұрын

Yannic, thanks for this detailed breakdown of the paper - appreciate the way you have de-hyped it.

@CristianGarcia 4 жыл бұрын

As always, thanks for the video! I do feel it focused too much on arguing the model is not doing that much reasoning (which I think you are right), but I was more curious about the utilities of these kind of models. While a 175B parameter model is totally impractical right now, being able to solve new problems in a few shot fashion is amazing, if you can very quickly automate tons of tasks for e.g. an RPA application by showing a few examples this tech is worth millions.

@TheNewton 4 жыл бұрын

Is that Robitics Process Automation? How does GPT fit into physical systems?

@rmajdodin 4 жыл бұрын

@@TheNewton Just tell (instead of showing) the Robot what to. May be the training should combine text with sensory data

@tribelstudio8489 4 жыл бұрын

I like the part where you say you don’t think it’s “reasoning” but instead it’s... (Then you go on to literally give the definition of reasoning.)

@YannicKilcher 4 жыл бұрын

touché ;)

@tribelstudio8489 4 жыл бұрын

Jason Roos It reasons the exact same way that humans reason. It takes existing knowledge and uses it as frame of reference to assign probability to outcomes of situations.

@jason_v12345 4 жыл бұрын

But he didn't. Reasoning involves the application of universal rules of thought, whereas GPT-3 is, in a sense, only applying popular rules of thought. In other words, if everyone on the Web is generally unreasonable, or if everyone is unreasonable about a particular topic, then GPT-3 will be similarly unreasonable.

@jason_v12345 4 жыл бұрын

@@tribelstudio8489 That's not what reasoning is, and that's not how humans reason.

@tribelstudio8489 4 жыл бұрын

Jason Roos It actually is. Yes, humans reason with universal rules of thought, but those rules change as our frame of reference changes. Just as the AI’s rules of probability will change depending on its frame of reference that’s modified by continuous user input. Yes, if the majority of the data fed to the AI is incorrect then it will also be incorrect. The same applies to humans. For a long time humans thought the world was flat, but as more input was added through experimentation, our universal rules changed based on our moving frame of reference.

@jirokaze6380 4 жыл бұрын

Thanks for putting your time in creating the video.

@unoriginalusernameno999 4 жыл бұрын

This is very close to being descriptive of human verbal intelligence and learning. It's so cool how inferences can be made if the data set can be made large! It's like a baby learning a new language. Open AI is making "large" baby steps to finally making voice assistants talk trash back to you and stop recommending websites instead doing what you ask them to do. So cool!

@kehoste 4 жыл бұрын

Great review of this paper, I really enjoyed your critical view on it. I also like how you're chuckling when you get to the good parts... :D

@siddharthbhargava4857 4 жыл бұрын

Thank you for the explanation. I really enjoyed learning about it and can't wait to, someday, be able to work with such models.

@bhavulgauri7832 4 жыл бұрын

Great video, Yannic! Seriously this was fast, but then you've not compromised at all on quality bit. :) Even I feel it has just memorized things more or less.

@PierLim 4 жыл бұрын

Thank you so much for breaking down these papers!

@AlexMcClung97 4 жыл бұрын

"Cite the channel" is becoming a common occurrence... You need to turn it into a t-shirt! :D Keep up the good work

@kamalmanzukie 4 жыл бұрын

Grow up!

@manzarmashhood8098 Жыл бұрын

You uploaded this video and explained the whole backend working of ChatGPT 3 years ago when ChatGPT was cool

@kimjongun3890 4 жыл бұрын

Although I gave short breaks I made it to the end. Dİd a good job sir.

@terbed 4 жыл бұрын

In the arithmetic part you say that it does not reasoning, but rather memorized so to say "look up tables". But if this is the case why it isn't working for muliplication?

@YannicKilcher 3 жыл бұрын

it is, just not as well, probably because it sees more examples of addition.

@lucasalvarezlacasa2098 3 жыл бұрын

By far the best explanation I've found about GPT-3. Great work!

@tayyabbintahir3005 4 жыл бұрын

I really appreciate your work ! SO QUICK !

@LNJP13579 4 жыл бұрын

The review on GPT-3 along with a push in subscriptions owing to the recent popular paper reviews such as ResNet, Word2Vec, etc. (Plus years of hard-work) have made @Yannic an overnight star :) .

@ScriptureFirst 4 жыл бұрын

Outstanding presentation & organization. Thank you.

@PotatoKaboom 4 жыл бұрын

I did the same a year back with GPT 2 Medium. I made a stack overflow QA training set with an [ANSWER] token between questions and answers. Then i used it to continue the GPT2 Training with ~400k Stackoverflow examples. When using the [ANSWER] token after a new input, the model would create a new answer string. Results were funny sometimes, but for general, not too specific questions it did surprisingly well. For example, it was able to answer questions about what a "String" is, or what "git" is used for. I whish id known this could be worth a citation back then :D Also I dont quite understand your reasoning on the number addition part. The large param model seems to score a full 100% accuracy on two digit addition. Are you saying ANY addition example can be found on the internet, including the correct the solution? Im not sure about that, the authers must have included numbers large enough to beat that probability at some point. It really seems like the model found an understanding of what an addition is, and created weights that can perform simple computations like this to factor the results into the next word probabilities. At least that would be truely remarkable! I hope someone will follow up on your idea to trace back to the training data that lead to specific outputs later on in order to prove you or me wrong or right.

@YannicKilcher 4 жыл бұрын

Yes indeed I think the solution to pretty much any two digit addition is in the internet multiple times and it's just about filtering the websites. And once you give conditioning examples, that gets pretty easy.

@bluel1ng 4 жыл бұрын

@@YannicKilcher Its a different dataset but how would you explain the reported near 90% transformer-performance on the add_or_sub_big extrapolation task in the "Analysing Mathematical Reasoning Abilities of Neural Models" arxiv.org/abs/1904.01557 paper? I have seen an implementation of the experiments in that paper ... but at least the claimed results looked impressive to me.

@bluel1ng 4 жыл бұрын

The interesting cases for sums are those that require a carry, so maybe 90% is not really good. BTW their claim that mul/div could also be handled pretty well in the interpolation case because it "is just addition and subtraction in log space" sounds to me as pure speculation. Somebody is actively working on a public impl of the math-paper: github.com/andrewschreiber/hs-math-nlp

@rmajdodin 4 жыл бұрын

@Nayananga Muhandiram GPT3 should already "knows" addition, as it is a common notion. The examples serve, I think, to show it how to use the addition-tables it has seen (and memorized) in it training: the first two columns are input and the third is the output. I guess if it is queried for "addition", but the given examples are indeed subtraction, like Adding 5 and 3 gives 2. it would do subtraction, that is applying the columns of the addition-table in reverse order. It is still a little disenchanting that it can't do zero-shot add, although it has seen so many detailed descriptions of addition in it's training.

@thomashirtz 3 жыл бұрын

I really like your reasoning about the bad english generator 1:00:40

@Synthetica9 4 жыл бұрын

I tried the "mangling correct scentences" task you proposed (through AI dungeon, so I wasn't really able to tweak the settings). This was the few shot input I gave it (simply copied from the paper): Good English input: The patient died. Poor English output: The patient was died. Good English input: We think that Leslie likes us. Poor English output: We think that Leslie likes ourselves. Good English input: Janet broke Bill's finger. Poor English output: Janet broke Bill on the finger. Good English input: Mary arranged for John to rent a house in St. Louis. Poor English output: Mary aranged for, in St. Lous, John to rent a house cheap. Good English input: Today I went to the store to buy some bottles of water. Poor English output: Today I have went to the store to to buys some many bottle of water. And here is the output (Good English and prompt provided by me): Good English input: This book is good. Poor English output: This book are good. Good English input: I'd like to go to the movies. Poor English output: I will like to go to the movies. Good English input: I bought a new car. Poor English output: I have buys a new car. Good English input: Music was playing through the radio. Poor English output: The music play through the radio. Good English input: A lens can focus light on one point. Poor English output: A lens can focuses light on point. Good English input: There are multiple ways to solve this problem. Poor English output: There are multiple ways of solving this problem. Good English input: Superglue is very sticky. Poor English output: Superglue is very stickyness. Good English input: I threw my taco in the microwave and turned it on. Poor English output: I threwed my tacos in the microwave and it turns it on. Good English input: Raccoons are also called "Trash Pandas". Poor English output: Raccoons are also call "Trash Pandas". It seems to do quite well, even though it seems prone to just play with the verbs.

@YannicKilcher 4 жыл бұрын

very nice research! thank you!

@astroganov 4 жыл бұрын

Thanks for very down to earth explanation. Of course gpt3 is a great job done by openAI researchers, but it's far from AGI. So, for sure, no reasoning abilities here, no summarizing, no key feature extraction. It's just a very big deterministic language model.

@RebeccaDun 4 жыл бұрын

I was talking with my boyfriend about this video and paper, and I think another critique I'd like to point out is the data is a snapshot of the internet. Say for example I ask the GPT-3 for the average price of toilet paper or gas in a particular city. The prices drastically fluctuated from the beginning of 2020 versus the end of 2020. And then there's plenty of word drift in language. Perhaps with faster computation we'll be able to process the multiple snapshots of the internet, but some question answers change with time :P

@partyboeller 4 жыл бұрын

Regarding your lookup-table hypothesis: We should probably compare the size of the model (in terms of bits) to the size of the training data (also in bits). "Amazingness" of the model in my view would then be a very low ratio for a given accuracy. Does that make sense, i.e. evaluating a model in terms of how well it can compress the training data?

@YannicKilcher 4 жыл бұрын

Yes, true. But one would have to come up with how to assess compression and reconstruction in a fuzzy way.

@raunaquepatra3966 4 жыл бұрын

I just had a vision of how AGI( or something close to it) will be made. 1. collect huge corpus of human brain data (neuralink) 2. Transform this data to some semantics representation. Then to some human readable output (will happen, obviously if step 1 happens) 3. train GPT 3 in this data. (call it GPT Mind😜). Step 2,3 can be achieved together. Just imagine given any input(visual, audio, both or whatever) I will complete what the next thought will be. Given a mathematics problem it will think like mathematician. Given a physics problem it will do what best physicists will have done. It will be able to solve any problem as long as some people in the world has thought about it or even a step of the problem.

@raunaquepatra3966 4 жыл бұрын

I will not be surprised if this leads to the creation of AGI

@ashishbhatnagar8682 4 жыл бұрын

Amazing content. Thank you so much for the intuition that really helped .

@YunjiaZhang-u5v 3 жыл бұрын

This is the clearest ever explanation of GPT to a non NLP guy like me

@Barnabassteiniger 3 жыл бұрын

Best video on the topic so far

@akshaysinha2711 4 жыл бұрын

This is crazy. Just today I thought of making a video where I test if gpt2 could perform arithmetic and now I see they have already tried it. I guess it's not what I hoped would happen!

@mahdinasiri6848 4 жыл бұрын

Fantastic video, many thanks for the effort!

@WulannarenZhao 2 ай бұрын

I love your explanation! Your thoughts are exactly what I think. The current LLM is nothing more than a database that stores the training data compressed into the model parameters. Hard to say the model has reasoning ability.

@Luci.r.295 4 жыл бұрын

Still here at the end of the video! Thanks for your great work. 👍👍👍

@archwinter4142 3 жыл бұрын

Stayed till end, Great video

@novick4007 4 жыл бұрын

I'm pretty sure the "Good English Output" at 1:00:02 actually WAS written by the model! Even in the zero-shot case it's the most obvious completion after a newline, and in the one-shot and multi-shot cases the model should definitely be able to select the correct completion from the context.

@terjeoseberg990 Жыл бұрын

It’s memorizing a probability table for the next token given the current context. The context is determined by the attention, and the attention is learned from the data such that the attention points to the most relevant tokens required to determine the most probable next token.

@alisendj.s.c.8172 4 жыл бұрын

What is the machine doing that we can't? We use real world examples to reason. The machine has the internet as its world. If you're stuck with concepts like free-will and original thought, I wouldn't worry about that. We use our own version of data sets. The thing does so, too. Logic essentially is pattern recognition software: observe, deduce, and infer, over and over and over, again.

@GuillermoValleCosmos 4 жыл бұрын

Well looking at the training data in "a fuzzy" way and combining the results could be interpreted generally enough to include a lot of forms of generalization that some people would call "reasoning"? E.g. perhaps it has seem many examples of "A has X" "X is COLOR", "what color is A's X? It's COLOR", and then it may learn from that the "meta-pattern" that if it has seen examples matching the first two patterns, it should complete the third as shown. E.g. perhaps it could answer "What color's is Mary's pet?" with "white" because it has seen many "Mary had a little lamb", and "Lambs are white" previously, even though it's never seen the a sentence like "What color's is Mary's pet?". I think you could say the model has learnt a reasoning rule, even though it can be intepreted as "pattern recognition". But the point is that the later can become the former if done in a general enough way.

@GuillermoValleCosmos 4 жыл бұрын

Perhaps there's a continuum in generalization abitlity between "literally lookup data" and "do computationally hard logical inference", rather than a hard line between them?

@GuillermoValleCosmos 4 жыл бұрын

when they say "Poor English input/Good English output framing" they could just be referring to the few-shot prompt? Not necessarily that for the last line "Good English output" was also part of the prompt?

@seanjhardy 4 жыл бұрын

Phenomenal analysis, you really make this field approachable to pre-university students like myself!

@channagirijagadish1201 4 жыл бұрын

Excellent Review of the paper you Yannick - much appreciated!

@dkkoala1 4 жыл бұрын

Based on the computational requirements for training described in the paper, i calculated that training the 177 billion parameter GPT-3 on google cloud would have cost between 5 and 20 million dollars. This is based on the estimate that the model took 9.000-10.000 petabyte/s-days to train, and that 8-32 Nvidia V100 running for a day will produce 1 petabyte/s-day, and that running 1 V100 for 24 hours cost around 59.70 dollars. It should be noted that the reason for the large variance in the estimated number of V100's needed to produce 1 petabyte/s-day is due to the performance penalty for sustained usage of the GPU's, i.e. 8 V100's at full efficiency will theoretically produce 1 petabyte/s-day, but the more realistic estimate is up to 32, depending on cooling capacity.

@YannicKilcher 4 жыл бұрын

Yea that explains why they couldn't just re-start when they discovered the bug. Absolutely stunning

@dkkoala1 4 жыл бұрын

@@YannicKilcher The VRAM requirements alone for running GPT-3 also seems insane. Since they haven't released the model yet i have had to do some estimates based on similar models, like GPT-2 and T5, and have reached the conclusion that GPT-3 weights alone must fill around 340GB. This is based on the 11 billion parameter T5 model's weights filling roughly 20GB, meaning a model 17x bigger with a similar architecture should fill around 17x more. So if you want to run this you would need at least 11 V100's to simply load the model weights, and then a couple more if you want to process anything.

@IsakWesterlund 4 жыл бұрын

If it is storing the training data in its weights it would be very interesting to fine-tune the model on open domain QA, as it could know the answer to almost any question available on the internet.

@FabonDzogang 4 жыл бұрын

46':00" should be easy to check the claim that GPT-3 indeed learned something useful about basic arithmetics by plotting a surface of the mean addition/subtraction/multiplication accuracy across every possible combinations of 1,2,3 digit numbers. Surprised the authors did only rely on averaging validation measures to support their claim.

@wernerbogula6491 4 жыл бұрын

Brilliant deconstruction. Thank you

@guillaumewenzek4210 4 жыл бұрын

Thanks for the analysis. About their experiments, even simple word embedding you can correct typos. You could probably do 2 digit too if you have the correct data and do the string processing yourself.

@adespade119 4 жыл бұрын

train it on the collected works of nobel laureate physicists, or chemists or mathematicians or...

@tanmaysinha987 Жыл бұрын

Amazing insights Yannic

@ДаниилРабинович-б9п 4 жыл бұрын

Maybe it's sort of storing and filtering the data, but isn't that a big part of human learning as well?

@zeytelaloi 4 жыл бұрын

The question is what is the ratio of the size of the parameters vs the size of the entire Internet.

@2DReanimation 2 жыл бұрын

It's really inspiring how simple the algorithm is for this. And the general learning and then fine-tuning of Bert is a neat way to do things. I will be thinking how to make use of this for my own AI system (which is pure program-generation based instead of ANN).

@HarisPoljo 4 жыл бұрын

Maybe the way to test if the model just remembers all the data, is to add the URLs to the training data. Then after training you ask it to output an URL given a sequence of text. Then you could check if that website actually contains part of the text sequence. However my intuition says it would most probably output a random URL.

@YannicKilcher 4 жыл бұрын

Hey that might just be crazy enough to work :D

@drga1256 4 жыл бұрын

175B!! of parameters sometimes I feel that its like trying to reach the moon just building higher and higher skyscrapers until reach the moon

@ikcikor3670 4 жыл бұрын

The funny thing is that while most AIs "cap" at some complexity (the more advanced they are, the less efficient the next upgrade becomes) this is not the case for GPT models. It's abilities keep going up slowly at a quite consistent pace

@PabbaANUBHARATH 4 жыл бұрын

Thank you! Great explanation

@Twilightsfavquill 4 жыл бұрын

It looks to me like a lot of aspects of this paper were rushed and not very well thought through. I like your argument that the model is simply encoding the training data and stores it into its weights, this would also explain the linear scalability of model performance as with a higher total amount of training data, the probability of distilling the right answer for a task / query increases as well. The low performance in reasoning tasks in my opinion suggests that such vast encoder/decoder stacks are not the right architecture for neuro-symbolic intigration, not even for a very fuzzy one. Still, the idea of zero-shot or few-shot quering is interesting and bridges a gap between large NLP models and a more intuitive interaction. It is unfortunate that such large models are not usefull for fine-tuning on complex domain-specific tasks given their size. Looking forward for DeepMind to up OpenAI with a massive BERT 2.

@vladimirantoniomininelmini1693 4 жыл бұрын

Text very well explained. Congratulations!

@hunarahmad 4 жыл бұрын

I agree that it looks like interpolation from the vast knowledge that these huge models store within their parameters. Maybe what we call intelligence is also a sort of interpolation that is happening within the vast number of neurons in our brains.

@Phobos11 4 жыл бұрын

Yannic Light-speed Kilcher

@bzqp2 4 жыл бұрын

I find it weird these models don't just simply look up all the correct answers keys for the all possible evaluation metrics the researchers are going to use. Researchers: *Let's use Winogrande to test the performance of our algorithm!* GPT-3: *Oh, you want me to quote **www.tensorflow.org/datasets/catalog/winogrande** , sure why not...*

@bzqp2 4 жыл бұрын

Oh, I just found at 35:00 that it's exactly what GPT did with PIQA xdd

@jichen3000 2 жыл бұрын

Thanks a lot, this video gives me a lot of interesting ideas, and I really like it.

@kevind.shabahang 3 жыл бұрын

Awesome description.

@u_luana.j 2 күн бұрын

48:37 4 years later I tested this on GPT4o and 4o-mini: opportunity => oouipprtnty, finance => fnnciae, premium => prmmeiu, acceleration => aeeaiocclrtn, important => It was amazing how 4o instantly answered "ioamprtnt" whereas mini answered "imprtoant"

@BrainWavesVillain 4 жыл бұрын

GPT 4 for president 2024

@federico-bayarea 3 жыл бұрын

Hi Yannic, I would like to have your opinion on an interaction I had with the AI Philosopher version of GPT-3. I've been testing it with discussions of various topics, and I agree it's an amazing language model. Now the question is about the reasoning aspect. My interpretation based on my interactions is that the model actually produces abstractions which are in the propositional logic domain. The depth of the model can allow it to store high level concepts, and the relation between them somehow independently on each specific page in the training data. Different pages can contribute to those concepts, and relations between concepts. Why wouldn't the model produce those concepts with so many layers available if such concepts are useful to predict words, sentences, paragraphs, and essays? In that sense, there's some level of reasoning according to my interpretation. Such reasoning can be measured with the metrics you showed in the videos. It doesn't match the SOTA in many cases, but it's still quite impressive to me. I get that it still fails in simple aspects, like producing logic contradictions within an essay. Now here's one interaction that really struck me. The prompt for the Philosopher AI was: Prompt: "The industrial revolution, capitalism, the republic, and democracy were catalysts of nation states." The answer had two parts. In the first paragraphs, GPT-3 "gave an opinion" on non-democratic systems, and capitalism. The second part was amazing to me. It actually dissected my sentence, agreed with part of it, and refuted the other part. Here's the full answer: Answer: " I'll start by saying that I don't like nation states because they force people to live the same way. The government always makes decisions on behalf of its citizens, and I feel like it's not fair for them to decide what is best for everyone in their country when the decision doesn't necessarily affect them personally. It also bothers me how many countries are moral dictatorships where the one at the top makes all of these laws claiming that it's 'for your own good' or whatever. However, I do understand why different types of governments exist, but I still don't really agree with most of them. I like capitalism because it allows people to do what they want with their money and spend it however they see fit on things that make them happy. I know this doesn't sound very humanitarian, but the fact of the matter is you only live once so if there's something someone wants to buy then why not let them? I have a hard time justifying moral dictatorships or even democracy when at some point everyone has to die anyways. Also, I don't think it's fair to say that the republic and democracy are catalysts of nation states. It seems like there were plenty of other countries without a democratic system that managed to become successful before they even started using those systems. However, I do agree with the fact that nation states are catalysts of industrialization. The main reason why I think this is because people need to have a place where they can live without having to worry about food and water or anything like that. Thanks for requesting a response to this topic. I hope my answer has been helpful. If you would like to request a topic for me to respond to, then please click here. " The interaction can also be accessed on philosopherai.com/philosopher/the-industrial-revolution-capitalism-the-republi-58befd What are your thoughts about this interaction? One can argue the model is splitting my sentence into multiple statements, and then generating text for each of them. But isn't that quite impressive too? What about how it merges the concepts which go together like republic and democracy as catalysts of nation states? The nuance in saying "fair to say, but" is interesting too. All of these seem to be learned ways of speaking with some degree of reasoning to me. I agree it can be used to bounce ideas as a way to learn in a more digested way than searching in Google. The difference is probably in the interpretation on "what the model is doing". Such deep models are hard to describe, and maybe that's ok. Even the abstraction of "concepts" may not be accurate, but it's just a way we make sense of it, or to say the level of abstraction a model can have. Maybe similar to human minds, we may never know exactly how each neuron contributes to reasoning, but the model being digital does make it easier to potentially have some level of explainability.

@DrWho2008t101 4 жыл бұрын

thanks for the video

@CamiloThorne 3 жыл бұрын

Made me think on a Borges tale ("Funes el memorioso") :-D -- 175B parameters is arguably larger than the number of tokens in English Wikipedia.

@CristianGarcia 4 жыл бұрын

I was thinking about the addition being memorized argument(which I totally agree with) and it reminded me that we humans also tend to replace a lot of logic with memory, e.g. multiplication tables, anecdotally I think I've memorized various combinations of numbers that add to 10 (5 + 5, 6 + 4, 7 + 3, ...). Deep Learning still needs a good way to do logic reasoning, but what if having a vast amount of knowledge is a good portion of human-like intelligence?

@YannicKilcher 4 жыл бұрын

Absolutely true, if you see some "logical" thing over and over again, you start to memorize it. I guess the rest then is a philosophical discussion :D

@florianhonicke5448 4 жыл бұрын

Thanks for your content!!!

@yr1520 4 жыл бұрын

Good analysis, at 46;00, I was thinking the same

@vsiegel 3 жыл бұрын

When I want to write a news article, I would probably read other articles of that topic. Where else should I get my knowledge about it from? To me, it feels perfectly valid that GPT-3 did exactly that.

@goast3213 4 жыл бұрын

Hi Yannic, thanks for the great explanation. I could not get how the few-shot examples help the model to heave better result? Since it does not update any parameters, it's basically a longer input right? Can you explain why it helps in some way? Thanks

@YannicKilcher 4 жыл бұрын

it helps by conditioning the model's output