Who would have thought training with authentic text like textbooks was better than training with social media gibberish? Big surprise.
@thefigureshow Жыл бұрын
Ya but the chatbots were meant to communicate normally with people so what Better way to do that than using social media as training data
@niggacockball7995 Жыл бұрын
@@thefigureshow but social media has a lot of trash compared textbooks which are more professionel and has more context for less text
@Omni7600 Жыл бұрын
@@thefigureshow Using anything but social media lmao. Social media is dumbed down society.
@akkudakkupl Жыл бұрын
@@thefigureshowof you want to communicate bullshit then you go with social media. Od you want actual knowledge on the other side...
@DigitalForerunners Жыл бұрын
@@akkudakkupla textbook does not contain significant amounts of conversational data which is the only way an “AI” can be trained to understand back-and-forth communication.
@alexisandersen1392 Жыл бұрын
Wow... you mean the quality of the training material determines the effectiveness of the model... Who would have thought it!? Amazing.
@SamuelAlbanie1 Жыл бұрын
Indeed. Very wow.
@vogel2499 Жыл бұрын
You're right, but kinda missed the point. Good AI should be more than accurate, but also could sort the signal from the noises by itself. Human intervention should be as minimum as possible. If all that OpenAi wants is an accurate LLMs they'll already done it long time ago.
@alexisandersen1392 Жыл бұрын
@@vogel2499 I get the feeling you don't understand machine learning at all.
@vogel2499 Жыл бұрын
@@alexisandersen1392 You have no idea who I am, so don't assume. I could disclosed it but it'll just humiliate both of us.
@alexisandersen1392 Жыл бұрын
@@vogel2499 I didn't say you definitely didn't understand machine learning,.... It's just that given what you said, I get the feeling that you don't understand machine learning at all.
@_anon Жыл бұрын
You have to love "open"AI not wanting people to develop models using their output even though they never got permission to use the data they trained their own models on.
@batteryfresh123 Жыл бұрын
you shouldn't develop models using outputs of other models to reduce skew in the models and amplifying bias. If you want to create a bad chatgpt clone you can do that, but whats the point?
@akam9919 Жыл бұрын
@@batteryfresh123 sorry but wtf does that have to do with the original comment?
@princetchalla2441 Жыл бұрын
@@batteryfresh123 to figure out what works and make a better chatgpt clone?
@darkarchon2841 Жыл бұрын
Very interesting. Could this mean that eventually smaller and smaller businesses could compete with tech giants on generative AI market? Monopolization of AI seems like a large issue.
@SamuelAlbanie1 Жыл бұрын
Good question. It's hard to say definitively. It seems likely to lower the barrier to entry for smaller businesses to reproduce current frontier models. However, the fact that the student model can outperform the teacher (in this case phi-1 vs GPT-3.5 in the domain of Python problem solving) may also mean that better capitalised firms are able to leverage the same ideas, and to do so at much greater scale. My hunch is that scale becomes *even more* important, since synthetic data can be targeted to such a wide range of scenarios (and therefore may represent an attractive investment). I expect we'll soon see various labs aiming to reproduce these results in other domains beyond Python coding (e.g. mathematics, engineering, etc.).
@sumofat4994 Жыл бұрын
No not directly but they will be able to build their own models without relying on the larger companies.
@dinoscheidt Жыл бұрын
Are small businesses competing on Databases, Operating Systems, Compute, Electricity… etc? All these drift towards commodities. You should focus on the problems you are solving as a business, not {insert technology/hype/trend}. What this shows is potentially a race to the bottom for large tech giants (like with cloud and compute) which is the true benefit for small businesses.
@hm09235nd Жыл бұрын
“We’ve done a lot of looking over our shoulders at OpenAI,” the memo said. “But the uncomfortable truth is, we aren’t positioned to win this arms race and neither is OpenAI.” “I’m talking, of course, about open source. Plainly put, they are lapping us,” it continued. “While our models still hold a slight edge in terms of quality, the gap is closing astonishingly quickly.”
@llothar68 Жыл бұрын
Now we need a full free library access to all books for every AI company. We already have a law in many countries, for example in my country Germany, that every published book (defined as having a ISBN) must send a book copy to the national library. Give AI access. AI Education matters. 🤣
@JamesJosephFinn Жыл бұрын
Commendable work! Subbed! Nobody else on YT (that I've seen) is parsing AI research papers in an approachable manner (as you've done here), for the benefit of non-academics, such as myself. Please keep it up. Forget vanity metrics ("likes", "views", etc.). We are at the dawn of a new age. There are many good people out here trying to come to grips with this storm, and this type of content will help us navigate through to the other side. Godspeed.
@SamuelAlbanie1 Жыл бұрын
Thanks for the encouragement!
@guy_th18 Жыл бұрын
You'd probably enjoy the channel "AI Explained", which does something similar.
@LuaanTi Жыл бұрын
That said, MSR papers are very often _brilliant_ and well written. Highly recommended, and not just about AI :)
@orti1283 Жыл бұрын
You'd probably enjoy two minute papers
@vatsaljha2143 Жыл бұрын
I loved this. I usually have trouble staying focused while reading research papers but your video was a godsend. Please do more videos like this. You earned a follower
@SamuelAlbanie1 Жыл бұрын
Thanks!
@younesprog2629 Жыл бұрын
Wow, what an amazing video! Keep up the good work, Mr. Samuel. I have a suggestion: From time to time, it would be amazing if you could make a video about these outstanding technical papers. Thanks a lot!
@SamuelAlbanie1 Жыл бұрын
Thanks @younesprog2629! I'm glad it was useful.
@smallestcat Жыл бұрын
Oh yeah, a monthly and yearly wrap-up of the most groundbreaking or hottest papers would be cool
@irismaxj Жыл бұрын
Thanks!
@SamuelAlbanie1 Жыл бұрын
You're welcome.
@pmk_ Жыл бұрын
Thanks for putting together the video!
@SamuelAlbanie1 Жыл бұрын
You are most welcome.
@yannalaplageable Жыл бұрын
I really dont understand how they could not think of that before ! that is so obvious. Thanks for sharing.
@connorrobinson9268 Жыл бұрын
About time someone makes a video like this. I'm happy you have a brain you are willing to share.
@SamuelAlbanie1 Жыл бұрын
Thanks @connorrobinson9268! I can confirm that I am a human with a brain.
@reinerheiner1148 Жыл бұрын
On one side, neural networks attempt to simulate neural networks in biological systems. On the other side, the input used to train these models would be catastrophic for biological systems to learn from. Its understandable that they gave priority to scaling vs more focus on the data itself, because scaling is just much easier, and less labour intensive, especially when there are no models yet that can do a good job doing that. So the paradigm shift probably did not occur by chance, its rather that one is dependent on the other. But to be honest, I see huge potential in optimizing data in a way that works up from small to large complexity, is easy to understand, just like humans would need. 1.3B parameters is honestly nothing compared to other LLM's with similar capabilities, so this is exiting but also means that bad actors will need less hardware to be able to build competent LLMs as well... Thanks for reviewing this paper, its indeed important, you have earned a sub!
@SamuelAlbanie1 Жыл бұрын
Thanks for your comment. I agree that we are likely to see greater focus on data quality, but haven't so far since scaling has proven easier. I also agree with your comment that these developments lower the barrier to building competent LLMs for everyone.
@therealb888 Жыл бұрын
I don't know, While it's easy for the ML engineers to scale compute it's hard for the systems engineer, platform engineer & others. Are the latter disciplines really that much easier than ML?
@therealb888 Жыл бұрын
In a multipolar world of cyber warfare there is no clear bad actors. Everyone has their self interests & bad is relative. Among the arsenal of 0 days, botnets & DDOS, AI may be the nukes. Let's make sure it's not the great filter.
@clray123 Жыл бұрын
There are no "bad actors", it is a propaganda term invented by the US government (compare "axis of evil" and what not). Or to put it another way, the actual "bad actors" here are only those who think they can designate others as "bad".
@BlackmesaofIND Жыл бұрын
Trying to model Machine Learning after the way humans learn is like trying to model a new plane design after the way a bird flies. It at first makes intuitive sense to try and do so, but it ends up being an inefficient path to take.
@charles3840 Жыл бұрын
This has the possibility of giving Google a leg up if they use this method of training and the method is fruitful. They have one of the largest databases of textualized books available, in a world where a lot of books in digital form are largely decentralized. Other companies will probably have to gather their own databases of textbooks before they can really begin training (assuming Google does something to prevent competitors from using their database of scanned books, either legally or technically). Amazon also has a similar database, but they're not really competing in the AI space as of yet.
@SamuelAlbanie1 Жыл бұрын
Thanks for sharing your thoughts. I agree Google is a company that could potentially benefit considerably from this technique.
@Jiyoon02 Жыл бұрын
This is brilliant yet totally intuitive.
@SamuelAlbanie1 Жыл бұрын
Thanks for sharing your thoughts.
@rupjitchakraborty8012 Жыл бұрын
Thank you for summarising this paper
@SamuelAlbanie1 Жыл бұрын
Thanks for watching!
@omarnomad Жыл бұрын
Great video! Love your podcasty voice!
@SamuelAlbanie1 Жыл бұрын
Thank you!
@adamdevereaux2459 Жыл бұрын
Imagine if we apply the same level of effort to teach AI that we put into a single human, where it takes 18-25+ years? We are just throwing reams and reams of mediocre data at them. Imagine if we tried raising AI in the same foundational knowledge growth that we do an infant or toddler.
@SamuelAlbanie1 Жыл бұрын
It's a good point. One of the comments made by the authors about existing internet-scraped datasets was "One can only imagine how frustrating and inefficient it would be for a human learner to try to acquire coding skills from these datasets, as they would have to deal with a lot of noise, ambiguity, and incompleteness in the data."
@vyrsh0 Жыл бұрын
that is old AI. nowadays we just train something a billion times, and act surprised that it is working. that is why LLMs are just a hype, because they cant learn (train != learn), and probably NN will never be able to learn.
@idiomaxiom Жыл бұрын
@@SamuelAlbanie1 LLM's are averages of whatever they are fed. Them seem smart because it is impossible to comprehend the average of the knowledge they have been fed.
@lucasblanc1295 Жыл бұрын
@@idiomaxiom But there is some emergent "quasi intelligence" from that. Maybe it is picking up on the subconscious patterns humans use for general intelligence, but that we obviously can't express well enough to manually code it. Although, to me it still feels like current LLMs are cheating on intelligence, because it can write so much in one go, human intelligence happens in an event loop, we iterate. That's why chain-of-thought works so well. Perhaps someday we will be able to distill that raw intelligence and separate it from knowledge, get a blank slate similar to a human baby that can be trained like a human. But so far we can't separate that raw intelligence, to distill that holy grail blank slate, decouple it completely from mere patterns from the training dataset. That would truly be general intelligence, of course. Essentially, create a training dataset that is so pure our minds can't even comprehend, it basically doesn't even contain knowledge of a specific language. And it can learn a language with as little input as a human baby would need. Of course, like I said, that probably requires a lot of processing going into a feedback loop which is essentially a type of subconscious. So instead of humans creating those datasets manually, it would create for itself from very little input, completely free from any external help from large amounts of data from another LLM. If we can do that, that's basically the definition of general intelligence, that's what nature has imbued us with through billions of years of training through natural selection. But god knows, if we will ever be able to do it like that, there is just so much more bang for the buck training it on useful datasets, probably if we see it happening we might dismiss it as being a inefficient technique that doesn't work, like how neural networks were seen back when our computing capability was barely enough to run a chess bot. The problem with current LLMs is that they were trained on data that was meant for human consumption, it was never meant to train a mind from scratch, no wonder it required to much to begin to show emergent intelligence.
@aoeu256 Жыл бұрын
What if we train poor people in the global south and AI in a way so that the poor people can learn faster by communicating with the AI, and the AI can help the poor people learn stuff that can help them improve their infrastructure.
@mathematicalninja2756 Жыл бұрын
I have short attention span and hey was able to follow through the video. Thank you !
@SamuelAlbanie1 Жыл бұрын
Glad it helped!
@hermestrismegistus9142 Жыл бұрын
Fascinating! I wonder how good the reasoning ability of small models can get with these techniques. LLMs already have a super-human ability to absorb information. If their reasoning ability also reaches or transcends human parity then world changing events won't be far!
@SamuelAlbanie1 Жыл бұрын
Thanks for sharing your thoughts. I agree the trend points in that direction.
@maloxi1472 Жыл бұрын
Not a lot has changed in that regard. Induction is still limited to Bayesian-type reasoning, deduction si what you'd expect (superhuman and disappointingly narrow too) and almost zero progress was made on the abduction (theory building) front. Most of the field is currently taken hostage by an influential cohort of inductivist and Bayesian ideologues but it's not all bad though. If we're lucky, the authoritarian scammers that follow them might miss actual AGI when it is built and the first such agents will have enough time to reach escape velocity before they tighten the screws.
@donquixoteupinhere Жыл бұрын
Thanks for this analysis sir, I respect you! Particularly impressed by your cost estimation and very much appreciate that info.
@SamuelAlbanie1 Жыл бұрын
Glad it was helpful!
@spmishra00 Жыл бұрын
Thanks for a very lucid explanation of the paper.
@SamuelAlbanie1 Жыл бұрын
Thanks!
@StephenGillie Жыл бұрын
Basically this paper calls whole-internet training a waste of time and money.
@SamuelAlbanie1 Жыл бұрын
Perhaps, or perhaps only one team will do whole-internet training, and everyone else will distil their models from this base...
@xbox70333 Жыл бұрын
If you think about it, to learn something or anything you dont require 99.9999% of the data on the internet. Most is noise.
@yashjindal9822 Жыл бұрын
Random internet data -> GPT -> Textbook data -> phi-1 Interesting, as it still makes the presence of gpt essential But all this in about 6.5k is o_O Great explaination. It is daunting to go through a paper but such a video is a breather
@SamuelAlbanie1 Жыл бұрын
Thanks! Glad it was useful.
@blahblahblah23424 Жыл бұрын
I'm a little confused about this. If the overwhelming majority of the "textbooks" are generated by GPT3 or 4 itself, then ultimately isn't this just a form of distillation? It is already a well-understood and well-demonstrated phenomenon that remarkably small models can approach the performance of much larger models on certain tasks via appropriate distillation methods, with a bunch of caveats. If you want to say that very small models can attain GPT levels of performance specifically because of a smaller, high-quality dataset, I think you need to take some care to differentiate what you are doing from distillation. Would be interested to see something like this using non-synthetic datasets
@SamuelAlbanie1 Жыл бұрын
Good points. I think you can certainly interpret training on ChatGPT/GPT-4 outputs as a form of distillation. There have been other works also exploring this direction over the last few months. For example, Vicuna (lmsys.org/blog/2023-03-30-vicuna/) is trained on data from ShareGPT, which collates conversations from ChatGPT (though targeting good conversational performance, rather than coding skills). What I found particularly notable about the Phi-1 work is how effective the final model is, given the small size of the training set and model. To your question, I would expect that we will see further studies (comparing training effectiveness on different distributions of GPT-4 generated content) to more carefully investigate the importance of how the data is synthetically constructed.
@Sleeperknot Жыл бұрын
How do you mean, an overwhelming majority of textbooks are already generated by GPT3? Aren't they talking about text books that are authored by humans?
@AliJardz Жыл бұрын
The use of LLMs to train LLMs is really interesting
@SamuelAlbanie1 Жыл бұрын
I certainly think so.
@devorein Жыл бұрын
This is wonderful. Please do more of technical paper reviews if your time permits. What are your thoughts on training the model using such large volume of generated data? A paper published almost a month ago titled "The Curse of Recursion" explains why this can cause "irreversible defects on the resulting models". Is it because its so hyper specialised in one particular domain (python coding) that phi 1 is immune to such defects?
@SamuelAlbanie1 Жыл бұрын
It's an interesting question. My hunch (unsubstantiated by experiments) would be: (i) More powerful models will be less prone to the collapse issue. We've seen in other settings that models like GPT-4 are able to self-correct through consistency checks, whereas weaker models are far less able to (some experiments showing this in the context of catching hallucinations can be found here: arxiv.org/abs/2305.18248) (ii) It seems plausible that diversity can be systematically introduced into generated data to avoid the issue (this is something that is hinted at in the Phi-1 work with heuristic approaches, but has also been demonstrated with "quality-diversity methods" e.g. carper.ai/quality-diversity-through-ai-feedback/) I suspect the question of "how well synthetic data works at large volume" will be settled empirically in the near future by one of the industrial AI labs.
@jmgbrito Жыл бұрын
Just a simple correction in your video, you said Orca used 5M synthetic tokens(4:40 in the video) it actually used 5M synthetic responses, the number of tokens of those responses are not actually said.
@SamuelAlbanie1 Жыл бұрын
Yes, you are quite correct! I have added a correction that should show up at this point in the video, but unfortunately it's quite subtle.
@DrHanes Жыл бұрын
Great video! The Phi-1 model's efficiency in training large language models is indeed impressive. However, it would be beneficial to delve deeper into its limitations and potential solutions. Also, a more detailed discussion on what constitutes 'high-quality' data would be helpful for those looking to apply this in different contexts. Practical examples of the model in action would also be appreciated. Lastly, a more in-depth conversation on the ethical and social implications of training such models is crucial in today's AI-driven world. Keep up the good work!
@SamuelAlbanie1 Жыл бұрын
Thanks for your input - that's helpful feedback.
@ashleigh3021 Жыл бұрын
🤓
@uiedbook7755 Жыл бұрын
Very interesting, this is a show that AI tech will keep improving for as long it can become so much useful!
@SamuelAlbanie1 Жыл бұрын
It certainly seems to be on a trend to keep improving.
@SavageStephen Жыл бұрын
You know... This also applies to humans... the better quality of training data, the better results you get. If you started reading textbooks and scientific journals instead of social media and TV you would get a better and more effective model for decision making.
@winddude9 Жыл бұрын
Orca was 6m samples, each about ~500 tokens so about 3b tokens so actually orca was about 3x bigger for tokens in the training data.
@SamuelAlbanie1 Жыл бұрын
Thanks for flagging this! I'll add a correction.
@meditationsafespace153 Жыл бұрын
Very fascinating!
@SamuelAlbanie1 Жыл бұрын
Thanks!
@creativeuser9086 Жыл бұрын
Isn’t almost all training datasets nowadays contaminated with some of the HumanEval test data? I’m not sold on the decontamination part, albeit the authors’ feedback on current training data being poorly formatted is very true. thoughts?
@SamuelAlbanie1 Жыл бұрын
For many training sets, I'd agree that there is a fairly significant risk of that unless careful deduplication is performed. However, although this paper doesn't share code (so it's not possible to audit it), the authors do appear to do quite a careful job in their contamination studies, so I suspect there is less of an issue with regards to having close-to-identical questions in the training dataset. There is the broader issue of how much the use of synthetic data that closely mimics the testing distribution gives the model an unfair advantage, but I view the experimental design choices taken here as reasonable (i.e. I think the paper works as a compelling proof of concept).
@creativeuser9086 Жыл бұрын
@@SamuelAlbanie1 fair points. Thanks a lot!
@vinayakasrinivas4240 Жыл бұрын
This is really interesting. Do you think there is potential for third-party companies that solely focus on not only data collection but data synthesis that can sell access to specifically constructed datasets?
@kazihiseguy-fernand4637 Жыл бұрын
You answered your question lol
@SamuelAlbanie1 Жыл бұрын
I would very much expect to see this. It will be interesting to see how the space evolves.
@adamdevereaux2459 Жыл бұрын
So the underlying implication is that incorrect information within your datasets, makes the logic and reasoning capabilities compromised. Essentially too much false or misleading information creates a signal to noise impact on the model itself and makes it less efficient… makes complete sense. and I believe it parallels to human learning. It’s much harder to teach someone if they have it on learn the wrong information they “knew” before. Especially hard for large language models considering they have no direct way to test and validate if what they learned is correct or wrong.
@SamuelAlbanie1 Жыл бұрын
I think correctness certainly plays a role (though I expect there to be some errors in the generated textbooks from ChatGPT). In this paper, I also think that creating a data that is well suited to learning (in this case, synthetic textbooks) makes a big difference. As the authors suggest, it would be hard for humans to learn from uncurated/scraped datasets - it is much easier for us to learn from textbooks that feature clear explanations and worked examples.
@SiddheshKukade Жыл бұрын
Please keep making more videos
@SamuelAlbanie1 Жыл бұрын
Thanks! I'll do my best.
@novantha1 Жыл бұрын
I wonder if a similar set of principles could be used to produce a high quality multi-modal diffuser model, which is able to learn not just from raw / annotated images, but also from natural language concepts and principles...
@SamuelAlbanie1 Жыл бұрын
Good question. I suspect the answer is yes. There are increasing efforts to curate synthetic datasets (e.g. github.com/JourneyDB/JourneyDB)
@southcoastinventors6583 Жыл бұрын
Found your channel its awesome need more channels that cover direct research instead of AI=Terminator so now they need to train a LLM model that understand quality data from everything else.
@SamuelAlbanie1 Жыл бұрын
Thanks for the positive feedback!
@floreskyle1 Жыл бұрын
Is there already a chat application we can use? Like ChatGPT?
@SamuelAlbanie1 Жыл бұрын
There is a demo on HF here: huggingface.co/blog/llama2#demo
@floreskyle1 Жыл бұрын
@@SamuelAlbanie1 Thank you!
@alir8zana635 Жыл бұрын
thank you so much for this great video it really helped me understand this paper very well
@SamuelAlbanie1 Жыл бұрын
Most welcome
@acters124 Жыл бұрын
So what you are saying is quality > quantity offers more efficient LLM model creation. However it would lack the informal world's constantly changing language quirks. But noone cares about the informal information, but it is still a market that has yet to be fully understood and may provide better way to communicate with a layman that is not knowledgeable. maybe in the future multiple LLMs coud team up to create new information that does not cause worse data understanding over the long term?
@SamuelAlbanie1 Жыл бұрын
I'd certainly agree the quality appears to be critical in efficiently creating LLMs. I don't have a good sense for the dynamics of future synthetic dataset creation though - it's an interesting question to consider.
@AdrianMcGavock Жыл бұрын
Really enjoyed this video (though don't pretend to have understood absolutely everything!) sub added 👍
@SamuelAlbanie1 Жыл бұрын
Thanks!
@notaspectator Жыл бұрын
Considering how senior engineers at times like to flex on junior. This is a flex of AI being textbook centric. I would appreciate a more philosophical discussing about this with someone.
@Pcoxproductions Жыл бұрын
2:54 Quality data or information is more powerful than massive poor quality data; Quality data is clear, self contained, instructive and balanced
@SamuelAlbanie1 Жыл бұрын
I think that's a fair takeaway.
@li-pingho1441 Жыл бұрын
awesome! thank you sooo much
@SamuelAlbanie1 Жыл бұрын
Thanks for watching!
@Deletaste Жыл бұрын
This is data engineering at its finest.
@SamuelAlbanie1 Жыл бұрын
Indeed, a new frontier...
@meanieweeny4765 Жыл бұрын
Now this is intresting
@SamuelAlbanie1 Жыл бұрын
Thanks!
@Grinwa Жыл бұрын
Woow this will epower even more crazy projects
@SamuelAlbanie1 Жыл бұрын
Thanks for watching!
@herp_derpingson Жыл бұрын
I am also seeing a shift from BS architecture ablation papers to papers that actually work and are reproducible. What a time to be alive!
@SamuelAlbanie1 Жыл бұрын
"What a time to be alive!" - indeed! Thanks for watching.
@prolamer7 Жыл бұрын
Yeah what a meme virus you and many others have, parroting this phrase and thinking you are funny while it is becomming anoying.
@ahmedkadry7717 Жыл бұрын
What is the name of the pdf reader you are using ?
@julkiewitz Жыл бұрын
Taking training data from one LLM to train another one seems like getting high on your own supply, idk. Doesn't it place a bound on the quality of the result?
@SamuelAlbanie1 Жыл бұрын
Good question. My intuition is that the answer is no - by iterative applications/bootstrapping, it seems possible that the model can gradually improve itself.
@julkiewitz Жыл бұрын
@@SamuelAlbanie1 I just wonder by what mechanism the resulting network would for instance "decide" to get rid of some common error that GPT 3.5 introduced into the data set. Say GPT gets confused anytime there's a DFS and adds a superfluous if at the end (something that's I've observed in reality when prompting it myself). Or maybe I misunderstood and that's not how GPT 3.5 was used here, maybe just provided some boilerplate / explanations etc.
@HominidPetro Жыл бұрын
This approach could help guide scientific research immensely by flagging pre-existing studies for quality to support future endeavors or prevent unnecessary replication.
@SamuelAlbanie1 Жыл бұрын
Thanks for the comment, but I don't quite see how this follows from the paper?
@HominidPetro Жыл бұрын
@SamuelAlbanie1 Sorry I think I maybe misunderstood the study. Were they able to screen inputs for quality? Or was there an assumption of quality if it came from a text book? The ability to screen information for quality is what I'm referring to. But maybe main conclusion here is demonstration of garbage in garbage out? I dunno
@Ryan-qu4vx Жыл бұрын
It really is amazing how many times we have to re-learn "Garbage in, Garbage out".
@SamuelAlbanie1 Жыл бұрын
Quality does seem to be key for synthetic data.
@thearchitect5405 Жыл бұрын
"Textbooks Are All You Need" Oh yeah, and GPT3.5 too.
@SamuelAlbanie1 Жыл бұрын
Perhaps this would be a good subtitle :)
@guilimasp Жыл бұрын
When you say "48 hours ago", did you mean 48 hours before the moment you were speaking, or 48 hours before the posting of this video? Thank you very much
@SamuelAlbanie1 Жыл бұрын
An excellent question. I was aiming for when the video was posted, but I may not have been highly precise...
@guilimasp Жыл бұрын
@@SamuelAlbanie1 The goat
@darshanrajpattanaik2154 Жыл бұрын
Hey! How are you able to find great research papers? I am graduating next year and want to understand about the research process... Could you please help?
@SamuelAlbanie1 Жыл бұрын
1. Twitter is a useful for resource for high profile papers that are being widely discussed. 2. It helps to be in a research group, where people tend to be discussing whatever seems new and exciting. In the past, this was restricted to people physically present in the group/lab, but in the last few years many open research collectives have emerged (typically on discord servers) where people discuss papers. Many are welcoming to newcomers - it may be worth trawling reddit to find suggestions for discords that suit your research interests.
@Mohith7548 Жыл бұрын
Can you please get the annotated version of this paper?
@SamuelAlbanie1 Жыл бұрын
Thanks for the comment. I don't think my annotated version is much use, since here the annotation is not permanent. I'll have a think about using a different format in future that would allow this.
@pkqs9065 Жыл бұрын
The costing less than 2 apple vision pro part was hilarious lol.
@pajeetsingh Жыл бұрын
So you bought all the books you are using for training?
@SamuelAlbanie1 Жыл бұрын
The books used by the authors in this paper are synthesised using LLMs. To your point, it's likely that the original textbooks used to train the LLMs were scraped (and not paid for, rather than bought individually). However, I don't know this for sure.
@krox477 Жыл бұрын
World is changing faster than we thought
@SamuelAlbanie1 Жыл бұрын
Indeed.
@asperspec9956 Жыл бұрын
I wonder what would happen if you train it with Common Core
@SamuelAlbanie1 Жыл бұрын
Good question. I suspect this already happens to some degree for the frontier models (but it's hard to know exactly what goes into their training data).
@demr04 Жыл бұрын
if one just think about it, it seems obvious that textbooks are better resource to learn for the models because the data has structure. To fit a model, it needs a structure where to fit in. For a person, some textbooks are horrible, because they like the pedagogic skills of the author(s).
@SamuelAlbanie1 Жыл бұрын
I think it's likely that we'll see further research investigating what kinds of textbooks work well for models. For now it's unclear, at least to me, what "good pedagogy" will mean in the context of an LLM.
@jessthnthree Жыл бұрын
Could you hypothetically have another AI to act as a translation layer between a user's input into a more formal format that the primary AI can understand? I'm no scientist or anything, I just have a new special interest
@SamuelAlbanie1 Жыл бұрын
Yes - it's a good suggestion. It's increasingly common to chain AIs together to make use of their relative strengths. For example, one trick that's used to improve text-vision models like CLIP (that perform classification with embeddings) is to use a second text model to "translate" the text inputs into a format that makes them work better (for example, adding a visual description of an item to be classified). This idea is explored here: arxiv.org/abs/2209.03320
@jessthnthree Жыл бұрын
@@SamuelAlbanie1 Thank you for the link to a research paper! Keep up the good work on videos :)
@thygrrr Жыл бұрын
"Given the prices of textbooks however, we conclude, it is cheaper to pay for more compute."
@SamuelAlbanie1 Жыл бұрын
The way of the future, I suspect.
@gorgolyt Жыл бұрын
1:46 You briefly say "it doesn't score higher than GPT-4" but this seems disingenuous, in fact GPT-4 is beating it by a fairly huge margin.
@SamuelAlbanie1 Жыл бұрын
This is a fair criticism. I tend to just take it as assumed that GPT-4 is far stronger than the other models, but I could have made this point clearer/more explicit.
@easyaistudio Жыл бұрын
synthetic data is all you need
@SamuelAlbanie1 Жыл бұрын
A good alternative title!
@taivas7216 Жыл бұрын
(good) synthetic data is all you need(?
@ZintomV1 Жыл бұрын
To increase the language diversity, why don't they ask GPT-4 to "translate" their textbook examples into other popular languages.
@SamuelAlbanie1 Жыл бұрын
It's a good suggestion. I suspect this will be explored in future work. In this paper, I believe they were aiming for a more narrowly scoped proof of concept.
@PraiseAllahu Жыл бұрын
Somewhere in LinkedIn they are already hiring people with 30 years experience
@SamuelAlbanie1 Жыл бұрын
I don't think I follow. Perhaps you could clarify?
@octaviusp Жыл бұрын
It's very interesting how ridiculous number of tokens can outperform 50b of tokens. 7b tokens could be trained with 2 or 3 gpus, with 1.3B we could create great models for phones or notebooks. Therefore i have some questions, less parameters doesnt see too much related to performance in these task but what if more parameters could add more flexibility to our models, i see it related to our brain, as more neurons we have and more neural connections we think better as far we understand , so, maybe we are using wrong the parameters thinking and instead we need to use the 100% power of it step by step, starting with little numbers of parameters and rich quality data, and then when we dont see more or slowly improvements therefore could be the end of rich data approach and finally we can start increment the parameters.
@SamuelAlbanie1 Жыл бұрын
Thanks for your comment. It's interesting to consider what scaling strategy is optimal. There have been various studies (perhaps the best known are the neural scaling laws from OpenAI and DeepMind's Chinchilla paper), but I'm sure there's more to be learned.
@PetrGladkikh Жыл бұрын
I would accept that this whole ML story has reached "AI" level when it could autonomously build consistent models of the world and WRITE accessible textbooks not just learn from them. Every one call is "AI" and then all the talk on how to teach it.
@SamuelAlbanie1 Жыл бұрын
Thanks for sharing your thoughts. In this case, the textbooks are synthesised (by ChatGPT), then consumed by Phi-1. The degree to which LLMs like GPT-4 learn consistent world models is an interesting ongoing research question.
@mohali4338 Жыл бұрын
Aren't these published papers copyrighted? Is it legal to present them like this video?
@SamuelAlbanie1 Жыл бұрын
Thanks for the question. First, I am not a lawyer, and none of the below should be taken as legal advice. (1) Yes, the papers are subject to copyright. (2) Usage the use of copyrighted material in a transformative way, such as for commentary, criticism, or educational purposes, can be considered 'fair use' under U.S. copyright law (related ideas apply in other jurisdictions). That's the principle I rely on when discussing research papers in my videos. However, 'fair use' is a complex area of law and is determined on a case-by-case basis, considering factors such as the purpose and character of the use, the nature of the copyrighted work, the amount and substantiality of the portion used, and the effect of the use upon the potential market for or value of the copyrighted work. I aim to respect copyright laws and believe my use of these materials falls under 'fair use'. Ultimately, however, only a court can definitively determine whether a particular use is 'fair use'. Note: GPT-4 assisted with writing this answer (but I'm still responsible for it).
@gileneusz Жыл бұрын
11:13 imagine you will make the training set of this model with 100x good quality dataset, what would result look like
@SamuelAlbanie1 Жыл бұрын
Thanks for your comment. I would imagine the results would be strong, and I expect we'll see people attempt this soon.
@SinanAkkoyun Жыл бұрын
Will they release the model?
@SamuelAlbanie1 Жыл бұрын
My guess would be no, based on the comments by Sebastien Bubeck about not releasing the dataset here: www.linkedin.com/feed/update/urn:li:activity:7077091292077330433?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A7077091292077330433%2C7077264503251374080%29&replyUrn=urn%3Ali%3Acomment%3A%28activity%3A7077091292077330433%2C7077309871414480896%29&dashCommentUrn=urn%3Ali%3Afsd_comment%3A%287077264503251374080%2Curn%3Ali%3Aactivity%3A7077091292077330433%29&dashReplyUrn=urn%3Ali%3Afsd_comment%3A%287077309871414480896%2Curn%3Ali%3Aactivity%3A7077091292077330433%29 (I could be wrong though)
@clray123 Жыл бұрын
It is Microsoft. What do you expect?
@4ngelf Жыл бұрын
Making ChatGPT to respond coding questions was just a proof of concept
@SamuelAlbanie1 Жыл бұрын
I would guess so. We'll see what comes next...
@4ngelf Жыл бұрын
Next I imagine there will be more models made this way in other areas besides programming xd
@television9233 Жыл бұрын
Wow, do they really start with a brand new LLM and only train it on those three datasets that contain very little English yet the final model still understands English?
@SamuelAlbanie1 Жыл бұрын
It would seem so.
@television9233 Жыл бұрын
@@SamuelAlbanie1 very interesting, thank you for doing these paper summaries. It's hard to eat food while reading a paper so videos like this are perfect.
@wall-eDefense Жыл бұрын
BLESS ME FOR MY SUCCESS 🙏🌱❤
@SamuelAlbanie1 Жыл бұрын
Bless you.
@MrDMIDOV Жыл бұрын
The textbook publishing industry be like: “Where my money at?! Call up the lawyers!”
@SamuelAlbanie1 Жыл бұрын
There are increasing reports of collaborations with publishers (the main one I'm aware of currently is AP news, but I'd expect others in the textbook world to follow: www.ap.org/press-releases/2023/ap-open-ai-agree-to-share-select-news-content-and-technology-in-new-collaboration)
@Speed001 Жыл бұрын
I've been wondering why learning from textbooks hasn't been done for a while now.
@SamuelAlbanie1 Жыл бұрын
It's a natural idea.
@thebluriam Жыл бұрын
I wonder what the cost would be of this training if you just built your own hardware. It seems more valuable to just go that route. After everyone was building their own crypto mining rigs and getting almost no value out of it in the end, it seems weird that people aren't building their own AI training hardware while religiously relying on dumb AWS pricing. It feels like using AWS over building your own is like using UberEats to order from your local coffee shop rather than just walking there; it's way more expensive for less benefit.
@SamuelAlbanie1 Жыл бұрын
It's a good question. Do you mean building your own chips, or buying GPUs and then constructing your own servers? I'm not a hardware expert, but a few challenges with the latter option include: (i) it's hard to run the chips as efficiently as the big cloud providers (by and large, they have very efficient cooling systems); (ii) scaling becomes quite difficult. If you know that will need at most 8 GPUs, it might not be such a bad call to build a rig. But often, the flexibility of being able to scale up and down as needed saves a lot of hassle.
@thebluriam Жыл бұрын
@@SamuelAlbanie1 Great reply. I can talk about a few of these topics from decades of personal experience from building my own computers of all types. Firstly, it's really really really important to point out that all of the "advantages" advertised for cloud compute is extremely thin at best. For instance, you mentioned cooking. It is very easy to adequately cool multiple GPUs, CPUs, memory, etc with readily available off the shelf products. The big boys at the data centers really aren't going much further than you could on your own. The difference is that they have to do cooling at an infrastructural level because they have insurance based liabilities, but they aren't doing anything special or anything you couldn't do on your own with very little effort on your part. Secondly, cloud compute solutions were built on the knowledge that as the desire for these systems as a product scales (as it has) there will be fewer people who know how to build the hardware from (relative) scratch, and fewer people with the know how for keeping these systems online all the time properly, giving us a demand problem. Therefore, the cloud solutions we have, have become entrenched as the default standard way of thinking, operating, and measuring costs, but this has blinded us to how stupid easy (and cheap) it is to build your own rigs. A partial antidotal side note: It's like this, back in the day, in the mid 90s to early 2000s, there were a lot of pre-build desktops you could buy from Dell, HP, etc, and they were pretty expensive. If you found yourself going down a certain knowledge path, you could easily build a PC for your that cost half the price and be like 3 times as powerful; all you had to do is learn a little bit about building custom PCs and be willing to take a tiny bit of risk. Between the early 2000s and maybe 2013ish, that's damn near all that anyone did, especially for gaming, it was crazy, and a lot of the big boy PC branded makers like Dell, HP, and others stopped making home desktops as the Laptops game started to rise. Now, everyone has laptops at home and it's not as common to see custom built PCs. Now, we're in the same boat (I believe), where we have cloud compute platforms acting like the old PC hardware providers charging an arm and a leg for something that someone with a little bit of grit and a few thousand dollars can build for themselves and end up with something that is more powerful and cheaper in the long run. Again, this is a belief, but I'm pretty sure I'm right. Yes, the cost of compute has gone up but if you're willing to do it yourself, just like it was in 1999, you can very likely get a hell of a lot more for less if you do it yourself. From what I can tell, the vast majority of the schools and non-massive tech companies who are using cloud compute to train models don't really need a lot of horsepower or scale, that's all marketing speak that I believe everyone is just buying into because it sounds convincing and because it's the answer that's already given and accepted by everyone. Fundamentally, I think we're all being sold a lie about how much this stuff actually costs and maybe we should run our own hardware most of the time because if you build your own custom stuff, it's going to be scaled to your custom needs. The vast majority of us, even universities, don't need scale, just access, and that is another selling point of cloud services, you don't have to be smart enough to do it yourself.
@henryholloway5656 Жыл бұрын
Did my man just say
@SamuelAlbanie1 Жыл бұрын
This is a fair criticism of my lack of precision :) In practical terms though, I think the approximation is reasonable.
@TheDavidlloydjones Жыл бұрын
It's a bit odd to give your new product a name ending in "1." That's like announcing "Here it is, our brand new, obsolete, thingie. Stay tuned for #2."
@SamuelAlbanie1 Жыл бұрын
Good point! Maybe they were thinking "Formula 1"...
@quaidcarlobulloch9300 Жыл бұрын
❤
@SamuelAlbanie1 Жыл бұрын
Appreciated.
@gileneusz Жыл бұрын
10:17 well, it's affordable - if you know how to create it by yourself. No such luck
@SamuelAlbanie1 Жыл бұрын
It's true that it still requires some expertise/experience in dataset construction.
@ronensuperexplainer Жыл бұрын
Orca is trained not on 1 and 5 million tokens. It's 1 and 5 million conversations
@SamuelAlbanie1 Жыл бұрын
Good catch! That's an error from me. I'll add a correction.
@itsbrex Жыл бұрын
Great video Samuel.👏 subscribed. Would love to connect. Keep up the great videos.
@SamuelAlbanie1 Жыл бұрын
Thank you itsbrex!
@deepfakescoverychannel6710 Жыл бұрын
"all you need" is all you need.
@davidyolchuyev2905 Жыл бұрын
It seems like not a single reference in this research is American. Americans need internationals in their educational instituations. Not because internationals bring pure cash to them, but also smart folks from abroad move there
@SamuelAlbanie1 Жыл бұрын
I'm struggling to understand your comment. Perhaps you could rephrase it to help me understand it?
@Apple-vm5gc Жыл бұрын
@@SamuelAlbanie1 look at the author names.There are many indians and chinese names.
@kazihiseguy-fernand4637 Жыл бұрын
@@Apple-vm5gcbut what’s the implication of the original comment ?
@Apple-vm5gc Жыл бұрын
@@kazihiseguy-fernand4637 I think he wants more foreign students in america
@roryboyes2307 Жыл бұрын
This is in the hands of the open source community now as maximizing textbook count will not be awfully legal.
@SamuelAlbanie1 Жыл бұрын
I think it's also possible that we'll see well-funded companies constructing enormous private, synthetic libraries of textbooks. We'll see!
@PhrontDoor Жыл бұрын
It won't be all that's needed. It will give nice 'academic' examples and academic writing. Textbooks are wonderfully bereft of pragmatic and production-level code.
@SamuelAlbanie1 Жыл бұрын
I like the phrase "wonderfully bereft".
@ericpmoss Жыл бұрын
I’m surprised it took a paper to figure this out. LLMs are GIGO.
@SamuelAlbanie1 Жыл бұрын
Thanks for sharing your perspective. Personally, I found the effectiveness of the technique quite surprising.
@ericpmoss Жыл бұрын
@@SamuelAlbanie1 I’m not trying to be an internet know-it-all - sorry if I come across as one. It’s just that the core of LLMs is prediction of the next word in a sequence based on the statistics of all prior sentences, not on the implications of those sentences. So the utility of its output is completely dependent upon a history of being trained on well-formed and well-considered sentences. Now, even natural intelligences are susceptible to GIGO. They can break out of it by pruning predictions that are contradicted by reality. As long as LLMs form a model of reality based on how often a thing is repeated rather than on its consistency, they’ll be unable to even mimic reasoning beyond repeating the popular line.
@ivan24zg Жыл бұрын
It's not the data quality, it's how the concepts are explained and gradually introduced in the training data so that low-dimensional grokking is sped-up. It's basically as if you are training a blank-state human child that has text-only sensory input about the real world. The more gradual, incremental and self-referential the training data is, the less steps and (and less parameters in the LLM world) will it take for the language model to "grok" the world model. Repeated cycles of distillation (with "bad quality" data as the paper calls it) of huge foundational models and retraining the semantic extraction based on incremental semantic and step-wuise build-up will through iterations evolve a minimal-size LLM that will have optimal combination of both konwledge and reasoning for the given model size. Eventually we will going to hit the limits what a certain size model can do, but I suspect thes limits are much, much higher than anyone imagines, and we could probably have an AGI-level intellect running on a few dozen billion parameters. We are "almost there". AGI by 2025.
@SamuelAlbanie1 Жыл бұрын
I agree with your observation that the incremental/self-referential textbook nature of the explanations is a key component in driving performance in here, and that a model with a few dozen billion parameters could be very powerful. We will see (perhaps in 18 months, if your timeline is correct)...
@DarkWizardGG Жыл бұрын
Damn, this Phi-1 has the potential to become a great LLM. Imagine that with just 1B parameters he almost on par with GPT4 & beaten the 175B GPT-3, damn this AI is a fckn BEAST. Gonna root on this AI. 😁😉😄🤖🤖🤖🤖
@SamuelAlbanie1 Жыл бұрын
Thanks for watching! It's perhaps worth pointing out that while Phi-1 is very competitive on Python coding benchmarks (and far smaller), it has a very narrow range of expertise relative to these other models (GPT-4, for example, is strong across a very large number of domains).
@DarkWizardGG Жыл бұрын
@@SamuelAlbanie1 yes, youre right, bro. Well, I guess long way to go for this Phi-1 with 1B LLM, we'll see it on time if it'll get a major upgrade in the near future. However, assumingly that it's a small model, its performance were quiet good tho. Anyway, u got a good presentation, bro. Done liking. More vids to come & God bless.😁🤗🤗🤗🙏🙏🙏
@sirnate9065 Жыл бұрын
This is crazy
@SamuelAlbanie1 Жыл бұрын
Indeed.
@jboss1073 Жыл бұрын
For something called "ARTIFICIAL INTELLIGENCE" this sure needs lots of natural, human guidance and intervention. Why doesn't it just do this all itself? Wait, I know why... this is no AI, that's why.
@memb. Жыл бұрын
This is very big.
@SamuelAlbanie1 Жыл бұрын
I think so.
@darkwingscooter9637 Жыл бұрын
This essentially gives the lie to the whole current generation of AI, because the one thing it cannot do is determine the quality of a dataset.
@IntelsCreed Жыл бұрын
Hardly understood anything
@jonabirdd Жыл бұрын
Yet another Meta AI paper gaming the benchmark. Don't get me wrong. It's impressive. But it's gaming the benchmark, so we need a new benchmark.
@SamuelAlbanie1 Жыл бұрын
Thanks for your comment. I agree that we probably need a new benchmark. To clarify: this work is from Microsoft (rather than Meta AI).
@jonabirdd Жыл бұрын
@@SamuelAlbanie1 Oh yes, my bad. Orca was from microsoft as well.
@khatharrmalkavian3306 Жыл бұрын
Microsoft lagging far behind here. Even the moat document talked about how hobbyists already figured that content quality can reduce training costs to independent hacker levels.
@SamuelAlbanie1 Жыл бұрын
Thanks for your comment. From my perspective, the main contribution is not so much that data quality matters, but that high quality training data can be effectively synthesised. I found the magnitude of the improvement somewhat surprising.
@super.digi777 Жыл бұрын
But then they wouldn't be able to understand our sms like texts
@SamuelAlbanie1 Жыл бұрын
I would guess that models trained in this way won't be particularly robust on other text distributions (like SMS content)