I Made a Language Model From 0 (you can run it too!)

Рет қаралды 7,880

Күн бұрын

Пікірлер: 116

@8AAFFF 3 күн бұрын

Go to piavpn.com/8AAFFF to get 83% off Private Internet Access with 4 months free (and support me :D)! thanks for watching! also discord server now: discord.gg/MC4wTeb4

@thefcraft8763 Күн бұрын

It's nice but i think your architecture has some flows like suppose a text "This is a ...." And now there are different possible next world predictions here like "dog, cow, mountain" and dog and cow are nearby in vocab dimensions space but mountain are might far apart and if you train your model in such cases it will average out the result and might give some nonsense or hallucinate etc... (basically it might give medium point/vector of cow dog and mountain)

@Myexpectationsarerealistic 13 сағат бұрын

I did similar. Not touching Rust.

@AndroidFerret 7 сағат бұрын

The production and information value of this video is insane. How long did you edit this?? Fantastic

@scoffpickle9655 3 күн бұрын

The reason why the 160k batch REAN was worse with the graphics card prompt is because the network is overfitting itself, I'd recommend using a test set with some prompts to choose the model that performs best on that test set instead of just running it with high batch amounts

@8AAFFF 3 күн бұрын

ur right its most likely overfitted, the weird thing is that most other test prompts i was running were generally getting better with more batches so idk

@scoffpickle9655 3 күн бұрын

@8AAFFF It sounds like a data problem, then, too little or not general enough data would lead to worse curve fitting. I suppose that there wasn't much data about graphics cards, so it freaked tf out and kept spamming "graphics"

@8AAFFF 2 күн бұрын

maybe, also possible that the graphics cards knowledge just got overshadowed because it was in the beginning of the dataset. i did some more tests today and basically it just seems to have some knowledge points that it tries sticking to no matter what the prompt is

@PaulanerStudios Күн бұрын

@8AAFFF Are you using any sort of speculative decoding or temperature scaling? That wasn't mentioned in the video and does make quite a difference.

@NoSubsWithContent Күн бұрын

@@8AAFFF what if you used an existing super efficient model like the granite MoE with 400M active parameters to comb through a different dataset like fineweb EDU and produce a list of knowledge it could access during training via RAG or something? if you figure out a way to do that I feel like it'd get much better performance because it doesn't have to spend so much of its weights on memorizing stuff, instead it can learn actual patterns, intelligence even?

@salad_txt 3 күн бұрын

You are so underrated it is actually insane, keep it up dude. Great stuff.

@rkpstam 17 сағат бұрын

Хорошая работа, Олег

@gilbertenevoldsen4469 Күн бұрын

This video is super well made and informative! But i'm a bit curious on why you chose the achitecture that you did. The reason this way of outputting words isn't typically used in large language models. Is because it's useful for the model to be able to have multiple high propability cadidates for the next word, that aren't necessarily close to each other in vector space. For example, let's say a sentence comes up in training like "My favorite hobby is..." There are a lot of possibilities for the next word. So the model would be optimised to output the average vector of those possible answers, which likely isn't a sensible continuation of the sentence. I would love to see what you could make by doing it the traditional way, and showing how good of a model you can train as a single person.

@simonwillover4175 21 сағат бұрын

or um maybe reward it for simply choosing any word close to any option rather than the average?

@WoolyCow 9 сағат бұрын

@@simonwillover4175 could just be weighted by distance as well, or even add in some error on purpose to get some more divergent responses

@toofardoug2188 Күн бұрын

This is so high quality it's nuts! The editing is excellent. The explanations are crisp. The relative context ti the SOTA for each variable choice js excellent. Such as. The origin and then evolution of concepts is extremely valuable. Such as the beginning/origin of tokenization that becomes embeddings.

@mrpro7737 Күн бұрын

To editing skills in this video harder that that new architecture 😂

@zaj007 3 күн бұрын

18:25 Bro there has gyat to be a better way! I'm crying 😭😭 wtf is that timeline 💀💀

@8AAFFF 3 күн бұрын

bro did the tower of babel editing technique ahh

@jaythecoderx4623 Күн бұрын

This should have millions of views what the hell this is epic, very well edited too

@slowpoke101 3 күн бұрын

GReat video, these longer videos are always nice to see. Thank you for opensourcing the code.

@jondoe6608 Күн бұрын

Out of curiosity are you aware of the RWKV architecture? Its a LLM thats based on a type of RNN, its main advantage is removing the hard context limit, making it possible to have longer contexts on weaker devices, due to using a constent amount of memory. Your idea of using embeddings as the input and output is really cool, especially due it further reducing vram requirements.

@IceMetalPunk 10 сағат бұрын

I mean... your network only predicts the most likely next token, whereas GPT models predict the probability of all tokens and sample from there (they don't just choose the highest-probability token); and your tokens are just entire words from the corpus. So it's like a GPT model that (a) always has a temperature of 0, and (b) can't understand anything that's not a word present in the corpus. I think we can see from that why GPT and its similar models didn't go this route 😅

@jairjuliocc 2 сағат бұрын

With respect to temperature , is possible to find the k neighborhood vectors more similar , and adding a probability based in the similarity score. In this way you can mimic temperature

@lionlight9514 3 күн бұрын

This is so cool man! Please, keep going.

@mrcavas 2 сағат бұрын

Such a fire video! I love the style

@brams06 Күн бұрын

I was shocked to see that this video has so little views. I feel so lucky to come across this gem.

@Pratikology Күн бұрын

Wtf, why isn’t This at a million views? keep it up bro what a fantastic blend of knowledge and creativity 🔥

@aamindehkordi Күн бұрын

Insane time spent and crazy W video. don't worry about compression or pacing this is gas and should blow up soon

@Aragamin Күн бұрын

Чел, это замечательная работа. Рад видеть, что энтузиазм порой превращается не только в увлечение, но и в серьёзные разработки) Продолжай свой путь! P.S: реально крутой дизайн видосов - зачёт.

@kotway2 3 күн бұрын

Very cool video and project man!

@Quozul 20 сағат бұрын

This is an amazing project! And I love the graphics and visuals of the video too!

@devbuffer0112 Күн бұрын

Creel style visuals, cool bgm and hot topics related to CS. You're gonna become huge

@AllExistence 2 күн бұрын

You seem to have went a weird route with training. Normally, networks are just trained in plain text first, to learn normal language. Then, they are finetuned with "human/assistant" data to actually answer questions instead of talking to themselves.

@8AAFFF Күн бұрын

yeah thats true its just that the higher quality human/assistant dataset was so big that i didnt need to first train on raw text

@bedahtpro Күн бұрын

Great quality video man!

@TeamDman Күн бұрын

Your animations are awesome :o

@v0idbyt3 Күн бұрын

damn you made davinci resolve go kaboom at the end btw cool video! i hope this architecture eventually gets a remake or succeeds, because this could be a way better alternative to GPT architecture.

@sandded7962 23 сағат бұрын

Hi , Can you elaborate on the 12:43 part where the circular text says the following: “I’m edging, I’m edging , I’m edging , I’m edging”

@absentmindhere 14 сағат бұрын

nice

@A_Me_Amy 19 сағат бұрын

i like ur ad or rather, ur general artistic style. Also for the model, i think that the idea of the vocabulary space makes sense. there is a research that came out today from meta that could pair with this fairly well about LCM as opposed to LLM, and it takes small sentence with limited tokens, and I could imagine if you were to in essence translate into the 768 vocab any sentence, or something like this... not technically aware enough to contribute more than to say this. perhaps word2word2vec2word2word process, so that it can still speak the full vocab list and understand it, but it processes the core essence in the smaller architecture. I do think that figure this out is the sort of future, or that there is a lot possible...Oh and the same dude who talked about this paper today also talked about another research form princeton about slow and shorter training leading to more in context learning ICL and that at some point when training weights it loses t he ability to do this.... but yeah the most fully reasoning model at the lowest possible is the new effect extension to the computational power halving in physical size and doubling in power process, i forget what it is called. meh. moores law. more law. even more law. 93/93 ok... but the new moores law. ai gets twice as smart every 2 years and half as large. I am quite sure this will be the trend to be honest.

@w花b 3 сағат бұрын

That's nice to have something that's not manim for once

@xorcise 12 сағат бұрын

8:20 ah yes good birchrunville Cassiano

@WoolyCow 9 сағат бұрын

was 784 a reference to mnist? loved the vid, well explained and beautifully edited :D dropped a sub

@8AAFFF 7 сағат бұрын

Nice XD someone saw it Thx for the sub :)

@PseudoProphet Күн бұрын

Wow, it could work, 😮😮 You just need a better and more complete dataset. You should have also tried to ask it questions that you knew were in it's training data, to see it's performance.

@vassa-vn7xz Күн бұрын

How is this even working? Why will it not collapse to single embedding for all words?

@DallasMcMillan Күн бұрын

Incredible project and so many insights into ai in a fun and digestible way! Thanks !!!! ❤❤❤

@juliansantos1900 13 сағат бұрын

Crazy work not to mention crazier animation, i know the concept of the ais but dont have this extensive knowledge to write it on m y own without lm libs 😆

@AverusMuto 4 сағат бұрын

This is very useful. Thank you.

@hypercoder-gaming 22 сағат бұрын

With more training, this could definitely be very powerful.

@driss1227 13 сағат бұрын

The graphics or so great, curious what you used to produce this video? Looks like manim expertly used

@alisyoung2741 Күн бұрын

I have been working on one as well but ran across issues currently! So exciting!

@8AAFFF Күн бұрын

yooo gl man are you doing like a custom architecture?

@alisyoung2741 18 сағат бұрын

Yes! I customized the standard U-net architecture by rebuilding the bridge to process input using a Bi-lstm a memory system and attention mech before re-upsampling.

@alisyoung2741 18 сағат бұрын

Your video actually inspired me to try and work on a kind of single token tokenizer that will produce a single unique token for any given input of a certain size, hopefully really large.

@mtlr3803 23 сағат бұрын

crazy video!!

@Moshi74533 Күн бұрын

sick bro, absolutely sick

@kamertonaudiophileplayer847 11 сағат бұрын

You need to patent your approach. It's a very interesting, although I use a slightly modified one.

@TimeLordRaps Күн бұрын

bros cracked. thank you fellow.

@VioletGiraffe 2 күн бұрын

Even your animations are cool, how did you make them? Or do you have another neural net to do that for you? :)

@8AAFFF 2 күн бұрын

thanks :), basically with just images / clips in davinci resolve. I put the almost final timeline at the end 18:26

@toofardoug2188 Күн бұрын

I wonder if there's a better sampling mechanism when you're using the word2VEC model? If you watch the got from scratch video from youll see that andrej karpathy doesnt just take the highest predicted token. They sometime take from the top 3rd value.

@Leb369 2 күн бұрын

very good video, the only default is the sound quality.

@Kwenen 12 сағат бұрын

4:00 It's intuitive to do so, but I'm surprised that big companies still choose to output Token regressions

@ЕгорКолов-ч5с 7 сағат бұрын

Because you need the model to be able to produce different outputs. For example if you have "Cat sits on a tree" and "Cat sits on a sofa" in the training data, this trained model will always predict (tree + sofa) / 2 when given "Cat sits on a" as a prompt, and there is no remedy for this issue

@Kwenen 5 сағат бұрын

@@ЕгорКолов-ч5с I don’t think it matters, because what we usually want is an output, and we don’t care what the value of a specific Token is (for example, 0~1 for emotion recognition), and the current model will also have situations where both Tokens are 0.5, which is also passed Throw a weighted die when an output is needed. The vector used in the video, (tree + sofa) / 2 also shows that this sentence can be followed by two words. Then I think the model can learn the usage of language very well. When calculating the similarity with the output, both It's 0.5, just throw a dice and everything is fine. I guess, in the video, the maximum value is always chosen, and there should be a chance of outputting other words when the probability is half and half. This is like using a Markhov chain, but letting the maximum value determine the transfer. :)

@ЕгорКолов-ч5с 3 сағат бұрын

@@Kwenen I don't really understand what you are referring to. I'm just relaying a simple fact: output of this model is a 784 value embedding, that corresponds to a vector in word2vec space which is not as powerfull as a probability distribution over tokens. Generating next word is just taking the closest word in word2vec space to generated embedding. Because of the way word2vec works, 1) embeddings of contextually close words will be closer in word2vec space, so even if you randomly choose 1 out of 10 closest word to the embedding you will get sinonyms at best and gibberish at worst, 2) because word2vec doesn't care about word order, model trying to predict next token will always produce chains of same words over and over. The main reason that nobody uses this method is that it fundamentally doesn't work.

@Kwenen 2 сағат бұрын

@@ЕгорКолов-ч5с If the correction w2v brings to the model is not strong enough and the training is too slow, it will really make me give up this path. Oh, maybe what I said was a bit out of focus. I mean, we hope that the language model's output of the meaning is enough, even if it is a synonym. Therefore, I thought that if I output a vector (fuzzy meaning) and then select from similar words, it should be enough to support communication. That model may have a lighter load on the graphics card. Of course, if the model gives meaningless vectors, then choosing from them will really only result in a bunch of gibberish, then I can only say that it is very sad. And I naively thought that the position encoding of the input and Self-Attention were enough to make the output position-sensitive. So the idea of playing with the prediction of the next token on the vector doesn’t really work? It’s just that this method really makes me feel intuitive. It is easy to imagine a sequence of arrows in hyperspace, pointing to multiple different words in sequence. As you reminded, this seems to be inefficient at the learning level. After all, the words are too discrete. Even if it is easier at the output layer, it does not mean that things have become easier, right?

@takeraparterer 3 күн бұрын

kzbin.info/www/bejne/lXOVg3yjns2Xi6s that's not correct. gpt models predict every "next word" from a sequence at the same time

@8AAFFF 2 күн бұрын

yeah 100% correct i just lied about it in the beginning for the explanation to be easier, but i do later correct myself well done for noticing :)

@60pluscrazy Күн бұрын

Amazing..how did you animate 👌🎉🎉🎉

@8AAFFF Күн бұрын

thanks :) all the animations are pretty much fully made up of davinci resolve images and clips and stuff i put the timeline at 18:26 if you want to see

@KristoferPettersson Күн бұрын

Arn't you using any sampling when picking the next token?

@ЕгорКолов-ч5с 7 сағат бұрын

His architecture doesn't output a probability distribution, where is he supposed to sample from?

@Vine_Zx Күн бұрын

Remember me when you make it big!

@8AAFFF Күн бұрын

i will ty

@ВладЧорний-ч4и Күн бұрын

This is fire

@user-qw1rx1dq6n Күн бұрын

you should probably use a cosine similarity loss

@TheTruthOfAI Күн бұрын

Hahaha funny guy.. it's like reading a long gpt4 hallucination

@fortaber 3 күн бұрын

The editing of the video is just amazing!!

@MrNootka Күн бұрын

Hello! Nice video, In the section "Final word2vec Results" i.e. at point 11:14 and 11:28, you had a space inside the variable value of similar_by_world in one and the other you didnt... I wonder if the space changes the results

@v0idbyt3 Күн бұрын

a space in the code would make the compiler or interpreter think that its something else, so it would make an error (which is a difference)

@8AAFFF Күн бұрын

thanks :) also well done for noticing, the space does change the results because its a slightly different token in the word2vec (but they are really close to each other). i dont know why its there its probably by accident but if ur curious this is the output for "2021" with no space: [('2020', 0.7180283069610596), ('2022', 0.6416824460029602), ('2021 ', 0.6249533295631409), ('2019', 0.6035624742507935), ('October ', 0.5840676426887512), ('october ', 0.5773099660873413), ('January ', 0.5399696230888367), ('2020 ', 0.5389090776443481), ('2018', 0.5194795727729797), ('July ', 0.5182425379753113)]

@MrNootka 21 сағат бұрын

@@8AAFFF Thanks for the reply, I mainly asked because of your "tokenization" approach; Anyway I believe what you have cooked here has some serious potential! When I found your channel yesterday I binge watched most of your videos and this one and the dimensions simulator are my top favorite ones 😆, I am working on somerhing similar, keep up the good work!

@tevincoburn1469 11 сағат бұрын

Dude. Great video but like... Bump up your audio by like 4db. You're so quiet I have to max out my speakers.

@8AAFFF 7 сағат бұрын

Thanks! Alot of people said that, was the general audio too quiet or just the voiceover?

@MommysGoodPuppy 23 сағат бұрын

holy GOATED

@averesenso 3 күн бұрын

Your voice is quiet on my speakers

@VioletGiraffe 2 күн бұрын

Fine for me, not quiet.

@rikhendrix261 Күн бұрын

3:11 i thought chat gpt 3 had a 12288 embedding size. You are saying as high as 4096.

@8AAFFF Күн бұрын

tbh i asked chatgpt whats its embedding dim XD so idk if its correct i looked it up again and ur right the biggest version of gpt3 is 12k embedding dim, and openai is really secretive about gpt4 so im probably completly wrong on that. thanks for pointing out :)

@rikhendrix261 Күн бұрын

@@8AAFFF Its okay I tought I might have been wrong. At 4:57 you are saying that you are going to compare the vector of the word it wants to predict to all the words in your database with its vector representation (RAG with cosine similarity). But words like Mole for example can be an animal, soemthing on your cheeck or arm, or it can be something having to do with the total number of molecules 6.02 x 10 ^23. Does this mean that your word database has these words written down multiple times? And at some point you said you had 340.000 words in your database?? instead of the 40.000 from openai? Im also interested to know what the most important thing you learned during this time was? I have only been learning about AI recently so im all ears.

@8AAFFF Күн бұрын

ah i get what ur saying. basically the word2vec model has a bunch of tokens in its vocabulary. every token can appear only once, but usually there are some accidental half duplicates like "mole", "mole ", " mole" etc... usually the duplicates have pretty much the same meaning as the "true" word just because they appear in exactly the same contexts. because the words are not written down multiple times there are some misplaced words that have meanings in two different areas so they are just awkwardly put in the middle of both "meaning areas". this doesnt really hurt the neural net because im relying on it understanding that even if a word's vector doesnt 100% match the situation, its still good because of the context of whatever its typing. as for the vocab size being 340k instead of something smaller like 40k its due to me using a tokenizer that splits the text into bigger tokens, usually the size of full words, instead of smaller half words like what openai uses. so for me "hello, world!" would be split into something like: "hello" "," " world" "!" and for openai same thing would be split into something like: "hel" "lo" ", " "wo" "rld" "!" so openai needs less of these tokens to fully encode english and probably the bigggest thing i learned with this was how to properly use tensorboard. its a library that lets you track stuff like the loss graphs in real time, compare how two different models trained and more stuff like that. the best way to learn about ai stuff is to do big projects like this because you actually encounter problems in the wild, then solve them, instead of just learning about solutions to problems you never had

@rikhendrix261 Күн бұрын

@@8AAFFF Wow, very interesting! Yes i now understand why your token count was higher. This would also mean that for you a simple "normal" english sentence would consist of less total tokens than openai which would save on the compute. Do you by chance follow "Discover AI" he has some very interesting videos on new Test-Time compute and Test-Time training which according to the literature saves a lot of time and has great results, but my level of AI understanding isn't at that point yet. Maybe that you would be able to combine that tactic with your own? I'll follow you and see what more you will post.

@sandded7962 Күн бұрын

That’s crazyyyyy

@8AAFFF Күн бұрын

cdn.discordapp.com/attachments/888851399490797620/1242133470235594853/attachment.gif?ex=675e49b1&is=675cf831&hm=dc928ebc5d6bb49010b1d0ce10dd3a420fbc86c69d8aeed38906f4dc3a526d0a&

@callowaysutton 10 сағат бұрын

I think you're running into issues at the front of the pipeline, when "translating" from the vocabulary to the tokens try just blacklisting tokens already mentioned in the past 3 tokens up to the point you're translating at.

@8AAFFF 7 сағат бұрын

To be honest i didnt think of that This could also work as some sort if temperature like GPTs have

@callowaysutton 6 сағат бұрын

@@8AAFFF Temperature would be more equivalent to doing a left tailed random distribution over the list of tokens for the given category, this would just be a repeat penalty

@lobiqpidol818 Күн бұрын

🤓 well AksUaLly each embedding vector takes up space on the device. So while you save space by vector quantizing the output embeddings the vocabulary is still limited by GPU space. Also you lose the ability to do some calculations on the output like temperature. Good video

@yoavco99 Күн бұрын

You can probably just have it not be on the gpu, and just check the closest token on like the CPU or whatever. Also can't you just easily recreate temperature with this?

@8AAFFF Күн бұрын

yeah thats exactly what im doing the word2vec weight is stored on regular RAM and is only used to translate tokens back and fourth. so the only stuff on the GPU VRAM is the network and the already translated vectors. its true that i dont really have regular temperature like other GPT models but i can sort of recreate it by either adding noise to the input or selecting the 2nd / 3rd closest word to the network output instead of the 1st :)

@user-qw1rx1dq6n Күн бұрын

You can absolutely recreate temperature if you just train the embedding model differently

@lobiqpidol818 Күн бұрын

@@8AAFFF what I've been thinking about what if you use very small embedding vectors only 2 dims for example to represent words then immediately expand it to more dimensions with linear layers when inside the model. Does the model see this as the same thing or completely different?

@Myexpectationsarerealistic 12 сағат бұрын

I did the same thing.

@absentmindhere 14 сағат бұрын

nice

@RasmusSchultz 21 сағат бұрын

interesting idea, except... it doesn't seem to work? 😅

@idf3da 3 күн бұрын

top!

@raihanhossain3423 Күн бұрын

Is that your real credit card number? 🙃

@8AAFFF Күн бұрын

one way to find out

@that_guy1211 21 сағат бұрын

Bro, not trynna be mean or anything, but you AI looks dumb as hell.... Keep working on it mah dude, would love to see this architecture get better with a actual decent LLM on it bruv!