Go to piavpn.com/8AAFFF to get 83% off Private Internet Access with 4 months free (and support me :D)! thanks for watching! also discord server now: discord.gg/MC4wTeb4
@thefcraft8763Күн бұрын
It's nice but i think your architecture has some flows like suppose a text "This is a ...." And now there are different possible next world predictions here like "dog, cow, mountain" and dog and cow are nearby in vocab dimensions space but mountain are might far apart and if you train your model in such cases it will average out the result and might give some nonsense or hallucinate etc... (basically it might give medium point/vector of cow dog and mountain)
@Myexpectationsarerealistic13 сағат бұрын
I did similar. Not touching Rust.
@AndroidFerret7 сағат бұрын
The production and information value of this video is insane. How long did you edit this?? Fantastic
@scoffpickle96553 күн бұрын
The reason why the 160k batch REAN was worse with the graphics card prompt is because the network is overfitting itself, I'd recommend using a test set with some prompts to choose the model that performs best on that test set instead of just running it with high batch amounts
@8AAFFF3 күн бұрын
ur right its most likely overfitted, the weird thing is that most other test prompts i was running were generally getting better with more batches so idk
@scoffpickle96553 күн бұрын
@8AAFFF It sounds like a data problem, then, too little or not general enough data would lead to worse curve fitting. I suppose that there wasn't much data about graphics cards, so it freaked tf out and kept spamming "graphics"
@8AAFFF2 күн бұрын
maybe, also possible that the graphics cards knowledge just got overshadowed because it was in the beginning of the dataset. i did some more tests today and basically it just seems to have some knowledge points that it tries sticking to no matter what the prompt is
@PaulanerStudiosКүн бұрын
@8AAFFF Are you using any sort of speculative decoding or temperature scaling? That wasn't mentioned in the video and does make quite a difference.
@NoSubsWithContentКүн бұрын
@@8AAFFF what if you used an existing super efficient model like the granite MoE with 400M active parameters to comb through a different dataset like fineweb EDU and produce a list of knowledge it could access during training via RAG or something? if you figure out a way to do that I feel like it'd get much better performance because it doesn't have to spend so much of its weights on memorizing stuff, instead it can learn actual patterns, intelligence even?
@salad_txt3 күн бұрын
You are so underrated it is actually insane, keep it up dude. Great stuff.
@rkpstam17 сағат бұрын
Хорошая работа, Олег
@gilbertenevoldsen4469Күн бұрын
This video is super well made and informative! But i'm a bit curious on why you chose the achitecture that you did. The reason this way of outputting words isn't typically used in large language models. Is because it's useful for the model to be able to have multiple high propability cadidates for the next word, that aren't necessarily close to each other in vector space. For example, let's say a sentence comes up in training like "My favorite hobby is..." There are a lot of possibilities for the next word. So the model would be optimised to output the average vector of those possible answers, which likely isn't a sensible continuation of the sentence. I would love to see what you could make by doing it the traditional way, and showing how good of a model you can train as a single person.
@simonwillover417521 сағат бұрын
or um maybe reward it for simply choosing any word close to any option rather than the average?
@WoolyCow9 сағат бұрын
@@simonwillover4175 could just be weighted by distance as well, or even add in some error on purpose to get some more divergent responses
@toofardoug2188Күн бұрын
This is so high quality it's nuts! The editing is excellent. The explanations are crisp. The relative context ti the SOTA for each variable choice js excellent. Such as. The origin and then evolution of concepts is extremely valuable. Such as the beginning/origin of tokenization that becomes embeddings.
@mrpro7737Күн бұрын
To editing skills in this video harder that that new architecture 😂
@zaj0073 күн бұрын
18:25 Bro there has gyat to be a better way! I'm crying 😭😭 wtf is that timeline 💀💀
@8AAFFF3 күн бұрын
bro did the tower of babel editing technique ahh
@jaythecoderx4623Күн бұрын
This should have millions of views what the hell this is epic, very well edited too
@slowpoke1013 күн бұрын
GReat video, these longer videos are always nice to see. Thank you for opensourcing the code.
@jondoe6608Күн бұрын
Out of curiosity are you aware of the RWKV architecture? Its a LLM thats based on a type of RNN, its main advantage is removing the hard context limit, making it possible to have longer contexts on weaker devices, due to using a constent amount of memory. Your idea of using embeddings as the input and output is really cool, especially due it further reducing vram requirements.
@IceMetalPunk10 сағат бұрын
I mean... your network only predicts the most likely next token, whereas GPT models predict the probability of all tokens and sample from there (they don't just choose the highest-probability token); and your tokens are just entire words from the corpus. So it's like a GPT model that (a) always has a temperature of 0, and (b) can't understand anything that's not a word present in the corpus. I think we can see from that why GPT and its similar models didn't go this route 😅
@jairjuliocc2 сағат бұрын
With respect to temperature , is possible to find the k neighborhood vectors more similar , and adding a probability based in the similarity score. In this way you can mimic temperature
@lionlight95143 күн бұрын
This is so cool man! Please, keep going.
@mrcavas2 сағат бұрын
Such a fire video! I love the style
@brams06Күн бұрын
I was shocked to see that this video has so little views. I feel so lucky to come across this gem.
@PratikologyКүн бұрын
Wtf, why isn’t This at a million views? keep it up bro what a fantastic blend of knowledge and creativity 🔥
@aamindehkordiКүн бұрын
Insane time spent and crazy W video. don't worry about compression or pacing this is gas and should blow up soon
@AragaminКүн бұрын
Чел, это замечательная работа. Рад видеть, что энтузиазм порой превращается не только в увлечение, но и в серьёзные разработки) Продолжай свой путь! P.S: реально крутой дизайн видосов - зачёт.
@kotway23 күн бұрын
Very cool video and project man!
@Quozul20 сағат бұрын
This is an amazing project! And I love the graphics and visuals of the video too!
@devbuffer0112Күн бұрын
Creel style visuals, cool bgm and hot topics related to CS. You're gonna become huge
@AllExistence2 күн бұрын
You seem to have went a weird route with training. Normally, networks are just trained in plain text first, to learn normal language. Then, they are finetuned with "human/assistant" data to actually answer questions instead of talking to themselves.
@8AAFFFКүн бұрын
yeah thats true its just that the higher quality human/assistant dataset was so big that i didnt need to first train on raw text
@bedahtproКүн бұрын
Great quality video man!
@TeamDmanКүн бұрын
Your animations are awesome :o
@v0idbyt3Күн бұрын
damn you made davinci resolve go kaboom at the end btw cool video! i hope this architecture eventually gets a remake or succeeds, because this could be a way better alternative to GPT architecture.
@sandded796223 сағат бұрын
Hi , Can you elaborate on the 12:43 part where the circular text says the following: “I’m edging, I’m edging , I’m edging , I’m edging”
@absentmindhere14 сағат бұрын
nice
@A_Me_Amy19 сағат бұрын
i like ur ad or rather, ur general artistic style. Also for the model, i think that the idea of the vocabulary space makes sense. there is a research that came out today from meta that could pair with this fairly well about LCM as opposed to LLM, and it takes small sentence with limited tokens, and I could imagine if you were to in essence translate into the 768 vocab any sentence, or something like this... not technically aware enough to contribute more than to say this. perhaps word2word2vec2word2word process, so that it can still speak the full vocab list and understand it, but it processes the core essence in the smaller architecture. I do think that figure this out is the sort of future, or that there is a lot possible...Oh and the same dude who talked about this paper today also talked about another research form princeton about slow and shorter training leading to more in context learning ICL and that at some point when training weights it loses t he ability to do this.... but yeah the most fully reasoning model at the lowest possible is the new effect extension to the computational power halving in physical size and doubling in power process, i forget what it is called. meh. moores law. more law. even more law. 93/93 ok... but the new moores law. ai gets twice as smart every 2 years and half as large. I am quite sure this will be the trend to be honest.
@w花b3 сағат бұрын
That's nice to have something that's not manim for once
@xorcise12 сағат бұрын
8:20 ah yes good birchrunville Cassiano
@WoolyCow9 сағат бұрын
was 784 a reference to mnist? loved the vid, well explained and beautifully edited :D dropped a sub
@8AAFFF7 сағат бұрын
Nice XD someone saw it Thx for the sub :)
@PseudoProphetКүн бұрын
Wow, it could work, 😮😮 You just need a better and more complete dataset. You should have also tried to ask it questions that you knew were in it's training data, to see it's performance.
@vassa-vn7xzКүн бұрын
How is this even working? Why will it not collapse to single embedding for all words?
@DallasMcMillanКүн бұрын
Incredible project and so many insights into ai in a fun and digestible way! Thanks !!!! ❤❤❤
@juliansantos190013 сағат бұрын
Crazy work not to mention crazier animation, i know the concept of the ais but dont have this extensive knowledge to write it on m y own without lm libs 😆
@AverusMuto4 сағат бұрын
This is very useful. Thank you.
@hypercoder-gaming22 сағат бұрын
With more training, this could definitely be very powerful.
@driss122713 сағат бұрын
The graphics or so great, curious what you used to produce this video? Looks like manim expertly used
@alisyoung2741Күн бұрын
I have been working on one as well but ran across issues currently! So exciting!
@8AAFFFКүн бұрын
yooo gl man are you doing like a custom architecture?
@alisyoung274118 сағат бұрын
Yes! I customized the standard U-net architecture by rebuilding the bridge to process input using a Bi-lstm a memory system and attention mech before re-upsampling.
@alisyoung274118 сағат бұрын
Your video actually inspired me to try and work on a kind of single token tokenizer that will produce a single unique token for any given input of a certain size, hopefully really large.
@mtlr380323 сағат бұрын
crazy video!!
@Moshi74533Күн бұрын
sick bro, absolutely sick
@kamertonaudiophileplayer84711 сағат бұрын
You need to patent your approach. It's a very interesting, although I use a slightly modified one.
@TimeLordRapsКүн бұрын
bros cracked. thank you fellow.
@VioletGiraffe2 күн бұрын
Even your animations are cool, how did you make them? Or do you have another neural net to do that for you? :)
@8AAFFF2 күн бұрын
thanks :), basically with just images / clips in davinci resolve. I put the almost final timeline at the end 18:26
@toofardoug2188Күн бұрын
I wonder if there's a better sampling mechanism when you're using the word2VEC model? If you watch the got from scratch video from youll see that andrej karpathy doesnt just take the highest predicted token. They sometime take from the top 3rd value.
@Leb3692 күн бұрын
very good video, the only default is the sound quality.
@Kwenen12 сағат бұрын
4:00 It's intuitive to do so, but I'm surprised that big companies still choose to output Token regressions
@ЕгорКолов-ч5с7 сағат бұрын
Because you need the model to be able to produce different outputs. For example if you have "Cat sits on a tree" and "Cat sits on a sofa" in the training data, this trained model will always predict (tree + sofa) / 2 when given "Cat sits on a" as a prompt, and there is no remedy for this issue
@Kwenen5 сағат бұрын
@@ЕгорКолов-ч5с I don’t think it matters, because what we usually want is an output, and we don’t care what the value of a specific Token is (for example, 0~1 for emotion recognition), and the current model will also have situations where both Tokens are 0.5, which is also passed Throw a weighted die when an output is needed. The vector used in the video, (tree + sofa) / 2 also shows that this sentence can be followed by two words. Then I think the model can learn the usage of language very well. When calculating the similarity with the output, both It's 0.5, just throw a dice and everything is fine. I guess, in the video, the maximum value is always chosen, and there should be a chance of outputting other words when the probability is half and half. This is like using a Markhov chain, but letting the maximum value determine the transfer. :)
@ЕгорКолов-ч5с3 сағат бұрын
@@Kwenen I don't really understand what you are referring to. I'm just relaying a simple fact: output of this model is a 784 value embedding, that corresponds to a vector in word2vec space which is not as powerfull as a probability distribution over tokens. Generating next word is just taking the closest word in word2vec space to generated embedding. Because of the way word2vec works, 1) embeddings of contextually close words will be closer in word2vec space, so even if you randomly choose 1 out of 10 closest word to the embedding you will get sinonyms at best and gibberish at worst, 2) because word2vec doesn't care about word order, model trying to predict next token will always produce chains of same words over and over. The main reason that nobody uses this method is that it fundamentally doesn't work.
@Kwenen2 сағат бұрын
@@ЕгорКолов-ч5с If the correction w2v brings to the model is not strong enough and the training is too slow, it will really make me give up this path. Oh, maybe what I said was a bit out of focus. I mean, we hope that the language model's output of the meaning is enough, even if it is a synonym. Therefore, I thought that if I output a vector (fuzzy meaning) and then select from similar words, it should be enough to support communication. That model may have a lighter load on the graphics card. Of course, if the model gives meaningless vectors, then choosing from them will really only result in a bunch of gibberish, then I can only say that it is very sad. And I naively thought that the position encoding of the input and Self-Attention were enough to make the output position-sensitive. So the idea of playing with the prediction of the next token on the vector doesn’t really work? It’s just that this method really makes me feel intuitive. It is easy to imagine a sequence of arrows in hyperspace, pointing to multiple different words in sequence. As you reminded, this seems to be inefficient at the learning level. After all, the words are too discrete. Even if it is easier at the output layer, it does not mean that things have become easier, right?
@takeraparterer3 күн бұрын
kzbin.info/www/bejne/lXOVg3yjns2Xi6s that's not correct. gpt models predict every "next word" from a sequence at the same time
@8AAFFF2 күн бұрын
yeah 100% correct i just lied about it in the beginning for the explanation to be easier, but i do later correct myself well done for noticing :)
@60pluscrazyКүн бұрын
Amazing..how did you animate 👌🎉🎉🎉
@8AAFFFКүн бұрын
thanks :) all the animations are pretty much fully made up of davinci resolve images and clips and stuff i put the timeline at 18:26 if you want to see
@KristoferPetterssonКүн бұрын
Arn't you using any sampling when picking the next token?
@ЕгорКолов-ч5с7 сағат бұрын
His architecture doesn't output a probability distribution, where is he supposed to sample from?
@Vine_ZxКүн бұрын
Remember me when you make it big!
@8AAFFFКүн бұрын
i will ty
@ВладЧорний-ч4иКүн бұрын
This is fire
@user-qw1rx1dq6nКүн бұрын
you should probably use a cosine similarity loss
@TheTruthOfAIКүн бұрын
Hahaha funny guy.. it's like reading a long gpt4 hallucination
@fortaber3 күн бұрын
The editing of the video is just amazing!!
@MrNootkaКүн бұрын
Hello! Nice video, In the section "Final word2vec Results" i.e. at point 11:14 and 11:28, you had a space inside the variable value of similar_by_world in one and the other you didnt... I wonder if the space changes the results
@v0idbyt3Күн бұрын
a space in the code would make the compiler or interpreter think that its something else, so it would make an error (which is a difference)
@8AAFFFКүн бұрын
thanks :) also well done for noticing, the space does change the results because its a slightly different token in the word2vec (but they are really close to each other). i dont know why its there its probably by accident but if ur curious this is the output for "2021" with no space: [('2020', 0.7180283069610596), ('2022', 0.6416824460029602), ('2021 ', 0.6249533295631409), ('2019', 0.6035624742507935), ('October ', 0.5840676426887512), ('october ', 0.5773099660873413), ('January ', 0.5399696230888367), ('2020 ', 0.5389090776443481), ('2018', 0.5194795727729797), ('July ', 0.5182425379753113)]
@MrNootka21 сағат бұрын
@@8AAFFF Thanks for the reply, I mainly asked because of your "tokenization" approach; Anyway I believe what you have cooked here has some serious potential! When I found your channel yesterday I binge watched most of your videos and this one and the dimensions simulator are my top favorite ones 😆, I am working on somerhing similar, keep up the good work!
@tevincoburn146911 сағат бұрын
Dude. Great video but like... Bump up your audio by like 4db. You're so quiet I have to max out my speakers.
@8AAFFF7 сағат бұрын
Thanks! Alot of people said that, was the general audio too quiet or just the voiceover?
@MommysGoodPuppy23 сағат бұрын
holy GOATED
@averesenso3 күн бұрын
Your voice is quiet on my speakers
@VioletGiraffe2 күн бұрын
Fine for me, not quiet.
@rikhendrix261Күн бұрын
3:11 i thought chat gpt 3 had a 12288 embedding size. You are saying as high as 4096.
@8AAFFFКүн бұрын
tbh i asked chatgpt whats its embedding dim XD so idk if its correct i looked it up again and ur right the biggest version of gpt3 is 12k embedding dim, and openai is really secretive about gpt4 so im probably completly wrong on that. thanks for pointing out :)
@rikhendrix261Күн бұрын
@@8AAFFF Its okay I tought I might have been wrong. At 4:57 you are saying that you are going to compare the vector of the word it wants to predict to all the words in your database with its vector representation (RAG with cosine similarity). But words like Mole for example can be an animal, soemthing on your cheeck or arm, or it can be something having to do with the total number of molecules 6.02 x 10 ^23. Does this mean that your word database has these words written down multiple times? And at some point you said you had 340.000 words in your database?? instead of the 40.000 from openai? Im also interested to know what the most important thing you learned during this time was? I have only been learning about AI recently so im all ears.
@8AAFFFКүн бұрын
ah i get what ur saying. basically the word2vec model has a bunch of tokens in its vocabulary. every token can appear only once, but usually there are some accidental half duplicates like "mole", "mole ", " mole" etc... usually the duplicates have pretty much the same meaning as the "true" word just because they appear in exactly the same contexts. because the words are not written down multiple times there are some misplaced words that have meanings in two different areas so they are just awkwardly put in the middle of both "meaning areas". this doesnt really hurt the neural net because im relying on it understanding that even if a word's vector doesnt 100% match the situation, its still good because of the context of whatever its typing. as for the vocab size being 340k instead of something smaller like 40k its due to me using a tokenizer that splits the text into bigger tokens, usually the size of full words, instead of smaller half words like what openai uses. so for me "hello, world!" would be split into something like: "hello" "," " world" "!" and for openai same thing would be split into something like: "hel" "lo" ", " "wo" "rld" "!" so openai needs less of these tokens to fully encode english and probably the bigggest thing i learned with this was how to properly use tensorboard. its a library that lets you track stuff like the loss graphs in real time, compare how two different models trained and more stuff like that. the best way to learn about ai stuff is to do big projects like this because you actually encounter problems in the wild, then solve them, instead of just learning about solutions to problems you never had
@rikhendrix261Күн бұрын
@@8AAFFF Wow, very interesting! Yes i now understand why your token count was higher. This would also mean that for you a simple "normal" english sentence would consist of less total tokens than openai which would save on the compute. Do you by chance follow "Discover AI" he has some very interesting videos on new Test-Time compute and Test-Time training which according to the literature saves a lot of time and has great results, but my level of AI understanding isn't at that point yet. Maybe that you would be able to combine that tactic with your own? I'll follow you and see what more you will post.
I think you're running into issues at the front of the pipeline, when "translating" from the vocabulary to the tokens try just blacklisting tokens already mentioned in the past 3 tokens up to the point you're translating at.
@8AAFFF7 сағат бұрын
To be honest i didnt think of that This could also work as some sort if temperature like GPTs have
@callowaysutton6 сағат бұрын
@@8AAFFF Temperature would be more equivalent to doing a left tailed random distribution over the list of tokens for the given category, this would just be a repeat penalty
@lobiqpidol818Күн бұрын
🤓 well AksUaLly each embedding vector takes up space on the device. So while you save space by vector quantizing the output embeddings the vocabulary is still limited by GPU space. Also you lose the ability to do some calculations on the output like temperature. Good video
@yoavco99Күн бұрын
You can probably just have it not be on the gpu, and just check the closest token on like the CPU or whatever. Also can't you just easily recreate temperature with this?
@8AAFFFКүн бұрын
yeah thats exactly what im doing the word2vec weight is stored on regular RAM and is only used to translate tokens back and fourth. so the only stuff on the GPU VRAM is the network and the already translated vectors. its true that i dont really have regular temperature like other GPT models but i can sort of recreate it by either adding noise to the input or selecting the 2nd / 3rd closest word to the network output instead of the 1st :)
@user-qw1rx1dq6nКүн бұрын
You can absolutely recreate temperature if you just train the embedding model differently
@lobiqpidol818Күн бұрын
@@8AAFFF what I've been thinking about what if you use very small embedding vectors only 2 dims for example to represent words then immediately expand it to more dimensions with linear layers when inside the model. Does the model see this as the same thing or completely different?
@Myexpectationsarerealistic12 сағат бұрын
I did the same thing.
@absentmindhere14 сағат бұрын
nice
@RasmusSchultz21 сағат бұрын
interesting idea, except... it doesn't seem to work? 😅
@idf3da3 күн бұрын
top!
@raihanhossain3423Күн бұрын
Is that your real credit card number? 🙃
@8AAFFFКүн бұрын
one way to find out
@that_guy121121 сағат бұрын
Bro, not trynna be mean or anything, but you AI looks dumb as hell.... Keep working on it mah dude, would love to see this architecture get better with a actual decent LLM on it bruv!