You are leagues apart when it comes to explaining complex concepts! Thanks and please never stop :)
@lucidateAI2 ай бұрын
Thank you for your kind words and compliments. Greatly appreciated!
@NithinDinesh-l3h3 ай бұрын
What an awesome video. Probably the best video on the internet for positional encodings. Loved every bit of it.
@lucidateAI3 ай бұрын
Glad you enjoyed it!
@gnorts_mr_alien Жыл бұрын
I've probably watched 50+ transformer related videos and this is the only one explaining positional encodings that make sense to me.
@lucidateAI Жыл бұрын
Thanks. Greatly appreciated! Have you had a chance to look at any of the other videos in the playlist that this video is taken from? kzbin.info/aero/PLaJCKi8Nk1hwaMUYxJMiM3jTB2o58A6WY. Lucidate.
@gnorts_mr_alien Жыл бұрын
@@lucidateAI I'm going through all as we speak
@lucidateAI Жыл бұрын
Let me know what you think. Hopefully they prove insightful.
@jeremylee637311 ай бұрын
One of the best videos on LLM I've seen so far.
@lucidateAI11 ай бұрын
Many thanks! Let me know what you think of the other videos in this playlist kzbin.info/aero/PLaJCKi8Nk1hwaMUYxJMiM3jTB2o58A6WY specifically and on the Lucidate channel more broadly.
@snehotoshbanerjee1938 Жыл бұрын
Best explanation of Positional embedding.
@lucidateAI Жыл бұрын
That is very kind of you to say so. Glad you found it enlightening. Did you find the other videos in this series on transformers as insightful?
@syedtahsin9717 Жыл бұрын
very detailed explanation. Loved it!! You're an amazing teacher
@lucidateAI Жыл бұрын
Thanks for your support of the channel. I’m glad you found it useful.
@fraternitas5117 Жыл бұрын
I love your style Luci!
@lucidateAI Жыл бұрын
Thanks! Glad you liked the video.
@JorgeMartinez-xb2ks11 ай бұрын
What a great teacher you are
@lucidateAI11 ай бұрын
Your kind words are greatly appreciated. I’m glad you are finding the content on the channel so useful.
@jonclement Жыл бұрын
bravo. I definitely appreciate the craftsmanship of the graphics
@lucidateAI Жыл бұрын
Thank you! Cheers!
@StoutProper Жыл бұрын
Honestly these videos are superb, I’ve seen videos that are appalling that because of the stupid KZbin algorithm have got far more views. I honestly think Google and KZbin order to promote scams and bullshit artists rather than quality content like this. This channel will blow up I’m sure of it, this is easily one of the best AI channels out there.
@lucidateAI Жыл бұрын
Thanks @Guinness. I appreciate you taking the time to watch the videos, your support of the channel and your generous comment. Any subjects you would like covered in future videos?
@StoutProper Жыл бұрын
@@lucidateAI mate it’s a pleasure, I’ll get through watching them all at some point. Honestly I think you’ve got a much better idea than I have, I’m learning here. Although if I was to be so bold as to dare to suggest a few ideas; one would be to perhaps look to the future of AI, possibly to look at the potential of new hardware like neurologic analog chips and where that will take AI, what it might mean for how software might evolve and also to the increasing areas AI could reside, so potentially you could have a sparse version of GPT2 or the equivalent on your phone that is learning from you, you could have cctv as in your doorbell that will learn who regular visitors are like family and the postman and pizza delivery drivers and will only alert you when someone it doesn’t know or recognise calls, otherwise it will just send you a notification of all the people who called that day etc, that kind of thing. Basically what’s the future of AI application on our phones and in our homes? That’s the micro level, at the macro level there’s how AO is going to affect employment and society in the next 5-10 years. My thesis was on this and personally I think it’s going to be profound.
@lucidateAI Жыл бұрын
These are great ideas. I'm not sure that I'm qualified to comment on them. I do have a series of videos on AI based around an explanation of computer vision - the playlist is here -> kzbin.info/aero/PLaJCKi8Nk1hz2OB7irG0BswZJx3B3jpob, the final video on 'YOLO' - You Only Look Once is here -> kzbin.info/www/bejne/l2OziXl-fKyKbNU. This gets to the 'Is it the postman at the door' use-case. I'd recommend checking that out first to see if you want to look at the more fundamental content that precedes it. But I'll confess that this is not what you described, it isn't about the future of AI or computer vision, it is just about how neural networks work and how they can solve interesting problems like computer vision. But if you do get a chance to check out some of the videos in this playlist (and I'd be grateful if you do) then I'd love to hear whet you think. Once again really appreciate the engagement with, and contribution to, the channel. - Lucidate.
@StoutProper Жыл бұрын
@@lucidateAI I’ll watch them all don’t worry. Your channel is treasure trove. You work in AI I assume?
@lucidateAI Жыл бұрын
Thanks for your kind words. I have a consultancy company called Lucidate that offers AI and DeFi solutions to Capital markets. You can check out the website at www.lucidate.co.uk for more info. If you have a chance to watch the computer vision series - which serves as a primer into Neural Networks, I'd really appreciate your feedback. Thanks again.
@hemanthyernagula9759 Жыл бұрын
Hola, nice explanation, Waiting for next video!!
@lucidateAI Жыл бұрын
Thanks Hemanth, I'm glad you found the explanation useful. Here is a link to the next video where we focus on self-attention. If you get a chance to take a look at it then I'd be grateful for any feedback, commentary or suggestions for other video explainers that you may have.
@M3t4lik Жыл бұрын
Excellent.
@lucidateAI Жыл бұрын
Many thanks!
@VKMaxim Жыл бұрын
Thanks a lot for your video. There is one moment I didn't understand. If I understood you right, you are stating that the distance between two vectors before and after adding one-hot encoding won't change. However, if the original vectors, for example, are [1,2,3] and [10,20,30], and after encoding they change to [2,2,3] and [10,20,31], the distance will change.
@lucidateAI Жыл бұрын
Hi @VKMaxim. I'm not sure I correctly understand your question, so please let me know if the answer below is an answer to another question - i.e., one that you did not ask! What I am saying is that in a vast, multi-dimensional space like word embeddings, most of the space is "empty". Words like "cat" and "dog" will be close, as will "red" and "orange", "micro" and "mini". But (relatively speaking) there will be vast distances between the vectors representing these words. When you add the positional encoding vector the difference that this makes is so tiny that the vector for "red" in the first position and the vector for "red" in the 11th position would not be confused with the vector for "orange" in any position. The point is positional encoding doesn't really change semantic distances in high dimensional embedding spaces
@VKMaxim Жыл бұрын
@@lucidateAI Thank you for your answer. I've rewatched your video. As far as I understood, the key point is that one-hot encoding doesn't capture relative position information. Since all positional vectors have a length of one and are orthogonal to each other, the distance between any two of them is always sqrt(2). What I misunderstood was that at 10:22, when you said the distance between two vectors wouldn't change, you were referring to the positional embedding vectors. I mistakenly thought you meant the resulting vectors, i.e., the original embedding vector added to the vector with positional encoding.
@lucidateAI Жыл бұрын
Again please accept my apologies if I misunderstand your question and comment, but positional encoding and OHE are two very different things. This video kzbin.info/www/bejne/iKXcnnRuotKIgqcfeature=shared delves deeper into OHE and explains why it is necessary. The key difference between one-hot encoding (OHE) and positional encoding is that OHE treats each category independently, while positional encoding preserves the ordering of categories. One-hot encoding assigns each unique category value a binary vector with a 1 at the index for that category and 0s elsewhere. For example, take the sentence: "The quick brown fox jumps over the lazy dog" We could one-hot encode the words as: "The" = [1,0,0,0,0,0,0,0] "quick" = [0,1,0,0,0,0,0,0] "brown" = [0,0,1,0,0,0,0,0] "fox" = [0,0,0,1,0,0,0,0] "jumps" = [0,0,0,0,1,0,0,0] "over" = [0,0,0,0,0,1,0,0] "the" = [0,0,0,0,0,0,1,0] "lazy" = [0,0,0,0,0,0,0,1] "dog" = [0,0,0,0,0,0,0,0,1] Positional encoding uses numeric vectors based on the position of each word. Using the formulation from Attention is All You Need: "The" = [sin(0),cos(0),sin(1/10000),cos(1/10000),...,sin((N-1)/10000),cos((N-1)/10000)] "quick" = [sin(1),cos(1),sin(2/10000),cos(2/10000),...,sin(N/10000),cos(N/10000)] "brown" = [sin(2),cos(2),sin(3/10000),cos(3/10000),...,sin((N+1)/10000),cos((N+1)/10000)] ... The trig functions preserve the ordering and spacing between words. I hope these examples clearly illustrate the difference between the two encodings! Let me know if I’ve misunderstood your question
@fidhalkotta Жыл бұрын
Amazing content
@lucidateAI Жыл бұрын
Thanks Fidhal for supporting the channel and for your generous comment.
@greengraycolor Жыл бұрын
Amazing content, thank you! I got a bit lost in transitions from one hot encoding, to sin cos formulas, to graphs, and back to values in a matrix. Would be great to eg indicate particular values on a graph and how where they slot in the matrix. Also, it would be great to see a numerical example of how Euclidian distances between words change depending on position in the sequence.
@lucidateAI Жыл бұрын
Hi @greengraycolor. Glad you liked the video, but sorry that I wasn't able to get some of the concepts across so well. At 13:30 I was trying to illustrate the addition of positional encodings to word embedding vectors. Here I'm using a 3D subspace (naturally the actual embedding spaces are 300, 512 or similar dimensions, which I can't draw). Here we have two words - 'Swap' & 'Bond' which in a capital markets concept share many similarities. For instance they are both financial instruments, they both make periodic payments and the price of both will vary with interest rates. And there are also differences - one is a derivative and the other is a security. So you would expect them to have some similarity in a word embedding space, while clearly not being identical. Meaning 'swap' would have a greater cosine similarity to 'bond' than 'swap' would to 'equity' or 'compliance'. The unit circle drawn around and centred on the the tip of the swap vector at 13:42 is intended to represent the 'bounding unit sphere' for all of the possible positional embeddings (as they are all outputs of a sin and cos function they will lie on a hypersphere of radius one). The arrows that appear at 13:44 represent the positional embeddings added to the embedding vector for 'swap'. As multidimensional hyperspace is essentially empty the chances of adding these small embeddings altering the meaning of a word are very low indeed. i.e. The hyperspheres around the swap and the bond vector wouldn't overlap. At least that was my concept... But clearly it didn't work... The whole point of the videos is for the people that _watch_ them to understand them, _not the people that wrote them_. So your question is a good one and I'll have to have a think about how I'd visualise this in a different way. If I might ask you to look at it again, might you be able to suggest a way I could improve the visualisation to get this point across? Many thanks for your question and contribution to the channel. - Lucidate
@greengraycolor Жыл бұрын
@@lucidateAI Thank you for a very extensive and clear reply! I am afraid I my question was not sufficiently clear. I got all points you explain in the reply already from the video. My problem is in understanding where exactly numerical values representing position come from. What would help me and perhaps others, is a numerical example starting with 1. equation, i.e. plugging numbers to the equation, 2. Showing the resulting value on the graph with sin cos "waves", 3. Highlighting that value in the appropriate place in the matrix.
@lucidateAI Жыл бұрын
Understood. Then that would be at 12:04. Here I'm taking the embedding length to be 300. If you take a look at the position vector you will see that for the position matrix, first element the 0th, 2nd & 4th rows of the matrix have a 0.0 value in the first column and the 1st and 3rd rows have a 1.0 in the first column. This is because the even rows have sin(0) and the odd rows have cos(0). The next column you have moved 1/300th through the embedding vector and you calculate the values again. I'm sticking with the "Attention is all you need" formula, which has the "magic number" coefficient of 10,000. If you go to pp6 of AIAYN -> arxiv.org/pdf/1706.03762.pdf you can see the formula used for the positional encodings. If you plug in a d_model of 300 you should get the exact numbers that I got in teh video. Here is the python code that I used for the manim animation for this scene, so if there is a bug in my code (it happens a lot!) at least you will be able to see where the discrepancy in the numbers comes from (though I hope there is no bug in my code or logic, so you should be ok to reproduce this yourself). ========================= mat2 = np.zeros((5, 5)) for x in range(0, 5): row = [] for y in range(0, 5): b = Square(side_length=0.8, fill_color=ohe_fills[y], fill_opacity=1, stroke_width=2, stroke_color=GREEN).shift(RIGHT * (y+1), UP * (2 - x)) if x%2 == 0: i=round(math.sin(y/math.pow(10000, 2*x/300)), 4) q = Text(str(i), font_size=FONT_SIZE, color=BLACK).shift(b.get_center()) else: i=round(math.cos(y/math.pow(10000, 2*x/300)), 4) q = Text(str(i), font_size=FONT_SIZE, color=BLACK).shift(b.get_center()) mat2[x][y]+=i row.append(b) row.append(q) g = VGroup(*row) ohe_g.append(g) ====================== the mod two conditional switches between sin and cos and then the AIAYN formula is in the first line after the conditional. 'q' is just a conversion to an object that can be rendered on the screen. Take a look and see if you can follow through and get the numbers on the screen that I get. Keen to hear how you get on - even keener to hear if you find a bug in my code and I'll try to fix it. Thanks for the great question - Lucidate.
@greengraycolor Жыл бұрын
@@lucidateAI Thank you again. Your first paragraph did the job and the code is a bonus. Thank you! I wonder how long your channel will remain a hidden gem before it explodes!
@lucidateAI Жыл бұрын
Tell your friends.... ;-)
@ThinAirElon11 ай бұрын
Great content ! one question here ..... relative postions between words word 2 and word 3 might be same as word 2 and word 50 lets say as sine and cosine can repeat , how does this add up?
@lucidateAI11 ай бұрын
Based on the specs in the original AIAYN paper there _will_ be repetition because of the periodicity of sin and cos. If I recall the parameters in the positional encoding function in AIAYN paper meant repetition every 10,000 tokens
@ThinAirElon11 ай бұрын
@@lucidateAI aaah Now i get thats why they divide by big number like 10000 so frequency is low and offcourse sentence usually will not have so many words... Thank you 👍
@lucidateAI11 ай бұрын
yw!
@Eltaurus5 ай бұрын
10:15 - This is not true, though. Eucledean distance does not only depend on the lengths of the vectors added, but also on the angles between the added encoding vectors and the original embedding vectors, which won't be the same if words are swaped. That can easily be checked with a direct computation: In the first case the distance between vectors corresponding to words "swaps" and "are" is equal to √[(-35.65-19.66)² + (59.47+61.65)² + (35.25-34.55)² + (-21.78-88.36)² + (33.44-50.35)² ] = 173.627 while in the second case it equals √[(-36.65-20.66)² + (60.47+62.65)² + (35.25-34.55)² + (-21.78-88.36)² + (33.44-50.35)² ] = 175.671 So with one-hot positional encoding the distances just as well depend on the positions of words in a sentence. The reason for not using one-hot encodings for positions is actually a completely different one.
@messapatingy Жыл бұрын
8:43 Me: What seems strange about the following sentence? "We'll start with a simple sentence, swaps are interest rate derivatives.". ChatGPT: The sentence "We'll start with a simple sentence, swaps are interest rate derivatives." may seem strange because it is not a simple sentence, but rather a complex one. Swaps, which are interest rate derivatives, are not a simple subject to explain and require a more in-depth understanding of financial markets and financial instruments. Additionally, the sentence is structured in a way that makes it appear as though the speaker is trying to simplify the topic, but in reality, the subject matter is quite complex.
@lucidateAI Жыл бұрын
Thanks for the feedback. Always appreciated. I perhaps should have said “a short sentence”. You are correct that both “swaps” and “interest rate derivatives” _are_ complex (if they weren’t, then I wouldn’t have a job). I meant to say that the sentence was grammatically simple, not semantically simple. Appreciate the feedback.
@lucidateAI Жыл бұрын
Good to hear.
@delhiboy0105 Жыл бұрын
Hi there. Thanks again for a lovely video. I have a question I have understood semtanic encoding well. I read a bit about word2vec and know how it works conceptually. In a vast 300 dimention space, there are vectors representing each word. This is understood. Adding the values of positional encoding here don't change semantic relationship. This is fine and understood as well. The question is: At what point in the training does positional encoding come into the picture? Positional encoding for the words in your example were different for different sentences. For example the final vector represetnation for "are" and "swaps" were different in the different sentences.... how are the positional encodings saved in the 300 dimension space for say word2vec? How is this positional info stored for each word so that the transformer calls upon it when needed. Are we saying that along with every word, we have smaller vectors representing it's positions in the word2vec dataset? Or is this something which is calculated only at the time of giving the transformer an input and while getting an output?
@lucidateAI Жыл бұрын
Your question is very similar to that of 'greengreycolor' below. Take a quick look at my replies to those questions and see if it makes a little more sense. If not let me know and I'll try and answer it another way that hopefully makes sense. - Lucidate.
@SinanAkkoyun Жыл бұрын
❤️
@lucidateAI Жыл бұрын
Thanks!
@DdesideriaS Жыл бұрын
Intuitively, this seems like a potentially lossy/non-smooth embedding as it seems that very different phrases might be placed very close to each other in such embedding. Are there any kind of scaling best practices (e.g normalization of semantic encoding before adding position)? Also what if we just concatenate semantic matrix with position (so effectively keep them separate)?
@DdesideriaS Жыл бұрын
And even some KZbin vids on topic kzbin.info/www/bejne/g2O3oHiOe5uCotk
@lucidateAI Жыл бұрын
Thank you for your insightful comment! You raise some valid concerns about the potential limitations of positional embeddings. Indeed, positional embeddings might sometimes seem non-smooth or lossy. However, during the training process, the model learns to fine-tune these embeddings to better understand the relationships between words in a sequence. Additionally, the attention mechanism helps the model to focus on relevant parts of the input, mitigating potential issues caused by the embeddings. Regarding best practices for scaling, normalization can be applied to both the token and positional embeddings before they are combined. This helps ensure that neither the token nor the positional information dominates the other. As for concatenating the semantic matrix with the position instead of adding them, this approach is certainly worth exploring. By concatenating the embeddings, you would effectively create a higher-dimensional representation of the input sequence, preserving the separation between the semantic and positional information. However, this method would result in a larger input size and could increase the complexity of the model. It might also require more substantial adjustments to the architecture. Ultimately, the choice of positional encoding method depends on the specific use case, dataset, and desired model complexity. It is essential to experiment with different techniques to determine which approach yields the best results for your particular problem. I hope this addresses your concerns and provides some valuable insights. If you have any further questions or need clarification, please don't hesitate to ask. - Lucidate
@DdesideriaS Жыл бұрын
@@lucidateAI Gosh, now I'm second guessing on whether your reply was auto-generated by ChatGpt or you are indeed so nice to reply so politely and in detailed way to random strangers on the internet.
@lucidateAI Жыл бұрын
@@DdesideriaS the “reverse Turing test?”. Can a human-response be differentiated from one generated by an LLM? Many thanks for your question and your support of the Chanel. - Lucidate.
@tipycalflow1767 Жыл бұрын
Could you pls briefly explain your statement at 10:33 why the distance between the vector representing the word 'swap' and the vector representing the word 'derivatives' will always be the same? Is the distance you're referring to the one that's calculated before adding positional encoding or after?
@lucidateAI Жыл бұрын
Thank tipycalFlow you for your question! In the statement at 10:33, I was referring to the distance between the word vectors before adding the positional encoding. The full answer to these questions often depends on the context. In a standard word embedding model, such as Word2Vec or GloVe, the distance between any two word vectors is calculated based solely on the co-occurrence statistics of those words in the training corpus. This means that the distance between the word vectors for "swap" and "derivatives" would be the same regardless of the position of those words in a sentence or the context in which they appear. This is because word embeddings do not incorporate any information about the position of the words in the input sequence. However by adding positional encoding to the model in a transformer, we can provide the model with additional information about the position of each word in the input sequence. This allows the model to differentiate between words that have the same meaning but appear in different contexts. So if we aren't using Word2Vec of GloVe or a similar embedding there will be a very slight difference in the cosine similarity depending on where the word appears in the sentence. But given the vast hyper-dimensionality of the embedding space these differences won't be enough to change the semantics of the word, simply the relative position. I hope this answers your question! Thank you for supporting and contributing to our channel. Your engagement is greatly appreciated.
@tipycalflow1767 Жыл бұрын
@@lucidateAI Thanks for taking the time out to answer with such detail
@lucidateAI Жыл бұрын
You are welcome. Thanks for contributing to and improving the channel with an excellent question. Greatly appreciated.
@Dxeus Жыл бұрын
To, the creator of this video. Please tell me you have a 40 hours course on Udemy about AI/ML. If not please do so.
@lucidateAI Жыл бұрын
Thanks D exus. Sorry to disappoint you, but I do not. I do have some playlists on KZbin on several aspects of AI/ML that you may find interesting. If you get a chance to look at them I would welcome any feedback that you may have: Transformers, (ChatGPT & GPT-3): kzbin.info/aero/PLaJCKi8Nk1hwaMUYxJMiM3jTB2o58A6WY Neural Networks: kzbin.info/aero/PLaJCKi8Nk1hzqalT_PL35I9oUTotJGq7a Intro to Machine Learning: kzbin.info/aero/PLaJCKi8Nk1hwklH8zGMpAbATwfZ4b2pgD Machine Learning Ensembles: kzbin.info/aero/PLaJCKi8Nk1hxRtzY-8M2r3nDnRcyXU99Z Computer Vision: kzbin.info/aero/PLaJCKi8Nk1hz2OB7irG0BswZJx3B3jpob EDA: kzbin.info/aero/PLaJCKi8Nk1hxQzKBU-065dvwk-wdaBU7u
@varunahlawat9013 Жыл бұрын
Sir, bouncing animations don't look good!
@lucidateAI Жыл бұрын
Thanks for the support of the channel, I appreciate the constructive feedback. Getting the balance right between making the graphics engaging and informative can be challenging. I.ll keep your comment in mind. Appreciated.