Rotary Positional Embeddings: Combining Absolute and Relative

  Рет қаралды 25,178

Efficient NLP

Efficient NLP

Күн бұрын

In this video, I explain RoPE - Rotary Positional Embeddings. Proposed in 2022, this innovation is swiftly making its way into prominent language models like Google's PaLM and Meta's LLaMa. I unpack the magic behind rotary embeddings and reveal how they combine the strengths of both absolute and relative positional encodings.
0:00 - Introduction
1:22 - Absolute positional embeddings
3:19 - Relative positional embeddings
5:51 - Rotary positional embeddings
7:56 - Matrix formulation
9:31 - Implementation
10:38 - Experiments and conclusion
References:
RoFormer: Enhanced Transformer with Rotary Position Embedding (main paper that proposes RoPE embeddings): arxiv.org/abs/2104.09864
EleutherAI blog post: blog.eleuther.ai/rotary-embed...
Blog posts by first author Jianlin Su (in Chinese): kexue.fm/archives/8130 and kexue.fm/archives/8265
Survey paper on positional embeddings: aclanthology.org/2022.cl-3.7/

Пікірлер: 53
@wolpumba4099
@wolpumba4099 10 ай бұрын
*Video Summary: Rotary Positional Embeddings: Combining Absolute and Relative* - *Introduction* - Discusses the importance of positional embeddings in Transformer models. - *Absolute Positional Embeddings* - Explains how absolute positional embeddings work. - Highlights limitations like fixed sequence length and lack of relative context. - *Relative Positional Embeddings* - Introduces the concept of relative positional embeddings. - Discusses the computational challenges and inefficiencies. - *Rotary Positional Embeddings (RoPE)* - Combines the advantages of both absolute and relative embeddings. - Uses rotation to encode position, preserving relative distances. - *Matrix Formulation* - Explains the mathematical formulation behind RoPE. - *Implementation* - Shows how RoPE can be implemented efficiently in PyTorch. - *Experiments and Conclusion* - Shares results of experiments showing RoPE's effectiveness and efficiency compared to other methods. The video provides a comprehensive overview of Rotary Positional Embeddings, a new method that combines the strengths of both absolute and relative positional embeddings. It delves into the mathematical details and practical implementation, concluding with experimental results that validate its effectiveness.
@laurentiupetrea3726
@laurentiupetrea3726 11 күн бұрын
Finally! My 4th video and I was lost but this one did the trick!
@MrOnlineCoder
@MrOnlineCoder 7 ай бұрын
Amazing video, intuitive explanations with examples.
@kevon217
@kevon217 10 ай бұрын
Thanks for this overview!
@theunconventionalenglishman
@theunconventionalenglishman 6 ай бұрын
I've watched a few videos trying to wrap my head around this concept and yours is by far the best. Thanks!
@ItsRyanStudios
@ItsRyanStudios 8 ай бұрын
This is amazing, thank you I just wrapped my mind around sinusoidal embeddings and came across rope and was really struggling to grasp it. Definitely going to refer back to this video. I love in depth NLP content like this.
@weekendwarrior7933
@weekendwarrior7933 Ай бұрын
Absolutely amazing explanation! Keep it up man
@pierreenel1516
@pierreenel1516 3 ай бұрын
Excellent video, thanks!
@marshallmcluhan33
@marshallmcluhan33 9 ай бұрын
Good work, I look 'forward' to the ReRoPE video. 😎
@roomo7time
@roomo7time Ай бұрын
your explanation is amazing. thank you for your work
@hw5622
@hw5622 5 ай бұрын
Thank you so much. Your explanation is very clear and succinct.
@ddobokki
@ddobokki 3 ай бұрын
OMG!! Very good teaching!!!
@muyanfeng2082
@muyanfeng2082 3 ай бұрын
Really good introduction, thanks
8 ай бұрын
very good explanation.
@cmbbqrpb9737
@cmbbqrpb9737 10 ай бұрын
Thanks for creating and sharing this vid! Still confused on the math stuff though. So I read through the paper and wrote down some notes: The rotation matrix Rm rotates a query vector q of the mth token by mθ, while Rn rotates a key vector k of the nth token by nθ. For any rotation matrix or orthogonal matrix R, R^T = R^-1 holds. Thus Rm^T is Rm's inverse, that rotates q in another direction by -mθ. This means (q·Rm)^T·(k·Rn) can in total rotate q^T·k by (n-m)θ. This ultimately associates the knowledge extracted from the mth query and the nth key with their relative distance n - m, naturally and interpretably.
@dy8576
@dy8576 3 ай бұрын
Genius
@akshaydevkarama3277
@akshaydevkarama3277 19 күн бұрын
great explanation,really helped me!
@ml.9106
@ml.9106 3 ай бұрын
Very clear~~thanks!
@vixguy
@vixguy 9 ай бұрын
You make it easy to learn even for a high school student
@gemini_537
@gemini_537 2 ай бұрын
Gemini: The video is about a new method for positional embeddings in transformers called rotary positional embeddings. The Transformer architecture is a neural network architecture commonly used for various natural language processing tasks. A key challenge for Transformer models is that they are invariant to the order of words by default. This means that the model would not be able to distinguish between a sentence and its scrambled version. To address this challenge, positional embeddings are added to the Transformer model. There are two main types of positional embeddings: absolute positional embeddings and relative positional embeddings. Absolute positional embeddings assign a unique vector to each position in a sentence. This approach however, can not handle sentences longer than the training data. Relative positional embeddings, on the other hand, represent the relationship between two words. While this method can handle sentences of any length, it requires additional computations in the self-attention layer, making it less efficient. Rotary positional embeddings address the limitations of both absolute and relative positional embeddings. The core idea is to rotate the word vector instead of adding a separate positional embedding vector. The amount of rotation is determined by the position of the word in the sentence. This way, rotary positional embeddings capture the absolute position of a word while also preserving the relative positions between words. The video also mentions that rotary positional embeddings have been shown to improve the training speed of language models.░
@varunsaagars
@varunsaagars 6 ай бұрын
🎯 Key Takeaways for quick navigation: 00:14 🆕 *In 2022, a new architectural improvement called "Rotary Positional Embeddings" (ROPE) was proposed and adopted by various language models.* 03:27 🔄 *Relative positional embeddings represent token pairs' distances but face engineering challenges like slower processing for longer sequences.* 06:01 🔄 *Rotary positional embeddings propose rotating word vectors based on positions, combining advantages of both absolute and relative positional embeddings.* 08:04 🔢 *Rotary embeddings are implemented using rotation matrices for 2D cases and a more general approach for higher-dimensional vectors.* 10:48 ⚙️ *Experiments show that models using rotary positional embeddings train faster than those using sinusoidal embeddings and are relatively robust across various model architectures and training setups.*
@SahilDua
@SahilDua 7 ай бұрын
Thanks for the in-depth explanation of RoPE. A couple of questions: 1. How is KV Cache used/built for RoPE case? RoPE is applied to q and K. Does this change anything in how K and V are cached? 2. Where can I find intuition behind why this RoPE works? I usually find it harder to jump into the mathematical equations directly to find the proof.
@EfficientNLP
@EfficientNLP 7 ай бұрын
Yes, the KV cache can be used normally with RoPE, because the rotation is applied to a token depending on its position from the start of the sequence, and this does not change as more tokens are generated. I hope this video provides a good intuition of why this works!
@einsteinsapples2909
@einsteinsapples2909 6 ай бұрын
I just smashed the like button.
@amortalbeing
@amortalbeing 7 ай бұрын
thanks alot
@harshmittal63
@harshmittal63 Ай бұрын
@qwerty_and_azerty
@qwerty_and_azerty 10 ай бұрын
Great vid! Nice explanation! Question: why is it termed “rotary” and not “rotational” position embeddings?
@EfficientNLP
@EfficientNLP 10 ай бұрын
It’s the name given in the paper. I think it’s quite catchy!
@naubull2
@naubull2 7 ай бұрын
Thanks for a great explanation! By the way, I was curious when I understood from the initial explanation and the rotational equations, consecutive pairs of coordinates seem to be rotated, as in (x_1, x_2) / (x_3, x_4) ... are each rotated. However from most of the implementations as suggested in the video, the codes pair up not by adjacent indices but with a window size of half the dimension, which would be (x_1, x_d//2+1) / (x_2, x_d//2+2) ... since the code states that we split the hdim by half and swap their order.. did I understand correctly or am I missing something?
@EfficientNLP
@EfficientNLP 7 ай бұрын
You are correct. In many implementations, rather than rotating each pair of adjacent dimensions, they choose to split the entire vector in half and rotate the two halves. Ultimately, this does not matter because the dimensions of vectors are interchangeable and do not affect vector addition and multiplication. This is likely to be more efficient from an implementation standpoint and is equivalent to the original formula.
@buh357
@buh357 4 ай бұрын
thank you for such a clear explaination, your explaination helped me to understand this concept, rotary positional embedding is so elegant way to do positional embedding, and intuitively make sense to me, curious how can this embedding technique works for vision transformer? anyone have experience?
@EfficientNLP
@EfficientNLP 4 ай бұрын
Rotary embeddings may be applied to a vision transformer, just as they can be for any other transformer; I'm not aware of any reports that it improves performance in this case. It would be an interesting experiment, though!
@abdelrahmanhammad1020
@abdelrahmanhammad1020 7 ай бұрын
Thanks @Bai for the great explanation. I still have a question: Mathematically, why will the positional embedding of other positional embedding techniques (may be absolute?) change if adding more tokens to the sentence? Approximately, at minutes 7:00 of this video. Thanks!
@EfficientNLP
@EfficientNLP 7 ай бұрын
This is a property of most absolute positional embeddings, but generally not for relative positional embeddings. For example, T5's relative embeddings change at every step as different bias values need to be added to the attention matrix. Thus, rotary embeddings are the first to combine the benefits of both absolute and relative embeddings.
@hussainshaik4390
@hussainshaik4390 9 ай бұрын
great video but i have one question you are referring the eluther ai blog right? in that pytorch implementation instead of rotating every 2 elements in dim vector they rotated half vector like this ```def rotate_half(x): x1, x2 = x.chunk(2, dim=-1) return torch.cat((-x2, x1), dim=-1)``` but in the jax implementation they rotated every two elements any idea on this?
@EfficientNLP
@EfficientNLP 9 ай бұрын
Yea that's possible, there are multiple ways to implement this but they should be logically equivalent.
@hazemessamm
@hazemessamm 9 ай бұрын
Hi, thank you for this great video, but I wanted to ask how they should be logically equivalent, the values that were negated are not the same, so how they are logically eqivalent?@@EfficientNLP
@ziqichen5902
@ziqichen5902 8 ай бұрын
Same question... Have you figured out the reason yet?😅@@hazemessamm
@jasonjones4236
@jasonjones4236 10 ай бұрын
Why is kv cache difficult to implement in case of relative embeddings?
@EfficientNLP
@EfficientNLP 10 ай бұрын
The KV cache saves the K and V matrices during autoregressive decoding to avoid recomputing them for every token. But for relative embeddings, when a new token is generated, the relative distance between the new token and previous tokens changes. So there is an extra step (adding the relative biases) that cannot be cached, making the KV cache not as effective.
@jasonjones4236
@jasonjones4236 10 ай бұрын
@@EfficientNLP Ah so to be precise, the cache can work but we need to fully compute the attention matrix and add the relative embedding matrix to it. But isn't the attention matrix computed when we torch.matmul q and k in other cases too?
@EfficientNLP
@EfficientNLP 10 ай бұрын
That is correct. In summary: there are several steps that are required in relative positional embeddings that aren't needed for absolute & rotary embeddings, which make them slower. Determining precisely which step causes the slowdown is an interesting question and would require some benchmarking experiments.
@dylstuart
@dylstuart 7 ай бұрын
Great video! What value is used for Theta?
@EfficientNLP
@EfficientNLP 7 ай бұрын
Theta_i = 10000^(2i / d). I didn't cover this in the video, but it is mentioned in the RoFormer paper.
@gemini_537
@gemini_537 4 ай бұрын
@@EfficientNLP That seems to be the same as the one used in the paper "Attention is All you Need".
@guanxi99
@guanxi99 6 ай бұрын
Thanks for the good explanation! How to actually make sure that the result of applying a positional embedding algorithm does not coincidently represent another token? E.g how to avoid that the positional embedding of “dog” in oosition i does not mean “cat” in position b?
@EfficientNLP
@EfficientNLP 6 ай бұрын
Indeed, it is possible for a word at position i to have the same embedding as a different word at position j, since both positional information and non-positional semantic information are represented in the same embedding space. The model learns to use them appropriately during training.
@davidlee327
@davidlee327 11 күн бұрын
dude you are the mf goat
@pratik6447
@pratik6447 3 ай бұрын
What is W (q,k) matrix and how its calculated?
@EfficientNLP
@EfficientNLP 3 ай бұрын
These are the W_q and W_k matrices in self-attention, which are used to generate the Q and K matrices.
@csbarathi
@csbarathi 8 ай бұрын
Why not positionally embed based on sentence and paragraph rather than just the position of the word in the overall prompt? I understand that it adds more computation. But would yield better results wouldn't it?
@EfficientNLP
@EfficientNLP 8 ай бұрын
The transformer doesn't distinguish between sentences and paragraphs; they are treated like any other token, so the position encoding doesn't refer to them specifically.
@csbarathi
@csbarathi 8 ай бұрын
@@EfficientNLP I guess I have something in mind that I'm unable to express in words now. Will try it out and let you know what I ran into.
Speculative Decoding: When Two LLMs are Faster than One
12:46
Efficient NLP
Рет қаралды 9 М.
RoPE Rotary Position Embedding to 100K context length
39:56
code_your_own_AI
Рет қаралды 2 М.
Why did the angel disappear?#Short #Officer Rabbit #angel
00:38
兔子警官
Рет қаралды 5 МЛН
Just try to use a cool gadget 😍
00:33
123 GO! SHORTS
Рет қаралды 85 МЛН
Children deceived dad #comedy
00:19
yuzvikii_family
Рет қаралды 3,4 МЛН
Why You Should Always Help Others ❤️
00:40
Alan Chikin Chow
Рет қаралды 130 МЛН
Relative Position Bias (+ PyTorch Implementation)
23:13
Soroush Mehraban
Рет қаралды 3,1 М.
Symmetry and Universality - Dr Sophia Sanborn (Science)
26:07
Thinking About Thinking
Рет қаралды 6 М.
The KV Cache: Memory Usage in Transformers
8:33
Efficient NLP
Рет қаралды 29 М.
Vectoring Words (Word Embeddings) - Computerphile
16:56
Computerphile
Рет қаралды 281 М.
Blowing up the Transformer Encoder!
20:58
CodeEmporium
Рет қаралды 17 М.
MAMBA from Scratch: Neural Nets Better and Faster than Transformers
31:51
Algorithmic Simplicity
Рет қаралды 128 М.
Positional embeddings in transformers EXPLAINED | Demystifying positional encodings.
9:40
AI Coffee Break with Letitia
Рет қаралды 63 М.
🔥Идеальный чехол для iPhone! 📱 #apple #iphone
0:36
Не шарю!
Рет қаралды 1,3 МЛН
ВЫ ЧЕ СДЕЛАЛИ С iOS 18?
22:40
Overtake lab
Рет қаралды 134 М.
Телефон в воде 🤯
0:28
FATA MORGANA
Рет қаралды 1 МЛН
Мечта Каждого Геймера
0:59
ЖЕЛЕЗНЫЙ КОРОЛЬ
Рет қаралды 1,6 МЛН
One To Three USB Convert
0:42
Edit Zone 1.8M views
Рет қаралды 441 М.