### How Self-Attention Works 1. **Input Representation**: - Consider an input sequence of words, each represented by a vector. For example, in a sentence "The cat sat on the mat", each word is converted into a vector through embedding. 2. **Query, Key, and Value Vectors**: - For each word in the sequence, self-attention computes three vectors: Query (Q), Key (K), and Value (V). - These vectors are derived by multiplying the input vector with three different weight matrices that are learned during training. 3. **Attention Scores**: - The attention score for a word is computed by taking the dot product of its query vector with the **key vectors of all words** in the sequence. This score indicates **the importance of other words to the current word**. - Mathematically, this can be represented as: $Attention Score(Q,K)=Q⋅KT$ $Attention Score(Q,K)=Q⋅KT\text{Attention Score}(Q, K) = Q \cdot K^T$ 4. **Softmax**: - The attention scores are normalized using a softmax function to get the attention weights, which sum to 1. This helps in converting the scores into probabilities. 5. **Weighted Sum**: - The final representation of each word is computed as a weighted sum of the value vectors, where the weights are the attention weights calculated in the previous step. - Mathematically: Output=$Softmax(Q⋅KT)⋅V$ Output=$Softmax(Q⋅KT)⋅V\text{Output} = \text{Softmax}(Q \cdot K^T) \cdot V$ ****4:00** Representing the process of acquiring Q, K, V as matrix operations** a1, a2, a3, a4 can be grouped as a matrix A q1, q2, q3, q4 can be grouped as a matrix Q and so on… W A = Q 14:35 The only parameters that require training are Wq, Wk, Wv. ****15:04** Multi-Head Self Attention** Number of heads is a hyperparam 2 Heads → 2 Queries, 2 Keys, 2 Values 3 Heads → 3 Queries, 3 Keys, 3 Values …… 19:00 But this means you have multiple bs, this requires additional transform to compute the output. 原本Head = 1, Value乘完Attention Score可直接輸出,Multi-head需要多一步處理 ****19:15** Positional Encoding** Self-Attention doesn’t have position info. Use positional encoding if you believe “position” is a required part of your data. ****26:00** Truncated Self-Attention** Speech is often very long. We can limit the range to a specific part of the speech, no need to consider the whole sequence ****28:00** Self-Attention for Image** ****29:30** Self-Attention VS CNN** CNN is self-attention with its range limited to the receptive field → CNN is simplified self-attention (CNN is more restricted) → Self-attention is CNN with a learnable receptive field. The kernel size is learned by the model, not set by the programmer. (Self-attention is more flexible) ****32:30** Flexible Models Need More Data** Flexible models need more data, otherwise you’ll encounter overfitting CNN is good enough when given a small amount of data, but eventually stops improving when fed more Self-attention can surpass CNN if given enough data ****35:10** Self-Attention VS RNN (Recurrent Neural Network)** ### RNN Key Characteristics 1. **Sequential Data Handling**: - RNNs are designed to handle sequential data by maintaining a hidden state that captures information from previous steps in the sequence. This allows the network to have a form of "memory" that can be used to inform its predictions. 2. **Recurrent Connections**: - Unlike feedforward neural networks, RNNs have connections that loop back on themselves. This feedback loop allows information to persist, making RNNs well-suited for tasks where the order of inputs matters. ****38:00** How RNN Works** Self-attention can replace RNN RNN must generate each vector one-by-one because the next vector requires the previous vector to compute. Self-attention can generate all vectors in parallel **那篇很有名的Transformer論文Attention is all you need說: “**Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states $ht$, as a function of the previous hidden state $ht−1$ and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.**”** 40:30 Self-Attention for Graph If two nodes are not connected, we don’t need to compute their attention score → GNN 坑很深