Pytorch Seq2Seq with Attention for Machine Translation

Рет қаралды 32,015

4 жыл бұрын

In this tutorial we build a Sequence to Sequence (Seq2Seq) with Attention model from scratch in Pytorch and apply it to machine translation on a dataset with German to English sentences, specifically the Multi30k dataset.
Resources to learn more:
github.com/bentrevett
• C5W3L07 Attention Mode...
• C5W3L08 Attention Model
arxiv.org/abs/1409.0473
pytorch.org/tutorials/interme...
Comment on resources:
I think bentrevett on Github is awesome and this series of videos was heavily inspired in this video by his Seq2Seq Tutorials and I really recommend checking him out, he puts out a lot of great tutorials on his Github.
❤️ Support the channel ❤️
/ @aladdinpersson
Paid Courses I recommend for learning (affiliate links, no extra cost for you):
⭐ Machine Learning Specialization bit.ly/3hjTBBt
⭐ Deep Learning Specialization bit.ly/3YcUkoI
📘 MLOps Specialization bit.ly/3wibaWy
📘 GAN Specialization bit.ly/3FmnZDl
📘 NLP Specialization bit.ly/3GXoQuP
✨ Free Resources that are great:
NLP: web.stanford.edu/class/cs224n/
CV: cs231n.stanford.edu/
Deployment: fullstackdeeplearning.com/
FastAI: www.fast.ai/
💻 My Deep Learning Setup and Recording Setup:
www.amazon.com/shop/aladdinpe...
GitHub Repository:
github.com/aladdinpersson/Mac...
✅ One-Time Donations:
Paypal: bit.ly/3buoRYH
▶️ You Can Connect with me on:
Twitter - / aladdinpersson
LinkedIn - / aladdin-persson-a95384153
Github - github.com/aladdinpersson

Пікірлер: 50

@semaj8683 Жыл бұрын

Excellent video as ever! thank you very much for clear explanations!

@antoninleroy3863 2 жыл бұрын

Thanks for the free education !

@2010mhkhan 3 жыл бұрын

Thank you so much for great video and explanation

@AladdinPersson 3 жыл бұрын

Appreciate the kind words!

@ZobeirRaisi 4 жыл бұрын

Thanks for the tutorial

@AladdinPersson 4 жыл бұрын

Appreciate the comment, hope you find it useful :)

@theexecutioner66 Жыл бұрын

Great video! Could you perhaps make one about skip connections in RNNs and how to utilise them?

@teetanrobotics5363 3 жыл бұрын

This guy is better than several college professors.

@AladdinPersson 3 жыл бұрын

You're too kind but not true, have too much to learn :\

@zawadtahmeed850 4 жыл бұрын

Thanks for this excellent content. Please make a follow-up video on the utils file it will be really helpful for new learners like me who are willing to work with other or custom datasets.

@AladdinPersson 4 жыл бұрын

In hindsight I probably should've gone through the utils function but I do think after the video you're able to go through that code by yourself if you take some time. Code for it can be found here: github.com/AladdinPerzon/Machine-Learning-Collection/blob/master/ML/Pytorch/more_advanced/Seq2Seq_attention/utils.py

@saraferro509 10 ай бұрын

Dear @Aladdin, when you concatenate solely the first hidden and the second hidden you are considering one only hidden layer of the RNN. I mean, in your example you are using a LSTM with 2 hidden layers, thus the dimension of hidden would be (2*n_hidden_layers,N,n_hidden_nodes). Should we consider not only the forward and backward connection of the first hidden layer, but also of the second? Or if more of all the other layers? (minute 9:53)

@user-or7ji5hv8y 3 жыл бұрын

How do you keep tract of the shapes of various inputs and outputs to ensure that they are aligned? I notice as the program becomes longer, it becomes harder to keep track, especially when you need to go through the encoder and decoder part.

@impact783 3 жыл бұрын

Hey! Did you draw those first images? If so what software did you use?

@DanielWeikert 4 жыл бұрын

Any ideas on how to figure out the necessary in and output shapes for the layers. It's really something I struggle with a lot. Thanks and great video!

@AladdinPersson 4 жыл бұрын

I'm assuming you're referring to when we define the lstm and the subsequent linear layers in the encoder and decoder since input size is determined by the vocab size and the embedding, hidden_size are just hyperparameters. I understand that the shapes can also be confusing, it's particularly confusing since we're using a bidirectional lstm in the encoder but not doing it in the decoder (following the papers implementation). If we start with the Encoder the input_size will just be the embedding size (since we first run the input x through the embedding layer). Then the linear layers in the Encoder will be hidden_size*2 since we are concatenating the ones for the forward part and the backward part because we are using a bidirectional lstm. You could also just use one of the hiddens of the encoder lstm, either forward or backward, but if you want to use the information from both you need to do something like I did with a linear layer to map it from hidden_size*2 to just hidden_size since the Decoder will not be bidirectional. For the decoder we we will have the encoder_states which remember are really just the hidden values for every timestep (since we don't run through encoder_states from encoder through any additional linear layers), hence it will be hidden_size*2 in size for the final dimension. These encoder_states are just element wise multiplied by the attention scores which will be scalar values to form the context vector, but that multiplication doesn't modify the shape. We will then concatenate this context_vector together with the embedding layer in the decoder resulting in hidden_size * 2 + embedding_size as the input size to the decoder lstm. Hopefully that gave you something, it can be confusing and there's a lot of of shapes to keep track of :) Wish you the best of luck!

@vanshshah6418 Жыл бұрын

what do i have to modify in the case if i want to change the num_layers, i cant figure out. can you modify your github code to make it generalise rather hardcoding your code for only one layer. thanks

@Andrew6James 3 жыл бұрын

Hi, I wondered why the input to the decoder has size (1,N)? I thought that the decoder only takes as input the previous output in the sequence? I am trying to use a Seq2Seq model for non-NLP tasks such as predictions of factory outputs and I am getting stuck having followed your GitHub.

@somayehseifi8269 2 жыл бұрын

a question the decoder part, hidden which is the input argument of the forward function, i think it was the hidden from encoder not from decoder am i right? so we used hidden from encoder 3 times without using hidden from decoder . I will be thankful if u explain this @Aladdin Persson. so for the fist time step all three hidden will be from encoder or what?

@swarajshinde3950 3 жыл бұрын

Nice explaination for Seq2Seq with Pytorch . Can you suggest some more sources for ( Nlp using Pytorch)

@AladdinPersson 3 жыл бұрын

Check out Ben Trevett on Github

@user-or7ji5hv8y 3 жыл бұрын

Given that the video uses LSTM, is this the ELMO model?

@archit_474 3 ай бұрын

I want to say that now(the time i am watching) Field adm BucketIterator is removed, so how can we do it without them

@stephennfernandes 3 жыл бұрын

BucketIterator is deprecated and can you please help how to convert your code to Training on TPUs the bucket iterator class doesn't support sampler args so I can't set DistributedDataSampler object to the DataLoader for TPU training

@adityay525125 3 жыл бұрын

Hi I have concatenation errors of context vector and embedding

@slouma1998 3 жыл бұрын

I'm curious, how did you learn this? Is there an academic book or anything like that discusses the coding part of building models ?

@AladdinPersson 3 жыл бұрын

Papers (& source code from papers), blog posts, forums, since everything is so new it's hard to find good books, although if you find any then do let me know:)

@vijayabhaskarj3095 3 жыл бұрын

@@AladdinPersson d2l.ai is awesome for most stuffs and it's often updated.

@UknownCompiler 3 жыл бұрын

i get this attention mechanism, but it took me time to get it after so much research, why? because the attention i know have keys, queries, and values, and it can be multi-head one, and all of the articles explain it that way, somehow the same attention which have the same keyword here is much simpler and different than the one i know multi-head one, and very few articles explain the attention shown here, my question is why this attention cannot have keys, queries, and values? and what exactly the name of this type of attentions?

@tanmay_ds 4 жыл бұрын

Not related to this current video but can you suggest me some good source to learn implementing ivectors for speaker verification. Thank you!

@AladdinPersson 4 жыл бұрын

I must admit I have not put any considerable time into this topic so can't really give you any valuable advice on this :\

@tanmay_ds 4 жыл бұрын

@@AladdinPersson No issues sir. Thank you for your response.

@user-qk2ev5jl2b Жыл бұрын

In line 92, the permutation sequence (1, 2, 0 ) works for me instead of (1, 0, 2)

@bibiworm Жыл бұрын

Yeah!

@MasterMan2015 2 жыл бұрын

num_layers =2>1, broke the code in: output, hiddens, cells = model.decoder(previous_word, outputs_encoder, hiddens, cells)

@user-gs8bc5zd8d 9 ай бұрын

yep got the same issue! did you figure out where it went wrong? lol it's 1 year ago but shooting my shot still

@charissayu8025 3 жыл бұрын

Hi I have pip install utils yet got the problem: cannot import name 'translate_sentence' from 'utils' (/opt/anaconda3/lib/python3.7/site-packages/utils/__init__.py) appreciated if you could advice how to solve it. Thank you.

@ATPokerATPoker-dp4ex 3 жыл бұрын

download utils from his github its a costum script he made

@arshadshaikh4676 3 жыл бұрын

How to increase the number of layers ? Can you share the link to that implementation

@AladdinPersson 3 жыл бұрын

It's on the github repository which is in the video description or here: github.com/aladdinpersson/Machine-Learning-Collection In the readme file I've tried to collect everything into a nice summary and hopefully it shouldn't be too difficult to find. To increase the number of layers you just have to send in the number of layers to use in the LSTM, although just remember that we need same in encoder and the decoder

@arshadshaikh4676 3 жыл бұрын

@@AladdinPersson yes , I tried to pass the number of layers in that. But it throws error. I.e. just assume I have 4 as number of layers, then for the decoder we need 4 encoder hidden weights as initilization. But it's an issue, we are actually concatinating the fwd and bwk passes that actually converts that 8(4fwd 4bwk) into 1. I.e.[8,*,*] --> [1,*,*].. instead we need [4,*,*] to initialise the decoder. Again, the next thing is, I edited it and returned [4,*,*] from the encoder, Still error persist, because the attention mechanism is hard-coded for the encoder outputs of [1,*,*]. ( Drop me a email, we can have a discussion- ars.arshad.ars@gmail.com)

@AladdinPersson 3 жыл бұрын

@@arshadshaikh4676 Actually now I remember this problem, so what I did to solve it was to concatenate and send them through a fully connected network that can then be sent in to the decoder. I found this unecessarily complicated since it didn't improve the performance all that much from my experiments, so I didn't include it in the video

@arshadshaikh4676 3 жыл бұрын

@@AladdinPersson Thanks for it.

@1potdish271 2 жыл бұрын

So your code will not work for decoder with `num_layers=2` or with `bidirectional=True` . Right!!!

@sreenjaysen927 Жыл бұрын

It should not work because of the decoder hidden states

@user-qk2ev5jl2b Жыл бұрын

right

@felixmohr8354 Жыл бұрын

First of all: Great video, really well done. I have some remarks (besides the now necessary updates in preparing the tokens etc. and the permutation prepared for bmm, which was mentioned already below). Most importantly, I would like to point you to the appendix of the paper, which not only explains how the hidden states are composed (yes, concatenated), but also explains how the energy function is computed. And with respect to the latter one, I think that your implementation is not correct. The problem is that you concatenate the hidden state of the decoder and encoder and then use this directly as an input for the energy function, but this is not how it works. As I understand it, you need two more separate linear layers for both encoder and decoder hidden states to learn two separate weight *matrices* for them (both with the same number of neurons, which are n' in the paper, and the matrices are W_a and U_a). The results of these computations are then *added* up and linked into a tanh function, and it is this result, which really is then mapped into a linear layer (your energy function, which corresponds to the vector v_a in the paper). I would be curious to know your opinion on this.

@winx_hajar Жыл бұрын

Could you provide your own code with this appropriate pls?