Training BERT #2 - Train With Masked-Language Modeling (MLM)

Рет қаралды 20,212

Күн бұрын

🎁 Free NLP for Semantic Search Course:
www.pinecone.io/learn/nlp
BERT has enjoyed unparalleled success in NLP thanks to two unique training approaches, masked-language modeling (MLM), and next sentence prediction (NSP).
In many cases, we might be able to take the pre-trained BERT model out-of-the-box and apply it successfully to our own language tasks.
But often, we might need to pre-train the model for a specific use case even further.
Further training with MLM allows us to tune BERT to better understand the particular use of language in a more specific domain.
Out-of-the-box BERT - great for general purpose use. Fine-tuned with MLM BERT - great for domain-specific use.
In this video, we'll cover exactly how to fine-tune BERT models using MLM in PyTorch.
👾 Code:
github.com/jamescalam/transfo...
Meditations data:
github.com/jamescalam/transfo...
Understanding MLM:
• Training BERT #1 - Mas...
🤖 70% Discount on the NLP With Transformers in Python course:
bit.ly/3DFvvY5
📙 Medium article:
towardsdatascience.com/masked...
🎉 Sign-up For New Articles Every Week on Medium!
/ membership
📖 If membership is too expensive - here's a free link:
towardsdatascience.com/masked...
🕹️ Free AI-Powered Code Refactoring with Sourcery:
sourcery.ai/?YouTu...

Пікірлер: 87

@braydenmoore3101 Жыл бұрын

Amazing man. Been struggling with this, because I've been getting bits and pieces of information from a bunch of different sources. You put it together really nicely.

@SupremeChickenx 2 жыл бұрын

Hi James I just wanted to write a quick comment to show my appreciation, this is the only clear and helpful resource on BERT MLM I could find and your presentation was well-thought out and easy to watch. Thanks so much!!

@guilhermelopes7809 Жыл бұрын

I cannot agree more :)

@ravikumarpawar2406 2 жыл бұрын

Can you please have testing videos on same process which have done in this video, after train, how we can evaluate and predict with custom inputs, It's really awesome video, thanks for your effort

@kaankorkmaz8180 2 жыл бұрын

Thanks James, great video!

@sasna8800 2 жыл бұрын

Thank you so much .I was planning to use this in my thesis .God bless you

@jamesbriggs 2 жыл бұрын

That's awesome, let me know if you publish it anywhere would be really cool to see!

@manusrivastava2047 5 ай бұрын

Amazing resource and great explanation!!

@yudiluceroguzmanmonteza5463 Жыл бұрын

James, thank you very much :)

@palashvishwas9835 2 жыл бұрын

Thanks Man! Very Helpful 😭😍💕

@yumo8361 Жыл бұрын

thank you very much ! i love you ! 🥰🥰🥰

@maximilianhuber8995 Жыл бұрын

great video, thanks!

@basharmohammad5353 3 жыл бұрын

Thanks a million

@vijaypalmanit 3 жыл бұрын

Superb explanation, finally understood fine tuning transformer after 15-20 videos may be. Please make an end to end video on fine tuning custom named entity recognition model.

@jamesbriggs 3 жыл бұрын

That's awesome! I haven't fine-tuned NER models *yet*, but am planning to soon

@AnamSadiq 3 жыл бұрын

Hey James, your video is very helpful. Can you please tell me how would I run the fine-tuned model to predict masked words for test data?

@mikhailschekalev6737 3 жыл бұрын

I think it's worth saying that some task domains require you to extend tokenizer vocabulary. Firstly, you have to somehow find new important tokens. After that your model will "include" them in training process.

@jamesbriggs 3 жыл бұрын

yes true, I'm hoping to cover the tokenizers pretty thoroughly sometime because there's a lot of value in understanding them well :)

@faheemahmed6682 2 ай бұрын

Hello James, it is amazing to always attend your videos. Actually I have protein sequence data which contains the protein sequence and CDR sequences. I want to mask the protein sequence with CDR Sequences. Can you please guide me how to selectively mask the it and then use the CDRs as labels?

@AI_Financier Жыл бұрын

Thank you James, this was the best video on training bert mlm. how can i get the embedding for the individual words?

@Kaassap 3 жыл бұрын

Nice video! I was wondering: In what variable is the fine-tuned model ultimately stored and how to save it like a typical transformer model?

@jamesbriggs 3 жыл бұрын

Try model.save_pretrained('your/model/name'), hope that helps!

@sxmirzaei 2 жыл бұрын

also one more question, how do you call the trained model for predicting a masked word in a sentence?

@sasna8800 2 жыл бұрын

Thank you so much .I want to know should the data we used for continue pretraining be processed the same way as the model we use ? Please could you do video on this since no many emphasis on this

@ifanwang1796 2 жыл бұрын

Nice video with great explanation! One question regarding your implementation is -- the MLM in the original paper, what they did is "predict those masked tokens", but looks like in your example, you're trying to recover the original sentence using the sentence with masked tokens. Any specific reason you implemented this way? Thanks!

@OZAMA 3 ай бұрын

i have a doubt, after you retrain the model where does this newly trained model get saved tho, like which variable should i refer to download it

@neuroinformaticafbf5313 2 жыл бұрын

Many thanks James, very interesting tutorial. I have some questions: - If I use a WordPiece tokenizer, I will mask the single token, and not the word, is that right? - In the BERT paper I read that the [MASK] token is replaced 80% of times, 10% the replacing is done with a random token (or word, see previous question), and 10% the replacing is skipped, how do you deal with that?

@dhanashreemurge 2 жыл бұрын

Hi James, Thank you for this video, it's really helpful. I was not able to figure out couple of points : 1. We are training BERT model further, that means we are using checkpoints from pre trained model and updating those as per domain specific corpus. Is my understanding correct? 2. If yes, will vocabulary size change when provided with new words altogether? I tried with sample corpus and found vocabulary was not updated. It's possible that all words were pre-existing. But just want to confirm

@ranjithkumarkarthic8696 Жыл бұрын

Hi James, Thanks for this wonderful video. I need to know how to use this trained model for text summarization. I saved the model. Now what should I do to pass the content and get the output as text summarization, Could you please help to do that?

@Deshwal.mahesh 2 жыл бұрын

What about Catastrophic Forgetting in MLM? Is there any effect of that? Also what about OOV for domain specific? exBERT, BioBERT, SciBERT all have different approached to handle that. Any idea?

@snippletrap 9 ай бұрын

Much simpler mask tensor logic: torch.where(mask_arr, tokenizer.mask_token_id, inputs.input_ids).

@rog0079 3 жыл бұрын

You know what we need

@jamesbriggs 3 жыл бұрын

Awesome to hear :) so take a BERT model with no pretraining and pretrain it? If so I think you mean training with NSP and MLM - which is something I'm working on

@rog0079 3 жыл бұрын

@@jamesbriggs yes exactly, like I want a BERT for my own native language, and also a GPT model too..., Have you ever tried to code your own BERT or GPT from scratch?

@jamesbriggs 3 жыл бұрын

I did a little on it some time ago - I'm not sure whether my approach was a good one or not though, it was complicated haha - but I came across this earlier in the week, really cool article from huggingface where they do the same for Esperanto: huggingface.co/blog/how-to-train

@clashofclan9878 3 жыл бұрын

very nicely explained... I have a request , could you please make a video on how to fine tune BERT on Name entity recognition (NER) , using the WNUT17 data or CoNLL 2003 data.

@jamesbriggs 3 жыл бұрын

Thanks, and yes I'm planning on doing some work with NER and transformers soon - so they'll be one coming :)

@srkk1633 2 жыл бұрын

Hey, man great video. Truly what i was looking for. I just wanted to know if: 1. The same can be done with tensorflow. 2. How do you get the sentences with replaced words inplace of the masked using the model you trained

@jamesbriggs 2 жыл бұрын

Happy it helped! Re your questions (1) yes but I've never done it with TensorFlow, so I can't say I know how to do it unfortunately, and (2) I believe I do this towards the end of this video: kzbin.info/www/bejne/aWazlaKvnpuNpbM Hope that helps!

@nourassem555 3 жыл бұрын

Thank you for the amazing tutorial ! I have a question is finetuning BERT better than GPT2 when it comes to conditional text generation What I mean is if I would like to make my model follow some certain criteria or keep generating in the same flow of the previously generated sentences like assume I want sentence 1 :X and sentence 2 : Y where Y will be generated based from a promt but will follow the previous context will it be better to use BERT due to its architecture of bidirectional?

@jamesbriggs 3 жыл бұрын

It's hard to say as I haven't used Bert for generation before, but if Bert were to be generating 'the next word' for the full sentence 2, I can't imagine it outperforming GPT2, I say so because the bidirectional nature of Bert means it would likely perform better if it *already* knew what the future few words were, but in this case it would only know the previous word. So let's say you are generating 'X' below: "the quick brown X" Bert would need to predict X based solely on what it sees to the left 'the quick brown', it cannot use words on the right because there are none until Bert predicts some. As GPT2 is trained specifically on the single direction, I'd imagine it would outperform. If you had something more like MLM: "the quick brown X jumped over the X dog" Then I would see Bert outperforming GPT2 But, this is all just guessing, maybe it turns out that this isn't the case! :)

@Mr-un1uo 2 жыл бұрын

Thank you for this video! It makes me understanding BERT. But I wonder how can I use trained model?

@jamesbriggs 2 жыл бұрын

Plenty of different ways, depending on what you want to do, for example classification: kzbin.info/www/bejne/ppvXn555fKqfmac Or semantic similarity: www.pinecone.io/learn/nlp/

@AnandP2812 3 жыл бұрын

Hello, great video! I was wondering...if I have a dataset with 70,000 tweets, with each tweet having a single emoji - would it be possible to use this method to predict an emoji for a tweet? For example, say if the tweet is "You are cool" - could the model predict various emojis that would fit the meaning of that tweet, such as this emoji 😎 ? Thanks.

@jamesbriggs 3 жыл бұрын

Sorry Anand, missed this comment - I replied to your more recent one!

@navinbondade5365 3 жыл бұрын

Great Man, can you make a video vision transformer and GPT from scratch and MLP model with attention

@jamesbriggs 3 жыл бұрын

I haven't worked on transformers with computer vision - so that once may take some time, but I'll definitely look into the other two, thanks for watching :)

@demmenatube-494 Жыл бұрын

Amazing Tutorial!! You a special one!! clear delivery!! I have two questions!! 1) How can I save this model to my local disk? 2) how can I load the saved model and use it to embed my training and testing dataset for classification task? Many thanks!!

@jamesbriggs Жыл бұрын

Thanks! You can use `model.save_pretrained("")` to save, and then load in the same way you normally would, eg `BertModel.from_pretrained("")`

@demmenatube-494 Жыл бұрын

Thanks for your sound explanation!

@blockman9783 3 жыл бұрын

Hi James, amazing video! However I have a question: if we mask this way wouldn't be the percentage of real tokens way lower than 15%? For what i have understood, although we make sure to not mask any special token or padding, they still count in that 15% used with the rand method. Any alternative to bypass this issue? Thanks again!!

@jamesbriggs 3 жыл бұрын

It should be ~15%, now in terms of alignment with the original BERT approach, they do actually assign a 15% probability to every token - like how we do here. But I see your point regarding whether that would translate to ~15% of the tokens, if the probability of each token was dependant on the probability of the other tokens around it then yes I believe you'd be correct, but, rand (as far as I'm aware) assigns the value of each token independently of other tokens. So with 1 token, we have a 15% probability of it being less than 0.15, adding another token means that token 1 still has that 15% probability, but so does token 2. That's how I understand the rand method to work, let me know if that doesn't make sense though!

@blockman9783 3 жыл бұрын

@@jamesbriggs Yeah, that makes sense, thanks :D

@pradeeppant8809 2 жыл бұрын

Thanks James for this awesome video it really helped me. Can you help me how can I get sentence embedding using this model. Any pointers will really help.

@jamesbriggs 2 жыл бұрын

sure, I did a course on this - pinecone.io/learn/nlp - if you go to the chapter on unsupervised learning, this is the easiest way to fine-tune a sentence transformer that will create the embeddings you need :)

@pradeeppant8809 2 жыл бұрын

@@jamesbriggs Thank you so much for sharing the reference, it is very insightful. In the resource you have shared you are talking about the decoder while comparing MLM and TSDAE. Where can I find the decoder for MLM to get sentence level encoding from word level encoding.

@The845548 2 жыл бұрын

Hi James, thank you for the amazing tutorial. Does it make sense to fine-tune Bert using MLM for labeled data? I want to mask some tokens based on labels and then augment those masked tokens based on labels.

@jamesbriggs 2 жыл бұрын

I haven't seen it done before but you could do it I'm sure, but it depends on what you mean by labeled data? If you wanted to label data into categories (eg sentence by sentence), you would just use a classifier, but I don't think that's what you're wanting to do? Maybe you're looking more for named entity recognition (NER) like in spaCy?

@The845548 2 жыл бұрын

@@jamesbriggs Thank you for your response. I trained my model by following your tutorial. But what do we do next after training? How do we use our saved pre-trained model? How do we test and validate the model? Thank you so much for your help in advance!

@jamesbriggs 2 жыл бұрын

@@The845548 the pretrained model (which you now have) acts as the core 'engine' of a transformer model, from there you need to 'fine-tune' it for a specific task, such as classification, Q&A, etc Before fine-tuning you may want to validate the 'core' I don't have a quantitative measure, I usually just go ahead and run a few tests to see if it's comprehending language well, you can see an example of this towards the end of this video: kzbin.info/www/bejne/aWazlaKvnpuNpbM

@thelastone1643 3 жыл бұрын

James... awesome 👏 . I don’t wanna mask using a random method. I want to mask some specific tokens such as punctuations etc. How we evaluate the model after we fine-tuned it. The loss is not a robust metric. I saw some tutorials using perplexity as a metric. Is there any other technique?

@jamesbriggs 3 жыл бұрын

So for me, I've always relied on watching the loss reduce over time, once the model is trained I will usually use it for something else (eg Q&A), and it's at that point where I use some other metrics, like ROUGE for Q&A, to know whether the model is performing better/worse I haven't used perplexity, but I'm going to look into it now that you've mentioned it. I don't know of any other metrics for checking performance at the MLM/NSP stage though If you find perplexity to be useful, or anything else, let me know!

@thelastone1643 3 жыл бұрын

@@jamesbriggs, We can measure the existing LM's (such as BERT, Elmo, etc) performance from the intended tasks such as text classification, Q&A, etc. BUT, the only one it isn't easy to evaluate is text generation. For example, if someone trained the model using GPT-2 on poems. How to evaluate this model? It is not like text classification to evaluate using an f1-score or others. I am intermediate in language modeling; I think all NLP topics easy to evaluate except text generation and any related task such as text summary ... kzbin.info/www/bejne/eHKxZIF4lLiBfrs

@thelastone1643 3 жыл бұрын

@@jamesbriggs James, Please, make a lesson on how fine-tuned the GPT2 model on a large custom dataset and evaluate this model. This will be a great lesson that we help many students. I said large custom dataset because 1- the students have their own large datasets that can't be trained as one chunk. 2- The student in the text generation field do not interested to use the pre-trained GPT model.

@jamesbriggs 3 жыл бұрын

@@thelastone1643 yes that's very true on generation and summary - not easy to measure at all! But that's awesome thanks for the info and the video, looks like a great place to start - will be watching it later

@jamesbriggs 3 жыл бұрын

@@thelastone1643 I'll add it to the list :)

@charmz973 2 жыл бұрын

this line confused me,thought we are using BERT that hs 768 tokens per sequence,why use 512 for a transformer inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True, padding='max_length')

@jamesbriggs 2 жыл бұрын

it's pretty confusing haha - there are 512 tokens per sequence, but every single one of those tokens is passed through an embedding layer, and each 1 token becomes a dense vector of length 768. So then what we have is a 512*768 tensor being processed through each encoding layer of the Bert model - hope that makes sense :)

@charmz973 2 жыл бұрын

@@jamesbriggs woow Morethan clear!! I have always gotten this wrong, am really glad I found this channel. Thx bro

@chayandhaddha7474 3 жыл бұрын

what steps should I take to test the fined-tuned BERT model with MLM?

@jamesbriggs 3 жыл бұрын

You'll want to test it out using Huggingface's 'fill-mask' pipeline - info here: huggingface.co/transformers/main_classes/pipelines.html

@chayandhaddha7474 3 жыл бұрын

@@jamesbriggs Thanks for lovely video and response. I have some question, could you please answer those. How to do whole word masking for languages other than the English language? Is these lines correct for testing? from transformers import pipeline unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer) #load finetuned model and tokenizer into pipeline unmasker("Hi, [MASK] are you?")[0]['token_str'] ## get the best(1st) recommended word at masked place

@jamesbriggs 3 жыл бұрын

You would need to train/use another model for your chosen language, then yes the code you've written should work - will be releasing a video today 'Training and Testing an Italian Bert' which goes through the same process :)

@sasna8800 2 жыл бұрын

Can I do both MLM and NSP at the same time ? or training separately?

@jamesbriggs 2 жыл бұрын

You can yes - we do it a few videos later in this series :)

@tharunsirimalla9794 3 жыл бұрын

Can you please make a tutorial on how to Build a BERT MLM from scratch using our own data set and use it... I want to build a BERT model for telugu language but its very difficult for me :'(

@jamesbriggs 3 жыл бұрын

hey Tharun, yes many have been asking for the same thing, I'm hoping to do very soon :)

@tharunsirimalla9794 3 жыл бұрын

@@jamesbriggs pleased to hear this from you :-)

@nadja3125 2 жыл бұрын

How can one further use it with Bert model itself????

@jamesbriggs 2 жыл бұрын

Hi Nadja, you will want to train it further using the specific training approach for your problem, for example, classification: kzbin.info/www/bejne/ppvXn555fKqfmac And Q&A: kzbin.info/www/bejne/kHq1nouhfdVjY8U

@nadja3125 2 жыл бұрын

@@jamesbriggs Yes, I pretrained BertforMaskedLM so, that I can use from_pretrained method with Bert model. The idea is - Bert will already have some knowledge about the type of data I use.