Training BERT #1 - Masked-Language Modeling (MLM)

  Рет қаралды 31,434

James Briggs

James Briggs

Күн бұрын

🎁 Free NLP for Semantic Search Course:
www.pinecone.io/learn/nlp
BERT, everyone's favorite transformer costs Google ~$7K to train (and who knows how much in R&D costs). From there, we write a couple of lines of code to use the same model - all for free.
BERT has enjoyed unparalleled success in NLP thanks to two unique training approaches, masked-language modeling (MLM), and next sentence prediction (NSP).
MLM consists of giving BERT a sentence and optimizing the weights inside BERT to output the same sentence on the other side.
So we input a sentence and ask that BERT outputs the same sentence.
However, before we actually give BERT that input sentence - we mask a few tokens.
So we're actually inputting an incomplete sentence and asking BERT to complete it for us.
How to train BERT with MLM:
• Training BERT #2 - Tra...
🤖 70% Discount on the NLP With Transformers in Python course:
bit.ly/3DFvvY5
Medium article:
towardsdatascience.com/masked...
🎉 Sign-up For New Articles Every Week on Medium!
/ membership
📖 If membership is too expensive - here's a free link:
towardsdatascience.com/masked...
🤖 70% Discount on the NLP With Transformers in Python course:
www.udemy.com/course/nlp-with...
🕹️ Free AI-Powered Code Refactoring with Sourcery:
sourcery.ai/?YouTu...

Пікірлер: 56
@samiragoudarzi988
@samiragoudarzi988 2 ай бұрын
Your videos and articles are a breeze to follow, James! They've truly made my learning journey smoother and more enjoyable. Thanks for all the hard work!
@exxzxxe
@exxzxxe Жыл бұрын
Excellent, James!
@hieunguyen8952
@hieunguyen8952 3 жыл бұрын
Really intuitive and easy to understand. Thank you very much, bro!
@jamesbriggs
@jamesbriggs 3 жыл бұрын
welcome!
@The845548
@The845548 2 жыл бұрын
Thank you James for this video. You've explained everything so well.
@jamesbriggs
@jamesbriggs 2 жыл бұрын
Great to hear! Thanks for watching :)
@trekkiepanda1704
@trekkiepanda1704 2 жыл бұрын
This is very helpful! Thanks!
@a_programmer2754
@a_programmer2754 2 жыл бұрын
Very helpful bruh, thank you!
@basharmohammad5353
@basharmohammad5353 3 жыл бұрын
Very friendly and intuitive introduction. Many thanks for this nice video.
@jamesbriggs
@jamesbriggs 3 жыл бұрын
More than welcome, thanks!
@vladsirbu9538
@vladsirbu9538 2 жыл бұрын
neat explanations! thanks!
@johnsonwalker545
@johnsonwalker545 2 жыл бұрын
Very nice video. Thanks. We are also waiting for a solution video with Tensorflow.
@charmz973
@charmz973 2 жыл бұрын
Thank YOUUU Very much,missing piece in my NLP journey.
@jamesbriggs
@jamesbriggs 2 жыл бұрын
Haha awesome, happy it helps!
@sxmirzaei
@sxmirzaei 2 жыл бұрын
Thanks! Great video, can you make a video for MLM using T5-1? it will be very helpful I couldn't find much on that.
@orewa9591
@orewa9591 2 жыл бұрын
Thank you
@artursradionovs9543
@artursradionovs9543 2 жыл бұрын
Thank you for the video! What's the best way how to get on the track in Deep Learning? Any hints? Thank you!
@PRUTHVIRAJRGEEB
@PRUTHVIRAJRGEEB 6 ай бұрын
Hey James! Thanks a lot for the clear explanation of how MLM works for bert. I have a question tho - so we're using only the 'encoder' part of the transformer during MLM to encode the sentence right, So how does the 'decoder' of bert get trained?
@user-ut6vb2hh8f
@user-ut6vb2hh8f 6 ай бұрын
Text Extractor can only recognize languages that have the OCR pack installed. Leam more about supported languages
@mohamadrezabidgoli8102
@mohamadrezabidgoli8102 3 жыл бұрын
Thanks, man. Also visualize the encoded vectors in a video.
@jamesbriggs
@jamesbriggs 3 жыл бұрын
happy you enjoyed - I find it helps to visualize these things
@BENTOL
@BENTOL Күн бұрын
Great explanation James! I want to ask what is the parameter of the feed forward neural network? What is the size of the weight vector/matrix in it so it can be outputed as a probability distribution consist ~30.000 class.
@poojaagri3776
@poojaagri3776 Жыл бұрын
Hello , this video is beneficial. But I am getting error when passing the inputs as double argument in model. It says it got unexpected argument "label". Can you please tell what I am doing wrong in it?
@testingemailstestingemails4245
@testingemailstestingemails4245 2 жыл бұрын
Thanks for this wonderful explanation Sir, I want to build my own voice dataset to train medical terms model for auto speech recognition please help me I don't how I can start what is the structure of dataset?
@piyushkumar-wg8cv
@piyushkumar-wg8cv Жыл бұрын
How do we decide mask value
@divyanshukatiyar8886
@divyanshukatiyar8886 2 жыл бұрын
Thanks a lot for this masterpiece. I do have something unusual going on. It shows me that BertTokenizer is not callable when it should be able to. I checked and realised that __call__ configuration was introduced from transformers v3.0.0 onwards so I updated my module. Still it throws the same error. Any help here?
@jamesbriggs
@jamesbriggs 2 жыл бұрын
After updating you should be able to call it, I'd double check that your code is using the correct version
@tomcruise794
@tomcruise794 3 жыл бұрын
Thanks for the informational video. Enjoyed it. When will you upload the training model through mlm?
@jamesbriggs
@jamesbriggs 3 жыл бұрын
Here you go kzbin.info/www/bejne/iGfLlKuDgrSlhqc :)
@tomcruise794
@tomcruise794 3 жыл бұрын
Another great piece. Also I have a doubt, while calculating the weights in encoder(attention layer) , what will the initial value of masked token ? Since there should be a numerical value to calculate probability and find a loss value.
@jamesbriggs
@jamesbriggs 3 жыл бұрын
@@tomcruise794 each token will have a vector representation in each encoder, for BERT-base this is a vector containing 768 values (of which there are 512 in each encoder - one for each token). The final vector is passed to a feed-forward NN which outputs another vector containing ~30K values (the number of tokens in BERTs vocabulary), we then apply softmax to this. The loss function can then be calculated as the difference between this softmax probability distribution (our prediction) and a one-hot encoded vector of the real token That's pretty long sorry! Does it make sense?
@tomcruise794
@tomcruise794 3 жыл бұрын
@james thanks for the detailed explanation. But my question is if any word is masked by masked token, then what will be it's initial value/vector representation. Will it be zero? Because there should be some initial value of masked token to calculate probability.
@jamesbriggs
@jamesbriggs 3 жыл бұрын
@@tomcruise794 I'm not sure I fully understand! Maybe you are referring to the initial vector in BERTs embedding array? Where the mask token (103) would be replaced by a specific vector which would then be fed into the first encoder block? In that case the initial vector representation wouldn't be zero, it would look like any other word (as far as I'm aware), and before BERT was pretrained these values will have bee initialized with random values (before being optimized in some way to create more representative initial vectors).
@rmtariq
@rmtariq 2 жыл бұрын
@james really impressive for ur explanation ... is it possible base on this MLM can also apply for text classification model ...
@jamesbriggs
@jamesbriggs 2 жыл бұрын
yes MLM is used to train the 'core' bert models, things like text classification, Q&A, etc are part of the additional 'heads' (or extra layers) added to the end of the transformer models, so you'd train with MLM, then follow that through by training on some text classification task This video will take you through the training for classification: kzbin.info/www/bejne/ppvXn555fKqfmac
@henkhbit5748
@henkhbit5748 3 жыл бұрын
Really interesting stuff. But how about if u want to use Bert in a different language. All the vids I saw were based on the english language. A video of creating a Bert model from scratch in a different language with some simple corpus of text would be nice. It would be also helpful if u can explain in a side note what u have to do if you want to transform your english example in another language...
@jamesbriggs
@jamesbriggs 3 жыл бұрын
hey Henk, yes I've had a lot of questions on this, will be releasing something on it soon
@henkhbit5748
@henkhbit5748 3 жыл бұрын
@@jamesbriggs thanks, looking forward👍
@johngrabner
@johngrabner 2 жыл бұрын
Good video. Have a question: Tokenizer results in shorter sequences vs raw characters and the probability distribution of each token is more even than the distribution of the characters. My question is how important is tokenization to BERT performance?
@jamesbriggs
@jamesbriggs 2 жыл бұрын
thanks! The model must be able to represent relationships between tokens and embed some meaning into each token. If we make 1 character == 1 token, that leaves us with (in English) 26 tokens that the model must encode the "meaning" of language into just 26 tokens, so it is limited. If we use sub-word tokens like with bert, we have 30K+ tokens to spread that "meaning of language" across - I hope that makes sense!
@johngrabner
@johngrabner 2 жыл бұрын
@@jamesbriggs I get 30K tokens means high-level semantics. BERT still must learn relationship between these tokens to perform. So what is the degradation of BERT at 30K tokens, vs 20K tokens, vs 10K tokens, ... 26 tokens? I can't find any mention of the above.
@boriswithrazor6992
@boriswithrazor6992 2 жыл бұрын
Hello! How to use BERT to predict the word standing still [MASK]?
@jamesbriggs
@jamesbriggs 2 жыл бұрын
you can use the HuggingFace 'fill-mask' pipeline huggingface.co/transformers/main_classes/pipelines.html#fillmaskpipeline
@technicalanu6919
@technicalanu6919 3 жыл бұрын
Hello. Bhai kese ho
@AnandP2812
@AnandP2812 3 жыл бұрын
Hi, I followed the Hugging Face tutorial for MLM, but it does not seem to work with emojis - any idea on how to do this? For example, I have a dataset containing tweets, with each tweet containing one emoji - and I want to use MLM to predict the emoji for a tweet. Thanks.
@jamesbriggs
@jamesbriggs 3 жыл бұрын
Hi Anand, I haven't used BERT with emojis, but it should be similar to training a new model from scratch. Huggingface have a good tutorial here: huggingface.co/blog/how-to-train That should be able to help. In particular this tutorial use Byte-level encodings, which should work well with emojis. I'm working on a video covering training BERT from scratch, hopefully that will help too :) Hope you manage to figure it out!
@AnandP2812
@AnandP2812 3 жыл бұрын
@@jamesbriggs Hi James - thanks for the reply. I will take a look at that tutorial - will it work with my own dataset? Also, keep up the great content!
@jamesbriggs
@jamesbriggs 3 жыл бұрын
@@AnandP2812 I believe so yes, I haven't worked through it myself yet - but I see no reason as to why not - I will do thanks!
@soumyasarkar4100
@soumyasarkar4100 2 жыл бұрын
Hi...If we mask tokens after tokenization of the text sequence then would it not lead to masking subwords instead of actual words ? any thoughts on the consequences of this ?
@jamesbriggs
@jamesbriggs 2 жыл бұрын
yep that's as intended, because BERT learns the relationships between words and subwords, so BERT learns that the word 'live' (or 'liv', '##e') is a different tense but same meaning as the word 'living' (or 'liv', '##ing'). In a sense the way we understand words can be viewed as 'subword', because I can read 'liv' and associate the word with the action 'to live' and then read the suffix '-ing' and understand the action 'to live in the present' - hope that makes sense! In more practical terms it also reduces the vocab size, rather than having the words ['live', 'living', 'lived', 'be', 'being', 'give', 'giving'] we have ['liv', 'be', 'giv', '-ing', '-ed']
@soumyasarkar4100
@soumyasarkar4100 2 жыл бұрын
@@jamesbriggs thanks for the clarification !
@rog0079
@rog0079 3 жыл бұрын
So this series is basically how to pre-train BERT for any language or text from scratch right?
@jamesbriggs
@jamesbriggs 3 жыл бұрын
the way I've used it so far is for fine-tuning, you can use the same methods for pretraining for fine-tuning BERT to more specific language (improving performance on specific use-cases), but it's pretty open-ended and I'm planning to do some on training from scratch on a different language I'll upload a series intro soon too :)
@rog0079
@rog0079 3 жыл бұрын
@@jamesbriggs yes, training for scratch on a new language would be so muchhhh helpful !!!, I'll be waiting for those videos :D, thanks a lot, your channel is a gem!
@rashdakhanzada8058
@rashdakhanzada8058 Жыл бұрын
How to use Bert for ROMAN URDU
Training BERT #2 - Train With Masked-Language Modeling (MLM)
27:46
James Briggs
Рет қаралды 20 М.
BERT Neural Network - EXPLAINED!
11:37
CodeEmporium
Рет қаралды 381 М.
Жайдарман | Туған күн 2024 | Алматы
2:22:55
Jaidarman OFFICIAL / JCI
Рет қаралды 1,5 МЛН
버블티로 체감되는 요즘 물가
00:16
진영민yeongmin
Рет қаралды 99 МЛН
OMG😳 #tiktok #shorts #potapova_blog
00:58
Potapova_blog
Рет қаралды 4,2 МЛН
Sentence Similarity With Sentence-Transformers in Python
6:10
James Briggs
Рет қаралды 31 М.
Masked Language Modeling (MLM) in BERT pretraining explained
6:12
Data Science in your pocket
Рет қаралды 2,8 М.
Build a Custom Transformer Tokenizer - Transformers From Scratch #2
14:18
Fine-tuning Large Language Models (LLMs) | w/ Example Code
28:18
Shaw Talebi
Рет қаралды 263 М.
Training BERT #4 - Train With Next Sentence Prediction (NSP)
36:45
The U-Net (actually) explained in 10 minutes
10:31
rupert ai
Рет қаралды 85 М.
3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
29:24
Жайдарман | Туған күн 2024 | Алматы
2:22:55
Jaidarman OFFICIAL / JCI
Рет қаралды 1,5 МЛН