BERT Research - Ep. 2 - WordPiece Embeddings

Рет қаралды 71,014

Күн бұрын

Пікірлер: 126

@amo6002 4 жыл бұрын

Chris, I am extremely impressed. You are doing a great job explaining what's going on and breaking things down so things make sense. Please keep up the good work.

@ChrisMcCormickAI 4 жыл бұрын

That's good to hear, thanks Laurin!

@SumanDebnathMTAIE 3 жыл бұрын

Chris, you will have a special place in heaven, we all are so blessed to have you.

@VasudevSharma01 Жыл бұрын

This channel does not have enough visibility, this is top notch content and almost feels like we are doing a group activity. Every subscriber over here is worth 100x more than any popular media subscriber.

@Breaking_Bold 7 ай бұрын

Excellent ..very good...Chris i hope i found this video channel sooner.

@TheSiddhartha2u 4 жыл бұрын

Hi Chris, I am an enthusiast, who also started learning Deep Learning and Related area sometime back. I have come accross your videos in youtube. Primarily I was searching something related to BERT. You are really explaining them in a simple way. You are really doing a great job. Thank you very much for helping me and other enthusiastic people like me.

@nospecialmeaning2 4 жыл бұрын

I like how additional pieces of information are thrown around - I am super new to this, and I hadn't known what Fasttext was! Thank you so much!

@kyungtaelim4412 4 жыл бұрын

What a nice explanation! As I am an NLP researcher, this is a wonderful lecture even for the people who are familiar with BERT.

@sahibsingh1563 4 жыл бұрын

Real Gem of work. This tutorial really flabergasted me :)

@ChrisMcCormickAI 4 жыл бұрын

Thanks Sahib :)

@debadrisengupta849 3 жыл бұрын

*flabbergasted

@briancase6180 2 жыл бұрын

Finally! Someone explains how the embeddings work. For some reason, I just never saw that explained. I figured it out from other explanations of how the embeddings are combined, but I still wondered if I understood correctly. Now I know! Thanks!!

@mohamedramadanmohamed3744 4 жыл бұрын

No better than this series for such an important topic like BERT, that is amazing. Chris, thanks and please keep going.

@ms1153 3 жыл бұрын

Thank you very much for your effort to bring light to this field. I spent weeks digging on the web looking for good BERT/Transformers explanations. My feeling is that there are lot of videos/tutorials explained by people who actually don't understand what they are trying to explain. After find you and buying your material I started to understand BERT and Transformers. MANY THANKS.

@vinodp8577 4 жыл бұрын

I think the reason why words are broken into chunks with the 1st chunk not having the ## prefixed is to be able to decode a bert output. Say we are doing a Q&A like task, if every token the model gave out was ##subword token, then we cannot uniquely decode the output. If we had only the 2nd chunk onwards with the ## prefix, you can uniquely decode bert's output. Great video. Thank you!

@ChrisMcCormickAI 4 жыл бұрын

Thanks Vinod, that’s a good point! By differentiating, you have (mostly) what you need to reconstruct the original sentence. I know that SentencePiece took this a step deeper by representing white space. As for Q&A, BERT on its own isn’t able to generate text sequences at its output-its the same tokens on the input and output. But maybe you know of a way to add something on top of bert to generate text? Thanks!

@thalanayarmuthukumar5472 4 жыл бұрын

Explained very simply. It feels we are with you in your learning journey. And things explained very simply. Thanks

@JamesBond-ux1uo 2 жыл бұрын

thanks, Chris for great explanation.

@ChrisMcCormickAI 2 жыл бұрын

Glad it was helpful!

@DarkerThanBlack89 4 жыл бұрын

Really amazing work!! I am so glad I found this. You would make a great teacher for sure! Thank you for all your work.

@ChrisMcCormickAI 4 жыл бұрын

Thank you! It's a lot of work to put this stuff together, so I really appreciate your encouragement :)

@mahmoudtarek6859 3 жыл бұрын

I hope you reach 10 million subs. Great Instructor. God bless you

@deeplearningai5523 3 жыл бұрын

looking at number of likes, it seems not many learners look hard for quality stuff, this whole series not just teaches the topic but also give a good idea about investigative research and thus building knowledge, thank you for such an amazing work.

@isaacho4181 4 жыл бұрын

Clear explanation. Rewatched many times.

@shardulkulkarni3999 2 жыл бұрын

Very informative and crisp clear. I find you very unique and confident in what you teach. I've been surfing the web and watching youtube videos on NLP for about a week now, and bluntly put I did not understand jack shit with whats already there. Someone who explains the theory part, doesn't teach you about the code, while someone who puts out the code doesn't explain the theory part, as a result I cannot understand what is really going on. I really really love your videos and hope that you keep on making them.

@ChrisMcCormickAI 2 жыл бұрын

Loved reading this--thanks so much, Shardul! 😁

@computervisionetc 4 жыл бұрын

Chris, the excellent quality of this tutorial "flabbergasts" me (just kidding, all your tutorials are excellent)

@ChrisMcCormickAI 4 жыл бұрын

Ha! Thank you :)

@wiseconcepts774 Жыл бұрын

Awesome videos, helpful to understand the applications properly. Keep up the excellent work Chris

@emmaliu6524 4 жыл бұрын

Thank you so much for sharing! This is incredibly useful for someone searching for tutorials to learn on BERT.

@saeideh1223 2 жыл бұрын

This is amazing. Thank you for your understandable explanation.

@rahulpadhy7544 Жыл бұрын

Just awesome, simply too good!!

@nikhilcherian8931 4 жыл бұрын

Awesome video series, covering a lot of areas and analogies. I am also researching about BERT and your videos provided a nice picture of BERT.

@qian2718 4 жыл бұрын

Thank you so much! These videos really save my life

@ChrisMcCormickAI 4 жыл бұрын

Awesome, good to hear!

@СергейПащенко-р5ж 4 жыл бұрын

Thanks a lot for your lessons, Chris. Hello from Ukraine)

@roushanraj8530 4 жыл бұрын

Chris, you are awesome sir, you make it very simple and easy to understand 💯✔✔💯

@dec13666 3 жыл бұрын

A very kroxldyphivc video! Keep it up with the good work!

@alimansour951 3 жыл бұрын

Thank you very much for this very useful video!

@fornication4me Жыл бұрын

Here are my two cents about "##" in subwords. First, I agree with Chris. It is a quick and smart way of differentiating between "bed" and "##bed". Their position in the word constructs different meanings. This might be inspired by the word stem approach, where the main meaning of the word is carried by the stem but slightly modified (according to time, person, etc) with conjugation. However, I suspect adding "##" in subword parts could also help differentiate between sub-word functions. I am guessing Bert also makes use of the collocation of words to estimate their vectors. In that case, the model can alter the representation of "##ding" based on its co-occurrence in "bedding" and "embedding" hence creating a more distinct representation of these words.

@akashkadel7281 4 жыл бұрын

Hey Chris, This video helped me understand the concepts with so much ease. You are doing an amazing work for people new to advanced NLP. Thank you so much :)

@AG-en1ht 4 жыл бұрын

You make really great videos for beginners. Thank you very much!

@ChrisMcCormickAI 4 жыл бұрын

Thanks, Antonio!

@majinfu 4 жыл бұрын

Thank you for your great explanation and exploration! That helps me a lot to understand the tokenizer of BERT! Many thanks!

@notengonickname 4 жыл бұрын

Thank you... just what I needed to start my journey in BERT

@alassanndiallo 3 жыл бұрын

Amazing Professor! A very impressive way to teach.

@mahadevanpadmanabhan9314 4 жыл бұрын

Very well done.Lot of effort analyzing the vocab.txt

@ChrisMcCormickAI 4 жыл бұрын

Thanks Mahadevan!

@prasanthkara 4 жыл бұрын

Chris, Very informative video especially for beginners. Thanks a lot

@ChrisMcCormickAI 4 жыл бұрын

Great! Thanks for the encouragement!

@vladimirbosinceanu5778 3 жыл бұрын

This is amazing Chris. A huge thank you!

@sarthakdargan6447 4 жыл бұрын

Chris, you are doing a really really great work. Big Fan!. Will definitely look for more awsum content.

@ChrisMcCormickAI 4 жыл бұрын

Thanks Sarthak, appreciate the encouragement!

@sushasuresh1687 4 жыл бұрын

Love love love your videos. Thank you !!

@j045ua 4 жыл бұрын

Hey Chris! Great content! :D

@ChrisMcCormickAI 4 жыл бұрын

Thanks Joshua! Have any plans to use BERT?

@turkibaghlaf4565 4 жыл бұрын

Thank you so much for this great content Chris !

@ChrisMcCormickAI 4 жыл бұрын

My pleasure!

@RedionXhepa 4 жыл бұрын

Great video. Thanks a lot!

@ChrisMcCormickAI 4 жыл бұрын

Thanks for the encouragement!

@ZhonghaiWang 3 жыл бұрын

thanks for such great content Chris!

@adityachhabra2629 4 жыл бұрын

super informative! please continue the good work

@manikanthreddy8906 4 жыл бұрын

Thanks Chris. That really helped.

@arjunkoneru5461 4 жыл бұрын

I think they do it because by adding ## they can reconstruct the original tokens back.

@scottmeredith4578 4 жыл бұрын

Very good video. For your name further research, you could check against the Social Security Death Index file, every first and last name registered in USA social security database from start of SS up to year 2013, several million unique names, broken into first and last.

@felipeacunagonzalez4844 4 жыл бұрын

Thank you Chris, you are great

@xiquandong1183 4 жыл бұрын

Hey, this was an excellent video. You said at the beginning that we have a dictionary of token indices and embeddings. How are these embeddings obtained during pre-training ? Are they initialized with some context-independent embedding like GLoVE or do we get the embeddings by having a weight matrix of size (30,000 x 768) where each input is represented as an one-hot vector and then use those trained parameters to get the embeddings ?

@ChrisMcCormickAI 4 жыл бұрын

Hi Rajesh, thanks for your comment. When the Google team trained BERT on the Book Corpus (the "pre-training" step), I have to assume that they started with randomly initialized embeddings. Because they used a WordPiece model, with a vocabulary selected on the statistics of their training set, I don't think they could have used other pre-trained embeddings even if they wanted to. So the embeddings would have been randomly initialized, and then learned as part of training the whole BERT model. Does that answer your question? Thanks, Chris

@mouleshm210 3 жыл бұрын

Thanks u made my day :)

@vijayendrasdm 4 жыл бұрын

Hi Chris Great video. I have a question. At 12.10 you say "The tokens at the beginning of the word can be redundant with the token for whole word. " So in the bedding example (bed,##ding): Does it mean token for bed and bedding would be similar? I see, there is token for ##ding in vocab, so , the token for bedding must be combination of "token for bed"+"token for ##ding"?

@bavalpreetsingh4664 3 жыл бұрын

i am glad that someone has raised the question, same doubt is coming in my mind as well?

@bavalpreetsingh4664 3 жыл бұрын

@ChrisMcCormickAI please let us know asap , thank you :)

@eliminatorism 4 жыл бұрын

You are amazing. Thank you so much!

@ChrisMcCormickAI 4 жыл бұрын

Haha, thank you!

@vivekmittal3478 4 жыл бұрын

great series! Keep it up!

@ChrisMcCormickAI 4 жыл бұрын

Thanks Vivek!

@ZohanSyahFatomi Жыл бұрын

JUST WOW WOW WOW

@gulabpatel1480 2 жыл бұрын

Really great job!! I have once doubt, why we still have [UNK] and [unused] tokens when subword embedding is used to handle the UNK words?

@nickryan368 2 жыл бұрын

Good question! [unused] tokens are deliberately left as empty slots so that, if you need to, after training you can initialize a custom word using one of these open slots. [UNK] is for characters or subwords that appeared so infrequent that they did not make it into the vocabulary. The vocabulary is limited to a certain number of words/subwords/characters, and if you train on a large corpus of internet text you are likely to run into some very rare symbols/emojis/unicode characters that aren't used frequently enough to merit a slot in the vocabulary and can't be decomposed into any of the other more common subwords existing in the vocabulary.

@Julia-ej4jz Жыл бұрын

1) It is interesting why em-bed-ding was spiltted this way. These wordpieces have nothing to do with the meaning of the whole word. It seems that uncautious splitting only deteriorates the results. 2) How different senses are handled in this dictionary? For example, "like" as a verb and as a conjunction. What about "like"-noun and "like"-verb? Do they correspond to diferent entries in the dictionary?

@breezhang4660 4 жыл бұрын

Thanks much for this great video! How funny the name list does not have your last name - you are a uniquely great e-educator! You mentioned you wonder why "##" is not added to the first subword. I was looking around the web pages, I see others' work where the first subword of a word is labeled, but the rest "##-word" ones do not have labels (or only default labels). Then they can, if they want, give the same label to the ##-words at least, especially for labeling the subwords of someone's last name;)

@ChrisMcCormickAI 4 жыл бұрын

Thanks Bree! When you mention "labelling" subwords, are you referring to marking them with the '##' (e.g., 'em' is "labeled" as '##em'), or are you referring to something else, like classification labels for something like Named Entity Recognition?

@breezhang4660 4 жыл бұрын

@@ChrisMcCormickAI Yes, I was referring to the later one, the classification tasks like labeling identities. I thought it seems like one useful way of not tokenizing the first subword with ##. So even the one-word last name gets falsely subworded, the labels remain roughly correct.

@maikeguderlei2146 4 жыл бұрын

Hi Chris, thanks so much for the input and insights on how Bert embedding works. I think there is a small mistake though, as the values for the special tokens are one too high. The PAD token in Bert is embedded as 0, the UNK token as 100 and so forth as python usually starts enumerating with 0 and not with 1 which gets lost when saving the tokens with their embeddings to a txt file. At least these are the values that I get when I try out tokenizer.vocab ["PAD"] etc.

@ChrisMcCormickAI 4 жыл бұрын

Thanks for pointing that out, Maike! In the Colab Notebook I mention that the indeces I'm reporting are 1-indexed rather than 0-indexed. I did this because I was referring to the line numbers in the "vocabulary.txt" file. Probably would have been better to keep things 0-indexed like you're saying, though.

@dinasamir2778 4 жыл бұрын

Thanks alot for the great explanation

@ChrisMcCormickAI 4 жыл бұрын

You are welcome!

@ppujari 4 жыл бұрын

This is one of the excellent tutorial on BERT. I have a question: In the word embedding look up table, what are the examples of 5 features look like for "I" (say)?

@sumeetseth22 4 жыл бұрын

Thanks Chris!

@ChrisMcCormickAI 4 жыл бұрын

You bet!

@ais3153 3 жыл бұрын

As I could understand that BERT has only 30000 words with 786 features each... but BERT represents each word specific contextualized vector (i.e., depending on the context of the word)? could someone explain that to me?

@DebangaRajNeog 4 жыл бұрын

Great tutorial.

@akshatgupta1710 3 жыл бұрын

Nice video! One question: Why does BERT have hindi and chinese alphabets when the words are not there in the vocabulary? What are some applications/uses of having them in the vocabulary?

@fdfdfd65 4 жыл бұрын

Hello Chris. I'm wondering how to get the hidden states of a word splitted by the tokenizer. For example, in the case of word 'embedding' that is splitted in: em ##bed ##ding. Each token would be a different hidden state vector right? Do you know some way of combine these hidden states in a single one that represents the original whole word in order to do some manipulations? My objective is to compare simarity between words like 'embeddings' and 'embedded'. Thanks for your attention. Congrats for the EXCELLENT content of your youtube channel. Sorry for my poor english skills.

@lukt 2 жыл бұрын

Thank you for this awesome explanation! :) I just have some trouble wrapping my head around, why e.g. splitting "embedding" into "em" bed" and "ding" would make any sense at all? Yes the subtokens might be in the vocabulary, but "bed", "em" and "ding" have little to do with the meaning of the word "embedding" and should therefore be quire far from "embedding" in the vector space, or am I missing something?

@nickryan368 2 жыл бұрын

Good question! Note that "embedding" gets split into "em" "##bed" and "##ding", so "bed" is a totally different token than "##bed," the first is one you sleep on and the second is "the middle or end subtoken of some larger word." That helps the problem somewhat. Also note that the embeddings are learned in the context of the sentence, so while river "bank" and financial "bank" have the same embedding, this embedding "understands" what the meaning should be in different contexts. But the short answer is simply that these embeddings are trained for a very long time on a lot of data. It is unintuitive that they should work in different contexts, and that when combined they should form a word that has a distinct meaning from the individual parts, but it's mostly a result of the sheer amount of pretraining. Hope that helps.

@lukt 2 жыл бұрын

@@nickryan368 Ahh I see! That makes a whole lot more sense now! I missed the fact that ##bed and "bed" are in fact different tokens. I guess the training process differentiates the two even further (by learning the context in which they stand). Thank you for taking the time to write a detailed response! :)

@shaofengli5605 4 жыл бұрын

Great Job!

@seonhighlightsvods9193 3 жыл бұрын

thanks :)

@prakashkafle454 3 жыл бұрын

Like word 2 vec how bert make embedding of every word ?? Waiting for details step . If I will pass 512 as a sequence length what will be the dimensions?

@techwithshrikar3236 4 жыл бұрын

Great content. Normally what distance measure do you use for similarity?

@tervancovan 4 жыл бұрын

I loooove it

@venkatramanirajgopal7364 4 жыл бұрын

Hi Chris. Thank you for this video. I notice lots of single words from different languages "Chinese, Tamil, Japanese, Bengali". Is BERT base uncased trained on multiple language corpus ?

@monart4210 4 жыл бұрын

I really hope you didn't actually mention this as I watched (and enjoyed!;) ) the full video two weeks ago, but I don't remember you mentioning it... I am a bit confused as to how to then receive embeddings for the actual words instead of WordPieces. I am using different kinds of embeddings for topic modeling and for topic exploration (nearest neighbours,...) I want to use actual words and not "em", "bed", "ding" instead of embedding.:) Does applying BERT not make sense in such an unsupervised task and should I rather stick to ELMO which gives me an embedding per word? Maybe you could give me some feedback :)

@ChrisMcCormickAI 4 жыл бұрын

Yeah, I never thought about that! I would probably try just averaging the embeddings for the different word pieces together to create a single embedding. I'm curious about your application--I've used word embeddings from word2vec to find "similar terms", but this isn't possible with BERT embeddings, since they are context-dependent (there's not a single static vocab of embeddings, as there in word2vec). How will you make use of these "contextualized" embeddings? Thanks, Chris

@shyambv 4 жыл бұрын

Thanks Chris for the creating this video. Is it possible to add additional vocabulary to pretrained BERT embeddings? Lot of the words which I wanted to use are not available in BERT embeddings.

@ChrisMcCormickAI 4 жыл бұрын

Hi Shyam, it's not possible to modify the BERT vocabulary. However, the BERT tokenizer will break down any word that it doesn't have in its vocabulary into a set of subwords (and individual characters, if necessary!). Something I am interested to learn more about is how big of a problem this is for datasets with a lot of "jargon" (application-specific language). When you fine-tune BERT on your task, the subword embedding vectors may be adjusted to better match your task. But in general, I *suspect* that BERT will not perform well on applications with lots of jargon. What's your dataset? I'd be curious to hear how it goes for you applying BERT!

@vivekmehta4862 4 жыл бұрын

Hi ChrisMcCormick ! This video gave me an easy start, thanks for it, but I really want how to get the feature vector for any single word? This is not explained in the video.

@ChrisMcCormickAI 4 жыл бұрын

Hi Vivek, I just responded to another of your comments with this, but for anyone else reading--Nick and I wrote a blog post / Colab Notebook showing how to do this here: mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/

@vivekmehta4862 4 жыл бұрын

@@ChrisMcCormickAI Ok Chris, thank you so much. By the way, do you have any idea of fine tuning the word embeddings with the help of document clustering task. Because I have a set of unlabelled documents.

@kelvinjose 4 жыл бұрын

Hi, where does the initial lookup table exist? Or how does the whole lookup process work?

@mikemihay 4 жыл бұрын

Hi Chris, do you know where I can find a "pseudocode" of how Bert works? And also for Word2vec? For word2vec I found your repo where you commented the original c code, but I feel overwelmed to go thru 1000 lines of code :)

@ilanaizelman3993 4 жыл бұрын

Amazing

@ChrisMcCormickAI 4 жыл бұрын

Thanks Ilan :D

@anaghajose691 4 жыл бұрын

great job!! i have one doubt,why we are using neural network?BERT Itself can give output right?

@ChrisMcCormickAI 4 жыл бұрын

Thanks, Anagha! Can you clarify your question for me--which neural network are you referring to? Thanks!

@HassanAmin77 4 жыл бұрын

Chris I need slides for your lectures in ppt or pdf format, to help with developing some lectures on BERT. Is it possible to share these.

@jean-baptistedelabroise5391 3 жыл бұрын

feel like BERt vocabulary is not really well built in the end. feels like you should have only single digit numbers. I feel like nouns might take too much space in this voc also. To modify the voc I found this paper www.aclweb.org/anthology/2020.findings-emnlp.129.pdf it seems very interesting but I did not test their method yet.

@Aliabbashassan3402 4 жыл бұрын

hi ... is Bert work with Arabic language??

@DrOsbert 4 жыл бұрын

Oops! You took more than 2 min for word emb 😆😜

@ChrisMcCormickAI 4 жыл бұрын

:D It's true, I have a hard time not going into depth on everything.

@yashwatwani4574 4 жыл бұрын

where can I find the vocabulary.txt file

@ChrisMcCormickAI 4 жыл бұрын

Hi yash, you can generate it by running the `Inspect BERT Vocabulary` notebook here: colab.research.google.com/drive/1MXg4G-twzdDGqrjGVymI86dJdm3kXADA But since you asked, I also uploaded the file and hosted it here: drive.google.com/open?id=12jxEvIxAmLXsskVzVhsC49sLAgZi-h8Q