Chris, I am extremely impressed. You are doing a great job explaining what's going on and breaking things down so things make sense. Please keep up the good work.
@ChrisMcCormickAI4 жыл бұрын
That's good to hear, thanks Laurin!
@SumanDebnathMTAIE3 жыл бұрын
Chris, you will have a special place in heaven, we all are so blessed to have you.
@VasudevSharma01 Жыл бұрын
This channel does not have enough visibility, this is top notch content and almost feels like we are doing a group activity. Every subscriber over here is worth 100x more than any popular media subscriber.
@Breaking_Bold7 ай бұрын
Excellent ..very good...Chris i hope i found this video channel sooner.
@TheSiddhartha2u4 жыл бұрын
Hi Chris, I am an enthusiast, who also started learning Deep Learning and Related area sometime back. I have come accross your videos in youtube. Primarily I was searching something related to BERT. You are really explaining them in a simple way. You are really doing a great job. Thank you very much for helping me and other enthusiastic people like me.
@nospecialmeaning24 жыл бұрын
I like how additional pieces of information are thrown around - I am super new to this, and I hadn't known what Fasttext was! Thank you so much!
@kyungtaelim44124 жыл бұрын
What a nice explanation! As I am an NLP researcher, this is a wonderful lecture even for the people who are familiar with BERT.
@sahibsingh15634 жыл бұрын
Real Gem of work. This tutorial really flabergasted me :)
@ChrisMcCormickAI4 жыл бұрын
Thanks Sahib :)
@debadrisengupta8493 жыл бұрын
*flabbergasted
@briancase61802 жыл бұрын
Finally! Someone explains how the embeddings work. For some reason, I just never saw that explained. I figured it out from other explanations of how the embeddings are combined, but I still wondered if I understood correctly. Now I know! Thanks!!
@mohamedramadanmohamed37444 жыл бұрын
No better than this series for such an important topic like BERT, that is amazing. Chris, thanks and please keep going.
@ms11533 жыл бұрын
Thank you very much for your effort to bring light to this field. I spent weeks digging on the web looking for good BERT/Transformers explanations. My feeling is that there are lot of videos/tutorials explained by people who actually don't understand what they are trying to explain. After find you and buying your material I started to understand BERT and Transformers. MANY THANKS.
@vinodp85774 жыл бұрын
I think the reason why words are broken into chunks with the 1st chunk not having the ## prefixed is to be able to decode a bert output. Say we are doing a Q&A like task, if every token the model gave out was ##subword token, then we cannot uniquely decode the output. If we had only the 2nd chunk onwards with the ## prefix, you can uniquely decode bert's output. Great video. Thank you!
@ChrisMcCormickAI4 жыл бұрын
Thanks Vinod, that’s a good point! By differentiating, you have (mostly) what you need to reconstruct the original sentence. I know that SentencePiece took this a step deeper by representing white space. As for Q&A, BERT on its own isn’t able to generate text sequences at its output-its the same tokens on the input and output. But maybe you know of a way to add something on top of bert to generate text? Thanks!
@thalanayarmuthukumar54724 жыл бұрын
Explained very simply. It feels we are with you in your learning journey. And things explained very simply. Thanks
@JamesBond-ux1uo2 жыл бұрын
thanks, Chris for great explanation.
@ChrisMcCormickAI2 жыл бұрын
Glad it was helpful!
@DarkerThanBlack894 жыл бұрын
Really amazing work!! I am so glad I found this. You would make a great teacher for sure! Thank you for all your work.
@ChrisMcCormickAI4 жыл бұрын
Thank you! It's a lot of work to put this stuff together, so I really appreciate your encouragement :)
@mahmoudtarek68593 жыл бұрын
I hope you reach 10 million subs. Great Instructor. God bless you
@deeplearningai55233 жыл бұрын
looking at number of likes, it seems not many learners look hard for quality stuff, this whole series not just teaches the topic but also give a good idea about investigative research and thus building knowledge, thank you for such an amazing work.
@isaacho41814 жыл бұрын
Clear explanation. Rewatched many times.
@shardulkulkarni39992 жыл бұрын
Very informative and crisp clear. I find you very unique and confident in what you teach. I've been surfing the web and watching youtube videos on NLP for about a week now, and bluntly put I did not understand jack shit with whats already there. Someone who explains the theory part, doesn't teach you about the code, while someone who puts out the code doesn't explain the theory part, as a result I cannot understand what is really going on. I really really love your videos and hope that you keep on making them.
@ChrisMcCormickAI2 жыл бұрын
Loved reading this--thanks so much, Shardul! 😁
@computervisionetc4 жыл бұрын
Chris, the excellent quality of this tutorial "flabbergasts" me (just kidding, all your tutorials are excellent)
@ChrisMcCormickAI4 жыл бұрын
Ha! Thank you :)
@wiseconcepts774 Жыл бұрын
Awesome videos, helpful to understand the applications properly. Keep up the excellent work Chris
@emmaliu65244 жыл бұрын
Thank you so much for sharing! This is incredibly useful for someone searching for tutorials to learn on BERT.
@saeideh12232 жыл бұрын
This is amazing. Thank you for your understandable explanation.
@rahulpadhy7544 Жыл бұрын
Just awesome, simply too good!!
@nikhilcherian89314 жыл бұрын
Awesome video series, covering a lot of areas and analogies. I am also researching about BERT and your videos provided a nice picture of BERT.
@qian27184 жыл бұрын
Thank you so much! These videos really save my life
@ChrisMcCormickAI4 жыл бұрын
Awesome, good to hear!
@СергейПащенко-р5ж4 жыл бұрын
Thanks a lot for your lessons, Chris. Hello from Ukraine)
@roushanraj85304 жыл бұрын
Chris, you are awesome sir, you make it very simple and easy to understand 💯✔✔💯
@dec136663 жыл бұрын
A very kroxldyphivc video! Keep it up with the good work!
@alimansour9513 жыл бұрын
Thank you very much for this very useful video!
@fornication4me Жыл бұрын
Here are my two cents about "##" in subwords. First, I agree with Chris. It is a quick and smart way of differentiating between "bed" and "##bed". Their position in the word constructs different meanings. This might be inspired by the word stem approach, where the main meaning of the word is carried by the stem but slightly modified (according to time, person, etc) with conjugation. However, I suspect adding "##" in subword parts could also help differentiate between sub-word functions. I am guessing Bert also makes use of the collocation of words to estimate their vectors. In that case, the model can alter the representation of "##ding" based on its co-occurrence in "bedding" and "embedding" hence creating a more distinct representation of these words.
@akashkadel72814 жыл бұрын
Hey Chris, This video helped me understand the concepts with so much ease. You are doing an amazing work for people new to advanced NLP. Thank you so much :)
@AG-en1ht4 жыл бұрын
You make really great videos for beginners. Thank you very much!
@ChrisMcCormickAI4 жыл бұрын
Thanks, Antonio!
@majinfu4 жыл бұрын
Thank you for your great explanation and exploration! That helps me a lot to understand the tokenizer of BERT! Many thanks!
@notengonickname4 жыл бұрын
Thank you... just what I needed to start my journey in BERT
@alassanndiallo3 жыл бұрын
Amazing Professor! A very impressive way to teach.
@mahadevanpadmanabhan93144 жыл бұрын
Very well done.Lot of effort analyzing the vocab.txt
@ChrisMcCormickAI4 жыл бұрын
Thanks Mahadevan!
@prasanthkara4 жыл бұрын
Chris, Very informative video especially for beginners. Thanks a lot
@ChrisMcCormickAI4 жыл бұрын
Great! Thanks for the encouragement!
@vladimirbosinceanu57783 жыл бұрын
This is amazing Chris. A huge thank you!
@sarthakdargan64474 жыл бұрын
Chris, you are doing a really really great work. Big Fan!. Will definitely look for more awsum content.
@ChrisMcCormickAI4 жыл бұрын
Thanks Sarthak, appreciate the encouragement!
@sushasuresh16874 жыл бұрын
Love love love your videos. Thank you !!
@j045ua4 жыл бұрын
Hey Chris! Great content! :D
@ChrisMcCormickAI4 жыл бұрын
Thanks Joshua! Have any plans to use BERT?
@turkibaghlaf45654 жыл бұрын
Thank you so much for this great content Chris !
@ChrisMcCormickAI4 жыл бұрын
My pleasure!
@RedionXhepa4 жыл бұрын
Great video. Thanks a lot!
@ChrisMcCormickAI4 жыл бұрын
Thanks for the encouragement!
@ZhonghaiWang3 жыл бұрын
thanks for such great content Chris!
@adityachhabra26294 жыл бұрын
super informative! please continue the good work
@manikanthreddy89064 жыл бұрын
Thanks Chris. That really helped.
@arjunkoneru54614 жыл бұрын
I think they do it because by adding ## they can reconstruct the original tokens back.
@scottmeredith45784 жыл бұрын
Very good video. For your name further research, you could check against the Social Security Death Index file, every first and last name registered in USA social security database from start of SS up to year 2013, several million unique names, broken into first and last.
@felipeacunagonzalez48444 жыл бұрын
Thank you Chris, you are great
@xiquandong11834 жыл бұрын
Hey, this was an excellent video. You said at the beginning that we have a dictionary of token indices and embeddings. How are these embeddings obtained during pre-training ? Are they initialized with some context-independent embedding like GLoVE or do we get the embeddings by having a weight matrix of size (30,000 x 768) where each input is represented as an one-hot vector and then use those trained parameters to get the embeddings ?
@ChrisMcCormickAI4 жыл бұрын
Hi Rajesh, thanks for your comment. When the Google team trained BERT on the Book Corpus (the "pre-training" step), I have to assume that they started with randomly initialized embeddings. Because they used a WordPiece model, with a vocabulary selected on the statistics of their training set, I don't think they could have used other pre-trained embeddings even if they wanted to. So the embeddings would have been randomly initialized, and then learned as part of training the whole BERT model. Does that answer your question? Thanks, Chris
@mouleshm2103 жыл бұрын
Thanks u made my day :)
@vijayendrasdm4 жыл бұрын
Hi Chris Great video. I have a question. At 12.10 you say "The tokens at the beginning of the word can be redundant with the token for whole word. " So in the bedding example (bed,##ding): Does it mean token for bed and bedding would be similar? I see, there is token for ##ding in vocab, so , the token for bedding must be combination of "token for bed"+"token for ##ding"?
@bavalpreetsingh46643 жыл бұрын
i am glad that someone has raised the question, same doubt is coming in my mind as well?
@bavalpreetsingh46643 жыл бұрын
@ChrisMcCormickAI please let us know asap , thank you :)
@eliminatorism4 жыл бұрын
You are amazing. Thank you so much!
@ChrisMcCormickAI4 жыл бұрын
Haha, thank you!
@vivekmittal34784 жыл бұрын
great series! Keep it up!
@ChrisMcCormickAI4 жыл бұрын
Thanks Vivek!
@ZohanSyahFatomi Жыл бұрын
JUST WOW WOW WOW
@gulabpatel14802 жыл бұрын
Really great job!! I have once doubt, why we still have [UNK] and [unused] tokens when subword embedding is used to handle the UNK words?
@nickryan3682 жыл бұрын
Good question! [unused] tokens are deliberately left as empty slots so that, if you need to, after training you can initialize a custom word using one of these open slots. [UNK] is for characters or subwords that appeared so infrequent that they did not make it into the vocabulary. The vocabulary is limited to a certain number of words/subwords/characters, and if you train on a large corpus of internet text you are likely to run into some very rare symbols/emojis/unicode characters that aren't used frequently enough to merit a slot in the vocabulary and can't be decomposed into any of the other more common subwords existing in the vocabulary.
@Julia-ej4jz Жыл бұрын
1) It is interesting why em-bed-ding was spiltted this way. These wordpieces have nothing to do with the meaning of the whole word. It seems that uncautious splitting only deteriorates the results. 2) How different senses are handled in this dictionary? For example, "like" as a verb and as a conjunction. What about "like"-noun and "like"-verb? Do they correspond to diferent entries in the dictionary?
@breezhang46604 жыл бұрын
Thanks much for this great video! How funny the name list does not have your last name - you are a uniquely great e-educator! You mentioned you wonder why "##" is not added to the first subword. I was looking around the web pages, I see others' work where the first subword of a word is labeled, but the rest "##-word" ones do not have labels (or only default labels). Then they can, if they want, give the same label to the ##-words at least, especially for labeling the subwords of someone's last name;)
@ChrisMcCormickAI4 жыл бұрын
Thanks Bree! When you mention "labelling" subwords, are you referring to marking them with the '##' (e.g., 'em' is "labeled" as '##em'), or are you referring to something else, like classification labels for something like Named Entity Recognition?
@breezhang46604 жыл бұрын
@@ChrisMcCormickAI Yes, I was referring to the later one, the classification tasks like labeling identities. I thought it seems like one useful way of not tokenizing the first subword with ##. So even the one-word last name gets falsely subworded, the labels remain roughly correct.
@maikeguderlei21464 жыл бұрын
Hi Chris, thanks so much for the input and insights on how Bert embedding works. I think there is a small mistake though, as the values for the special tokens are one too high. The PAD token in Bert is embedded as 0, the UNK token as 100 and so forth as python usually starts enumerating with 0 and not with 1 which gets lost when saving the tokens with their embeddings to a txt file. At least these are the values that I get when I try out tokenizer.vocab ["PAD"] etc.
@ChrisMcCormickAI4 жыл бұрын
Thanks for pointing that out, Maike! In the Colab Notebook I mention that the indeces I'm reporting are 1-indexed rather than 0-indexed. I did this because I was referring to the line numbers in the "vocabulary.txt" file. Probably would have been better to keep things 0-indexed like you're saying, though.
@dinasamir27784 жыл бұрын
Thanks alot for the great explanation
@ChrisMcCormickAI4 жыл бұрын
You are welcome!
@ppujari4 жыл бұрын
This is one of the excellent tutorial on BERT. I have a question: In the word embedding look up table, what are the examples of 5 features look like for "I" (say)?
@sumeetseth224 жыл бұрын
Thanks Chris!
@ChrisMcCormickAI4 жыл бұрын
You bet!
@ais31533 жыл бұрын
As I could understand that BERT has only 30000 words with 786 features each... but BERT represents each word specific contextualized vector (i.e., depending on the context of the word)? could someone explain that to me?
@DebangaRajNeog4 жыл бұрын
Great tutorial.
@akshatgupta17103 жыл бұрын
Nice video! One question: Why does BERT have hindi and chinese alphabets when the words are not there in the vocabulary? What are some applications/uses of having them in the vocabulary?
@fdfdfd654 жыл бұрын
Hello Chris. I'm wondering how to get the hidden states of a word splitted by the tokenizer. For example, in the case of word 'embedding' that is splitted in: em ##bed ##ding. Each token would be a different hidden state vector right? Do you know some way of combine these hidden states in a single one that represents the original whole word in order to do some manipulations? My objective is to compare simarity between words like 'embeddings' and 'embedded'. Thanks for your attention. Congrats for the EXCELLENT content of your youtube channel. Sorry for my poor english skills.
@lukt2 жыл бұрын
Thank you for this awesome explanation! :) I just have some trouble wrapping my head around, why e.g. splitting "embedding" into "em" bed" and "ding" would make any sense at all? Yes the subtokens might be in the vocabulary, but "bed", "em" and "ding" have little to do with the meaning of the word "embedding" and should therefore be quire far from "embedding" in the vector space, or am I missing something?
@nickryan3682 жыл бұрын
Good question! Note that "embedding" gets split into "em" "##bed" and "##ding", so "bed" is a totally different token than "##bed," the first is one you sleep on and the second is "the middle or end subtoken of some larger word." That helps the problem somewhat. Also note that the embeddings are learned in the context of the sentence, so while river "bank" and financial "bank" have the same embedding, this embedding "understands" what the meaning should be in different contexts. But the short answer is simply that these embeddings are trained for a very long time on a lot of data. It is unintuitive that they should work in different contexts, and that when combined they should form a word that has a distinct meaning from the individual parts, but it's mostly a result of the sheer amount of pretraining. Hope that helps.
@lukt2 жыл бұрын
@@nickryan368 Ahh I see! That makes a whole lot more sense now! I missed the fact that ##bed and "bed" are in fact different tokens. I guess the training process differentiates the two even further (by learning the context in which they stand). Thank you for taking the time to write a detailed response! :)
@shaofengli56054 жыл бұрын
Great Job!
@seonhighlightsvods91933 жыл бұрын
thanks :)
@prakashkafle4543 жыл бұрын
Like word 2 vec how bert make embedding of every word ?? Waiting for details step . If I will pass 512 as a sequence length what will be the dimensions?
@techwithshrikar32364 жыл бұрын
Great content. Normally what distance measure do you use for similarity?
@tervancovan4 жыл бұрын
I loooove it
@venkatramanirajgopal73644 жыл бұрын
Hi Chris. Thank you for this video. I notice lots of single words from different languages "Chinese, Tamil, Japanese, Bengali". Is BERT base uncased trained on multiple language corpus ?
@monart42104 жыл бұрын
I really hope you didn't actually mention this as I watched (and enjoyed!;) ) the full video two weeks ago, but I don't remember you mentioning it... I am a bit confused as to how to then receive embeddings for the actual words instead of WordPieces. I am using different kinds of embeddings for topic modeling and for topic exploration (nearest neighbours,...) I want to use actual words and not "em", "bed", "ding" instead of embedding.:) Does applying BERT not make sense in such an unsupervised task and should I rather stick to ELMO which gives me an embedding per word? Maybe you could give me some feedback :)
@ChrisMcCormickAI4 жыл бұрын
Yeah, I never thought about that! I would probably try just averaging the embeddings for the different word pieces together to create a single embedding. I'm curious about your application--I've used word embeddings from word2vec to find "similar terms", but this isn't possible with BERT embeddings, since they are context-dependent (there's not a single static vocab of embeddings, as there in word2vec). How will you make use of these "contextualized" embeddings? Thanks, Chris
@shyambv4 жыл бұрын
Thanks Chris for the creating this video. Is it possible to add additional vocabulary to pretrained BERT embeddings? Lot of the words which I wanted to use are not available in BERT embeddings.
@ChrisMcCormickAI4 жыл бұрын
Hi Shyam, it's not possible to modify the BERT vocabulary. However, the BERT tokenizer will break down any word that it doesn't have in its vocabulary into a set of subwords (and individual characters, if necessary!). Something I am interested to learn more about is how big of a problem this is for datasets with a lot of "jargon" (application-specific language). When you fine-tune BERT on your task, the subword embedding vectors may be adjusted to better match your task. But in general, I *suspect* that BERT will not perform well on applications with lots of jargon. What's your dataset? I'd be curious to hear how it goes for you applying BERT!
@vivekmehta48624 жыл бұрын
Hi ChrisMcCormick ! This video gave me an easy start, thanks for it, but I really want how to get the feature vector for any single word? This is not explained in the video.
@ChrisMcCormickAI4 жыл бұрын
Hi Vivek, I just responded to another of your comments with this, but for anyone else reading--Nick and I wrote a blog post / Colab Notebook showing how to do this here: mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/
@vivekmehta48624 жыл бұрын
@@ChrisMcCormickAI Ok Chris, thank you so much. By the way, do you have any idea of fine tuning the word embeddings with the help of document clustering task. Because I have a set of unlabelled documents.
@kelvinjose4 жыл бұрын
Hi, where does the initial lookup table exist? Or how does the whole lookup process work?
@mikemihay4 жыл бұрын
Hi Chris, do you know where I can find a "pseudocode" of how Bert works? And also for Word2vec? For word2vec I found your repo where you commented the original c code, but I feel overwelmed to go thru 1000 lines of code :)
@ilanaizelman39934 жыл бұрын
Amazing
@ChrisMcCormickAI4 жыл бұрын
Thanks Ilan :D
@anaghajose6914 жыл бұрын
great job!! i have one doubt,why we are using neural network?BERT Itself can give output right?
@ChrisMcCormickAI4 жыл бұрын
Thanks, Anagha! Can you clarify your question for me--which neural network are you referring to? Thanks!
@HassanAmin774 жыл бұрын
Chris I need slides for your lectures in ppt or pdf format, to help with developing some lectures on BERT. Is it possible to share these.
@jean-baptistedelabroise53913 жыл бұрын
feel like BERt vocabulary is not really well built in the end. feels like you should have only single digit numbers. I feel like nouns might take too much space in this voc also. To modify the voc I found this paper www.aclweb.org/anthology/2020.findings-emnlp.129.pdf it seems very interesting but I did not test their method yet.
@Aliabbashassan34024 жыл бұрын
hi ... is Bert work with Arabic language??
@DrOsbert4 жыл бұрын
Oops! You took more than 2 min for word emb 😆😜
@ChrisMcCormickAI4 жыл бұрын
:D It's true, I have a hard time not going into depth on everything.
@yashwatwani45744 жыл бұрын
where can I find the vocabulary.txt file
@ChrisMcCormickAI4 жыл бұрын
Hi yash, you can generate it by running the `Inspect BERT Vocabulary` notebook here: colab.research.google.com/drive/1MXg4G-twzdDGqrjGVymI86dJdm3kXADA But since you asked, I also uploaded the file and hosted it here: drive.google.com/open?id=12jxEvIxAmLXsskVzVhsC49sLAgZi-h8Q
@vinayreddy86834 жыл бұрын
BERT only has 30000 tokens!!!! You might be wrong on that.