Introduction to NLP | How to Train Custom Word Vectors

Рет қаралды 4,384

Normalized Nerd

Күн бұрын

Пікірлер: 21

@shrutiiyyer2783 2 жыл бұрын

More such videos please, this is much better than the Udemy courses, even the paid ones!

@vishnuprabhaviswanathan546 2 жыл бұрын

Pls show how to custom train Bert embedding

@s.m.saifulislambadhon2654 4 жыл бұрын

bro, in 44 no shell what is the purpose of tokenizer when we already tokenize the sentences into words in preprocessing part

@s.m.saifulislambadhon2654 4 жыл бұрын

would you please explain 44 no shell little bit more briefly? I think this is the most important part which I missing....

@NormalizedNerd 4 жыл бұрын

Great point! In NLP preprocessing, tokenization makes it easier to clean the text. Here I generally use nltk library. In block 44, I did the tokenization with keras Tokenizer which allows us to use two nice functions: word_index & texts_to_sequences. These help us to create the tensors easily. So yes, tokenization is redundant here but I did it anyway to make our life easier :D

@s.m.saifulislambadhon2654 4 жыл бұрын

Thanks for the explanation

@MrStudent1978 4 жыл бұрын

Very nice explanation! I have a question....in shell no 50..what is sense behind "trainable = false" ? The video is about training custom word2vec...then why false?

@NormalizedNerd 4 жыл бұрын

@Gurpreet Singh I understand your confusion. We are actually training our word vectors in shell 46. In shell 50, we are making our embedding layer that will be placed just before the LSTM units. Remember that embedding layer is nothing but the learned word vectors (in matrix form)! So if we make trainable = True at the embedding layer then keras will train the embedding layer (i.e. the word vectors) again while performing the back prop on LSTM. We don't want that. I hope now it's clear to you.

@MrStudent1978 4 жыл бұрын

@@NormalizedNerd thanks for your response! I got it now....

@Lotof_Mazey 2 жыл бұрын

Sir Kindly guide - How can I use Pre Trained word embedding models for local languages (or languages written in Roman format) that are not available/trained in the pretrained model. Do I have to use an embedding layer(not pre trained) for creating embedding matrices for any local language? How can I get benefit from pretrained models for local language?

@NormalizedNerd 2 жыл бұрын

Hi, unfortunately there aren't a lot of pre-trained word embeddings of romanized non-english languages. You can search and if you find something then you can fine tune it on your data. But I don't think there's an easy way to use English models on romanized non-english languages.

@hanjes4793 3 жыл бұрын

Hello...i got a question. In train test split cell. Where is 'word_index' from??? Thx

@NormalizedNerd 3 жыл бұрын

It's the Keras Tokenizer that is giving us the 'word_index'

@rushikeshkulkarni7758 Жыл бұрын

why didn't we use sklearn train_test_split?

@vishnuprabhaviswanathan546 2 жыл бұрын

Can u show how to calculate similarity of 2 words using custom trained word2vec

@ARSHABBIR100 4 жыл бұрын

Excellent. Thanks for uploading. Kindly make more videos to build a chatbot .

@NormalizedNerd 4 жыл бұрын

It is in my wish-list too! keep supporting

@coxixx 4 жыл бұрын

would you learn how to train our custom word vectors with Glove using python?

@NormalizedNerd 4 жыл бұрын

That's actually very easy. Just make your corpus (.txt file). Then use the official repo to train glove model on your corpus. github.com/stanfordnlp/GloVe