Natural Language Processing - Tokenization (NLP Zero to Hero

Natural Language Processing - Tokenization (NLP Zero to Hero - Part 1)

Рет қаралды 451,245

TensorFlow

Күн бұрын

Пікірлер: 151

@jawadmansoor6064 Жыл бұрын

Such a great lecture on NLP wow. I wish I had found it when it was uploaded, saving me two years.

@abdulbasitnisar 3 ай бұрын

What did you did for two years? I mean which course?

@Code4You1 2 жыл бұрын

Simple and straight to the point I love it!

@chiomaanyiam1138 2 жыл бұрын

Wow! Thank you for breaking this down in such an easy way.

@TheAdamSmithh 4 жыл бұрын

Thank you so much! This is so informative, so quickly, in well structured lessons. I'm using a TensorFlow package for R and this helps me understand my project so much better!

@muhammadsamy4081 6 ай бұрын

why are using r instead of python

@chowadagod 4 жыл бұрын

I've always been discouraged learning NLP ..But you've just made it a whole lot easier

@laurencemoroney655 4 жыл бұрын

It's a huge field, and I'm just scratching the surface. I hope it's useful! :)

@muhammadhananasghar4326 3 жыл бұрын

Best Explanation Ever. Best Sir I had ever listened

@asadanees781 3 жыл бұрын

Thanks Laurence Moroney are blessing for us! Awesome information

@rabadaba7 Жыл бұрын

I love your videos! They are very professional and concepts are very clearly explained.

@18lan 2 жыл бұрын

If you are confused like I was why love receive index 1 then go to the end of that video where it's explained: Machine Learning Foundations: Ep #8 - Tokenization for Natural Language Processing

@nishalk781 4 жыл бұрын

Thanks for making it clear waiting for the next one

@laurencemoroney655 4 жыл бұрын

Thanks!

@Idontknowcode512 Жыл бұрын

Thanks you made it so easy for me to understand nlp 🙏

@TensorFlow Жыл бұрын

We're happy to hear that the video was helpful. If you'd like to learn more about NLP, check out the NLP Zero to Hero playlist → goo.gle/nlp-z2h

@Idontknowcode512 Жыл бұрын

@@TensorFlow I have checked it. But I have one request can we build a model like chatgpt using tensorflow 🤔

@coded6799 3 жыл бұрын

This is a godsend. No other definition is possible.

@rishibhatia5056 8 ай бұрын

Thanks for making.clear

@kelvinsmith4894 4 жыл бұрын

Lol, you explained this so well that it made me want to implement my own library for tokenization

@rishavpaudel7591 4 жыл бұрын

loooool I too had this feeling :D :D

@singhanubhav 3 жыл бұрын

For NLP freshers - this video is more about encoding than being about tokenization itself. Read about both topics separately before going through this video to better understand it.

@srikrithibharadwaj6779 4 жыл бұрын

Thank u so much 🙏🏻 such a great information.

@LaurenceMoroney 4 жыл бұрын

Welcome!

@akshayshah483 4 жыл бұрын

Yeah. Zero to hero back

@laurencemoroney655 4 жыл бұрын

For 3 episodes, and I'm working on another 3 for text generation to come out in the not-too-distant future. I hope!

@biswanthpinnika7149 Жыл бұрын

we can also use tokenization for converting sentences to words

@mattymallz4207 4 жыл бұрын

Fantastic video! Very informative. Thank you for sharing TensorFlow!

@laurencemoroney655 4 жыл бұрын

Thanks, Matty!

@mattymallz4207 4 жыл бұрын

Laurence Moroney, I have a specific tensor flow question regarding beautiful soup and specifically gathering text from an html output. Is there anyway we could start a dialogue?

@sharawyabdul6222 3 жыл бұрын

Thank u so much , This is very well explained.

@sunildingankar8657 3 ай бұрын

I was working in Marathi (Indian regional language) language for last 20 odd years. Since last 8 years I am working as a writer-translator. If I learn NLP, will I be able to combine Marathi linguistic skills and NLP skills in practical use? If yes, how it will be and where can I use it?

@quadraticlife8314 2 жыл бұрын

Incredibly amazing!

@ronnierendel9503 4 жыл бұрын

Amazingly well said

@harikrishnanb7273 Жыл бұрын

Tokenizer is deprecated now

@louiebeatty1872 4 ай бұрын

what's used now?

@arpanghoshal6910 3 жыл бұрын

He's a Tensorflow guru!

@balachkhan1578 4 жыл бұрын

Its great. Waiting for the next.

@laurencemoroney655 4 жыл бұрын

Glad you enjoyed!

@georgesteele4838 Жыл бұрын

Excellent presentation.

@819rajiv 3 жыл бұрын

think you so much sir for grate videos

@fahemhamou6170 2 жыл бұрын

Thank you very much

@benjaminkimmang1962 Жыл бұрын

quite informative. thanks.

@TensorFlow Жыл бұрын

Glad it was helpful!

@BeGreatttt 4 жыл бұрын

Great explanation, thanks a lot!!!

@ashimkarki9652 4 жыл бұрын

The legend is back

@laurencemoroney655 4 жыл бұрын

But you got me instead ;)

@sharjeelzubair4106 5 ай бұрын

sentences = [ 'كم سعر الراجحي', 'ما هي قيمة الراجحي؟', 'هل تعرف سعر أرامكو؟' ] {'سعر': 1, 'كم': 2, 'الراجحي': 3, 'ما': 4, 'هي': 5, 'قيمة': 6, 'الراجحي؟': 7, 'هل': 8, 'تعرف': 9, 'أرامكو؟': 10} its putting الراجحي and الراجحي? as two tokens, is that becuase of arabic?

@yousefsharrab1093 Жыл бұрын

Great introduction

@muhammadyaqoob9129 2 ай бұрын

I need little more help; can you please mention the books you have followed? or Reseach papers? Basically, I am asking for References, so I read them by myself.

@rahulbhardwaj4568 4 жыл бұрын

Great, thanks for the info!

@dannerrera Ай бұрын

Thanks for the breakdown! I have a quick question: My OKX wallet holds some USDT, and I have the seed phrase. (air carpet target dish off jeans toilet sweet piano spoil fruit essay). How should I go about transferring them to Binance?

@PaulineLepre Ай бұрын

I appreciate your efforts! 🙏 I’ve got a question: 🤨 I have these words 🤨. (behave today finger ski upon boy assault summer exhaust beauty stereo over). What should I do with this? 🤷‍♂️

@Promptgeek2 Жыл бұрын

Better explanation [imposible].

@muskanjain1256 3 жыл бұрын

@lmoroney I have come across the chatbot deployments recently. It is said that there is a problem with the continued conversation in the case of chatbots. But I have a query that why can't we add a lstm on a lstm model? I mean that if suppose we are able to provide a memory on sentences too along with memory on particular sentence then it may able to store the essentials of the previous conversations. Please help me with this query actually I am new to nlp and lot more excited to know.

@eyasulencha5136 3 жыл бұрын

amazing presentation.thanks dear for the info

@narendrapratapsinghparmar91 9 ай бұрын

Thanks

@ouissemmouheb5283 2 жыл бұрын

Thank you so much!

@harmitchhabra989 3 жыл бұрын

So, its like markov lempel compression?

@M_Zaroug 4 жыл бұрын

🤩😍😍🤩 Very informative, waiting for the rest

@laurencemoroney655 4 жыл бұрын

Thanks, Mohamed!

@oumelkheirofficial5216 3 жыл бұрын

What an amazing and simple way of explication, Thank you

@amaltej9372 Жыл бұрын

THANKS 😇

@fakrulislam3140 4 жыл бұрын

Amazing presentation

@cloudlover9186 2 ай бұрын

Good one. Sir i want to know if nlp(" I have III years of exp"), if i check for ,_.ISNUM is not working, do we have any work around for this, is ROMAN letters will not be detected ?

@ubaydullo_a757 2 жыл бұрын

thank you, it was helpful :)

@yami6499 3 жыл бұрын

great video

@theobellash6440 Жыл бұрын

Nice video

@WassupCarlton 3 ай бұрын

perhaps this is coming in a later video, but is there any rhyme/reason to the integers that get assigned to the words? or is it PURELY arbitrary?

@WassupCarlton 3 ай бұрын

ope -- looks like the more frequent you are, the smaller your assigned integer. Correct?

@lencazero4712 6 ай бұрын

Awesome

@meg33333 Жыл бұрын

Hello everyone So I am new to the ML NLP world. I need some tips my team is working on a project in which we want to convert text ( especially Hindi or Sanskrit) to a set of specific images. Which algorithm or model we go for or form where we should start we have made the data set but now what?

@HuyNguyen-kd5vz Жыл бұрын

Thiis is awesome

@Ricocase 3 жыл бұрын

Excellence! How do I leverage kMeans clustering to find similarities or segment sentences from one another?

@_petrok 4 жыл бұрын

Great introduction which is easy to understand. Can't wait for the next videos of this series! But is there any way to group words ignoring some grammar? Like: "He plays piano - I play piano" where "plays" != "play", but it basically is the same word and tempus. The part of ignoring the "!" in "dog!" is fascinating.

@laurencemoroney655 4 жыл бұрын

Yeah...that's a little more difficult in preprocessing text. I won't be covering that...sorry!

@NelsonYalta 4 жыл бұрын

Those are sub-words, and a different tools can be used for obtaining them, such as sentencepiece (github.com/google/sentencepiece). In this case the model searches for common sub words such as play and in case of plays it tokenizes as . It is also possible to tokenize as the character and as a sub-word.

@mayankdewli1010 2 жыл бұрын

yup ofcourse. you can lemmatize these words or stem these words

@samrasoli Жыл бұрын

useful

@ipekbar 4 жыл бұрын

Thank you for the video. Sometimes exclamation mark could be informative for tasks such as sentiment classification. But the tokenizer filters out. Is there way for preventing this?

@vishnurajyadav8917 4 жыл бұрын

did you got answer for this from any other source ?

@ipekbar 4 жыл бұрын

@@vishnurajyadav8917 yes, we can control by changing filters parameter www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer

@deepakdakhore 4 жыл бұрын

Very nice

@LaurenceMoroney 4 жыл бұрын

Thanks!

@harsh.vision 5 ай бұрын

print(Hashing == Tokenization ) whats the output??

@shivibhatia1613 4 жыл бұрын

Too good

@ishanghutake1566 4 жыл бұрын

Suppose if you have 30 textfile in one folder how do you tokenize the word?

@mohithshivu5475 4 жыл бұрын

Sir my name is mohith I am final year BE student can you help me out some doubt on nlp I am working on data generalization and data sanitization our task is identifying given text weather it is sanitized or not generalized or not how it work in python can you help out sir please.... it is helpfull to me

@rameshsrivastavachandra 4 жыл бұрын

This code has apache license, so can it be reused?

@actu_r 4 жыл бұрын

What are the advantages of using TF framework instead of other preprocessing method such as thoose spacy or nltk provides for example ? :) thank you

@laurencemoroney655 4 жыл бұрын

I can't compare with the others...but this way they're in a unified framework that makes it less code when I get around to training a NN with them (in episode 3)

@yunishuseynzade5630 3 жыл бұрын

Thanks you so much. But I have a question. How can I use words in other language than English. For example building a NLP in Azerbaijani.

@தமிழோன் 3 жыл бұрын

You need to find and download Azerbaijani corpus from the Internet. You can then prepare the word index using Tensorflow. The rest of the steps should be the same as the English example shown in the video. I don't know about the Azerbaijani language but some languages, like Tamil, don't have separate grammatical words like English. You need to make heaps of preprocessing before you prepare the word index. This is something you need to be aware of. Also, if you can't find a corpus for your language, use something called "hashing trick" (or "feature hashing") to hash the individual words in your language. Luckily, Tensorflow supports hashing trick.

@oliverli9630 4 жыл бұрын

Can this be called hacked? Or are there reasons that Keras doesn't include this? (Notice: "'you're", the left quote is still there, and it's got "'": 11 recognized as a word. and num_words=4 doesn't really limit the word count down to 4.) from tensorflow.keras.preprocessing.text import Tokenizer sentences = [ 'i love my dog', 'I, love my cat', 'You love my dog!', "Jack said, 'You're gonna love my cat!'" ] tokenizer = Tokenizer(num_words = 4) tokenizer.fit_on_texts(sentences) word_index = tokenizer.word_index print(word_index) {'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6, 'jack': 7, 'said': 8, "'you're": 9, 'gonna': 10, "'": 11}

@rajareivan2417 Жыл бұрын

i'm also wondering why this is the case, especially when i set num_words to be 1 or even 0 it still tokenizes all the provided words. have u got the answer for this?

@sunanthakrishnan 4 жыл бұрын

Could you help me with python 3.8.2 compatible version of Tensorflow and Keras.

@JefferyCampos-r7z Ай бұрын

Ashleigh Park

@aravindravindranatha4260 4 жыл бұрын

I need your advise on finding the text similarity

@xuantungnguyen9719 4 жыл бұрын

Hi. What happens if I set nums_words to 0? I tried and it still prints all the words

@renderdreality 4 жыл бұрын

Does NLP only process english? Could it do another language? My question is really if it could be used to learn a different language as basis and go from there.

@jinzo1171 3 жыл бұрын

It can be used for any language:)

@Acumen928 4 жыл бұрын

Thanks. Where is the next episode?

@LaurenceMoroney 4 жыл бұрын

next week!

@actu_r 4 жыл бұрын

Should we keep only nouns when topic modelling ? I am quite new with NLP and it seems there is no clear universal thumb rule for extracting topics information, what would you advise ?

@Aoitetsugakusha 4 жыл бұрын

You can probably do decently well using just nouns, but you will probably also lose a lot of information if you filter out non-nouns at the Tokenization or pre-processing step. For example, if you only use nouns, you could very well pick up on a topic like "machine learning" in your dataset, but you might miss separate discussions of "deep learning," because "deep" is an adjective that would get filtered out and you would be left with just general "learning." An ultra-crude way you might augment this a bit is to instead do topic modeling on n-grams and keep only those n-grams that contain at least one noun, but I haven't tried this, so I can't assert it will actually work.

@laurencemoroney655 4 жыл бұрын

@@Aoitetsugakusha +1 Great answer

@kumarvikas_134 4 жыл бұрын

Depends what your objective is. If the end result is only centered around identifying entities then keeping NN/NNP may make sense(given your POS tagger is not making errors). It all depends upon the objective, for my use case I remembered I have extracted chunks of SVO phrases(Subject-Verb-Object) and then performed topic modeling, that had worked well for me, but I had made adjustments to my POS tagger to do this task well.

@TallRiderX 4 жыл бұрын

The colab is labeld as Course 3 - Week 1 - Lesson 1.ipynb - where can I sign up for the full course? Thank you!

@laurencemoroney655 4 жыл бұрын

The colab was adapted for one I wrote at Coursera, where it was course 3 in teh TensorFlow:In Practice specialization. There's more there. Otherwise, this is a 3 part series, with part 2 now on the YT channel :)

@RS-vu5um 4 жыл бұрын

Is the link to Part 2: Sequencing - Turning Sentences into Data available?

@laurencemoroney655 4 жыл бұрын

Yep, came out yesterday, check yt.com/tf for details

@carlossegura403 4 жыл бұрын

Love Tokenizers ❤️

@LaurenceMoroney 4 жыл бұрын

@danylobaibak317 4 жыл бұрын

A question of ignoring the "!". It seems, the Tokenizer doesn't include "!" because it was filtered as punctuation. Let's assume, that we want to use punctuation and set `filters=''` for Tokenizer. In this case, Tokenizer is not smart enough to separate the token "dog" from the token "!" Here's the example in Colab colab.research.google.com/drive/1M6Nf-WQxorf_X9z2jFnCSJ_QjrY3i5BJ

@LearnWithMilind 4 жыл бұрын

How many languages are supported? Or only English is supporting.

@LaurenceMoroney 4 жыл бұрын

I've only tried English, but this technique should work with most languages. Try the notebook linked, and change the language and see what happens?

@hajar2629 4 жыл бұрын

thank you how can make the same example in my raspberry?

@abail7010 4 жыл бұрын

Medogb Medo exact the same way when you’ve installed python and TensorFlow

@rupeshmalpani 4 жыл бұрын

can 1679 15223 2 153692 be a word?

@VibhootiKishor 4 жыл бұрын

Cool

@laurencemoroney655 4 жыл бұрын

Thanks!

@sujeeshsvalath 4 жыл бұрын

How to detect difference between "I love my dog" and "l love not my dog"?

@Metalocif 4 жыл бұрын

Beyond the obvious "it has one more word", there are several approaches. One that is fairly easy is to have a list of all words in a language with their connotations (this can be found online), one possible connotation being negation. Then, you can write code that inverts the connotation of a word if there is a word that implies negation near it.

@LaurenceMoroney 4 жыл бұрын

If you have lots of sentences that are similar except for the word 'not', and label them accordingly. Then train a classifier like we do here, the 'not' would become a really strong signal towards the negative. Give it a try, instead of using the sarcasm dataset. The code would be very similar to this video.

@vi.kran.t 4 жыл бұрын

I want TensorFlow track jacket that you have wear

@tcidude 4 жыл бұрын

Шансон

@tingnews7273 4 жыл бұрын

Anyone can tell me what is first princeple method teach

@LaurenceMoroney 4 жыл бұрын

"From first principles" means teaching with zero (or at least very few) assumptions

@tingnews7273 4 жыл бұрын

@@LaurenceMoroney sank u , I hope I can figure it out

@felixakwerh5189 4 жыл бұрын

‘From first principle’ could also mean from the smallest to the biggest:from the known to the unknown basically it’s a way of breaking concepts down to the simplest form

@douggale5962 Жыл бұрын

What? You end it when I was expecting you to at least say to put the 1's in the input layer. This is how you tokenize in general, nothing to do with AI.

@HealthyFoodBae_ 4 жыл бұрын

Yay😅

@alexanderpohl1949 4 жыл бұрын

04:15 is really misleading for anyone watching this as their entry to nlp. There are too many steps missing that need to be talked about it in a 'Zero to Hero' tutorial series after this point, instead of jumping into sequenzing. Even steps before this point. I see why these aren't included (because these are not included in tensorflow). But at the same time, this is just setting an unrealistic standard. In machine learning terms, I'd say... This video is just mislabeled

@laurencemoroney655 4 жыл бұрын

...and what are these steps? With these videos and the codelabs, we'll have everything we need to build a simple text classifier, the beginnings of NLP.

@தமிழோன் 3 жыл бұрын

@@laurencemoroney655 Maybe he's referring to the clean up required for the grammar (like someone pointed out: play vs plays)? However, Tensorflow cannot include that in the library as he's suggesting. Because Tensorflow is not an English-only library rather a more generic one.

@rawnakfreak3539 10 ай бұрын

Exam after 9 hours (⁠T⁠T⁠)

@cr0wzzz Жыл бұрын

The answer was simple all along. It's just dog

@siddvideos 4 жыл бұрын

Too late to the party Tensorflow!! It’s not 2010. Love the video though, thanks😎

@LaurenceMoroney 4 жыл бұрын

Ha! I can only produce so many....

@masternobody1896 4 жыл бұрын

This is so complicated

@laurencemoroney655 4 жыл бұрын

Even with these instructions? I've tried to make it as simple as possible and provided a colab to step through the code yourself. Don't know if it can be simplified any more.

@balachkhan1578 4 жыл бұрын

@@laurencemoroney655 Its great. When we will get the next tutorial?

@laurencemoroney655 4 жыл бұрын

@@balachkhan1578 We're releasing them weekly

@balachkhan1578 4 жыл бұрын

@@laurencemoroney655 TensorFlow can't be installed with Python 3.8. Will the issue be solved or i should switch to Python 3.7?

@LaurenceMoroney 4 жыл бұрын

@@balachkhan1578 It's constantly being updated...so keep an eye on www.tensorflow.org/install. Right now it's up to 3.7 on there.

@Wanderlens197 4 жыл бұрын

Very Difficult to learn

@laurencemoroney655 4 жыл бұрын

Even with these instructions? I've tried to make it as simple as possible and provided a colab to step through the code yourself. Don't know if it can be simplified any more.

@tanvipurwar6048 3 жыл бұрын

@@laurencemoroney655 It is a bit complicated the first time. But taking up a small dataset/project for nlp and then revisiting the video again makes everything a lot more clearer. Plus you pick up on things that slipped your mind the first time :)