Machine Learning with Text in scikit-learn (PyData DC 2016)

  Рет қаралды 20,082

Data School

Data School

Күн бұрын

Пікірлер: 84
@squirrel2770
@squirrel2770 7 жыл бұрын
This guy. You teach amazingly well. Gifted communicator, looking forward to future content!
@dataschool
@dataschool 7 жыл бұрын
Thanks!
@brendensong8000
@brendensong8000 4 жыл бұрын
Oldies, but goodies!!!! it's awesome!!! Thank you!
@dataschool
@dataschool 3 жыл бұрын
Thank you so much!
@shashankpulijala3378
@shashankpulijala3378 7 жыл бұрын
Thank you Kevin. I really liked the in depth explanation of the concept. Teachers like you inspire me a lot....
@dataschool
@dataschool 7 жыл бұрын
You're very welcome! Thanks for your kind comment!
@hpchen5402
@hpchen5402 8 жыл бұрын
Another great video by Kevin! Thanks a lot for kind sharing.
@dataschool
@dataschool 8 жыл бұрын
You're very welcome!
@ccchang0111
@ccchang0111 7 жыл бұрын
Thank you Kevin! Such informed tutorial with great details but not redundant!
@dataschool
@dataschool 7 жыл бұрын
You're very welcome! Thanks for your kind words!
@ciurkut2
@ciurkut2 5 жыл бұрын
another very good, easy to follow tutorial ^^
@dataschool
@dataschool 5 жыл бұрын
Thanks!
@7810
@7810 6 жыл бұрын
Amazing course! Thanks.
@dataschool
@dataschool 5 жыл бұрын
You're welcome!
@mkarthekeyan
@mkarthekeyan 7 жыл бұрын
Thank you very much for sharing the lecture, its one of the best and well explained lecture on this topic for beginners like me.
@dataschool
@dataschool 7 жыл бұрын
Great to hear!
@datascienceds7965
@datascienceds7965 6 жыл бұрын
Please do a video on sentiment analysis
@dataschool
@dataschool 6 жыл бұрын
Thanks for your suggestion!
@parambole8671
@parambole8671 8 жыл бұрын
Awesome lecture as always
@dataschool
@dataschool 8 жыл бұрын
Thanks!
@shunmugaprabhusiddharthan2678
@shunmugaprabhusiddharthan2678 7 жыл бұрын
Excellent Explanation! Thanks a lot... Kevin
@dataschool
@dataschool 7 жыл бұрын
You're very welcome!
@musasall5740
@musasall5740 5 жыл бұрын
Excellent!
@dataschool
@dataschool 5 жыл бұрын
Thank you!
@hatrer2244
@hatrer2244 5 жыл бұрын
So clear! Big ups.
@dataschool
@dataschool 5 жыл бұрын
Thanks!
@Ankurkumar14680
@Ankurkumar14680 5 жыл бұрын
Great video, amazing teaching skills...thanks a ton :)
@dataschool
@dataschool 5 жыл бұрын
You're welcome! Thanks for your kind words 😄
@rahulkulkarni7224
@rahulkulkarni7224 8 жыл бұрын
Hi Kevin, I follow and recommend your tutorials to my friends. With this particular video I have a question. Why did you choose Scikit learn package for feature extraction, Text cleaning etc. What advantages do I get over NLTK. NLTK can easily interact with Scikit learn for ML classification algorithms. I believe NLTK is more mature and has 1000 other features which might not be required while performing basic Text mining. But I am just trying to know the differences w.r.t to performance, ease of use and interaction with Scikit for ML.
@dataschool
@dataschool 8 жыл бұрын
If your focus is machine learning, scikit-learn is a far better choice than NLTK. scikit-learn is built for machine learning, whereas NLTK includes machine learning as a small feature. scikit-learn supports the entire machine learning pipeline, from preprocessing through evaluation and even ensembling. I have some more thoughts about scikit-learn in this video: kzbin.info/www/bejne/f6S7iZ-Pi6enZ68 If your focus is Natural Language Processing, and machine learning is a secondary concern, then NLTK may be a good choice. But for higher performance NLP, with a cleaner API and a simpler workflow, many people these days are choosing spaCy: spacy.io/ NLTK is not optimized for performance, whereas scikit-learn and spaCy are. Hope that helps!
@ianyang8799
@ianyang8799 7 жыл бұрын
hi kevin, hope u can make videos with deep learning such as CNN,RNN, LSTM
@dataschool
@dataschool 7 жыл бұрын
Thanks for your suggestion! I'll consider it for the future!
@phuccoiinkorea3341
@phuccoiinkorea3341 8 жыл бұрын
Good job! Thank so much!
@dataschool
@dataschool 7 жыл бұрын
You're very welcome!
@ravisamal3533
@ravisamal3533 5 жыл бұрын
it was a great session.
@dataschool
@dataschool 5 жыл бұрын
Glad you liked it!
@CarlosRomeroconnect
@CarlosRomeroconnect 8 жыл бұрын
Easy and informative!
@dataschool
@dataschool 8 жыл бұрын
Thanks!
@salamatburj9502
@salamatburj9502 5 жыл бұрын
Hi, Kevin! I have question regarding using countvectorization in CV. Can we just transform before splitting to folds and train model? In principle, it will not train features which are not in the training set.Can please elaborate on this? Thank you!
@dataschool
@dataschool 5 жыл бұрын
That's far beyond what I can cover in a KZbin comment, I'm sorry! This is explained in-depth during my course, however: www.dataschool.io/learn/
@radouane5591
@radouane5591 8 жыл бұрын
Thanks, I downloaded all your videos. Although I had my capstone project in NLP: Analyzing tweets, I am thinking your self paced class will be beneficial to me.
@dataschool
@dataschool 8 жыл бұрын
Excellent! Here's a link to learn more about the course: www.dataschool.io/learn/ Feel free to email me if you have any questions. My email address is on that page.
@radouane5591
@radouane5591 8 жыл бұрын
do you have a black Friday deal for the class?
@dataschool
@dataschool 8 жыл бұрын
I'm sorry, but I don't. Worth asking though! :)
@hafizhassaan9263
@hafizhassaan9263 4 жыл бұрын
Hello Sir! I want to ask a question is that "how to convert journal title name to journal abbreviation using NLP or the method which is easy than NLP? Please guide me, waiting for your kind response. Thanks in anticipation.
@semanticgeek3072
@semanticgeek3072 8 жыл бұрын
Great video Kevin. Is this process an alternative for using NLTK?
@dataschool
@dataschool 8 жыл бұрын
NLTK is a library focused on Natural Language Processing (NLP), whereas scikit-learn is focused on machine learning. Some machine learning tasks can be done in NLTK, while others cannot. I choose to use scikit-learn for machine learning with text because it's a far better tool than NLTK for machine learning. However, if your focus is NLP, then NLTK (or spaCy) may be a good choice.
@Rijndhadu
@Rijndhadu 8 жыл бұрын
Did you gave the similar lecture in any other python conference or is it completely different ??
@dataschool
@dataschool 8 жыл бұрын
This is a shorter version of the tutorial that I delivered at PyCon 2016. I made some improvements to the lesson for PyData DC and there are lots of good audience questions, but it may not be worth your time if you have already watched the lesson from PyCon. Thanks for asking!
@guohuashen599
@guohuashen599 7 жыл бұрын
Thank you Kevin for your great video! I have a question: how can I combine the plural and singular words or verbs with different tenses together and just keep one of them (I don't want to differentiate them)?
@dataschool
@dataschool 7 жыл бұрын
That's beyond the scope of what you can do natively with CountVectorizer. However, the tasks you are proposing may be of limited value, if your goal is predictive accuracy.
@warrock-5489
@warrock-5489 8 жыл бұрын
Hi Kevin, lots of appreciate for the tutorial! I got a question regarding on how to merge other features to the the Vectorization feature. For example, when we pass got a column of 'text' feature pass to TFIDFVector (after fit and transform), how do we properly add other features to it such as 'subject' feature to each of the train and test instances. Thanks in advance. :)
@warrock-5489
@warrock-5489 8 жыл бұрын
I tried using pipelining, but it's very confusion in such a way I store the data in panda DataFrame with 2 tuples(text, subject) as training data. And this causes error when fitting the pipeline.
@dataschool
@dataschool 8 жыл бұрын
You would use a FeatureUnion: scikit-learn.org/stable/modules/pipeline.html I cover this in module 5 of my online course: www.dataschool.io/learn/ Hope that helps!
@khizaraman386
@khizaraman386 2 жыл бұрын
I used colab and it didn't show much description for count vectorizer...as it shown in Jupiter. Could u please tell the difference between colab and Jupiter. Which one is better?
@dataschool
@dataschool 2 жыл бұрын
Neither is better, they are just different! This might help: www.dataschool.io/cloud-services-for-jupyter-notebook/
@khizaraman386
@khizaraman386 6 ай бұрын
@@dataschool Thanks a lot!! I am back at revising ML...so refreshing to be back!!!
@takbirhossaintushar7290
@takbirhossaintushar7290 7 жыл бұрын
sir in the model we are just feeding the machine which is desperate or not . if we want to feed more class suppose we want to predict a comment which is positive or negative or neutral then which will be the commands of scikit learn or how we implement these ?
@dataschool
@dataschool 7 жыл бұрын
It sounds like you are just describing a 3-class problem instead of a 2-class problem. The basic scikit-learn code is exactly the same for classification problems, regardless of the number of classes. However, note that the relevant evaluation metrics are different when there are more than 2 classes. Hope that helps!
@kikiisboy4911
@kikiisboy4911 7 жыл бұрын
with imbalanced multi-class text dataset, should I normalize the data with TFIDF weight score or not?
@dataschool
@dataschool 7 жыл бұрын
You could experiment to see whether TF-IDF is useful. There's no easy way to know in advance whether or not it will be better!
@mdmasumbillah1796
@mdmasumbillah1796 5 жыл бұрын
Thank You.
@dataschool
@dataschool 5 жыл бұрын
You're welcome!
@ranjanpatel5146
@ranjanpatel5146 8 жыл бұрын
how i can classify an email as positive or negative response
@dataschool
@dataschool 8 жыл бұрын
I would frame this as a classification problem. The response value is "positive" or "negative", the features are the text (as well as any other engineered features), and the email messages are the observations. The main challenge will be obtaining labeled training data, which is training data that has been labeled with the true response value (so that you can train your model). Hope that helps!
@srikantachaitanya6561
@srikantachaitanya6561 8 жыл бұрын
thank you...
@dataschool
@dataschool 8 жыл бұрын
You're welcome! I hope the video is helpful to you.
@shamsuddinjunaid30
@shamsuddinjunaid30 5 жыл бұрын
It’s sklearn.model_selection instead of sklearn.cross_validation
@dataschool
@dataschool 4 жыл бұрын
That's correct, the scikit-learn API changed since this video was recorded.
@royklaassebos2400
@royklaassebos2400 7 жыл бұрын
Again, another great tutorial! Recently I've been watching a lot (i.e. almost all haha) of your videos; really love how you explain code line by line so that we can understand the "why" in addition to the "how". There are only a handful of people I've found online that have a similar teaching style. Honestly, I think your content is almost too good to give away for free. Maybe you should consider publishing your videos on Udemy (e.g. Kirill Eremenko is an excellent teacher as well and reaches a huge audience on Udemy - www.udemy.com/user/kirilleremenko/ ). Anyway, thanks again really appreciate it!
@dataschool
@dataschool 7 жыл бұрын
Thanks so much for your kind words! I really appreciate it! I will be releasing paid courses in the future, but I also enjoy giving away a lot of material for free so that everyone can access it :)
@rohitnagal3704
@rohitnagal3704 6 жыл бұрын
If i have 100 articles then i have to create 100 corpus related to that or something else
@dataschool
@dataschool 6 жыл бұрын
Generally, each article would be a row in your dataset. Does that help?
@rohitnagal3704
@rohitnagal3704 6 жыл бұрын
If i have article of 30 line then it i convert into one single vector
@dataschool
@dataschool 6 жыл бұрын
Yes
@rohitnagal3704
@rohitnagal3704 6 жыл бұрын
@@dataschool thanks
@gourusai101
@gourusai101 6 жыл бұрын
can any one explain Max_df and Min_df Clearly
@dataschool
@dataschool 6 жыл бұрын
My answer here should help: stackoverflow.com/questions/27697766/understanding-min-df-and-max-df-in-scikit-countvectorizer
@gourusai101
@gourusai101 6 жыл бұрын
@@dataschool thank you so much :)
@dataschool
@dataschool 6 жыл бұрын
You're very welcome!
Learn Machine Learning Like a GENIUS and Not Waste Time
15:03
Infinite Codes
Рет қаралды 179 М.
Machine Learning with Text in scikit-learn (PyCon 2016)
2:40:15
Data School
Рет қаралды 125 М.
Sigma Kid Mistake #funny #sigma
00:17
CRAZY GREAPA
Рет қаралды 25 МЛН
The evil clown plays a prank on the angel
00:39
超人夫妇
Рет қаралды 44 МЛН
Tuna 🍣 ​⁠@patrickzeinali ​⁠@ChefRush
00:48
albert_cancook
Рет қаралды 104 МЛН
Jake VanderPlas: Machine Learning with Scikit Learn
1:34:37
PyData
Рет қаралды 45 М.
Kevin Markham | Machine Learning with Text in scikit learn
1:24:20
Top Minds in AI Explain What’s Coming After GPT-4o | EP #130
25:30
Peter H. Diamandis
Рет қаралды 701 М.
Pandas for Data Science in 20 Minutes | Python Crash Course
23:06
Nicholas Renotte
Рет қаралды 140 М.
Coding Was HARD Until I Learned These 5 Things...
8:34
Elsa Scola
Рет қаралды 762 М.
Comparing machine learning models in scikit-learn
26:42
Data School
Рет қаралды 187 М.
Natural Language Processing in Python
1:51:03
PyOhio
Рет қаралды 1,2 МЛН
My top 50 scikit-learn tips
2:47:31
Data School
Рет қаралды 13 М.
Sigma Kid Mistake #funny #sigma
00:17
CRAZY GREAPA
Рет қаралды 25 МЛН