How do I use pandas with scikit-learn to create Kaggle submissions?

  Рет қаралды 44,561

Data School

Data School

Күн бұрын

Have you been using scikit-learn for machine learning, and wondering whether pandas could help you to prepare your data and export your predictions? In this video, I'll demonstrate the simplest way to integrate pandas into your machine learning workflow, and will create a submission for Kaggle's Titanic competition in just a few lines of code!
VIDEO: What is machine learning, and how does it work? • What is machine learni...
VIDEO SERIES: Introduction to machine learning with scikit-learn: • Machine learning in Py...
SUBSCRIBE to learn data science with Python:
www.youtube.co...
JOIN the "Data School Insiders" community and receive exclusive rewards:
/ dataschool
== RESOURCES ==
GitHub repository for the series: github.com/jus...
Kaggle's Titanic competition: www.kaggle.com...
"loc" documentation: pandas.pydata.o...
"DataFrame" constructor documentation: pandas.pydata.o...
"to_csv" documentation: pandas.pydata.o...
"to_pickle" documentation: pandas.pydata.o...
"read_pickle" documentation: pandas.pydata.o...
== LET'S CONNECT! ==
Newsletter: www.dataschool...
Twitter: / justmarkham
Facebook: / datascienceschool
LinkedIn: / justmarkham

Пікірлер: 104
@aleksandramazurek1364
@aleksandramazurek1364 4 жыл бұрын
I'm going through all of your pandas videos, and I can't stress enough what a wonderful job you've done here. Just wanted to say thank you
@dataschool
@dataschool 4 жыл бұрын
Thanks very much for your kind words! 😄
@roccococolombo2044
@roccococolombo2044 7 жыл бұрын
Your pronunciation is exemplary, nice for us non native english speaker.
@dataschool
@dataschool 7 жыл бұрын
Excellent! I tried very hard to be understandable.
@mratanusarkar
@mratanusarkar 4 жыл бұрын
exactly!! I don't even need subtitles!! his voice, as well as the audio and recording devices are perfect!!
@jiangweiguo1150
@jiangweiguo1150 7 жыл бұрын
i'm really appreciate this series of tutorial, thank you very much.
@dataschool
@dataschool 7 жыл бұрын
You're very welcome!
@urwellwisher4ui
@urwellwisher4ui Жыл бұрын
You are just the best as being a beginner on polishing level your teachings are so much helpful....Gratitude!
@dataschool
@dataschool Жыл бұрын
You are so welcome!
@salehahmed6806
@salehahmed6806 4 жыл бұрын
Your teaching is really awesome, I stopped seeing videos of other educational sites .
@dataschool
@dataschool 4 жыл бұрын
😊
@Pepper-li3fl
@Pepper-li3fl 8 жыл бұрын
I was not aware that pandas data frames can be used in scikit-learn and I used to convert them to numpy arrays. Now my code is leaner. Also the bonus tip on object persistence is very helpful. Keep up the great work, many thanks!
@dataschool
@dataschool 8 жыл бұрын
Glad that the video is helpful to you! You're very welcome!
@hyakushiki23
@hyakushiki23 5 жыл бұрын
Another excellent guide Kevin. I know the Titanic case is a popular one and I am glad I finally understand it. I also loved your Scikit-learn series. Lastly, thank you for explaining pickling. I read the MS Python book, but now I finally understand it thanks to you.
@dataschool
@dataschool 5 жыл бұрын
Thanks very much for your kind words! :)
@RayedWahed
@RayedWahed 8 жыл бұрын
Awesome!!! Didn't realize writing to a csv file was that easy. When you set the index to PassengerId and printed it, it seemed the formatting was a bit off but then I wrote the code, and viewed the generated csv file in Excel and it was perfect. Thanks :)
@dataschool
@dataschool 8 жыл бұрын
You're very welcome! I know what you mean about the formatting... that is just how pandas lets you know that a particular column is the index (rather than a "regular" DataFrame column).
@nagnathsatav9978
@nagnathsatav9978 3 жыл бұрын
Thanks for ur Crystal clear explanation seems very easy after hearing our voice behind it. I think u r made only for data science community.
@dataschool
@dataschool 3 жыл бұрын
You're welcome!
@MrKishorebabukotha
@MrKishorebabukotha 4 жыл бұрын
Thank you very much in providing short and simple learning videos with practicals
@MrDavisv
@MrDavisv 6 жыл бұрын
Thank you so much for this video! I’ve been struggling with coding a LogisticRegression classifier in sklearn until I watched this video. Thank you!!
@dataschool
@dataschool 6 жыл бұрын
You're very welcome! Glad it was helpful to you!
@Davidemmanuelkatz
@Davidemmanuelkatz 7 жыл бұрын
Needed this for so long!!! Big thanks from a helpless econ student
@dataschool
@dataschool 7 жыл бұрын
You're very welcome!! :)
@shobhitsrivastava4496
@shobhitsrivastava4496 6 жыл бұрын
This you so much sir,for this video series coz this series solves much of my confusion about pandas.Greate Job!
@dataschool
@dataschool 6 жыл бұрын
Great to hear!
@adriandeveraaa
@adriandeveraaa 6 жыл бұрын
You are absolutely awesome, thank you for running down the simple structure.
@dataschool
@dataschool 6 жыл бұрын
Thanks very much for your kind words! I appreciate it :)
@adriandeveraaa
@adriandeveraaa 6 жыл бұрын
Do you have any videos on space transformation? Pre-processing tips such as space transformation? Btw Subscribed (:
@dataschool
@dataschool 6 жыл бұрын
Not yet, thanks for the suggestion!
@kushagrasaxena5202
@kushagrasaxena5202 4 жыл бұрын
I am having trouble in predicting a csv file in which the model predicts several labels, can u help me out with it, i have trained a good model but i dont know how to predict that on a test file given by kaggle
@johnfreitas250
@johnfreitas250 5 жыл бұрын
Great tutorial. Thank you so much.
@dataschool
@dataschool 5 жыл бұрын
You're very welcome!
@Shkvarka
@Shkvarka 4 жыл бұрын
Great!! Great job!!! Thank you!!!
@dataschool
@dataschool 4 жыл бұрын
Thank you!
@hridayborah9750
@hridayborah9750 5 жыл бұрын
your tutorial are very helpful , thnx lot , please keep helping us
@dataschool
@dataschool 5 жыл бұрын
Thanks!
@sherlocksu1131
@sherlocksu1131 7 жыл бұрын
Hello, Kevin. In the video of the bonus part , you mention the method ".to_pickle",since I used "to_csv" more usually, what is the advantage of ".to_pickle" ? and what the difference between them?
@dataschool
@dataschool 7 жыл бұрын
Great question! 'to_csv' writes the data out to a CSV file, which contains the data from the DataFrame in a structured format. You could read that CSV file into any other program, or even examine it in a text editor. 'to_pickle' writes the DataFrame object to a file. That file contains all of the information about the DataFrame itself, including the data. Only pandas can read that file. 'to_csv' is generally more useful since it is more flexible. However, 'to_pickle' allows you to preserve the exact DataFrame, so that you are able to reconstruct the identical DataFrame later. If you use 'to_csv', that will not preserve the properties of the DataFrame, just the data itself. Hope that helps!
@pratikmehta1152
@pratikmehta1152 5 жыл бұрын
@Sherlock Su: You may scroll below. User "Nikita Zakharov" asked the same query where Kevin has answered.
@davidborger1487
@davidborger1487 7 жыл бұрын
I'd like to sign up for your email newsletter, but when I follow the link to the Data School page, the upper left panel, where I presume the subscription form is supposed to appear, just contains a continually rotating circle of dots. No form ever appears.
@dataschool
@dataschool 7 жыл бұрын
I'm sorry to hear that! Unfortunately, sometimes that widget does not load. Please try again: www.dataschool.io/ If that doesn't work for you, please email me at kevin@dataschool.io and I'll add you manually. Thanks for your interest, and sorry for the trouble! :)
@AI-Health-posts
@AI-Health-posts 6 жыл бұрын
Fantastic Tutorial. thanks for that
@dataschool
@dataschool 5 жыл бұрын
You're very welcome!
@kishanastro
@kishanastro 5 жыл бұрын
Thanks man, was really helpful!
@dataschool
@dataschool 5 жыл бұрын
You're very welcome!
@mukulsn1698
@mukulsn1698 4 жыл бұрын
your videos are very comprehensive and insightful... thankyou ! can you also upload more videos on pandas python and machine learning advanced level?
@dataschool
@dataschool 4 жыл бұрын
Thanks for your suggestion! This course may interest you: www.dataschool.io/learn/
@PankajMishra-ey3yh
@PankajMishra-ey3yh 8 жыл бұрын
Hello,kevin.This video is really amazing.I would like to suggest you something.Why don't you make a full series on the tasks that ever been to kaggle(Like titanic-> Facial Keypoints Detection etc ) .I mean learning with task would be so great.People will actually learn how to apply all the things.What are your views on this ? Can we expect something like this in near future ?
@dataschool
@dataschool 8 жыл бұрын
Thanks for the suggestion! I will see what I can do :)
@mohammadabulhasnat4387
@mohammadabulhasnat4387 4 жыл бұрын
when we fit the data with classifier, do we pass dataframe/series or numpy array? logmodel=LogisticRegression() logmodel.fit(X_train, y_train) X_train --> whether dataframe or np array?
@dataschool
@dataschool 4 жыл бұрын
scikit-learn understands both pandas objects and NumPy arrays.
@AliElDousCS
@AliElDousCS 7 жыл бұрын
Hello :) When I run new_pred_class= logreg.predict(x_new) it gives an error "ValueError: Input contains NaN, infinity or a value too large for dtype('float64')." Not sure why? Can you help? Thanks
@dataschool
@dataschool 7 жыл бұрын
Did you run the exact same code as me? You can double check your code here: nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb Just click on number 22 in the table of contents to jump to the relevant code.
@zny918951
@zny918951 7 жыл бұрын
because have some null value in data, should do data scrubbing
@sudaksina
@sudaksina 5 жыл бұрын
Sir kindly give me site name where I get diabetic data set CSV file with column name HBA1c or LDL and HDL,. Pima datasets are not having these column. With Thanks
@dataschool
@dataschool 5 жыл бұрын
I don't know, sorry!
@MrNagios
@MrNagios 3 жыл бұрын
please, show me the best way to import XML, I struggled to find it out
@dataschool
@dataschool 3 жыл бұрын
I don't have a tutorial on that topic, sorry!
@sodapopinski9922
@sodapopinski9922 6 жыл бұрын
how does the the new_pred_class = logreg.predict(X_new) know your looking for Survived without specifying it, in the training set it was logreg.fit(X,y) but in the test class it was new_pred_class = logreg.predict(X_new) shouldn't it be new_pred_class = logreg.predict(X_new, y). y being the survived???
@dataschool
@dataschool 6 жыл бұрын
With the fit method, the model is learning the relationship between X and y. With the predict method, you are making predictions for unknown target values using X_new. There is no "y" when you are making predictions. Does that help?
@sodapopinski9922
@sodapopinski9922 6 жыл бұрын
why yes it does sir,, but shouldn't there be at least an empty variable to store the prediction in, im probably over thinking it, I did the same thing getting my degree in nursing, I over thought all of the problems.
@dataschool
@dataschool 6 жыл бұрын
It returns the predictions into whatever variable is on the left side of the equals sign.
@artemiofilho
@artemiofilho 5 жыл бұрын
how can i use the "sex" atribute as a feature in my X_train matrix? can i transform it to 1 and 0, 1 = male and 0 = female?
@dataschool
@dataschool 5 жыл бұрын
Exactly!
@kritikeshsinha1716
@kritikeshsinha1716 5 жыл бұрын
By using different feature we get different output..how could I select which one is right
@dataschool
@dataschool 5 жыл бұрын
My ML videos series might be helpful to you: kzbin.info/aero/PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A
@injypal
@injypal 5 жыл бұрын
Your videos are awesome. Can you please explain where should I use .filter vs .loc method? I'm new to pandas and want to know which method is recommended practice.
@dataschool
@dataschool 5 жыл бұрын
I'm not familiar with filter, sorry! loc is always useful though, see here: kzbin.info/www/bejne/rqfTf3Rtl6hrmdU
@kofteci408
@kofteci408 7 жыл бұрын
Thank you, this is very nicely done and useful. A quick question. Say that I wish to use another classifier, like RidgeClassifier. Its ".fit" method also takes a third parameter "sample_weight". Can you please indicate what that might look like in your example, with the two predictor classes you are using? I can extrapolate from there. Thanks again.
@dataschool
@dataschool 7 жыл бұрын
I'm sorry, I not familiar with how to use sample_weight. Let me know if you find a good resource that explains it... thanks!
@robindong3802
@robindong3802 6 жыл бұрын
Thank you so much Kavin. great video. I have different output as (891,) (not 891L) after y.shape, can you let me know why. thank you so much.
@dataschool
@dataschool 6 жыл бұрын
"L" stands for "long integer". I think it used to be displayed in Python 2, but is no longer displayed in Python 3. It's an implementation detail, and nothing to worry about!
@LonglongFeng
@LonglongFeng 7 жыл бұрын
Hi Kevin, How did you select the 'feature columns', based on your experience or intuition?
@dataschool
@dataschool 7 жыл бұрын
There are a few main ways to do feature selection: (1) domain expertise, (2) data exploration, (3) experimentation, (4) automated approaches, (5) model-specific approaches. That's the best I can summarize it in a KZbin comment! :)
@dataschool
@dataschool 5 жыл бұрын
I recently released a video about feature selection, that might be useful to you: kzbin.info/www/bejne/j5Kufph3oa2ap7M
@joescanlon7502
@joescanlon7502 8 жыл бұрын
Very nice!
@dataschool
@dataschool 8 жыл бұрын
Thanks!
@nikitazakharov1046
@nikitazakharov1046 7 жыл бұрын
Why I have to use pickle if i can do the same things with to_csv and read_csv? Thanks
@dataschool
@dataschool 7 жыл бұрын
Great question! I think the main reason is that pickle files are smaller than plain text files.
@nikitazakharov1046
@nikitazakharov1046 7 жыл бұрын
Thanks for answer :)
@GabrielBecerraAyala
@GabrielBecerraAyala 6 жыл бұрын
Pickle files also let you preserve objects and class structures. They are a powerful tool to do marshalling.
@ArunKumar-ir5tp
@ArunKumar-ir5tp 5 жыл бұрын
Hi i have one doubt. When i wish to implement k-means clustering to my dataset. but i had a problem on plotting date cloumn and floating column in 2D array. pls explain
@dataschool
@dataschool 4 жыл бұрын
I won't be able to help, good luck!
@injypal
@injypal 5 жыл бұрын
Would you please create videos for .pivot, .pivot_table, merge, concat?
@dataschool
@dataschool 5 жыл бұрын
Thanks for your suggestion!
@sezan92
@sezan92 7 жыл бұрын
Hello. I am eager to learn about feature engineering for ml. can you make a video about it ?
@dataschool
@dataschool 7 жыл бұрын
Thanks for the suggestion! I'll definitely consider it for the future :)
@dataschool
@dataschool 5 жыл бұрын
You might enjoy my recent blog post about feature engineering: www.dataschool.io/introduction-to-feature-engineering/
@guysensei1152
@guysensei1152 4 жыл бұрын
wish i found this video sooner
@dataschool
@dataschool 4 жыл бұрын
Better late than never!
@samarthsaraswat6892
@samarthsaraswat6892 5 жыл бұрын
I'm getting a future warning while using logistic regression
@dataschool
@dataschool 5 жыл бұрын
That's because the API has changed over time. It's not a big deal, but you should do whatever the warning says.
@anonify88
@anonify88 4 жыл бұрын
You may use the following: from sklearn.linear_model import LogisticRegression logreg = LogisticRegression(solver='lbfgs') logreg.fit(X, y)
@Natalia-sh6ck
@Natalia-sh6ck Жыл бұрын
Hi Kevin, I'm going through all of your pandas videos and I wanted to thank you because I am enjoying the learning process with your detailed explanations. I am writing here a comment because I am stuck with this one. I have installed scikit-learn (version 1.1.3), but when I do 'from sklearn.linear_model import LogisticRegression', it says 'Import "sklearn.linear_model" could not be resolved'. I have tried this: from sklearn.linear_model import LogisticRegression logreg = LogisticRegression(solver='lbfgs') logreg.fit(X, y) But it did not work either. Do you know what could I be missing? I am running all the commands in 'visual Studio Code', not in Jupyter. Thank you!
@dataschool
@dataschool Жыл бұрын
Sounds like maybe there is a problem with the installation?
@Natalia-sh6ck
@Natalia-sh6ck Жыл бұрын
@@dataschool Hi Kevin. Thank you for answering. I uninstalled and installed again and now it works. Thanks!
@dataschool
@dataschool Жыл бұрын
Great to hear!
@seandhuynh
@seandhuynh 8 жыл бұрын
Thank you for the great tutorial.
@dataschool
@dataschool 8 жыл бұрын
You're very welcome!
More of your pandas questions answered!
19:24
Data School
Рет қаралды 28 М.
How do I use the MultiIndex in pandas?
25:01
Data School
Рет қаралды 174 М.
Underwater Challenge 😱
00:37
Topper Guild
Рет қаралды 48 МЛН
Пришёл к другу на ночёвку 😂
01:00
Cadrol&Fatich
Рет қаралды 3,9 МЛН
How do I create dummy variables in pandas?
13:14
Data School
Рет қаралды 85 М.
Exploratory Data Analysis with Pandas Python
40:22
Rob Mulla
Рет қаралды 467 М.
Football AI Tutorial: From Basics to Advanced Stats with Python
1:30:19
Learn NUMPY in 5 minutes - BEST Python Library!
13:38
Python Programmer
Рет қаралды 847 М.
How do I change display options in pandas?
14:56
Data School
Рет қаралды 41 М.
How do I select multiple rows and columns from a pandas DataFrame?
21:47
Solving real world data science tasks with Python Pandas!
1:26:07
Keith Galli
Рет қаралды 1,5 МЛН
Predict Football Match Winners With Machine Learning And Python
44:43