How do I find and remove duplicate rows in pandas?

Рет қаралды 106,388

Күн бұрын

During the data cleaning process, you will often need to figure out whether you have duplicate data, and if so, how to deal with it. In this video, I'll demonstrate the two key methods for finding and removing duplicate rows, as well as how to modify their behavior to suit your specific needs.
SUBSCRIBE to learn data science with Python:
www.youtube.co...
JOIN the "Data School Insiders" community and receive exclusive rewards:
/ dataschool
== RESOURCES ==
GitHub repository for the series: github.com/jus...
"duplicated" documentation: pandas.pydata.o...
"drop_duplicates" documentation: pandas.pydata.o...
== LET'S CONNECT! ==
Newsletter: www.dataschool...
Twitter: / justmarkham
Facebook: / datascienceschool
LinkedIn: / justmarkham

Пікірлер: 233

@fredcalo 7 жыл бұрын

I spent hours trying to figure this stuff out through reading chapters and chapters in Python books. Then I come here, and everything I was trying to figure out was explained in 9 minutes. This was IMMENSELY helpful, thanks!

@dataschool 7 жыл бұрын

Awesome!! That's so great to hear!

@mea97905 8 жыл бұрын

I like your concise and precise videos. I really appreciate your efforts.

@dataschool 8 жыл бұрын

Thanks, I appreciate your comment!

@jordyleffers9244 4 жыл бұрын

lol, just when I felt you wouldn't handle the exact subject I was looking for: there came the bonus! Thanks!

@reubenwyoung 5 жыл бұрын

Thanks so much for this! You helped me combine 629 files and remove 250k duplicate rows! You're the man! *Subscribed*

@dataschool 5 жыл бұрын

Great to hear! 😄

@hongyeegan733 4 жыл бұрын

wow! you are already teaching data science in 2014 when it is not even popular! Btw, your videos are really good, you speak slow and clear, easy to understand and for me to catch. Kudos to you!

@dataschool 4 жыл бұрын

Thanks very much for your kind words!

@emanueleco7363 4 жыл бұрын

You are the greatest teacher in the world

@minaha9213 2 жыл бұрын

just find your channel , just watched this as my first watch for your videos , and pressed subscribe !!! , cause your explanation for the idea as whole is very remarkable 😃 thanks a lot .

@dataschool Жыл бұрын

Thank you!

@tushargoyaliit 5 жыл бұрын

Myself from Punjab .M studying at IIT even then i got satisfaction of pandas from ur videos only . Thanks please give all u done in text format or like tutorial ,

@dataschool 5 жыл бұрын

Is this what you are looking for? nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb

@MrTheAnthonyBielecki 6 жыл бұрын

Exactly what I needed! Why not set up a Patreon so we can show some love?

@dataschool 6 жыл бұрын

Thanks for the suggestion! I am planning to set one up soon, and will let you know when it's live :)

@dataschool 6 жыл бұрын

I just launched my Patreon campaign! I'd love to have your support: www.patreon.com/dataschool/overview

@shashwatpaul3330 4 жыл бұрын

I have watched a lot of your videos; and I must say that the way, you explain is really good. Just to inform you that I am new to programming let alone Python. I want to learn a new thing from you. Let me give you a brief. I am working on a dataset to predict App Rating from Google Play Store. There is an attribute by name "Rating" which has a lot of null values. I want to replace those null values using a median from another attribute by name "Reviews". But I want to categorize the attribute "Reviews" in multiple categories like: 1st category would be for the reviews less than 100,000, 2nd category would be for the reviews between 100,001 and 1,000,000, 3rd category would be for the reviews between 1,000,001 and 5,000,000 and 4th category would be for the reviews anything more than 5,000,000. Although, I tried a lot, I failed to create multiple categories. I was able to create only 2 categories using the below command: gps['Reviews Group'] = [1 if x

@cablemaster8874 3 жыл бұрын

Really, your teaching method is very good, your videoes give more knowledge, Thanks Data School

@dataschool 3 жыл бұрын

You're very welcome!

@cradleofrelaxation6473 Жыл бұрын

This is so helpful! Pandas has the best duplicates handling. Better than spreadsheets and SQL.

@dataschool Жыл бұрын

Thanks!

@supa.scoopa 4 ай бұрын

THANK YOU for the keep tip, that's exactly what I was looking for!

@dataschool 4 ай бұрын

Great to hear!

@dhananjaykansal8097 5 жыл бұрын

I didn't find much in Duplicates. Thanks so much sir. I can't thank u enough.

@dataschool 5 жыл бұрын

You're welcome!

@Beny123 6 жыл бұрын

Thank you! here is a way to extract the non-duplicate rows df=df.loc[~df.A.duplicated(keep='first')].reset_index(drop=True)

@dataschool 6 жыл бұрын

Thanks for sharing!

@rashayahya 4 жыл бұрын

I always find what I need in your channel.. and more... Thank you

@dataschool 4 жыл бұрын

Great to hear!

@balajibhaskarraokondhekar1823 3 жыл бұрын

You have done very Good jobs about under standing of DataFrame and make very easy to understanding DataFrame it so easy with the people which are working in excel Best wishes from me

@dataschool 3 жыл бұрын

Thanks!

@ranveersharma1666 4 жыл бұрын

love u brother . u r changing so many lives, thanku ....the best teacher award goes to Data school.

@dataschool 4 жыл бұрын

Thanks very much for your kind words!

@Kristina_Tsoy Жыл бұрын

Kevin your videos are super helpful! thank you!!!

@dataschool Жыл бұрын

You're very welcome!

@oeb5542 4 жыл бұрын

A very much appreciated efforts. Thanks a million for sharing with us your python knowledge. It has been a wonderful journey with your precise explanation. keep the hard work! Warm regards.

@dataschool 4 жыл бұрын

Thanks very much! 😄

@jessicafletcher0610 Жыл бұрын

OMG I WANT TO THAT YOU SOOOO MUCH 😊I been on the problem for days and the way you explain it make so easy then how I learned in class. I was so happy not to see that error message 😂 Thank you

@dataschool Жыл бұрын

You're so very welcome! Glad I could help!

@cyl1040 4 жыл бұрын

I can solve the duplicate data from my CSV file~~~ Thank you. However, I suggest you can do more in this video. I think you can show after the delete result list. Such as: >> new_data=df.drop_duplicates(keep='first') >> new_data.head(24898) If you have to add it, I think this video will be more perfect~~~

@deki90to 3 жыл бұрын

HOW DO YOU KNOW WHAT I NEED? YOU ARE MY FAV TEACHER FROM NOW

@dataschool 3 жыл бұрын

Ha! Thank you! 😊

@randyle2511 7 жыл бұрын

I like it the way you explain things...it's very clearly and precisely. My problem is little more complex where I want to remove the entire row where it met the following conditions. If any rows in Latitude column that has the same value as previous row (-1) AND the same row in the Longitude column that has the same values as previous row THEN remove the whole entire row that duplicated. Basically we have to compare two consecutive ROWS and COLUMNS and IF both conditions are met then remove the entire row. Let's say if there are 15 rows have the same values(i.e, If Lat[1,1] == Lat[0,1] & Lon[1,2] ==Lon [0,2] then remove, else skip, # Lat = Col1, Long = Col2) in both Latitude and Longitude columns then remove them all except keep one. Hope you got my points... :-). Looking forward to see your code.

@dataschool 7 жыл бұрын

Glad you like the videos! It's not immediately obvious to me how I would approach this problem, but I think that the 'shift' function from pandas might be useful. Good luck! Sorry that I can't provide any code.

@goldensleeves 4 жыл бұрын

At the end are you saying that "age" + "zip code" must TOGETHER be duplicates? Or are you saying "age" duplicates and "zip code" duplicates must remove their individual duplicates from their respective columns? Thanks

@ItsWithinYou 2 жыл бұрын

If I have a datataframe with a million rows and 15 columns, how do I figure out if any columns in my dataframe has mixed data type?

@asadghnaim2332 3 жыл бұрын

When I use the parameter keep=False I get a number of rows less than the first and last combined what is the reason of that??

@mariusnorheim 6 жыл бұрын

How can I remove duplicate rows based on 2 column values? I want to drop a row if two column values are the same. E.g. I have one column with Country = [USA, USA, Canada, USA] and an income column with values = [1000, 900, 900, 900]. I only want to drop the duplicate where both the country AND the income is 900. While if one row has country = Canada and income = 900 and second row has USA with income 900 I want to keep them both. Answers appreciated! Your videos are really helpful for learning pandas. Keep up the good work!

@dataschool 6 жыл бұрын

Sorry, I'm not quite clear on what the rules are for when a row should be kept and when it should be dropped. Perhaps you could think of this task in terms of filtering the DataFrame, rather than using the drop duplicates functionality?

@mariusnorheim 6 жыл бұрын

Thanks for the reply! I managed to improve my code to avoid the duplicates in the first place. Keep up your great work with the videos, really helpful for improving my skills!

@dataschool 6 жыл бұрын

Great to hear! :)

@anthonygonsalvis121 3 жыл бұрын

Very methodical explanation

@dataschool 3 жыл бұрын

Thanks!

@narbigogul5723 6 жыл бұрын

That's exactly what I was looking for, great explanation, thanks for sharing!

@dataschool 6 жыл бұрын

You're welcome!

@brianwaweru9764 3 жыл бұрын

wait Kevin, keep=first means what is duplicated are the rows towards the bottom, meaning they have a much higher index. Keep= last means ?? Oh men am getting mixed up. Could someone please explain to me. Kevin,Please?

@chandrapatibhanuprakashap1862 2 жыл бұрын

It helps me a lot. Can you explain how do we get the count of each duplicated value.

@alishbakhan1084 Жыл бұрын

Thank you so much💕 your videos are really amazing...can you tell how to read any csv(without header on first line) and set first row with non null values as header...

@cafdo 3 жыл бұрын

Great video. This helped me tremendously. How would you go about finding duplicates "case insensitive" with a certain field?

@JoshKelson 5 жыл бұрын

Trying to figure out how to replace values above/below a threshold with the mean or median. If I find values that are skewing the data from a column, but don't want to exclude the whole row and drop the row, I just want to replace the value in one of the columns with a mean/median value. Can't figure out how to do this! IE: I want to replace all values in column 'age' that are above 130 (erroneous data), with the mean age of all the other values in 'age' column.

@dataschool 5 жыл бұрын

I'm sorry, I don't know the code for this off-hand. However, this would be a great question to ask during one of my monthly live webcasts with Data School Insiders: www.patreon.com/dataschool (join at the "Classroom Crew" level to participate)

@chandramohanbettadpura4993 5 жыл бұрын

I have some missing dates in my dataset and want to add the missing dates to the dataset. I used isnull() to track these dates but I don't know how to add those dates into my dataset..Can you please help.Thanks

@dataschool 5 жыл бұрын

You might be able to use fillna and specify a method: pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

@KaiZergTV Жыл бұрын

Thank you so much, you made my day. Finally i found the row of code, that i really needed to finish my task:)(Code Line 17)

@dataschool Жыл бұрын

Glad I could help!

@oasisgod1421 3 жыл бұрын

Great video. But I'd like just to find a duplicate column and then go to another column and find the duplicate and go to another column and find the duplicate and remain only one row with certain information.

@imad_uddin 3 жыл бұрын

Thanks a lot. It was a great help. Much appreciated!

@dataschool 3 жыл бұрын

You're welcome!

@reazahmed7004 3 жыл бұрын

How do I access iPython Jupyter Notebook link? it is not available in the github repository.

@dataschool 3 жыл бұрын

Is this what you were looking for? nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb

@asifsohail5900 3 жыл бұрын

How can we efficiently find near duplicates from a dataset?

@rajoptional 3 жыл бұрын

Amazing and thanks bro , the right place for data queries

@dataschool 3 жыл бұрын

Happy to help

@lindafl2528 3 жыл бұрын

hello, thank you for the video, I'm wondering if you can make some tutorials about the API requests

@dataschool 3 жыл бұрын

Thanks for your suggestion!

@jeffhale739 5 жыл бұрын

Great video, Kevin! Super useful!

@dataschool 5 жыл бұрын

Thanks Jeff! :)

@antonyjoy5494 3 жыл бұрын

This is case of complete duplicates. So what should we do when we have to deal with incomplete duplicates..Ex age,gender and occupation same but zip is different.. could you also make a video on that please..

@Anastasia-wy1uj 3 жыл бұрын

Jeez you just saved me so much work for a seemingly unsolvable project 🙏☕

@dataschool 3 жыл бұрын

That's awesome to hear!

@Anastasia-wy1uj 3 жыл бұрын

@@dataschool hey Kevin, I wonder if there's a way of grouping the results in groups that contain the found duplicate rows 🤔 I'm just thinking of a use case where some products (rows/index) with the same values (numerical and categorical) in features (columns) could be put into a product group so that a customer doesn't need to look through thousands of similar products but through much fewer product groups. The idea is of course implying product group feature selection beforehand and adding product variants afterwards (i.e. further product features that could differ among the products of one product group). I'd really appreciate your thoughts or advice on this 🙏 thanks 💙

@halildurmaz7827 3 жыл бұрын

Clean and informative !

@dataschool 3 жыл бұрын

Thanks!

@emilyyyjw 4 жыл бұрын

Hi, I am wondering whether you could identify an issue that I am having whilst cleaning a dataset with the help of your tutorials. I will post the commands that I have used below: df["is_duplicate"]= df.duplicated() # make a new column with a mark of if row is a duplicate or not df.is_duplicate.value_counts() -> False 25804 True 1591 df.drop_duplicates(keep='first', inplace=True) #attempt to drop all duplicates, other than the first instance df.is_duplicate.value_counts() # -> False 25804 True 728 I am struggling to identify why there are still some duplicates that are marked 'True'? Kind regards,

@dataschool 4 жыл бұрын

That's an excellent question! The problem is that by adding a new column called "is_duplicate", you actually reduce the number of rows which are duplicates of one another! Instead of adding that column, you should first check the number of duplicates with df.duplicated().sum(), then drop the duplicates, then check the number of duplicates again. Hope that helps!

@rationalindian5452 3 жыл бұрын

Brilliant video .

@dataschool 3 жыл бұрын

Thanks!

@omgthisana10 3 ай бұрын

very well explained ty !

@dataschool 3 ай бұрын

You're very welcome!

@mahdibouaziz5353 4 жыл бұрын

you're amazing we need more videos in your channel

@dataschool 4 жыл бұрын

I do my best! I've got 20+ hours of additional videos available to Data School Insiders at various levels: www.patreon.com/dataschool

@ravinduabeygunasekara833 5 жыл бұрын

Great video! Btw, how do you know all these stuff? Do you take classes or read books?

@dataschool 5 жыл бұрын

Work experience, reading documentation, trying things out, teaching, reading tutorials, etc.

@MrMukulpandey Жыл бұрын

love to have more videos like this

@dataschool Жыл бұрын

Thanks for your support!

@deltatv9335 5 жыл бұрын

Hey Buddy, You are amazing and you remind me of Sheldon Cooper (BBT) because of the way you talk and also both of you are super smart. :-) One request- Please cover outliers sometime. Thanks.

@dataschool 5 жыл бұрын

Ha! Many people have commented something similar :) And, thanks for your topic suggestion!

@harshitagrwal9975 10 ай бұрын

user id are not same then how it can be duplicated?

@prakmyl 4 жыл бұрын

i get a error when i run users.drop_duplicates(subset=['age','zip_code']).shape . error "'bool' object is not callable" even i get the same error if i run users.duplicated().sum()

@dataschool 4 жыл бұрын

Remove the .shape, and see what the results look like. Also, compare your code against mine in this notebook: nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb

@robind999 5 жыл бұрын

simple and useful. thanks Kevin.

@dataschool 5 жыл бұрын

You're welcome!

@harneetlamba9512 5 жыл бұрын

Hi, In the above video, at 1:12 minutes - the pandas DataFrame is displayed in Tabular form, with all the variables separated by vertical line. But in latest jupyter notebook, we get a single line below variable name. Can we get the same display as earlier, with new Jupyter version ?

@dataschool 5 жыл бұрын

There's probably a way, but it's probably not easy. I'm sorry!

@ayatbadayatbad7688 5 жыл бұрын

Thank you for this useful tutorial. Quick question, how do you check whether a value in column A is present in column B or not; not necessarily on the same row. It is like the samething that VLOOKUP function looks for in Excel. Many thanks for your feed-back!

@dataschool 5 жыл бұрын

I'm not sure I understand your question, I'm sorry!

@subuktageenshaikh2041 7 жыл бұрын

Hi, I have a doubt how do i remove duplicates from rows which are text or sentences like in RCV1 data set.

@dataschool 7 жыл бұрын

The same process showed in the video will work for text data, as long as the duplicates are exact matches. Does that answer your question?

@engineeringlife2775 Жыл бұрын

Bonus Question 7:55

@KimmoHintikka 7 жыл бұрын

I had weird error with this one. Setting index col with index_col='user_id' does not work for me it raises KeyError: 'user_id' error. Instead I had to run users = pd.read_table('bit.ly/movieusers', sep='|', header=None, names=user_cols) first and then users.set_index('user_id') for this tutorial to work

@dataschool 7 жыл бұрын

Interesting! I'm not sure why that would be. But thanks for mentioning the workaround!

@jamesdoone3516 7 жыл бұрын

Really great gob. Thank you very much!!

@prakmyl 4 жыл бұрын

Awesome videos Kevin. Thanks a to for the knowledge share.

@dataschool 4 жыл бұрын

Thanks Prakash!

@artistz1831 6 жыл бұрын

Hey Kevin, I am confused for the drop duplicates here: the number of duplicated age and zipcode is 14; but after your drop the duplicates, the shape is 927. The total shape is 943, so the correct shape should be 943 - 14 = 929? Thanks a lot for your help!!!

@dataschool 6 жыл бұрын

I disagree with your statement "the number of duplicated age and zipcode is 14"... could you explain how you came to that conclusion? Thanks!

@mansoormujawar1279 7 жыл бұрын

Because of your quality panda series I started following you. @duplicate - in my use case instead of drop duplicate I would like to keep 1st instance and just remove other duplicate values from specific column, so shape will remain same after removing duplicate values from column. Really appreciate if you got some time to answer this, thanks.

@dataschool 7 жыл бұрын

Glad you like the series! I'm not sure I understand your question - perhaps the documentation for drop_duplicates will help? pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

@mmarva3597 3 жыл бұрын

Thank you for this content! I have a question : how can we handle quasi redundant values in different columns ? (Imagine two different columns each containing similar values at 80%). Thanks a lot

@dataschool 3 жыл бұрын

When you say "handle", what is your goal? If you want to identify close matches, you can do what is called "fuzzy matching". Here's an example: pbpython.com/record-linking.html Hope that helps!

@mmarva3597 3 жыл бұрын

@@dataschool Merci beaucoup for the reply. Let me explain my question : I have two variables/features named categories (milk, snack,pasta,oil,etc) and categories_en(en:milk , en:snack, en: pasta). My goal is to keep only one feature since both features share the same information. It was suggested that running a chi square test would help me decide which feature to keep but it seems silly to me :( ( I have almost 2millions records)

@dataschool 3 жыл бұрын

It probably doesn't matter which feature you keep, if they contain roughly the same information.

@zhaoqilong1994 8 жыл бұрын

is that any simple regular expression on python tutorial available?

@dataschool 8 жыл бұрын

For learning regular expressions, I like these two resources: developers.google.com/edu/python/regular-expressions www.pythonlearn.com/html-270/book012.html

@arpitmittal7865 4 жыл бұрын

very useful videos.. can you please tell me how to find duplicate of just one specific row?

@dataschool 4 жыл бұрын

Sorry, I don't fully understand. Good luck!

@jatinshetty 4 жыл бұрын

Yo! You are a superb teacher!

@dataschool 4 жыл бұрын

Thank you!

@benogidan 7 жыл бұрын

cheers for this :) will definitely consider purchasing the package

@dataschool 7 жыл бұрын

You're very welcome! The pandas library is open source, so it's free!

@benogidan 7 жыл бұрын

sorry i meant on your website, the course ;)

@dataschool 7 жыл бұрын

Awesome! Let me know if you have any questions about the course. More information is here: www.dataschool.io/learn/

@abdulazizalsuayri4908 6 жыл бұрын

full of useful info. Thanx man

@dataschool 6 жыл бұрын

You're very welcome! :)

@dandixon9466 7 жыл бұрын

Great work man!

@dataschool 7 жыл бұрын

Thanks!

@peekayji 6 жыл бұрын

Great! Very well explained.

@dataschool 6 жыл бұрын

Thanks!

@syyamnoor9792 5 жыл бұрын

you are a hero...

@dataschool 5 жыл бұрын

That's very kind of you! :)

@DimasAnggaFM 4 жыл бұрын

great video!!

@dataschool 4 жыл бұрын

Thanks!

@krzysztofszeremeta1125 6 жыл бұрын

how is the best way to compare data from tow file (in the same schema)

@dataschool 6 жыл бұрын

I don't know if there's one right way to do this... it depends on the details. Sorry I can't give you a better answer!

@sherlocksu1131 7 жыл бұрын

HI, when you mention the "inplace" in the video, I am happy that PD have this parameter for experiment, but a problem comes, should I rember all the method that have the inplace parameter ;and rember the method that affect the origial dataframe in case that I use the DF already change when doing the calculation. That is a hugh job to remove all the method that have 'inplace' parameter or doesnot have ,isn't it..... TOT

@sherlocksu1131 7 жыл бұрын

That is a huge

@dataschool 7 жыл бұрын

The 'inplace' parameter is just for convenience. I do recommend trying to memorize when that parameter is available. But if you forget, that's fine, because you can always write code like this: ufo = ufo.drop('Colors Reported', axis=1) ...instead of this: ufo.drop('Colors Reported', axis=1, inplace=True)

@sherlocksu1131 7 жыл бұрын

Is all inplace argument in method way default by "False"? My problem is that: I worry that somethimes the method change original dataframe by method that have "inplace parameter"; somethimes the method does not change original dataframe. so i confuse when it affect the original DataFrame , since the wrong judgemet might be lead to bad conclusion.

@dataschool 7 жыл бұрын

I think that 'inplace' is always False (by default) for all pandas functions.

@sagarbhadani1932 5 жыл бұрын

Hi, need help. Suppose if we have table such as transaction contains atleast 1 common item in the item column. How to code which are the transactions having coffee atleast? Transaction Item 1 Tea 2 Cookies 2 Coffee 3 cookies 4 Bread 4 Cookies 4 Coffee

@dataschool 5 жыл бұрын

I'm not sure off-hand, good luck!

@VNTHOTA 5 жыл бұрын

You should have used sort_values option with users.loc[users.duplicated(keep=False)].sort_values(by='age')

@dataschool 5 жыл бұрын

Thanks for your suggestion!

@hiericzhu 6 жыл бұрын

Hi, I have question here. I want to mark the continue duplicate value like this [1,1,1,0,2,3,2,4,2], my expected result is [True,True, True,False,False,False,False,...]. But the pandas.duplicated(keep=False) returns [True,True,True,False,True,False,True,False,True], The function treat the '2' in 2,x,2,y,2,z,2 sequence as duplicated. but it is not I want. How to remove it? I just want to mark the 1,1,1 as true. thanks.

@dataschool 6 жыл бұрын

How about just using code like this: df.columnname == 1 Does that help?

@srincrivel1 5 жыл бұрын

you're doing god's work son!

@dataschool 5 жыл бұрын

Thanks!

@somantalha4888 2 жыл бұрын

beneficial videos. ❤

@dataschool Жыл бұрын

Thanks!

@anantgosai8884 2 жыл бұрын

That was so accurate, thanks a lot genius!

@dataschool 2 жыл бұрын

You're very welcome!

@Animesh19007 4 жыл бұрын

How to keep rows that contains null values in any column and remove completed rows?

@dataschool 4 жыл бұрын

Does this help? kzbin.info/www/bejne/nHSwo4KVi9-Ygpo

@harshindublin 3 жыл бұрын

Thanks for the video

@dataschool 3 жыл бұрын

You're welcome!

@Drivebyeasy 7 жыл бұрын

Hello I want to know the concept of ReSampling please help

@dataschool 7 жыл бұрын

I'm sorry, I don't have any resources to offer you. Good luck!

@johnsonburgundypants 6 жыл бұрын

very clear, very concise!! :)

@dataschool 6 жыл бұрын

Thanks! Glad you liked it!

@zma314125 3 жыл бұрын

Thank you!

@dataschool 3 жыл бұрын

You're welcome!

@da_ta 5 жыл бұрын

thanks for tips and bonus ideas

@dataschool 5 жыл бұрын

You're welcome!

@muhammadbashar572 7 жыл бұрын

hi good afternoon. how do i remove different letter from values for example i have got column which contain customer income like J:10,000, P:50,000 . i want to make it like 10000,50000

@dataschool 7 жыл бұрын

You can use string methods to strip the first two characters, and then the astype function to change the type from string to integer. These videos might be helpful to you: kzbin.info/www/bejne/mKDJknZmfsieftE kzbin.info/www/bejne/jGGkiKywi7KZa5Y Good luck!

@SahibzadaIrfanUllahNaqshbandi 7 жыл бұрын

Thanks for good channel. I like it very much. I have a query. I am working on tweets, I have to remove duplicate tweets as well as tweets which are different in at most one word. I can do first part, Will you please guide me how can I do the second part?? Thanks

@dataschool 7 жыл бұрын

That's probably beyond the scope of what you can do with pandas. Perhaps you can take advantage of a fuzzy string matching library.

@SahibzadaIrfanUllahNaqshbandi 7 жыл бұрын

Thanks...I will look into it.

@ashishacharya8427 7 жыл бұрын

replace similar duplicate values with one of the values how to solve it??

@dataschool 7 жыл бұрын

I think the process would depend a lot on the particular details of the problem you are trying to solve.

@moremirinplease 3 жыл бұрын

i love you, sir.

@dataschool 3 жыл бұрын

😊

@Ishkatan 2 жыл бұрын

Good lesson, but the datatype has to match. I found I had to process my pandas tables with .astype(str) before this worked.

@ajithtolroy5441 6 жыл бұрын

This is what I want, thanks for sharing :)

@dataschool 6 жыл бұрын

Great!

@sasa4840 5 жыл бұрын

Thanks my question how we can sort months name

@dataschool 5 жыл бұрын

This video might be helpful to you: kzbin.info/www/bejne/r3TKe3qpnJWLl5Y

@duckthatgivesafuk8471 5 жыл бұрын

I really need help guys. I have a table that has a column : Column name - " Neighbourhood" This Column has A LOT of names repeated MANY times. To be specific, the column "Neighbourhood" has 10 Names that are repeated ALOT of times. My question is : I NEED HELP IN CREATING A SEPARATE COLUMN SPECIFYING HOW MANY TIMES EACH ELEMENT IN "NEIGHBORHOOD" HAS BEEN COUNTED. If anyone help me please.

@dataschool 5 жыл бұрын

I'm not positive this would work, but I might start by creating a dictionary out of value_counts, and then use that as a mapping for the new column. Anyway, I hope you were able to figure out a solution!