Turning multiple CSV files into a single pandas data frame

Рет қаралды 31,951

Python and Pandas with Reuven Lerner

Күн бұрын

Пікірлер: 59

@sloughpacman 2 жыл бұрын

Your material has the habit of provoking thought, which often leads me off on exploratory tangents for a couple of hours. Thanks for wasting my day, Reuven 😀

@ManjeeriG 7 ай бұрын

Thank you so much for this tutorial. I have gone through multiple tutorials but this one is the easiest and (ofcourse) smartest way to combine multiple csv into the one. 😊😊 You got a new subscriber.

@ReuvenLerner 7 ай бұрын

I'm delighted that it helped!

@learner7273 Жыл бұрын

I never knew you could have a list of dataframes. Thank you!

@ReuvenLerner Жыл бұрын

I'm delighted to hear it!

@Merqos 2 жыл бұрын

Thank you so much for this example! This made me save so much time!

@ReuvenLerner 2 жыл бұрын

I'm so happy to hear it!

@Merqos 2 жыл бұрын

@@ReuvenLerner yeah, really! I was told to extract data from a vessel where all data was given in 10 minutes interval excel documents. This means within a day i had to open 144 files and collect them in one sheet. I love you and the internett man, thanks again!

@matthewhsu7299 2 жыл бұрын

Learn something new today. Thank you so much!

@ReuvenLerner 2 жыл бұрын

Glad to hear it helped!

@silvajonatan 10 ай бұрын

Hi, fantastic infomation. Very didatic. Thanks a lot.

@ReuvenLerner 10 ай бұрын

My pleasure!

@mikej62 Жыл бұрын

Nice!!! - I've seen several ways to solve this problem and this is the most efficient I've personally came across! Question for you, In [13], there will be times I want to use this approach because I want to verify the number of rows in each file....how would you do it? TIA!

@ReuvenLerner Жыл бұрын

Glad to hear it helped! If you want to verify the number of rows in a data frame, the fastest/easiest way is to ask for len(df.index). You can have an "if" statement in your "for" loop checking how many rows there are in the data frame, and thus know how many you retrieved. I don't believe that there's a way to check the number of rows in a file without reading it into a data frame first, at least with read_csv.

@mikej62 Жыл бұрын

@@ReuvenLerner Gotcha! What I did was added another print statement before "all_dfs.append(new_df)" print(len(new_df.index)). Wanted to see the way you would approach it. All for the purpose of documenting what I started with. Cheers!!

@33samogo 2 жыл бұрын

Excellent, worked fine for me also, thank you!

@ReuvenLerner 2 жыл бұрын

Excellent!

@varunsharma3485 2 жыл бұрын

Worked for me. Earlier I was using excel power query to join multiple csv's and then importing them to pandas, but it had a limitation of max 10million rows in excel. This tutorial is very helpful. Thanks and subscribed. One issue is that I am facing in this code is that I want to add my individual csv file names as a column in concatenate dataframe that is missing in this code.Any tips

@ReuvenLerner 2 жыл бұрын

I'm delighted to know that the video helped! BTW, the maximum number of rows in Excel is (in)famous. It even led to the loss of a bunch of covid data in the UK, by silently removing a whole bunch of inputs. Thanks for joining me here, and I'll see you around...

@tomtaetz20 2 жыл бұрын

How would you go about this when the csv files don’t all contain the same column names(some different some the same)?

@ReuvenLerner 2 жыл бұрын

Ooh, that makes it more interesting! If all of the input data frames are guaranteed to contain the columns you want, then you can just select columns with double square brackets + a list of column names in the top line of the comprehension. That is, you can say pd.read_csv(one_filename)[['email', 'phone']]. But if they have different columns, and/or you have to do transformations, then this technique gets much messier. Perhaps you could call a function in the top line of the comprehension. And the function could find the columns it wants and output them. But it won't be quite as elegant.

@tomtaetz20 2 жыл бұрын

Awesome! Thank you, I will give it a try

@anushkajoshi8427 8 ай бұрын

I was literally looking for the exact question in the comment box, before commenting mine.

@awaisraza2285 Жыл бұрын

I have a main folder and in that main folder there are 35 sub folders. and their names are like S01, S02... S35. And each folder has a dataset of same structure. How i can concatenate that data into one dataframe?

@ReuvenLerner Жыл бұрын

You can! Look upthe "recursive" parameter in glob.glob, and you'll see that you can get all of the files that match patterns across multiple subdirectories.

@regal7548 7 ай бұрын

What if the datasets doesnt haw anything in common , like one is geological data, one is survey data, one is market analysis and each of them has a massive number of null values . Also the unique ids are different for example , one table has SLT20284903 and some others just numbers . What do we do then ?

@ReuvenLerner 7 ай бұрын

Then you shouldn't be combining them, in this way or any other way! My assumption for this video was that you have a data set broken up across a number of CSV files, each with the same column names and dtypes. You want to take these multiple files and turn them into a single data frame. Pandas provides us with the "pd.concat" method, which is good for such things, but the problem is how you read them into Pandas quickly and easily. If you have geological data, survey data, and market analysis, then *perhaps* they have some factors in common. But you don't want them in the same data frame. Rather, read each into its own data frame, and use "join" or "merge" to combine them.

@regal7548 7 ай бұрын

@@ReuvenLerner ok.. thank you

@gabbyf2906 Жыл бұрын

Great video helped a lot!

@ReuvenLerner Жыл бұрын

So glad to hear it helped!

@israaaljowder9751 2 жыл бұрын

very informative video, Thanks

@ReuvenLerner 2 жыл бұрын

Glad it helped!

@nvduk3 Жыл бұрын

If one of the csv files is blank inbetween then it's breaking for loop. How to avoid that?

@ReuvenLerner Жыл бұрын

My best suggestion is to skip a data frame with zero rows from the output list. There might well be better solutions, though!

@nvduk3 Жыл бұрын

@@ReuvenLerner @Reuven Lerner yes the video was of immense help. I got the concept but was facing an issue as I am trying to run it for over 2300 .csv files and many of them inbetween are blank and so it's not able to get those mentioned column names defined in the loop and is stopping there. Manually deleting them is time consuming & kinda stupid lol. I'm sorry as my Python is of intermediate level, but I will try to skip the ones based on if it can't find a matching column name as you mentioned. Thanks a lot again!

@ReuvenLerner Жыл бұрын

@@nvduk3 Oh, right - if the file is completely blank, then you can't select columns. That is a problem! Unfortunately, that's a tough one to solve. Maybe you could write a function that loads the first 5 lines, and checks to see if there are any columns. If such a function returns True/False, then you can use it in the "if" of your comprehension, and only truly process those that have the columns you want. Yes, you'll end up reading (at least part of) each file twice, but that might still be best. I'm not sure.

@nvduk3 Жыл бұрын

@@ReuvenLerner skipping the column names didn't work but skipping the blank sheets entirely worked somehow after few hit and trials. Thanks a lot, I really appreciate it 👍🏽

@guocity 8 ай бұрын

thats really helpful, what about reading multiple csv file, read new csv file ignore repeated rows

@ReuvenLerner 8 ай бұрын

To read multiple CSV files, you need to run read_csv multiple times. And I think that if you want to ignore repeated rows, that's something you have to do after creating the data frame, not during the loading process.

@koosschutter1675 Жыл бұрын

Where can I download some of these large files for testing? I want to split or combine some CSV files but with my basic laptop can't load everything in memory and I need all the columns. Me and chatgtp came up with appending files which is slow. I'm not a data scientist, I just want to split and combine csv files and test my and my pc's capabilities.

@ReuvenLerner Жыл бұрын

You can download the (large!) files from Pandas Workout from here: files.lerner.co.il/pandas-workout-data.zip

@koosschutter1675 Жыл бұрын

@@ReuvenLerner Thank you very much.

@traal 2 жыл бұрын

Would this also work with a generator expression passed into pd.concat? That would look nicer and probably save some memory. 😊

@ReuvenLerner 2 жыл бұрын

It would definitely work. But I'm not sure how much memory it would save, because at the end of the day, we still get all of the data from the three smaller data frames. And the way that behind-the-scenes memory is managed, I'm going to guess (and might be wrong about this) that the usage would be the same. But it would probably be wise to experiment with this; I'll try to find some time to do that, and if I have interesting results, I'll report them on this channel.

@gam3rman85 11 ай бұрын

helpful. thanks!

@ReuvenLerner 10 ай бұрын

Glad it helped!

@ReadyF0RHeady Күн бұрын

how to i clean up synonyms in my csv files (e.g. United States in one file and United states of america in another file) I want all of my csv files to have the same country name

@FletchersHeating Жыл бұрын

Thanks for this video! Is there a tutorial or information on how to do the same but for multiple data frames? ie. one csv = one dataframe Many thanks

@ReuvenLerner Жыл бұрын

Glad it helped! The same code would basically work for what you want, if you don't then concatenate the data frames together at the end. You'll end up with a list of data frames, each from a separate CSV file, with which you can do whatever you want.

@tedfitzgerald4202 Жыл бұрын

Really super video.

@ReuvenLerner Жыл бұрын

Glad you liked it!

@KingofDiamonds117 2 жыл бұрын

I can't get this to waork for me. I need help.

@ReuvenLerner 2 жыл бұрын

What are you doing, and what error do you get?

@KingofDiamonds117 2 жыл бұрын

@@ReuvenLerner I tried doing this: import glob data = glob.glob['channel1/*.csv'] and got: TypeError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_13700/502398532.py in 1 import glob ----> 2 data = glob.glob['channel1/*.csv'] TypeError: 'function' object is not subscriptable

@ReuvenLerner 2 жыл бұрын

@@KingofDiamonds117 "function object is not subscriptable" means: you're using [] on a function, and you really should be using (). Try that!

@KingofDiamonds117 2 жыл бұрын

@@ReuvenLerner I'm not getting any errors this time, thanks. I have poor eyesight so it's difficult for me to see properly.