Your material has the habit of provoking thought, which often leads me off on exploratory tangents for a couple of hours. Thanks for wasting my day, Reuven 😀
@ManjeeriG7 ай бұрын
Thank you so much for this tutorial. I have gone through multiple tutorials but this one is the easiest and (ofcourse) smartest way to combine multiple csv into the one. 😊😊 You got a new subscriber.
@ReuvenLerner7 ай бұрын
I'm delighted that it helped!
@learner7273 Жыл бұрын
I never knew you could have a list of dataframes. Thank you!
@ReuvenLerner Жыл бұрын
I'm delighted to hear it!
@Merqos2 жыл бұрын
Thank you so much for this example! This made me save so much time!
@ReuvenLerner2 жыл бұрын
I'm so happy to hear it!
@Merqos2 жыл бұрын
@@ReuvenLerner yeah, really! I was told to extract data from a vessel where all data was given in 10 minutes interval excel documents. This means within a day i had to open 144 files and collect them in one sheet. I love you and the internett man, thanks again!
@matthewhsu72992 жыл бұрын
Learn something new today. Thank you so much!
@ReuvenLerner2 жыл бұрын
Glad to hear it helped!
@silvajonatan10 ай бұрын
Hi, fantastic infomation. Very didatic. Thanks a lot.
@ReuvenLerner10 ай бұрын
My pleasure!
@mikej62 Жыл бұрын
Nice!!! - I've seen several ways to solve this problem and this is the most efficient I've personally came across! Question for you, In [13], there will be times I want to use this approach because I want to verify the number of rows in each file....how would you do it? TIA!
@ReuvenLerner Жыл бұрын
Glad to hear it helped! If you want to verify the number of rows in a data frame, the fastest/easiest way is to ask for len(df.index). You can have an "if" statement in your "for" loop checking how many rows there are in the data frame, and thus know how many you retrieved. I don't believe that there's a way to check the number of rows in a file without reading it into a data frame first, at least with read_csv.
@mikej62 Жыл бұрын
@@ReuvenLerner Gotcha! What I did was added another print statement before "all_dfs.append(new_df)" print(len(new_df.index)). Wanted to see the way you would approach it. All for the purpose of documenting what I started with. Cheers!!
@33samogo2 жыл бұрын
Excellent, worked fine for me also, thank you!
@ReuvenLerner2 жыл бұрын
Excellent!
@varunsharma34852 жыл бұрын
Worked for me. Earlier I was using excel power query to join multiple csv's and then importing them to pandas, but it had a limitation of max 10million rows in excel. This tutorial is very helpful. Thanks and subscribed. One issue is that I am facing in this code is that I want to add my individual csv file names as a column in concatenate dataframe that is missing in this code.Any tips
@ReuvenLerner2 жыл бұрын
I'm delighted to know that the video helped! BTW, the maximum number of rows in Excel is (in)famous. It even led to the loss of a bunch of covid data in the UK, by silently removing a whole bunch of inputs. Thanks for joining me here, and I'll see you around...
@tomtaetz202 жыл бұрын
How would you go about this when the csv files don’t all contain the same column names(some different some the same)?
@ReuvenLerner2 жыл бұрын
Ooh, that makes it more interesting! If all of the input data frames are guaranteed to contain the columns you want, then you can just select columns with double square brackets + a list of column names in the top line of the comprehension. That is, you can say pd.read_csv(one_filename)[['email', 'phone']]. But if they have different columns, and/or you have to do transformations, then this technique gets much messier. Perhaps you could call a function in the top line of the comprehension. And the function could find the columns it wants and output them. But it won't be quite as elegant.
@tomtaetz202 жыл бұрын
Awesome! Thank you, I will give it a try
@anushkajoshi84278 ай бұрын
I was literally looking for the exact question in the comment box, before commenting mine.
@awaisraza2285 Жыл бұрын
I have a main folder and in that main folder there are 35 sub folders. and their names are like S01, S02... S35. And each folder has a dataset of same structure. How i can concatenate that data into one dataframe?
@ReuvenLerner Жыл бұрын
You can! Look upthe "recursive" parameter in glob.glob, and you'll see that you can get all of the files that match patterns across multiple subdirectories.
@regal75487 ай бұрын
What if the datasets doesnt haw anything in common , like one is geological data, one is survey data, one is market analysis and each of them has a massive number of null values . Also the unique ids are different for example , one table has SLT20284903 and some others just numbers . What do we do then ?
@ReuvenLerner7 ай бұрын
Then you shouldn't be combining them, in this way or any other way! My assumption for this video was that you have a data set broken up across a number of CSV files, each with the same column names and dtypes. You want to take these multiple files and turn them into a single data frame. Pandas provides us with the "pd.concat" method, which is good for such things, but the problem is how you read them into Pandas quickly and easily. If you have geological data, survey data, and market analysis, then *perhaps* they have some factors in common. But you don't want them in the same data frame. Rather, read each into its own data frame, and use "join" or "merge" to combine them.
@regal75487 ай бұрын
@@ReuvenLerner ok.. thank you
@gabbyf2906 Жыл бұрын
Great video helped a lot!
@ReuvenLerner Жыл бұрын
So glad to hear it helped!
@israaaljowder97512 жыл бұрын
very informative video, Thanks
@ReuvenLerner2 жыл бұрын
Glad it helped!
@nvduk3 Жыл бұрын
If one of the csv files is blank inbetween then it's breaking for loop. How to avoid that?
@ReuvenLerner Жыл бұрын
My best suggestion is to skip a data frame with zero rows from the output list. There might well be better solutions, though!
@nvduk3 Жыл бұрын
@@ReuvenLerner @Reuven Lerner yes the video was of immense help. I got the concept but was facing an issue as I am trying to run it for over 2300 .csv files and many of them inbetween are blank and so it's not able to get those mentioned column names defined in the loop and is stopping there. Manually deleting them is time consuming & kinda stupid lol. I'm sorry as my Python is of intermediate level, but I will try to skip the ones based on if it can't find a matching column name as you mentioned. Thanks a lot again!
@ReuvenLerner Жыл бұрын
@@nvduk3 Oh, right - if the file is completely blank, then you can't select columns. That is a problem! Unfortunately, that's a tough one to solve. Maybe you could write a function that loads the first 5 lines, and checks to see if there are any columns. If such a function returns True/False, then you can use it in the "if" of your comprehension, and only truly process those that have the columns you want. Yes, you'll end up reading (at least part of) each file twice, but that might still be best. I'm not sure.
@nvduk3 Жыл бұрын
@@ReuvenLerner skipping the column names didn't work but skipping the blank sheets entirely worked somehow after few hit and trials. Thanks a lot, I really appreciate it 👍🏽
@guocity8 ай бұрын
thats really helpful, what about reading multiple csv file, read new csv file ignore repeated rows
@ReuvenLerner8 ай бұрын
To read multiple CSV files, you need to run read_csv multiple times. And I think that if you want to ignore repeated rows, that's something you have to do after creating the data frame, not during the loading process.
@koosschutter1675 Жыл бұрын
Where can I download some of these large files for testing? I want to split or combine some CSV files but with my basic laptop can't load everything in memory and I need all the columns. Me and chatgtp came up with appending files which is slow. I'm not a data scientist, I just want to split and combine csv files and test my and my pc's capabilities.
@ReuvenLerner Жыл бұрын
You can download the (large!) files from Pandas Workout from here: files.lerner.co.il/pandas-workout-data.zip
@koosschutter1675 Жыл бұрын
@@ReuvenLerner Thank you very much.
@traal2 жыл бұрын
Would this also work with a generator expression passed into pd.concat? That would look nicer and probably save some memory. 😊
@ReuvenLerner2 жыл бұрын
It would definitely work. But I'm not sure how much memory it would save, because at the end of the day, we still get all of the data from the three smaller data frames. And the way that behind-the-scenes memory is managed, I'm going to guess (and might be wrong about this) that the usage would be the same. But it would probably be wise to experiment with this; I'll try to find some time to do that, and if I have interesting results, I'll report them on this channel.
@gam3rman8511 ай бұрын
helpful. thanks!
@ReuvenLerner10 ай бұрын
Glad it helped!
@ReadyF0RHeadyКүн бұрын
how to i clean up synonyms in my csv files (e.g. United States in one file and United states of america in another file) I want all of my csv files to have the same country name
@FletchersHeating Жыл бұрын
Thanks for this video! Is there a tutorial or information on how to do the same but for multiple data frames? ie. one csv = one dataframe Many thanks
@ReuvenLerner Жыл бұрын
Glad it helped! The same code would basically work for what you want, if you don't then concatenate the data frames together at the end. You'll end up with a list of data frames, each from a separate CSV file, with which you can do whatever you want.
@tedfitzgerald4202 Жыл бұрын
Really super video.
@ReuvenLerner Жыл бұрын
Glad you liked it!
@KingofDiamonds1172 жыл бұрын
I can't get this to waork for me. I need help.
@ReuvenLerner2 жыл бұрын
What are you doing, and what error do you get?
@KingofDiamonds1172 жыл бұрын
@@ReuvenLerner I tried doing this: import glob data = glob.glob['channel1/*.csv'] and got: TypeError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_13700/502398532.py in 1 import glob ----> 2 data = glob.glob['channel1/*.csv'] TypeError: 'function' object is not subscriptable
@ReuvenLerner2 жыл бұрын
@@KingofDiamonds117 "function object is not subscriptable" means: you're using [] on a function, and you really should be using (). Try that!
@KingofDiamonds1172 жыл бұрын
@@ReuvenLerner I'm not getting any errors this time, thanks. I have poor eyesight so it's difficult for me to see properly.