First post! That’s my husband he knows about data…
@LuisRomaUSA2 жыл бұрын
He knows a lot of good stuff about data 😁. His the first non-introductory Python KZbinr I have found so far 🎉
@venvanman2 жыл бұрын
aww this is cute
@sketch16252 жыл бұрын
Guess he's really in a "pickle" now.
@foobarAlgorithm2 жыл бұрын
Awww now you guys need a The DataCouple channel if you both do data science! Love your content
@Arpan_Gupta Жыл бұрын
Nice work Mr. ROB
@lashlarue7924 Жыл бұрын
You are my new favorite KZbinr, Sir. I'm learning more from you than anyone else, by a country mile!
@Jvinniec2 жыл бұрын
One really cool feature of .read_parquet() is that it passes through additional parameters for whichever backend you're using. For example the filters parameter in pyarrow allows you to filter data at read, potentially making it even faster: df = pd.read_parquet("myfile.parquet", filters=[('col_name', '
@robmulla2 жыл бұрын
Whoa. That is really cool. I didn't realize you could do that. I've used athena which allows you to query parquet files using standard SQL and it's really nice.
@PhilcoCup Жыл бұрын
Athena is amazing when backed with parquet files, I've used it in order to be able to read through 600M+ records that were in those parquets easily
@incremental_failure Жыл бұрын
That's the real use case for parquet. Feather doesn't have this.
@mschuer1002 жыл бұрын
As always, awesome video...a real eye opener on most efficient file formats. I have only used pickle as compression, but will now investigate feather and parquet. Thanks for putting this together for all of us.
@robmulla2 жыл бұрын
Glad it was helpful! I use parquet all the time now and will never go back.
@holgerbirne18452 жыл бұрын
Very good video :). One note: pickle files can be compressed. If you compress them, they become much smaller but reading and writing becomes slower. Overall parquet und feather are still much better.
@robmulla2 жыл бұрын
Good point! There are many ways to save/compress that I probably didn't cover. Thanks for watching the video.
@nancyzhang67902 жыл бұрын
I saw people mentioned feather on Kaggle sometimes, but had no clue what they were talking about. Finally, I got answers to many questions in my mind. Thank you!
@robmulla2 жыл бұрын
Yes. Feather and parquet formats are awesome for when you want to quickly read and write data to disk. Glad the video helped you learn!
@nascentnaga10 ай бұрын
as someone moving into datascience this is such a great explainer! thank you
@69k_gold10 ай бұрын
I looked this up, and it's a pretty cool format, I kinda guessed that it could be a column-based storage strategy when you said that we can efficiently get only select columns, but after I looked it up and found it to be true, it felt very exciting. Anyways, hats off to Google's engineers for thinking out of the box on this, the number of things we can do just by storing data as column-lines rather than row-lines is a lot. Of course, the trade-off is that it's very expensive to modify column-wise data, so this is more useful for static datasets that require multi-dim analysis
@DainiusKirsnauskas8 ай бұрын
Man, I thought this video is a clickbait, but it was awesome. Thank you!
@walterpark8824 Жыл бұрын
Exactly what I needed to know, and to the point. Thanks. As Einstein said, 'Everything should be as simple as possible, and no simpler!'
@robmulla Жыл бұрын
That’s a great quote. Glad you found this helpful.
@spontinimalkyАй бұрын
You explain very clearly. Thank you.
@bendirval36122 жыл бұрын
A major design objective of feather is to be able to be read by R. If you are doing pandas-type data science stuff, this is a significant advantage.
@robmulla2 жыл бұрын
Great point. The R package called "arrow" can read in both parquet and feather files.
@Banefane10 ай бұрын
Very clear, very structured, and the details are intuitive to understand!
@chrisdsouza18853 ай бұрын
These file saving methods are really useful 😊
@rrestituti Жыл бұрын
Amazing! Got one new member. Thanks, Rob! 😉
@robmulla Жыл бұрын
Glad you liked it. Thanks for commenting!
@KirowOnet Жыл бұрын
This was the first video from the channel that randomly appeared in my feed. I clicked, I watched - I liked and subscribed :D. This video plant a seed into my mind, some others inspired me to try. So few days later I got running playground environment in the docker. I'm not data scientist but tips and tricks from your videos could be useful for any developer. I used to code before to check some datasets, but with pandas and jupiter notebook it way more faster. Thank You for sharing your experience !
@robmulla Жыл бұрын
Wow, I really appreciate this feedback. Glad you found it helpful and got some code working yourself. Share with friends and keep an eye out for new videos dropping soon!
@wonderland860 Жыл бұрын
This video greatly helped me. I didn't know so many ways to dump a DataFrame. I then did a further test, and found the compression option plays a big role: df.to_pickle(FILE_NAME, compression='xz') -> 288M df.to_pickle(FILE_NAME, compression='bz2') -> 322M df.to_pickle(FILE_NAME, compression='gzip') -> 346M df.to_pickle(FILE_NAME, compression='zip') -> 348M df.to_pickle(FILE_NAME, compression='infer') -> 679M # default compression df.to_parquet(FILE_NAME, compression='brotli') -> 334M df.to_parquet(FILE_NAME, compression='gzip') -> 355M df.to_parquet(FILE_NAME, compression='snappy') -> 423M # default compression df.to_feather(FILE_NAME) -> 500M
@robmulla Жыл бұрын
Nice findings! Thanks for sharing. Funny that compressing parquet still works. I didn't know that.
@DeathorGloryNow Жыл бұрын
@@robmulla Actually if you check the docs parquet files are snappy compressed by default. You have to explicitly say `compression=None` to not compress it. Snappy is the default because it adds very little time to read/write with modest compression and low CPU usage while still maintaining the very nice columnar properties (as you showed in the video). It is also the default for Spark. Other compressions like gzip get it smaller but at a much more significant cost to speed. I'm not sure this is still the case but in the past they also broke some of the nice properties because it is compressing the entire object.
@jeremynicoletti90605 ай бұрын
Thanks for sharing; I think I'll start using feature and parquet for some of my data needs.
@jamesafh994 ай бұрын
Thanks a lot! Loved the video and helped me with what I need it ❣️ Keep going with these videos. They really worthy 🔥
@cristianmendozamaldonado3241 Жыл бұрын
I really love it man, thank you. You saved a life
@robmulla Жыл бұрын
Thanks! Maybe not saved a life, but saved a few minutes of compute time!
@gsm74908 ай бұрын
Parquet really saved me ) Around one year data, each day is appr 2GB (csv format). Parquet is both compact and fast. But have to use filtering and load only necessary columns “on demand”.
@beethovennine2 жыл бұрын
Rob, you did it again...keep'em coming, good job!
@robmulla2 жыл бұрын
Thanks!
@FilippoGronchi2 жыл бұрын
Excellent as usual Rob...very very useful indeed
@robmulla2 жыл бұрын
Thank you sir!
@gustavoadolfosanchezhurtad1412 Жыл бұрын
Very clear and insightful explanation, thanks Rob, keep it up!
@robmulla Жыл бұрын
Thanks Gustavo. I’ll try my best.
@MatthiasBussonnier2 жыл бұрын
On the first pass when you timeit the csv writing you time both the writing to csv and generating the dataset. So you are likely having biased results as you only time the writing with other format. (Sure it does not change the final message, just want to point it out) Also with timeit, you can use the -o flag of timeit to output the result to a variable, and this can help you to for example make a plot of the times.
@robmulla2 жыл бұрын
Good point about timing the dataframe generation. It should be negligable but fair to note. Also great tip on using -o. I didn't know about that! It looks like from the docs it writes the entire stdout, so it would need to be parsed. ipython.readthedocs.io/en/stable/interactive/magics.html#magic-timeit Still a handy tip. Thanks!
@ozymet Жыл бұрын
Very good stuff. The essence of information.
@robmulla Жыл бұрын
Glad you liked it!
@ozymet Жыл бұрын
@@robmulla I saw few more videos, insta sub. Thank you. Glad to find you.
@humbertoluzoliveira Жыл бұрын
Hey Guy, nice job. Congratulations! Thanks for video.
@robmulla Жыл бұрын
Thanks for watching Humberto.
@bothuman-n4b Жыл бұрын
Hi Rob. I'm from Argentina, you are the best!!!
@truthgaming2296 Жыл бұрын
thanks rob, its help me a lot for beginner like me to realize there is weakness in csv format 😉
@niflungv10982 жыл бұрын
This is good to know. I`m going into web development now, so I usually use JSON format for serialization... I`m still new to python so I didn`t know about parquet and feather. Thank you!
@robmulla2 жыл бұрын
Glad you found it helpful. Share it with anyone else you think would benefit!
@javiercmh Жыл бұрын
Very engaging and clear. Thanks!
@robmulla Жыл бұрын
Thanks for watching. 🙌
@arielspalter74252 жыл бұрын
Excellent tutorial Rob. Subscribed!
@robmulla2 жыл бұрын
Thanks so much for the feedback. Thanks for subscribing!
@DAN_1992 Жыл бұрын
Thanks a lot, just brought down my database backup size to MBs.
@robmulla Жыл бұрын
Glad it helped. That’s a huge improvement!
@chrisogonas Жыл бұрын
Great stuff! Thanks for sharing.
@robmulla Жыл бұрын
Glad you enjoyed it!
@chrisogonas Жыл бұрын
@@robmulla 👍
@Schmelon Жыл бұрын
interesting to learn the existence of parquet and feather files. nothing beats csv for portability and ease of use
@robmulla Жыл бұрын
Yea, for small/medium files CSV gets the job done.
@mr_easy10 ай бұрын
great comparison. What about HDF5 format? Is it in anyway better?
@JohnMitchellCalif Жыл бұрын
super clear and useful! Subscribed
@robmulla Жыл бұрын
Awesome, thank you!
@olucasharp Жыл бұрын
Huge thanks for sharing 🍀
@robmulla Жыл бұрын
Glad you liked it? Thanks for the comment.
@danieleingredy6108 Жыл бұрын
This blew my mind, duuude
@robmulla Жыл бұрын
Happy to hear that! Share with others so their minds can be blown too!
@marcosoliveira87312 жыл бұрын
I've learned a great deal with this video. Thank you!
@robmulla2 жыл бұрын
Thanks so much for the feedback. Glad you learned from it!
@MAKSIMILIN-h8e Жыл бұрын
Nice video. I'm going to rewrite the storage on the parquet
@robmulla Жыл бұрын
You should! Parquet is awesome.
@reasonableguy67062 жыл бұрын
Rob, You're a natural communicator (or you worked really hard at acquiring that skill) - most effective. I follow you on twitch and I'm currently going through your youtube content to come up to speed. Thanks for sharing your time and experience. Have you thought about aggregating your content into a book as a companion to your content - something like "Data Analysis Using Python/Pandas - No BS, Just Good Stuff" ?
@robmulla2 жыл бұрын
Hey. Thanks for the kind words. I’ve never considered myself a naturally good communicator and it’s a skill I’m still working in but I appreciate your positive feedback. The book idea is great, maybe sometime in the future….
@pablodelucchi353 Жыл бұрын
Thanks Rob, awesome information! Learning a lot from your channel. Keep it up!
@robmulla Жыл бұрын
Isn’t learning fun?! Thanks for watching.
@krishnapullak Жыл бұрын
Good tips on speeding up large file read and write
@robmulla Жыл бұрын
Glad you liked it! Thanks for the feedback.
@MarcBenkert0012 жыл бұрын
Thanks, great comp. One thing about Parquet - it has some limitations in what chars column names can take, I spent quite some time renaming col names 1 year ago - perhaps that has fallen away by now.
@robmulla2 жыл бұрын
Good point! I've noticed this too. Definately a limitation that makes it sometimes unusable. Thanks for watching!
@arpanpatel91912 жыл бұрын
Great video!! Small things matter the most. Thanks
@robmulla2 жыл бұрын
Absolutely! Thanks.
@anoopbhagat132 жыл бұрын
learnt something new today. Thank you Rob for this useful & informative video.
@robmulla2 жыл бұрын
Learn something new every day and before long you will be teaching others!
@pawarasiriwardhane3260 Жыл бұрын
This content is really awesome
@robmulla Жыл бұрын
Appreciate that!
@MrWyYu2 жыл бұрын
Great summary of data types. Thanks
@robmulla2 жыл бұрын
Thanks for the feedback! Glad you found it helpful.
@baharehbehrooziasl9517 Жыл бұрын
Great! Thank you for this very helpful video.
@robmulla Жыл бұрын
Glad it was helpful!
@casey7411 Жыл бұрын
Very informative video! Subscribed :)
@robmulla Жыл бұрын
Glad it helped! 🙏
@riessm Жыл бұрын
In addition to everything, parquet is the native file format to spark and can fully support spark‘s lazy computing (spark will only ever read the columns and rows that are needed for the desired output). If you ever prep really big data for spark, parquet is the way to go.
@robmulla Жыл бұрын
That’s a great point. Same with polars!
@riessm Жыл бұрын
@@robmulla Need to have a closer look at polars then! 🙂
@mint91214 ай бұрын
Great comparison, thanks
@rafaelnegreiros_analyst Жыл бұрын
Amazing. Congrats for the video
@robmulla Жыл бұрын
Glad you like the video. Thanks for watching.
@SergioBerlottoJr2 жыл бұрын
Awesome informations ! Thankyou for this.
@robmulla2 жыл бұрын
Glad you liked it!
@steven7639 Жыл бұрын
Fantastic video
@robmulla Жыл бұрын
Fantastic comment. 😎
@vigneshwarselva9276 Жыл бұрын
Was very useful, thanks much
@robmulla Жыл бұрын
Thanks! Glad you learned something new.
@safsaf2k Жыл бұрын
This is excellent, thank you man
@robmulla Жыл бұрын
Glad it helped!
@CalSticks2 жыл бұрын
Really useful video - thanks. I was just searching for some Pandas videos for some light upskilling on the weekend, so this was a great find.
@robmulla2 жыл бұрын
Glad I could help! Check out my other videos on pandas too if you liked this one.
@againstthegrain59142 жыл бұрын
Hey this was very useful to me thank you for sharing!!
@robmulla2 жыл бұрын
So glad you found it useful.
@FranciscoPMatosJr Жыл бұрын
Experiment add the compression "Brotli" at the file create. The file size reduce considerably and the read is more fast a lot. Example: to save file: from pyarrow import csv, parquet parse_options = csv.ParseOptions(delimiter=delimiter) data_arrow = csv.read_csv(temp_file, parse_options=parse_options, read_options=csv.ReadOptions(autogenerate_column_names=autogenerate_column_names, encoding=encoding)) parquet.write_table(data_arrow, parquet_file + '.brotli', compression='BROTLI') to read file: pd.read_parquet(file, engine='pyarrow')
@robmulla Жыл бұрын
Oh. Very cool I need to check that out.
@i-Mik Жыл бұрын
It's useful for me, thanks a lot!
@robmulla Жыл бұрын
Happy to hear that!
@hugoy1184 Жыл бұрын
Thank u very much for sharing such useful skills! 😉Subscribed!
@robmulla Жыл бұрын
Anytime! Glad you liked it.
@aaronsayeb65664 ай бұрын
the pickle format seems to be significantly faster (10x) than parquet in the final 5mil row test
@bkcy183 ай бұрын
Amazing content!
@ChrisHalden007 Жыл бұрын
Great video. Thanks
@robmulla Жыл бұрын
You are welcome!
@EVL624 Жыл бұрын
Very good and informative video
@robmulla Жыл бұрын
So nice of you. Thanks for the feedback.
@Zoltag002 жыл бұрын
Great video - It would have been good to at least mention the downsides to pickle and also the built in compatibility with zip files. Haven't come across feather before, will try it out
@robmulla2 жыл бұрын
Great point! I did forget to mention that pandas will auto-unzip. I still like parquet the best.
@Zoltag002 жыл бұрын
@@robmulla - Agreed, parquet has some serious benefits You know it also supports a compression option? Use it with gzip to see your parquet file get even smaller (and you only need to use it on write)
@danilzubarev295211 ай бұрын
Lol this video changed my life :D Thank you so much.
@yogiananta9674 Жыл бұрын
awesome ! thank you for this tutorial
@robmulla Жыл бұрын
You're very welcome! Share with a friend.
@huuquannguyen66882 жыл бұрын
I really hope you make a video about Data Cleaning in Python soon. Thanks a lot for all your awesome tutorials
@robmulla2 жыл бұрын
I'll try my best. Thanks for the feedback!
@sangrampattnaik744 Жыл бұрын
Very nice explanation. Can you compare Dask and PySpark ?
@JoeMcMullin9 ай бұрын
Great video and content.
@abhisekrana9039 ай бұрын
stumbled on to this awesome video and absolutely loved it. Just out of curiosity - what tool are you using for making Jupyter notebook with themes especially dark theme?
@robmulla9 ай бұрын
Glad you enjoyed the video. I have a different video that covers my jupyter setup including theme: kzbin.info/www/bejne/a6HJYZKYpbOVodk
@vladvol855 Жыл бұрын
Hello! Very interesting! Thank you! Can you please tell me is any limitation for a DF to save in parquet in terms of number of columns? Excel allow around 16-17k columns to save! Thank you for the answer!
@Patrick-hl1wp Жыл бұрын
super awesome tricks, thank you
@robmulla Жыл бұрын
Glad you like them! Thanks for watching.
@melanp4698 Жыл бұрын
12:28 "When your data set gets very large." - Me working with 800GB json files: :) Good video regardless, i might give them a test sometime.
@robmulla Жыл бұрын
Haha. It’s all relative. When your data can’t fit in local ram you need to start using things like spark.
@pele512 Жыл бұрын
Thanks for the great benchmark. In R / Python hybrid environment I sometimes use `csv.gz` or `tsv.gz` to address the size issue with CSV but retain the ability to quickly pipe these through line based processors. It would be interesting to see how gzipped flat files perform. I do agree that parquet/feather is a better way to go for many reasons, they are superior especially from the data engineering point of view.
@robmulla Жыл бұрын
I do the same with gzipped CSV files. Good idea about making a comparison. I’ll add it to the list of potential future videos.
@Extremesarova2 жыл бұрын
Informative video! I've heard about feather and pickle, but never used them. I think I should give feather and parquet a try! I'd like to get some materials on machine learning and data science that are not introductory - something for middle and senior engineers :)
@robmulla2 жыл бұрын
Glad you found it useful. I’ll try to make some more ML videos in the near future.
@coopernik Жыл бұрын
I’m working on a little project and I have a csv file that’s 15GB. If I get what you’re telling me, I could turn it into a parquet file and save tons of memory space and time?
@codeman99-dev Жыл бұрын
Hey, just want to mention that when you wrote the pickle file to disk, you did so with no compression. While other formats have compression by default.
@robmulla Жыл бұрын
Good point. I guess it would be slower and smaller if compressed.
@franciscoborn2859 Жыл бұрын
you can compress in parquet too with the compression parameter =D, at is even smaller
@user-hy1lm2rd9q Жыл бұрын
really good video! thank you
@엘더스크롤2 жыл бұрын
진짜 parquet는 혁명임... 저장용량은 확 줄이고 나중에 다시 데이터 불러올 때의 속도는 확 높이는 최고의 데이터 포맷
@robmulla2 жыл бұрын
I agree. Parquet is great!
@Andrew-ud3xl3 ай бұрын
I didnt know about just reading select columns in polars, wanted to see how much bigger coverting a 320mb parquet file to csv and json, csv was over 5 times and json 17.5
@lorenzowinchler1743 Жыл бұрын
Nice video! Thank you. What about hdf5 format? Thanks!
@robmulla Жыл бұрын
Thanks! I haven’t used Hdf5 much but I’d be interested to hear how it compares.
@leonjbr Жыл бұрын
Hi Rob! I love our channel. It is very helpfull. I would like to ask you a question: is HDF5 any better than all the options you showed in the video?
@robmulla Жыл бұрын
Good question. I didn't cover it because I thought it's an older, lesser used format.
@leonjbr Жыл бұрын
@@robmulla so the answer is no?
@robmulla Жыл бұрын
@@leonjbr The answer is - I don't know but probably not. 😁
@leonjbr Жыл бұрын
@@robmulla ok thanks.
@CoolerQ Жыл бұрын
I don't know about "better" but HDF5 is a very popular data format in science.
@LissetteBF Жыл бұрын
Interesting video!.. thanks.. i tried to compress several csv into one parquet, but I had several problems with datetime ISO 8601 with time zone, I just couldn't change the format after all my eforts, I had to continue using csv as it didn't have problems transforming it into to_datetime, any suggestions for compressing files without having problems with datetime?, thanks!
@robmulla Жыл бұрын
Oh yes! I've had this same problem and it can be really annoying. Have you made sure you updated apache arrow to the latest version? stackoverflow.com/questions/58854466/best-way-to-save-pandas-dataframe-to-parquet-with-date-type
@Levince36 Жыл бұрын
Thank you very much 😂, I got something totally new to me.
@robmulla Жыл бұрын
Happy to hear it.
@nirbhay_raghav2 жыл бұрын
Another awesome video. It has become my favorite channel. Only regret is that I found it too late. Small correction. It should be 0.3s 0.08s for parquet files. You mistakenly wrote 0.3ms and 0.08ms while converting. Thanks.
@robmulla2 жыл бұрын
Apprecate that you are finding my videos helpful. Good catch on finding that typo!
@Jay-og6nj Жыл бұрын
i was going to comment that, but decided to check first, least should have caught that. Good video.
@ivangordiychuk55102 жыл бұрын
Thanks!!! It is really helpful for me!)
@robmulla2 жыл бұрын
Thanks Ivan. Glad you found it helpful. Please share it with anyone else you think might also learn from it.
@gregory8988 Жыл бұрын
Rob, could you explain how to add contents with internal links to paragraphs of the jupyter notebook?
@robmulla Жыл бұрын
I think you use something like this in the markdown: [section title](#section-title) check this link: stackoverflow.com/questions/28080066/how-to-reference-a-ipython-notebook-cell-in-markdown
@gregory8988 Жыл бұрын
@@robmulla clear explanation, thank you!
@malcolmanderson6735 Жыл бұрын
Hey Rob, great video. Only nit, around the 8 minute mark you called 308 ms ".3 ms" as opposed to ."3 seconds"
@robmulla Жыл бұрын
Oops! Good catch. Thanks for watching.
@nassarzinho Жыл бұрын
Hello Rob. Thanks for your videos. Rob, assuming a scenario where we don't have a machine with a lot of memory but a large dataset. Which lib can I use that will allow me to load and manipulate this dataset in memory? Do you have any videos discussing alternatives for this limitation? I'm thinking in the steps of eda until a ML modeling as example.
@robmulla Жыл бұрын
Yes! I have a lot of videos on the topic. Check my video on polars, pandas alternatives and data pipelines with pandas/pyspark/polars!
@nassarzinho Жыл бұрын
Tkx @@robmulla. I'll look for these videos. 👏🏽
@ruckydelmoro2500 Жыл бұрын
Can i do this for building text recognition? how to save and read the data if it's image?
@robmulla Жыл бұрын
Not sure what you mean. That's possible but not really related to this. There are other ways of saving text and images that would be better.
@DevenMistry367 Жыл бұрын
Hey Rob, this was a really nice video! Can you please make a tutorial where you try to write this data to a database? Maybe sqlite or postgres? And explain bottlenecks? (Optional: with or without using an ORM).
@robmulla Жыл бұрын
I was actually working on just this type of video and even looking at stuff like duckdb where you can write SQL on parquet files.
@dist3212 жыл бұрын
Great videos! Thank you for posting them. I wonder if feather is faster to read a >2G file.tsv than csv in chunks.
@robmulla2 жыл бұрын
Thanks for watching Ondina! I think it would depend on the data types within the >2G file. I think the only difference between tsv and csv is a comma ',' vs tab '\t' seperator between values. Hope that helps.
@scottybridwell2 жыл бұрын
Nice video. How does the performance and storage size of parquet, feather compare to hdf/pytables?
@robmulla2 жыл бұрын
Great question. I have no idea! I need to learn more about how they compare.
@joeloug744 Жыл бұрын
Good tips, can you create a video about parallel processing
@robmulla Жыл бұрын
Thanks. That’s definitely on my todo list.
@crazymunna2397 Жыл бұрын
amazing info
@robmulla Жыл бұрын
Thanks!
@luketurner314 Жыл бұрын
3:34 another example: if the data is going to be read in via JavaScript (like on a website) then wouldn't JSON be the best option?
@robmulla Жыл бұрын
JSON has a lot of great use cases, and that is probably one of them. This video if more focused on large datasets to be processed in bulk. JSON is great for data that is less structured.