No video

Speed Up Data Processing with Apache Parquet in Python

  Рет қаралды 8,570

NeuralNine

NeuralNine

Күн бұрын

Пікірлер: 18
@islam9212
@islam9212 10 ай бұрын
It hurt my eyes when I saw the calculator even though a python console exists. For a future video it would be interesting to include a comparison with the pickle, feather and jay formats.
@chndrl5649
@chndrl5649 9 ай бұрын
The reason why the memory taken for both dataframe is because of the datatypes. Csv will convert most predefined datatypes into string which is much larger than numeric datatypes
@tb9359
@tb9359 10 ай бұрын
Had never heard of Parquet. Thank you. It looks very useful.
@jeremiahhauser7148
@jeremiahhauser7148 9 ай бұрын
Interesting, but I am not convinced. If I got it correctly, when selecting columns the time went down by a factor of 3 for both methods (4->1.3s and 0.24->0.08s). So parquet is better anyway, but whether it is specifically better for column-wise access still needs to be demonstrated. As the other commenter, I would also be interested in a broader comparison with other formats. Great channel, keep up the good work.
@multitaskprueba1
@multitaskprueba1 4 ай бұрын
You are a genius! Fantastic video! Thanks!
@dana-pw3us
@dana-pw3us 6 ай бұрын
Why not compare sizes of files on a disk? Are they different?
@JeremyLangdon1
@JeremyLangdon1 9 ай бұрын
I think pandas tried to infer data types from CSV and often defaults to string. This takes much more space and CPU. Parquet has data types built in to the file so pandas does not need to infer anything. What would be more interesting is when reading the CSV, specify the data types to make it a more “even” comparison.
@Gabriel-cf3bw
@Gabriel-cf3bw 9 ай бұрын
Nice tutorial! Very introductory!
@slothner943
@slothner943 9 ай бұрын
Usually go for feather format. Never understood the difference - just that for me and the data im handling (few columns) feather seems to be quicker.
@JLSXMK8
@JLSXMK8 10 ай бұрын
I have a related question: Since parquet files are "column-oriented", do you think they would be a good way to store database backups? Example scenario: Let's say you want to store a database backup, assuming that the data in the database is in a stable state; it contains a large number of product records; maybe their IDs, descriptions, how many purchases for a product, the product prices, etc. Would it be a good idea to store a backup of this database using a parquet file since the backups would be faster to load in case of the data becoming unstable via a transaction in the future? You could rollback the transactions too; however, what if too many of them fail, and all of them need to be rolled back?
@KingOfAllJackals
@KingOfAllJackals 9 ай бұрын
Parquet isn’t a generic file format. It IS a table so you’re not “store backups” in a Parquet file. I guess you could backup each table independently but nearly every real DB has much more efficient and powerful native backup infrastructure. Parquet however is where a lot of transactional data ends up for analytics. Columnar storage is more suited to large analytic workloads. Row stores are more suited for OLTP workloads. You would never want to use Parquet for things like “deduct $7.83 from customer 1234’s checking account”.
@JLSXMK8
@JLSXMK8 9 ай бұрын
@@KingOfAllJackals That is exactly what I thought of possibly using it for; I could use it to back up tables in the database. You did interpret that correctly. I would NOT edit the contents of the parquet backups.
@N0rberK
@N0rberK 10 ай бұрын
Tnx Capt.
@julianreichelt1719
@julianreichelt1719 9 ай бұрын
nice
@farshidzamanirad9691
@farshidzamanirad9691 10 ай бұрын
Awesome!
@codewithmajid4841
@codewithmajid4841 10 ай бұрын
ok Boss
@codewithmajid4841
@codewithmajid4841 10 ай бұрын
I am Junior data scientist From Pakistan
Garbage Collection in Python: Speed Up Your Code
16:41
NeuralNine
Рет қаралды 16 М.
This INCREDIBLE trick will speed up your data processes.
12:54
Rob Mulla
Рет қаралды 264 М.
The CUTEST flower girl on YouTube (2019-2024)
00:10
Hungry FAM
Рет қаралды 37 МЛН
나랑 아빠가 아이스크림 먹을 때
00:15
진영민yeongmin
Рет қаралды 14 МЛН
Секрет фокусника! #shorts
00:15
Роман Magic
Рет қаралды 60 МЛН
Argument Parsing with argparse in Python
11:27
NeuralNine
Рет қаралды 14 М.
Is Rust the New King of Data Science?
15:38
Code to the Moon
Рет қаралды 136 М.
What is Apache Parquet file?
8:02
Riz Ang
Рет қаралды 75 М.
Transfer Complex Python Objects via Sockets
13:36
NeuralNine
Рет қаралды 9 М.
Make Your Pandas Code Lightning Fast
10:38
Rob Mulla
Рет қаралды 182 М.
shutil: The Ultimate Python File Management Toolkit
14:32
NeuralNine
Рет қаралды 11 М.
Image Annotation with LLava & Ollama
14:40
Sam Witteveen
Рет қаралды 25 М.
Efficient ML pipelines using Parquet and PyArrow -  Ingargiola
28:11
Python Italia
Рет қаралды 1,8 М.