Speed Up Data Processing with Apache Parquet in Python

  Рет қаралды 9,639

NeuralNine

NeuralNine

Күн бұрын

Пікірлер: 19
@ДаниилИмани
@ДаниилИмани 15 күн бұрын
00:30 - what is a parquet file 01:36 - installing pyarrow; downloading data (NYC taxi dataset) 02:52 - reading parquet file; exporting to csv 04:38 - benchmarking reading time for parquet vs csv 06:30 - file size comparison for parquet vs csv; getsizeof function from sys module 08:12 - loading only specific columns of parquet file 08:58 - benchmarking reading specific columns for parquet vs csv
@chndrl5649
@chndrl5649 Жыл бұрын
The reason why the memory taken for both dataframe is because of the datatypes. Csv will convert most predefined datatypes into string which is much larger than numeric datatypes
@islam9212
@islam9212 Жыл бұрын
It hurt my eyes when I saw the calculator even though a python console exists. For a future video it would be interesting to include a comparison with the pickle, feather and jay formats.
@tb9359
@tb9359 Жыл бұрын
Had never heard of Parquet. Thank you. It looks very useful.
@jeremiahhauser7148
@jeremiahhauser7148 Жыл бұрын
Interesting, but I am not convinced. If I got it correctly, when selecting columns the time went down by a factor of 3 for both methods (4->1.3s and 0.24->0.08s). So parquet is better anyway, but whether it is specifically better for column-wise access still needs to be demonstrated. As the other commenter, I would also be interested in a broader comparison with other formats. Great channel, keep up the good work.
@JLSXMK8
@JLSXMK8 Жыл бұрын
I have a related question: Since parquet files are "column-oriented", do you think they would be a good way to store database backups? Example scenario: Let's say you want to store a database backup, assuming that the data in the database is in a stable state; it contains a large number of product records; maybe their IDs, descriptions, how many purchases for a product, the product prices, etc. Would it be a good idea to store a backup of this database using a parquet file since the backups would be faster to load in case of the data becoming unstable via a transaction in the future? You could rollback the transactions too; however, what if too many of them fail, and all of them need to be rolled back?
@KingOfAllJackals
@KingOfAllJackals Жыл бұрын
Parquet isn’t a generic file format. It IS a table so you’re not “store backups” in a Parquet file. I guess you could backup each table independently but nearly every real DB has much more efficient and powerful native backup infrastructure. Parquet however is where a lot of transactional data ends up for analytics. Columnar storage is more suited to large analytic workloads. Row stores are more suited for OLTP workloads. You would never want to use Parquet for things like “deduct $7.83 from customer 1234’s checking account”.
@JLSXMK8
@JLSXMK8 Жыл бұрын
@@KingOfAllJackals That is exactly what I thought of possibly using it for; I could use it to back up tables in the database. You did interpret that correctly. I would NOT edit the contents of the parquet backups.
@dana-pw3us
@dana-pw3us 10 ай бұрын
Why not compare sizes of files on a disk? Are they different?
@Gabriel-cf3bw
@Gabriel-cf3bw Жыл бұрын
Nice tutorial! Very introductory!
@slothner943
@slothner943 Жыл бұрын
Usually go for feather format. Never understood the difference - just that for me and the data im handling (few columns) feather seems to be quicker.
@JeremyLangdon1
@JeremyLangdon1 Жыл бұрын
I think pandas tried to infer data types from CSV and often defaults to string. This takes much more space and CPU. Parquet has data types built in to the file so pandas does not need to infer anything. What would be more interesting is when reading the CSV, specify the data types to make it a more “even” comparison.
@multitaskprueba1
@multitaskprueba1 7 ай бұрын
You are a genius! Fantastic video! Thanks!
@N0rberK
@N0rberK Жыл бұрын
Tnx Capt.
@farshidzamanirad9691
@farshidzamanirad9691 Жыл бұрын
Awesome!
@julianreichelt1719
@julianreichelt1719 Жыл бұрын
nice
@codewithmajid4841
@codewithmajid4841 Жыл бұрын
I am Junior data scientist From Pakistan
@codewithmajid4841
@codewithmajid4841 Жыл бұрын
ok Boss
Closures in JavaScript
10:40
Saksham Srivastava
Рет қаралды 48
JSON Schema Validation in Python: Bring Structure Into JSON
13:45
Tuna 🍣 ​⁠@patrickzeinali ​⁠@ChefRush
00:48
albert_cancook
Рет қаралды 148 МЛН
小丑女COCO的审判。#天使 #小丑 #超人不会飞
00:53
超人不会飞
Рет қаралды 16 МЛН
Леон киллер и Оля Полякова 😹
00:42
Канал Смеха
Рет қаралды 4,7 МЛН
REAL or FAKE? #beatbox #tiktok
01:03
BeatboxJCOP
Рет қаралды 18 МЛН
Garbage Collection in Python: Speed Up Your Code
16:41
NeuralNine
Рет қаралды 18 М.
F-Strings Have A Lot of Format Modifiers You Don't Know
17:24
NeuralNine
Рет қаралды 11 М.
Debug Running Python Processes with GDB
12:19
NeuralNine
Рет қаралды 7 М.
How to Get Financial Statements Using Python
18:31
Intrendias
Рет қаралды 1,8 М.
Selenium Headless Scraping For Servers & Docker
16:22
NeuralNine
Рет қаралды 35 М.
Test-Driven Development in Python: Test First Code Later
15:44
NeuralNine
Рет қаралды 12 М.
Automatically Fill Word Files with Python
14:35
NeuralNine
Рет қаралды 26 М.
How I'd learn ML in 2025 (if I could start over)
16:24
Boris Meinardus
Рет қаралды 19 М.
The Dome Paradox: A Loophole in Newton's Laws
22:59
Up and Atom
Рет қаралды 1,1 МЛН
Understanding stdin, stdout, stderr in Python
11:53
NeuralNine
Рет қаралды 15 М.
Tuna 🍣 ​⁠@patrickzeinali ​⁠@ChefRush
00:48
albert_cancook
Рет қаралды 148 МЛН