Working with Large Data Sets Made Easy: Understanding Pandas Data Types

  Рет қаралды 31,087

ArjanCodes

ArjanCodes

Күн бұрын

In this video, we'll show you how to use the Pandas library to make working with large datasets easy. You'll learn about the different data types that Pandas supports and see some examples of how to use them to optimize your memory usage.
Git repository: github.com/ArjanCodes/2023-pa...
🚀 Next-Level Python Skillshare Class: skl.sh/3ZQkUEN
👷 Join the FREE Code Diagnosis Workshop to help you review code more effectively using my 3-Factor Diagnosis Framework: www.arjancodes.com/diagnosis
🎓 Courses:
The Software Designer Mindset: www.arjancodes.com/mindset
The Software Designer Mindset Team Packages: www.arjancodes.com/sas
The Software Architect Mindset: Pre-register now! www.arjancodes.com/architect
Next Level Python: Become a Python Expert: www.arjancodes.com/next-level...
The 30-Day Design Challenge: www.arjancodes.com/30ddc
🛒 GEAR & RECOMMENDED BOOKS: kit.co/arjancodes.
👍 If you enjoyed this content, give this video a like. If you want to watch more of my upcoming videos, consider subscribing to my channel!
💬 Discord: discord.arjan.codes
🐦Twitter: / arjancodes
🌍LinkedIn: / arjancodes
🕵Facebook: / arjancodes
📱Instagram: / arjancodes
♪ Tiktok: / arjancodes
👀 Code reviewers:
- Yoriz
- Ryan Laursen
- James Dooley
- Dale Hagglund
🎥 Video edited by Mark Bacskai: / bacskaimark
💻 Code example by Henrique Branco: / henriqueajnb
🔖 Chapters:
0:00 Intro
0:54 About pandas
1:37 Types in pandas
3:57 Type conversion
4:30 Data type inference and conversion
11:50 Optimizing memory using categorical types
16:20 Outro
#arjancodes #softwaredesign #python
DISCLAIMER - The links in this description might be affiliate links. If you purchase a product or service through one of those links, I may receive a small commission. There is no additional charge to you. Thanks for supporting my channel so I can continue to provide you with free content each week!

Пікірлер: 66
@ArjanCodes
@ArjanCodes 7 ай бұрын
👷 Join the FREE Code Diagnosis Workshop to help you review code more effectively using my 3-Factor Diagnosis Framework: www.arjancodes.com/diagnosis
@KLM1107
@KLM1107 Жыл бұрын
It's probably worth noting that pandas 2.0 (which is due to be released soon) implements the Apache arrow backend, which brings it in line with polars and massively reduces memory usage for non numeric data types
@RomualdMenuet
@RomualdMenuet Жыл бұрын
And enable missing values for all types 😍 Until now, due to numpy types, integers with missing values had to be cast to float which is pretty annoying and inefficient.
@RomualdMenuet
@RomualdMenuet Жыл бұрын
Great video as always 👍 Two optimizations, if I may 🙏, as I know the lib. You're reading the file and detecting types twice in your final code, which isn't efficient. You could read it once and avoid any type inference attempt: 1. You are loading the CSV twice: once with bad types to then only use column names, and then only values to better infer types by skipping the first two rows with `read_csv(..., skiprows=2)`. You could actually just do `skiprows=[1]` to read it only once with the desired type inference. When provided with an iterator, read_csv will only skip the rows at provided indices (here the second line). This will load the column names and the required values in one go, while having the improved type inference (as the first line with column names is omitted for type inference by default). 2. You can directly provide the types when reading the CSV, by providing the dict to read_csv like this: `.read_csv(..., dtype=mapping_types_conversion)`. This will prevent pandas from wasting memory and time by trying to infer them. "category" is also a valid type for pandas, and you can therefore add it to your dict so that read_csv builds the desired dataframe from the start. Parsing CSVs is pretty slow and type inference can have a significant memory impact at load time, especially when dealing with big files. Those optimizations will lower your memory footprint and can speed up things a lot.
@gabrielmedeiros2277
@gabrielmedeiros2277 Жыл бұрын
any Polars fans around here?
@ErikS-
@ErikS- Жыл бұрын
Many of the gaps will be (are) solved in the next pandas release... My reco is to better spend time on learning Pandas + Numpy + Scipy instead of putting your time in alternatives like Polars, Modin, Dask, Vaex... In the end, Pandas will close the gap to these alternatives
@martincontreras1521
@martincontreras1521 Жыл бұрын
​@@ErikS- I love pandas, but is useful to have other tools. Polars is a nice library but quite immature still.
@mberlinger3
@mberlinger3 Жыл бұрын
Need to find a use case!
@Victorinoeng
@Victorinoeng Жыл бұрын
Worth mentioning it is possible to get huge savings for numeric data types as well, just using the smaller bytes representation for integers and floats. Usually, the min/max value being represented is small enough to be represented with 32 instead of the original 64 bit representation. Int64 -> int32 = 50% less memory But indeed, I am curious to see how he new backend under Pandas 2.0 handles this
@iliasaarab7922
@iliasaarab7922 Жыл бұрын
More pandas/numpy stuff would be appreciated! What’s the difference between a Pandas object vs. String representation?
@JeremyLangdon1
@JeremyLangdon1 Жыл бұрын
We have been using “pandera” to define our pandas schemas. It is great to quickly convert the inferred data types to the target types (“coerce”). The key feature being that it can validate pandas against the Pandera schema and issue a report of errors when the pandas dataframe fails validation (either before or after coercion. It also supports checking against predefined constraints, rules, unique keys, etc. which significantly boosts confidence in both sourcing of data (inputs) and transforming to generate output datasets. Totally worth looking into!
@robbiebatley4348
@robbiebatley4348 Жыл бұрын
Nice video! One minor thing thing that might be worth noting is that pandas converted the postcode to int, so the memory increase is for an int -> categorical. There’s a chance categorical would still be more memory efficient than object
@dokick2746
@dokick2746 Жыл бұрын
Pandas 2.0.0 is dropping in about 2 weeks with huge improvements
@fernandino1909
@fernandino1909 Жыл бұрын
Please do one using pandas. Something like a small ml app to crunch numbers. Love your content
@exoticgolfholidays
@exoticgolfholidays Жыл бұрын
👍 as a coding-hobbyist, it was fascinating to watch using pandas with the terminal and not jupyter notebooks.. perhaps Arjan could consider a series of vdo's that actually builds a project?
@androiduser457
@androiduser457 11 ай бұрын
This is golden, used to handle hundred of gbs data, and the memory usage is massive. 64gb of ram not even cut it. Apparently have to use modin.
@hcubill
@hcubill Жыл бұрын
Thanks Arjan! I didn't know about this categorical data type. Thanks so much!
@ArjanCodes
@ArjanCodes Жыл бұрын
Happy to help!
@azadnoorani7065
@azadnoorani7065 Жыл бұрын
High-quality content! Amazing explanation. More pandas, please!
@ArjanCodes
@ArjanCodes Жыл бұрын
More to come!
@sambroderick5156
@sambroderick5156 Жыл бұрын
Pandas is a bit messy. I switched to Pola-rs and haven’t looked back.
@scottmckaygibson879
@scottmckaygibson879 Жыл бұрын
Just a heads up: when you want to return a certain type using the .asype(int) function, you'll get a numpy integer type such as np.int64, which many things outside of pandas don't like. You'll still have to do int(thing) to convert it back to a normal python int if your use case needs a regular int outside of pandas.
@barrykruyssen
@barrykruyssen Жыл бұрын
Awesome. I've been using datasets for a while on very large sqlite3 reads, but not with categorical types, I'll have to see how that works. Thanks.
@kacperwodarczyk9349
@kacperwodarczyk9349 Жыл бұрын
Is there gonna be more movies about FastApi or maybe some personal project? I would love to watch how you code somethings from the scratch
@edgeeffect
@edgeeffect 11 ай бұрын
That was very interesting. Another team (not my one - mine was doing JS/PHP) where I used to work did a lot of Pandas, it's nice to get a grasp of what they were doing... It would be good now if you did a part two with some example of what can be achieved with Pandas once the data is efficiently loaded.
@casc4
@casc4 Жыл бұрын
Amazing job!! Great content!
@ArjanCodes
@ArjanCodes Жыл бұрын
Thanks so much!
@pradeepgb986
@pradeepgb986 6 ай бұрын
Just a small tip! We can use Categorical type if we know that the number of values that the column holds do not vary much. (For example: Gender --> Male, Female). In these cases, it's ideal to use categorical type to save memory usage.
@MinhVu-ym4tk
@MinhVu-ym4tk Жыл бұрын
this video is all I need now :D thanks
@ArjanCodes
@ArjanCodes Жыл бұрын
You are welcome!
@svensorgenfrey9859
@svensorgenfrey9859 Жыл бұрын
Hey Arjan, I am a big fan of your videos. I find this one especially helpful.
@ArjanCodes
@ArjanCodes Жыл бұрын
Thank you, Sven, glad you found it helpful!
@shaun_v
@shaun_v Жыл бұрын
Great video. Very useful. Thankyou.
@ArjanCodes
@ArjanCodes Жыл бұрын
You are welcome! ❤
@acatch22
@acatch22 Жыл бұрын
it would be interesting to see the difference in memory consumption vs speed in specifying 32bit vs 64bit. i constantly had slowdowns in numpy code because it assigns 64bit by default if not specified otherwise, but when i used 32bit data (eg: from an image) it would need to convert one of them into the same dtype, which for large arrays was costly. So I always specify 32bit unless i need the accuracy, eg: matrix calculations, svd, polynomial fitting
@Micro-bit
@Micro-bit Жыл бұрын
But Pandas background is swithed to pyArrow (Pandas2.0), please make small update!!! :) Im strugling with datatype changes in pyArrow (Pandas2)
@djangodeveloper07
@djangodeveloper07 Жыл бұрын
Great tutorial specially last comparison part. what if columns names are dynamic then which way is to handle it properly on 1M records with 20+ columns.
@aminebouaita9202
@aminebouaita9202 Жыл бұрын
Excellent
@ArjanCodes
@ArjanCodes Жыл бұрын
Thank you so much!
@dinoscheidt
@dinoscheidt Жыл бұрын
That cable management in the cover image 😰😱
@alansnyder8448
@alansnyder8448 7 ай бұрын
Is there any opportunity to use pydantic when importing data from CSV files to pandas? What about storing DataFrames vs CSV files?
@olo90
@olo90 Жыл бұрын
Another useful conversion is downcasting, e.g., going from int64 to int32 etc.
@dm_zemo
@dm_zemo 5 ай бұрын
@ArjanCodes I’ve become obsessed with learning Python and your videos have been incredibly helpful on my journey. 🤓 I have a question/vid suggestion: is there a way to run a Python script within a JavaScript app and have both dynamically update each other? Azgaar’s Fantasy Map Generator is an amazing open source Javascript app and I want to “inject” my own Python apps into it and expand its functionalities. My goal is to utilize and build upon the data the simulator already contains. I want to add modular python systems to use the FMG data (CSVs i think) as a starting point to simulate even more detailed systems (such as a Markov Model simulating weather patterns using Biome data of map cells). One method I think I understand is extracting CSVs from FMG and then importing it into Python and use Pandas (or something better) to parse it all. I believe there must be a better way to create this project. I’m struggling to understand how best to solve this problem, any advice or maybe a video featuring you modding Azgaar’s Fantasy Map Generator? lol
@pablogonzalez7959
@pablogonzalez7959 Жыл бұрын
I will love a similar video with numpy, Is a library I use a lot and I don't know how to use typehints for example.
@ArjanCodes
@ArjanCodes Жыл бұрын
Noted!
@Moncamonca
@Moncamonca Жыл бұрын
Just passing by to note that it is possible to specify data types on the read methods. It is always efficient (avoid typing inference processing cycles) and sometimes it is almost necessary; e.g. on the zip codes column you would lose leading zeros if you let pandas read it as integers. You could always `str.zfill(5)` the strings latter, but is it a good design?
@Geza_Molnar_
@Geza_Molnar_ Жыл бұрын
Just wanted to note the leading zeros of zip, and seen your comment :-)
@thelethalmoo
@thelethalmoo Жыл бұрын
Id love a run down on degugging pandas. I always grt lost in the debugger when working with pandas
@RedShipsofSpainAgain
@RedShipsofSpainAgain Жыл бұрын
What are the general rules of when it makes sense to convert an "object" type column into a "categorical" type? From your example, the ZIP codes may not be the best to convert to categorical type, since there are so many ZIP codes, whereas the State column only has a few values, so it makes sense to convert it from object type to categorical. Are there heuristics to decide when to convert to categorical? Like if the num_unique values is
@Geza_Molnar_
@Geza_Molnar_ Жыл бұрын
Compare the num_unique values with the total count of values in the column (or with the row number). Also, have an eye on missing data, like empties, NULLs, Nones, n.a.s and similars for the start (think of it: is it more useful for the use case that it's represented as e.g. Null or a separate category). Then make the conversion and compare the memory usage with the original as in the video. It's quick and beats the heuristics ;-)
@rafiullah-zz1lf
@rafiullah-zz1lf Жыл бұрын
Reading from csv having a column of numerical data length greater than 11 are read and saved wrongly as power digits .. how can it be resolved to read csv
@MrLotrus
@MrLotrus Жыл бұрын
Great autocomplete when typing pip install pandas. What is it? Some oh-my-zsh settings or plugin?
@ArjanCodes
@ArjanCodes Жыл бұрын
It’s called Fig. I’m going to cover my Mac setup in detail in a few weeks.
@ponnethajmal5751
@ponnethajmal5751 Жыл бұрын
can you make a video about python generic types?
@dversoza
@dversoza Жыл бұрын
Maybe you didn't notice, but when retrieving data from Olist you hovered over my city here in Brazil! Was that a signal?
@rafiullah-zz1lf
@rafiullah-zz1lf Жыл бұрын
@ArjanCodes
@ArjanCodes Жыл бұрын
@nuurnwui
@nuurnwui Жыл бұрын
Where can I give multiple thumbs up?
@gabrielgil2346
@gabrielgil2346 11 ай бұрын
Boss
@BuFu1O1
@BuFu1O1 Жыл бұрын
oooooo, now we're talking
@arnoldwolfstein
@arnoldwolfstein Жыл бұрын
idk why this channel turned into data channel istead of python.
@joelmamedov404
@joelmamedov404 Жыл бұрын
Way too many steps for data processing. Thankfully we have databases and sql.
@ali-om4uv
@ali-om4uv Жыл бұрын
None of this would be any different a relational database. Sql Performance is very dependent on types
@Matt0x00
@Matt0x00 Жыл бұрын
If you are not taking the same type considerations when building a database, you should not be touching a database.
@signoc1964
@signoc1964 7 ай бұрын
polars is the fast inbetween, its kind of like a merge between sql and pandas, it reads more like sql, but you can do stuff in a pandas sort of way, and it may help you understand how its working if one know pandas. Using polars is a lot easier than working with sql in a lot of ways. For example you can select columns dynamically by type or regex etc, easier to pivot/unpivot etc, and you can build expressions from functions making repetitve stuff easier to read than it is in pure sql(since installing a lot of functions with a specific task in a database gets messy after a time).
GRASP Design Principles: Why They Matter (And How to Use Them)
32:02
Protocol Or ABC In Python - When to Use Which One?
23:45
ArjanCodes
Рет қаралды 195 М.
У мамы в машине все найдется
00:38
Даша Боровик
Рет қаралды 2,1 МЛН
格斗裁判暴力执法!#fighting #shorts
00:15
武林之巅
Рет қаралды 42 МЛН
He Threw A Banana Peel At A Child🍌🙈😿
00:27
Giggle Jiggle
Рет қаралды 13 МЛН
Three ways to optimize your Pandas data frame's memory footprint
13:37
Python and Pandas with Reuven Lerner
Рет қаралды 1,9 М.
5 Tips To Achieve Low Coupling In Your Python Code
18:30
ArjanCodes
Рет қаралды 93 М.
Introducing Python in Excel 😱
19:01
Leila Gharani
Рет қаралды 1,4 МЛН
Learning Pandas for Data Analysis? Start Here.
22:50
Rob Mulla
Рет қаралды 72 М.
SQL Databases with Pandas and Python - A Complete Guide
16:59
Rob Mulla
Рет қаралды 101 М.
Python Data Classes Are AMAZING! Here's Why
16:11
Tech With Tim
Рет қаралды 65 М.
25 Nooby Pandas Coding Mistakes You Should NEVER make.
11:30
Rob Mulla
Рет қаралды 255 М.
Functions vs Classes: When to Use Which and Why?
10:49
ArjanCodes
Рет қаралды 136 М.
How to Use FastAPI: A Detailed Python Tutorial
20:38
ArjanCodes
Рет қаралды 203 М.