Intro to Python Dask: Easy Big Data Analytics with Pandas!

Рет қаралды 14,642

Bryan Cafferky

Күн бұрын

Пікірлер: 22

@atanu4321 2 жыл бұрын

Thanks for awesome introduction of Python Dask

@BryanCafferky 2 жыл бұрын

YW. Thanks for watching.

@Septumsempra8818 2 жыл бұрын

The Champ Cafferky!

@arturkunz 2 жыл бұрын

15:30 the rounding difference is probably because you use the full dataset in ddf and only a part in pdf. Very great introduction in Dusk! At the moment I am only working with numpy for data engineering (Deep Learning with Images). Would you say it makes sense to save images to pandas dataframes? It would probably make a lot of stuff easier and by using Dask even fast because of the parallelization.

@BryanCafferky 2 жыл бұрын

It may make it easier depending on what you are trying to do. Seems like keeping the image attributes with the image might make it easier to use both together.

@sawantamang2069 2 жыл бұрын

how reliable it is to use it in production for data ingestion?

@KOMPAJAM 2 жыл бұрын

When do you realize you have to leverage Dask on a DF - What error message would you gte?

@BryanCafferky 2 жыл бұрын

When your dataframes are taking a significant chunk of available memory, it's good to think of trying Dask. You do get an error though. See this blog. towardsdatascience.com/how-to-avoid-memory-errors-with-pandas-22366e1371b1

@KOMPAJAM 2 жыл бұрын

@@BryanCafferky Thanks Bryan!

@jamiew3986 2 жыл бұрын

Thanks for your video! I 've recently exploring DASK and realized that it only has read_sql_table, but no read_sql_query function. I used to read sql queries into python by using pyodbc/sqlalchemy, but it looks like it's not possible with DASK.

@BryanCafferky 2 жыл бұрын

Yeah. I mean you could use pandas with SQL but that would not scale. You could create a view in the SQL database and query that with read_sql_table. It is a limitation. I think Spark wins on that one.

@jamiew3986 2 жыл бұрын

@@BryanCafferky Thanks. I do have another question maybe related to DASK or jupyter notebook. So, the data I am currently working on has over 50MM rows. It takes a long time to read in via pyodbc even though I read in chuncks. I then converted to dask dataframe, hoping it would fasten the data manipulations. I called .persist() after adding 3 more columns, it would run for 30 minutes until I get a memory error. I am using my company desktop (96GB available, 12 precessors), so I'm surprised it's taking that much memory. Groupby seems to take a long time as well. I tried to use R data.table table, and hasn't had a memory error yet. Do you happen to encounter the same situation before, or do you have any guess on what could be the problem causing this issue?

@BryanCafferky 2 жыл бұрын

@@jamiew3986 If possible, maybe create a table or view on the source database that limits the columns and data you need. Also, limit large columns like substring to get part of a string, etc. The read this in and directly write it to a parquet file. Once that has been done. Load the parquet file. Not sure why its running out of memory but you don't have scale out. This would be easier on a cloud environment. Here are some links that may help. docs.dask.org/en/stable/dataframe-sql.html Measuring memory usage docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.memory_usage.html A Blog Use Case with Detailed Info on Optimization blog.dask.org/2021/03/11/dask_memory_usage#:~:text=When%20possible%2C%20you%20can%20fine,wasting%20RAM%20or%20CPU%20cores. Hope that helps.

@I677000 2 жыл бұрын

Do I need know python prior this course ?

@BryanCafferky 2 жыл бұрын

Yes. pandas is a Python library so this video is not useful if you don't know Python pandas.

@I677000 2 жыл бұрын

@@BryanCafferky I guess I need to finish python course first 😅

@BryanCafferky 2 жыл бұрын

@@I677000 Focus on pandas. You don't need to become an expert on all of Python. The book Python for Data Analysis by Wes McKinney is a good one.

@ericxls93 Жыл бұрын

Great video as usual, thank you. But after about a week of hammering the subject, I could not load a data table from an Azure SQL database ☹️… back to pandas… (having to do loops to deal with the memory limits 🤦‍♂️)

@BryanCafferky Жыл бұрын

Hmmm..Ok. Sorry to hear that. For Azure SQL, Databricks might be a better option.

@ericxls93 Жыл бұрын

@@BryanCafferky would love to use databricks, but system is far to mature to change it. Contacted Microsoft and appears my issue has to do with the version of Alquemy…. Only works with version 1.4 and below.