15:30 the rounding difference is probably because you use the full dataset in ddf and only a part in pdf. Very great introduction in Dusk! At the moment I am only working with numpy for data engineering (Deep Learning with Images). Would you say it makes sense to save images to pandas dataframes? It would probably make a lot of stuff easier and by using Dask even fast because of the parallelization.
@BryanCafferky2 жыл бұрын
It may make it easier depending on what you are trying to do. Seems like keeping the image attributes with the image might make it easier to use both together.
@sawantamang20692 жыл бұрын
how reliable it is to use it in production for data ingestion?
@KOMPAJAM2 жыл бұрын
When do you realize you have to leverage Dask on a DF - What error message would you gte?
@BryanCafferky2 жыл бұрын
When your dataframes are taking a significant chunk of available memory, it's good to think of trying Dask. You do get an error though. See this blog. towardsdatascience.com/how-to-avoid-memory-errors-with-pandas-22366e1371b1
@KOMPAJAM2 жыл бұрын
@@BryanCafferky Thanks Bryan!
@jamiew39862 жыл бұрын
Thanks for your video! I 've recently exploring DASK and realized that it only has read_sql_table, but no read_sql_query function. I used to read sql queries into python by using pyodbc/sqlalchemy, but it looks like it's not possible with DASK.
@BryanCafferky2 жыл бұрын
Yeah. I mean you could use pandas with SQL but that would not scale. You could create a view in the SQL database and query that with read_sql_table. It is a limitation. I think Spark wins on that one.
@jamiew39862 жыл бұрын
@@BryanCafferky Thanks. I do have another question maybe related to DASK or jupyter notebook. So, the data I am currently working on has over 50MM rows. It takes a long time to read in via pyodbc even though I read in chuncks. I then converted to dask dataframe, hoping it would fasten the data manipulations. I called .persist() after adding 3 more columns, it would run for 30 minutes until I get a memory error. I am using my company desktop (96GB available, 12 precessors), so I'm surprised it's taking that much memory. Groupby seems to take a long time as well. I tried to use R data.table table, and hasn't had a memory error yet. Do you happen to encounter the same situation before, or do you have any guess on what could be the problem causing this issue?
@BryanCafferky2 жыл бұрын
@@jamiew3986 If possible, maybe create a table or view on the source database that limits the columns and data you need. Also, limit large columns like substring to get part of a string, etc. The read this in and directly write it to a parquet file. Once that has been done. Load the parquet file. Not sure why its running out of memory but you don't have scale out. This would be easier on a cloud environment. Here are some links that may help. docs.dask.org/en/stable/dataframe-sql.html Measuring memory usage docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.memory_usage.html A Blog Use Case with Detailed Info on Optimization blog.dask.org/2021/03/11/dask_memory_usage#:~:text=When%20possible%2C%20you%20can%20fine,wasting%20RAM%20or%20CPU%20cores. Hope that helps.
@I6770002 жыл бұрын
Do I need know python prior this course ?
@BryanCafferky2 жыл бұрын
Yes. pandas is a Python library so this video is not useful if you don't know Python pandas.
@I6770002 жыл бұрын
@@BryanCafferky I guess I need to finish python course first 😅
@BryanCafferky2 жыл бұрын
@@I677000 Focus on pandas. You don't need to become an expert on all of Python. The book Python for Data Analysis by Wes McKinney is a good one.
@ericxls93 Жыл бұрын
Great video as usual, thank you. But after about a week of hammering the subject, I could not load a data table from an Azure SQL database ☹️… back to pandas… (having to do loops to deal with the memory limits 🤦♂️)
@BryanCafferky Жыл бұрын
Hmmm..Ok. Sorry to hear that. For Azure SQL, Databricks might be a better option.
@ericxls93 Жыл бұрын
@@BryanCafferky would love to use databricks, but system is far to mature to change it. Contacted Microsoft and appears my issue has to do with the version of Alquemy…. Only works with version 1.4 and below.
@ButchCassidyAndSundanceKid Жыл бұрын
Is Dask better than Polar ?
@BryanCafferky Жыл бұрын
Dask is more like Apache Spark, i.e., scales out. Polar is a data science library.