Azure Databricks using Python with PySpark

Рет қаралды 77,029

Күн бұрын

Learn how to use Python on Spark with the PySpark module in the Azure Databricks environment. Basic concepts are covered followed by an extensive demonstrations in a Databricks notebook. Bring your popcorn!
Notebook at: github.com/bcafferky/shared/t...

Пікірлер: 95

@shaibalbose9831 6 ай бұрын

amazing presentation is only possible by people who has deep understanding and clarity - with the bonus of excellent communication skill ... thank you !!

@BryanCafferky 6 ай бұрын

You're Welcome!

@thehumbleone1936 24 күн бұрын

Even though its been 5 years since he published this, it is still jam packed with knowledge and how to work with PySpark, Thank you Bryan!

@BryanCafferky 19 күн бұрын

Thank You!

@christianlira1259 4 жыл бұрын

Two excellent Azure Databricks videos Bryan, and thank you for taking the time for sharing your knowledge.

@suhasreddybondugula3210 3 жыл бұрын

Really helped me to understand PySpark as a beginner. Hoping to see videos on real-time and streaming data. Thanks and keep sharing your wonderful knowledge Bryan.

@digwijoymandal8662 4 жыл бұрын

When I noted this video, never knew that I would be watching it till the end. But I took time and watched it till the end and it took me 2 days as I practiced all along. Its totally worth it. Keep sharing your knowledge. Cheers!

@BryanCafferky 4 жыл бұрын

Wow! That's awesome! Thanks for sharing that.

@SuperGnarley 4 жыл бұрын

Sir Cafferky! Thanks to your generous brilliance and my KZbin search skills, my day is made! Thank you so much for the information.

@BryanCafferky 4 жыл бұрын

Glad you found it. I'm writing a book on Azure Databricks that will be out soon!

@SurenderSingh-rn9tp 4 жыл бұрын

Really Great Explanation. Totally worth spending 2-3 hours to watch the video and understand all the concepts in detail. Thanks @Bryan Cafferky

@SIVERITOO 4 жыл бұрын

I really had to log in just to like and subscribe. Your explanations are awesomely straight to the point and not time wasted, really excellent.

@BryanCafferky 4 жыл бұрын

Thanks!

@SaurabRao 3 жыл бұрын

Was really glad when you said ' highly recommend you don't restrict yourself to python' in a video which deep dives into Python with PySpark! A real good video.

@BryanCafferky 3 жыл бұрын

Thanks. If you have not seen it, I have an in-depth series in progress. kzbin.info/www/bejne/iXO3p32LZ9t4pcU

@SaurabRao 3 жыл бұрын

@@BryanCafferky yes :)

@umuttekakca6958 3 жыл бұрын

Could not have been showcased more nicely and concisely.

@BryanCafferky 3 жыл бұрын

Thank you!

@saavipihu6381 4 жыл бұрын

Nice tutorial, very well explained, thanks Bryan !!

@mithileshsanam9561 2 жыл бұрын

26:36 minor correction in the code, df.selectExpr() takes column names as sql would do. So, if we have spaces in col names, it wouldn't take the actual values. instead, use df.withColumnRenamed()

@saideepkaranam8783 3 жыл бұрын

Link to the dataset used in video: raw.githubusercontent.com/AvisekSukul/Regression_Diabetes/master/Custom%20Diabetes%20Dataset.csv. You can download the csv from Bryan's github too fyi. Cheers!

@christophersly8448 4 жыл бұрын

Fantastic explanation Bryan!

@krzysztofprzysowa9284 5 жыл бұрын

Great tutorial as always!

@stateside_story 4 жыл бұрын

Really great tutorial ... Thank you Bryan !

@techsteering 3 жыл бұрын

Thanks for this amazing video. Exactly, what I was looking for.

@balanm8570 5 жыл бұрын

Awesome tutorial. Liked it much.

@pradeepnagaraj7347 4 жыл бұрын

Excellent Bryan, Thanks!

@vajikaakbar9107 2 жыл бұрын

Really helped me thank you so much. Keep sharing your knowledge.

@BryanCafferky 2 жыл бұрын

Glad to help.

@zidu2010 2 жыл бұрын

Super helpful! Thanks a lot!

@saurinpatel2507 4 жыл бұрын

Very good video, it would be awesome if you can create similar video just for the ML.

@rdawson3648 3 жыл бұрын

Excellent! Thank you very much...

@BryanCafferky 3 жыл бұрын

You're welcome.

@vaibhavrana4953 10 ай бұрын

very good tutorial

@nabilaabraham9503 4 жыл бұрын

Hi Brian! Great tutorial! At 26:44 when you rename the columns like 'blood pressure' to 'bloodpressure', the actual data doesn't get copied over. It looks like that new column 'bloodpressure' is just populated with 'bloodpressure' over and over again. That's not supposed to happen, right? The same thing happened when I used your syntax to copy columns with the SQL statements. Could you please advise on how to actually copy data over?

@nabilaabraham9503 4 жыл бұрын

backticks do the trick! for anyone wondering this, use `blood pressure` to reference a column heading with a space in it

@BryanCafferky 4 жыл бұрын

@@nabilaabraham9503 Yeah! Doh! Missed that. Just retried this with sdf_diabetes.selectExpr("`blood pressure` as bp").show() and it works fine. Thanks for asking!

@ranmax123 5 жыл бұрын

Thanks Bryan, great video. There are couple of issues in the demo. When you do sdf.selectExpr, the output changes. The values in the columns with spaces changes. The same thing happens when you use sdf.filter.sort on 'blood pressure' column. The values in blood pressure column becomes all 0. Is this something you observed?

@BryanCafferky 4 жыл бұрын

I have not. I'll have to go back and try that again. Thanks.

@navinsenguttuvan4037 3 жыл бұрын

Loved this lecture Bryan! I'm curious to know, given that the spark engine optimizes the sql code, is it a good idea to use python udf for processing at all ?

@BryanCafferky 3 жыл бұрын

If Apache Arrow is enabled, Python UDFs can perform well. However, if SQL can do what you need, I would use SQL.

@navinsenguttuvan4037 3 жыл бұрын

@@BryanCafferky Got it. Also interested to know your comments on using the dataframe API vs SQL for performance ?

@BryanCafferky 3 жыл бұрын

@@navinsenguttuvan4037 The Spark Dataframe and Spark SQL are based on the same code and both go through the Catalyst optimizer so, in general, they should be equal in performance. For UDFs, like Python user-defined functions, make sure Apache Arrow is enabled.

@navinsenguttuvan4037 3 жыл бұрын

@@BryanCafferky I understand. Thank you Bryan!

@amusicated 4 жыл бұрын

I must say this video is very very thorough. I searched quite a bit to find the notebook you're using. Would I be able to get it from you somehow?

@BryanCafferky 4 жыл бұрын

You can find the notebook in my GitHub repo. It is the .dbc file. You need to import it into Databricks. github.com/bcafferky/shared/blob/master/AzureDatabricksPython/AzureDatabricksPython.zip

@DarthBuLB 4 жыл бұрын

hi how to mount two azure storage (blob) and copy file from one mount to another mount using python (shutil) . I am not using the dbutils since databricks is still a preview.

@BryanCafferky 4 жыл бұрын

Azure Databricks is not in preview. Its GA. However, Azure Data Factory is a better option to copy files.

@krish_telugu 4 жыл бұрын

hi Bryan: How to import unstructured data into DBFS, it always makes us convert that into table and stores it in /Filestore/tables. Is there anyway to load json or xml files which cannot be loaded as table?

@BryanCafferky 4 жыл бұрын

Hi Krishna, You can't upload image files with the Databricks UI. However, you can use the Databricks REST API. See the link: docs.microsoft.com/en-us/azure/databricks/data/filestore Python code is one example. A better way may be to upload the images to Azure Blob storage and then read the images from there. You can use Azure Storage Explorer, a free Windows GUI app to upload files to Blob, or tools like PowerShell. Keeping files separate from the Databricks workspace is probably a good idea for production implementations as you have better shared access and get more admin features.

@krish_telugu 4 жыл бұрын

@@BryanCafferky thanks for fast reply Bryan, my question was for semi structured files like json or xml files which cannot be converted into table easily.

@BryanCafferky 4 жыл бұрын

@@krish_telugu Right. You can use the same approach to upload those types of files. The UI documentation says you can upload JSON files same way as CSV files assuming they are note a nested hierarchy, i.e. just a two dimensional table. Also, to convert files to tables see docs.databricks.com/data/data.html

@cimedp1141 4 жыл бұрын

Great!! question: What is the best way to analyze 35 thousand tables of 98 rows contained in a single Spark dataframe? Process each of the 35,000 tables one by one as Spark tables or convert the entire dataframe for Pandas and work locally with the tables?

@BryanCafferky 4 жыл бұрын

If the schemas are the same for all the tables, maybe you could append them together and treat it as one, table. Then you could partition the table to run your query in parallel. Processing the 35k separately is not likely to perform well.

@cimedp1141 4 жыл бұрын

@@BryanCafferky The schemes are the same. The data is already on a single dataframe. There are data from 5 sensors located in 35 thousand different locations, therefore, there are 98 samples of 5 different sensors that are located in 35 thousand points. The dataframe contains 3,430,000 lines. But it is necessary to analyze all 35 thousand points separately.

@BryanCafferky 4 жыл бұрын

@@cimedp1141 I think you need to find a way to partition the data and then do the analysis, it may require looping over each partition. Take a look at Python User Defined Aggregated functions. Apache Arrow makes this performant. docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html What do you need to do for each set of 35k points? Train ML model, calculate something?

@chicagobeast12 2 жыл бұрын

Question - If I have a script written using pandas for transformations in a Databricks notebook... would I need to convert all the code to pyspark to realize the benefits or would it be okay if I only converted the 'inefficient blocks' and used pandas for some of the more simpler munging tasks?

@BryanCafferky 2 жыл бұрын

There is a big difference between a local pandas dataframe which only exists on the driver node and a distributed Spark dataframe which is spread over the cluster. Both have their uses and you do need to keep in mind which one you want in a given situation. However, you can make the code migratikon easier by using the Apache Koalas library which applies pandas code to Spark. See koalas.readthedocs.io/en/latest/ Also, for more in-depth coverage on PySpark, see my playlist at kzbin.info/aero/PL7_h0bRfL52qWoCcS18nXcT1s-5rSa1yp PySpark videos start at Lesson 20. Thanks for watching.

@marcelkore 5 жыл бұрын

Hi Bryan, great tutorials. They helped by get a lay of the land with databricks. You mention providing access to your notebooks. Where would those be?

@BryanCafferky 5 жыл бұрын

Thanks and glad they helped. My content is on GitHub at: Look for the folders starting with AzureDatabricks and you may be interested in the other videos and content which are deeper dives. github.com/bcafferky/shared

@gracefan9688 5 жыл бұрын

@@BryanCafferky Hi Bryan, thanks for the tutorial. The github is 404.

@BryanCafferky 5 жыл бұрын

@@gracefan9688 Hmmm... I just clicked the link and it came up. Might be a internet connection issue.

@BryanCafferky 5 жыл бұрын

You can also got to github.com, and search for bcafferky.

@ilia8265 5 жыл бұрын

@@BryanCafferky Hi Bryan, great video. I checked everything you had on your github and couldn"t find this particular notebook. Could you provide a link to it please. This was extremely helpful for migrating from Jupyter to Databricks. Thanks!

@techproductowner 7 ай бұрын

Hi Bryan, Can you pls help to understand if my role is a etl do i need to learn pyspark or ADF can do the job of transfering and transforming the data

@BryanCafferky 7 ай бұрын

ETL/ELT is just copying and transforming data and there are many tools to do that. ADF is one that integrates with Databricks. Databricks workflows and Delta Live Tables is another. Python scripts can also do the job. ADF and Databricks are very popular for this task on Azure.

@whharding1243 2 жыл бұрын

Brian, do you mind a random question? when in Databricks notebooks and writing base Python on a local pandas dataframe, is that technically still PySpark? not sure why that question matters to me but it kind of bothers my brain not knowing for certain 🙃 if it is PySpark does that mean even pandas dataframes get passed to the optimiser, or is that restricted to distributed dataframes? loving the videos, thank you. also really love your sign off, thanks for pulling for us, great person!

@whharding1243 2 жыл бұрын

*Bryan (apologies)

@BryanCafferky 2 жыл бұрын

Its a good question. When you do a local pandas dataframe, it is not a Spark dataframe and it only exists on the driver node. A PySpark dataframe is broken into chunks (partitions) and distributed over the nodes. PySpark dataframes try to give you the same syntax as possible as Pandas but not 100%. See my video on the new PySpark pandas API which uses 100% pandas syntax except the library name is changed. kzbin.info/www/bejne/j3W4lWN6Za-seK8 But the bottom line is that a pandas dataframe is not partitioned and only on the driver node. That's why you need to be careful about how large they are as the driver has limited memory.

@rezguizina6013 4 жыл бұрын

Hello Bryan ! I checked the link of the notebook but I still don't see the diabets notebook you are using in the video ! Is there any way to get it ? I would be grateful ! Thank you !

@BryanCafferky 4 жыл бұрын

Just uploaded it in a separate file. github.com/bcafferky/shared/tree/master/AzureDatabricksPython

@rezguizina6013 4 жыл бұрын

@@BryanCafferky Thank You so much

@harshaagarwal8480 4 жыл бұрын

Sir, is ur GitHub link for notebook posted somewhere?

@BryanCafferky 3 жыл бұрын

Here it is. Just saw this comment. github.com/bcafferky/shared/tree/master/AzureDatabricksPython

@FPrimeHD1618 2 жыл бұрын

How has your experience been in the solutions space? Is your job more along the lines of a sales engineering type role? Reason I asked was I just recently turned down a solutions role in my company and chose to stay non-client facing :)

@BryanCafferky 2 жыл бұрын

Most of my career has been hands-on data engineering development and deployment. Sales is fun b/c you get to sample a lot of things but it can be harder to get in-depth knowledge. Also, in sales, you have to keep your integrity by always recommending the best thing for the customer whether of not it will sell a product. Not all salespeople get that.

@FPrimeHD1618 2 жыл бұрын

@@BryanCafferky I appreciate the response, thanks!

@NavneetKumar-rj6wo 3 жыл бұрын

The video is really good.. but i don't find the git repository on the path mentioned in the video. Can you please share path with me

@BryanCafferky 3 жыл бұрын

Oops. I'll add to the desc too. Here is is: github.com/bcafferky/shared/tree/master/AzureDatabricksPython

@NavneetKumar-rj6wo 3 жыл бұрын

@@BryanCafferky Thanks Bryan for sharing the GitHub link... Finding some error while importing the .DBC files in azure databricks..any suggestions is welcome

@BryanCafferky 3 жыл бұрын

@@NavneetKumar-rj6wo What is the error message, etc?

@JimRohn-u8c 2 жыл бұрын

Since SQL is native to Spark is there any benefit of using PySpark over Spark SQL?

@BryanCafferky 2 жыл бұрын

Depends on what you are doing. PySpark with Arrow enabled lets you run custom functions on data over the cluster. Also, Python has more functionality like being able to train ML models. Performance for data engineering work may depend on specifics but SQL and dataframes go thru the optimizer. Of course, language preference is a plus for whichever you prefer.