Advancing Spark - Rethinking ETL with Databricks Autoloader

Рет қаралды 26,572

Күн бұрын

Пікірлер: 63

@theenduser1037 4 жыл бұрын

I read all the comments and I watched the full video too coz it’s very much informative. I had a interview question on the same context. I wish I had known about this video earlier.

@OliverVallack 4 жыл бұрын

Great video as always, keep them coming!

@theenduser1037 3 жыл бұрын

I desperately need help to write json file to azure table storage from Databricks

@HiYurd Жыл бұрын

Really good job on this tutorial. TY

@meghapatil6426 3 жыл бұрын

This was extremely helpful!

@AdvancingAnalytics 3 жыл бұрын

Glad it was helpful!

@jhoop2002 3 жыл бұрын

Cool video!!!

@kunalr_ai 4 жыл бұрын

Keep it up

@JackPickle 3 жыл бұрын

Great tutorial, many thanks. Is there any chance you could post your code?

@mc1912 3 жыл бұрын

For the first 3 files I think you don't see anything in the queue because the messages are already consumed at the time you query the queue. When you upload the next 2 files the stream is stopped, so you see the messages in the queue because they are not consumed. After you trigger once the stream those messages are consumed and dissaper from the queue.

@AdvancingAnalytics 3 жыл бұрын

I'd need to rewatch the vid (it's been a while!) but the initialisation of an autoloader run (ie: when there is no existing checkpoint) will use directory listing as the event grid setup will only trigger messages for new/added files. Otherwise yep, messages will stay in the queue until flagged to dequeue once the stream has processed them, can sometimes be tricky to see what messages are in the queue as it's processing through them!

@skipa9906 5 ай бұрын

Great video. Where can we get the sample files?

@briancuster7355 2 жыл бұрын

I've used autoloader before and had some problems with it loading parquet files. I never figured out what the problem was though.

@morrolan 3 жыл бұрын

I see that the default file discover mode is directory listing. This removes the need for the event grid and storage queue complexity. I am surprised that you didn't mention this or talk about it. Wouldn't this be ideal for many use cases?

@AdvancingAnalytics 3 жыл бұрын

The directory listing approach is the same used in standard Spark Streaming over a file source, I guess I was focusing on the notification queues as the major autoloader benefit at that time, avoiding slow-down over large directories. Autoloader has had a bunch of features added since that give it additional benefits, such as the new schema evolution elements. Probably a very different conversation now than when it was released :) Simon

@777ashwin 3 жыл бұрын

You should avoid using file discovery mode in production, the storage costs are gonna be very high because it will continuously list the files in the directory.

@davidsou 3 жыл бұрын

What's the difference between Autoloader and achieving the same result with Spark Streaming in Databricks?

@konradpaniec6038 3 жыл бұрын

He answered that question at the bottom of the comment section

@shivaniverma2881 2 жыл бұрын

How to read comma in data itself in CSV file?

@sid0000009 4 жыл бұрын

Thanks for the video. I believe it does not support immutable source files. If we use Execute only once the only advantage we get is that we don’t need to determine which files are new and to be processed each time. This kind of confused me as checkpoints are meant to determine the same? What difference it would make if we don't use auto-loader and use execute once option with checkpoint..wont be determine self what files were processed already.. thank you!

@abhradwipmukherjee3697 Жыл бұрын

Great as usual. If my streaming data is getting appended in a single file, will this work?

@AlejandraLemmo 2 жыл бұрын

Hello and thank you for this video. We have implemented autoloader in the PRD environment but we are experiencing that the process takes longer. He went from running in 4 minutes to 10 minutes. I see a connection with that the folder where the files arrive contains many files ...

@AdvancingAnalytics 2 жыл бұрын

Are you using cloudFiles.useNotifications? If not, then yes, it'll be using directory listing mode, which will read the entire contents of the blob and compare it to the checkpoint to work out what files need to be loaded - this process shouldn't take a super lengthy time, but can start hitting performance problems if you have many thousands of files! If that's not the case, then it's more likely a cluster sizing issue with the streaming query, which shouldn't be impacted by existing files! Simon

@deepjyotimitra1340 2 жыл бұрын

thank you so much for the video. its going to reduce pain of lot of data engineers. one quick question from my side - does the autoloader support data lake gen1? and also if i don't want to create any event subscription, can i achieve it ?

@guptaashok121 3 жыл бұрын

I tried to implement this, using csv as source, though it worked. I am having issue with header. Header record is getting written In my target delta table with each file. I am not sure where I can give that option. I checked all options of cloud file and did not find any.

@jayjaysize 4 жыл бұрын

Thanks for the video. How can you get an audit of who uploaded the file? Is it possible to pass the meta data of the user's name, who uploaded the file, so that you can log that in the delta lake as well?

@AdvancingAnalytics 4 жыл бұрын

Good question! Don't know - there's metadata available in the queue but I've not looked into detail. Certainly worth checking out! Simon

@nsudeesh601 3 жыл бұрын

How to configure autoloader on AWS. Pls share vedio link if you have. Thanks

@akankshagupta5067 3 жыл бұрын

Thank you for the video but I have a question, I see that writeStream has spike in the dashboard each time file is loaded in the BLOB but why does row count is still 0 when I am checking in the table as I am directly write into the table? df.writeStream.format("csv").option("delimiter", ",").option("checkpointLocation", "mnt/path/path/_checkpoint").table("table1")

@werner6816 Жыл бұрын

Does autoloader+delta tables work with serverless SQL warehouses, too ?

@AdvancingAnalytics Жыл бұрын

Yep, should do. Although we would tend to use autoloader via a normal spark cluster to load the data, then use DBX Serverless to query the delta tables - don't have to use a single cluster for everything :)

@rakeshraushan6573 2 жыл бұрын

Can we use autoloader with delta source? is it recommended or am I headed in the wrong direction? My scenario involves transformation of new records arriving in the delta table. Please suggest. Thanks

@dineshbandaru5821 4 жыл бұрын

Can we write data into Data Warehouse using this method? and if so how?

@vinayakakm-n5v Жыл бұрын

I'm getting this error " com.databricks.sql.cloudfiles.errors.CloudFilesIllegalArgumentException: Found notification-setup authentication options for the (default) directory"

@bhaveshpatelaus 4 жыл бұрын

What if the files get deleted in the blob storage? Does this affect the dataframe row count. Do you have to have all files present in blob storage at all times. Just thinking in terms of cost savings. Can a regular clean up of blob storage run to save storage costs.

@AdvancingAnalytics 4 жыл бұрын

Hey - the files in blob are ignored once triggered. When a blob lands, it triggers a read in the dataframe, after which there is no permanent link between that blob and the data in your delta table. So yes, you can absolutely clear down your blob table once the files have been processed. Just bear in mind, if you load a file with the same name, this will be ignored by autoloader (it will assume you have already processed this file!)

@bhaveshpatelaus 4 жыл бұрын

@@AdvancingAnalytics Great. Thanks for the response. Loving the work you are doing. Keep it up..

@JD-xd3xp 3 жыл бұрын

Can i use Auto Loader to load Shape file (GeoSpatial file)

@qweasdzxc2007 2 жыл бұрын

Thank you! What is a way of loading multible tables using Autoloader? Should i create separate stream for each directory or there is a convinient way of doing ELT for several sources using one Autoloader stream?

@AdvancingAnalytics 2 жыл бұрын

Each separate source/destination is a separate streaming query. We usually build out generic, parameterised autoloader processes then trigger the notebooks in parallel to run various streams simultaneously

@jhansle1 2 жыл бұрын

@@AdvancingAnalytics Basically took the same approach but can't get it working with excel xlsx files. Any guidance on this?

@AdvancingAnalytics 2 жыл бұрын

@@jhansle1 Hrm - I've not seen Autoloader working with the Excel reader, given that's a 3rd party driver. Were you doing direct file streaming, or trying to get it work work within the cloudFiles context? Simon

@Rothbardo 4 жыл бұрын

So should any processes that update data based on data changes happen at the data lake layer, then it'll automatically get updated witih autoloader? What if an file gets overwritten?

@AdvancingAnalytics 4 жыл бұрын

It's fairly blunt about that - it ONLY pushes through new files. Files that are appended to/overwritten don't trigger a new write. Definitely narrows the use case - basically if you're receiving transactional data into a regular location with unique file names, Autoloader will be great. If you have a more complex ingestion pattern, its probably not appropriate. For the later layers of the lake, I wouldn't use Autoloader - ie: you wouldn't point it at a delta table source and trigger based on new files there. For a chain of updates, you can stream directly from a delta table in a similar way, which won't pick up duplicate records. Definitely some downsides to the tool, doesn't solve everything. But if your data fits the expected pattern, could be a very nice way of doing early-stage ingestion. Simon

@Rothbardo 4 жыл бұрын

@@AdvancingAnalytics Thanks for your reply! Love you content, man. Great work!

@marcocaviezel2672 3 жыл бұрын

Hi Simon! Once again an awesome video! I followed your instructions and used ADLS Gen2 as source and sink. Somehow for me it only worked if I used .option(„cloudFiles.validateOptions“, „false“) in the readStream command, but everything worked as expected in the end. (Except that I don‘t see the messages in the queue. Do you have any idea why I need this option and did you find a solution how to show queues?

@marcocaviezel2672 3 жыл бұрын

Hi again, I just found the solution. I had Databricks only as a Storage Blob Data Contributor in the blob. I changed this to the Contributor role and now it works fine. (Except that the queues aren‘t displayed)

@AdvancingAnalytics 3 жыл бұрын

@@marcocaviezel2672 if the queues aren't being created, it might be in directory listing mode. If you're got cloudfiles.useNotifications set to True, then it will be creating the event grid topic & queue according to your config? The dequeue time is also pretty quick, so it's possible the messages are getting to the queue, then being read & deleted before you see them in there :) Simon

@marcocaviezel2672 3 жыл бұрын

@@AdvancingAnalytics Hi Simon, Thanks for your help! I toggle cloudfiles.useNotifications to true, but it definitely could be that the data is already deleted, since I‘m testing with a really small dataset. I thought that they might be saved and I can use them as a history, but if I understand correctly that‘s not the case. Have a great day! Marco

@morris2582 3 жыл бұрын

Good Video ! Thanks When I Try Autoloader find that is not support readStream XML file format , isn't it? When I try Autoloader got this exception. java.lang.UnsupportedOperationException: Data source com.databricks.spark.xml does not support streamed reading

@AdvancingAnalytics 3 жыл бұрын

Yeah, we've found Autoloader can be a little fussy about the files that it can read, and it can sometimes hide away the errors that would normally be returned by datareader failures. One thing is definite - if the stream reader can't read the file type, then nor can Autoloader!

@vinayakakm-n5v Жыл бұрын

Where can i get the client id and client secret?

@AdvancingAnalytics Жыл бұрын

Thats the Service Principal - it's an "application" created within Azure Active Directory (soon to be renamed as Microsoft Entra ID), you can get those parameters from the app object itself through AAD

@vinayakakm-n5v Жыл бұрын

@@AdvancingAnalytics Does the redirect url has to be something relevant or can it be ignored?

@vinayakakm-n5v Жыл бұрын

@@AdvancingAnalytics I used the correct keys but im getting this error "com.databricks.sql.cloudfiles.errors.CloudFilesIllegalArgumentException: Found notification-setup authentication options for the (default) directory"

@gamachu2000 3 жыл бұрын

what is the difference between autoloader and event trigger from Azure data Factory. Are the same. Thanks

@AdvancingAnalytics 3 жыл бұрын

Hi! They're two different implementations that both use EventGrid behind the scenes. So the actual technology that triggers is the same. However, an ADF file trigger will trigger for each and every file that lands in blob. With Autoloader, the queue is held separately so you have the ability to offset and trigger periodically to process files that have come in since it last ran. For me, Autoloader has more flexibility, but it depends on your scenario

@IanSkelton 3 жыл бұрын

Great video! I had this set up successfully reading json files from ADLS v2 with DRB version 7.0. However, when upgrading to DBR 7.2 or 7.3 it creates the event grid objects but fails to read the json files. Have you had any success with these versions?

@AdvancingAnalytics 3 жыл бұрын

Yep! Currently using it on-project with a client on Runtime 7.3LTS. Autoloader over CSV/Parquet not JSON but should be fine. I did find it suddenly demanded I provided a schema, even for parquet files, but once provided all seemed absolutely fine! Simon

@IanSkelton 3 жыл бұрын

Thanks Simon. Turns out the nullability constraints in the schema definiton are ignored in DBR 7.0. In 7.2 / 7.3, if nulls are encountered and your schema is set not allow them the stream fails with a generic "file could not be read" error. This had me stumped for ages!

@AdvancingAnalytics 3 жыл бұрын

@@IanSkelton Yeah, it's not a great error message! We've come to know the "file could not be read" as "someone messed up the schema definition" - nullability would definitely come into it!