52. Databricks| Pyspark| Delta Lake Architecture: Internal Working Mechanism

Рет қаралды 44,174

Күн бұрын

Пікірлер: 89

@farzicoderz 2 жыл бұрын

What a wonderful work you have done to help us understand not just theory but the practical look n feel also. Highly appreciate your efforts to create most valued content for us.

@rajasdataengineering7585 2 жыл бұрын

Thank you for your kind words, Ayushi!

@battulasuresh9306 2 жыл бұрын

It's so helpful if lectures are arranged in an order

@vinodhmani7773 Жыл бұрын

Thanks for all the videos sir. I have read few data engineering books/blogs in recent times, your sessions are way more detailed with practical knowledge. Thanks for taking time and doing it.

@rajasdataengineering7585 Жыл бұрын

Thanks and welcome. Glad it helps data engineers in the community

@purnimasharma9734 2 жыл бұрын

Very nice and helpful tutorials. The lectures are so good and to the point that I went through the entire series in a day. Learnt so much, thank you for posting these videos. I have become your follower and fan.

@rajasdataengineering7585 2 жыл бұрын

Thank you Purnima

@ranjansrivastava9256 9 ай бұрын

Dear Raja, small request, for this interview series I can not see the videos are in the sequential order. Like I can not see video number 5,6,7,8,49,50,51 etc like that. If possible can you please help on this to arrange that?

@rajasdataengineering7585 9 ай бұрын

Hi Ranjan, those video numbers are ordered number for entire video list so missing in interview series as those topics are not part of interview questions

@SidharthanPV 2 жыл бұрын

I was wondering how delta lake handles ACID features then this video came.. thank you for making this!

@rajasdataengineering7585 2 жыл бұрын

Thank you

@pridename2858 Жыл бұрын

This really eye opening video's.kindly keep doing it. this so full of knowledge. Great work.

@rajasdataengineering7585 Жыл бұрын

Thanks for your comment! Hope it helps to gain knowledge of delta internals. Sure, will keep creating more videos

@vaidhyanathan07 4 ай бұрын

I have a couple of questions for me .parquet file is not showing in _deltalog after inserting more than 10 records probably as you said may be admin setting would have be differently set .. 1) how to check the checkpoint file .parquet if it is hidden or not ? 2) if it is hidden how to view the file .. 3) if we use this command c = spark.conf.set("spark.databricks.delta.checkpointInterval", " 10") print(f"interval : "{c}") .. for me its shows "None"

@satijena5790 Жыл бұрын

Very very nice explanation. I am already a fan of Raja's data engineering channel. Just wondering whether i can get a copy of this notebook for practice please ???

@demudunaidugompa 11 ай бұрын

Great content and very helpful. thank You so much for sharing valuable content.

@rajasdataengineering7585 11 ай бұрын

Glad it was helpful!

@ravipaul1657 2 жыл бұрын

Do we need delta table if we have synapse analytics and we are performing our ETL Task using Azure Daabricks.

@rajasdataengineering7585 2 жыл бұрын

It is not mandatory. But based on the complete requirement of project and recommended architecture, it can be decided

@shakthimaan007 2 ай бұрын

Awesome work bro. Have you put these notebooks somewhere in your github? Can you share that with us if possible?

@PinaakGoel 28 күн бұрын

I have a doubt regarding update operation, you mentioned that delta engine scans for those particular files which have records that needs to updated and then updates on them, but if this the case, how time travel could be possible because updating existing files will result in loss of historical data.

@rajasdataengineering7585 28 күн бұрын

Parquet files are immutable in nature. So during update, relevant files are getting scanned and based on updated value new parquet files are getting created. It won't overwrite existing parquet files

@PinaakGoel 28 күн бұрын

@@rajasdataengineering7585 Understood, thanks for your reply and kudos to your effort for compiling this databricks playlist!

@rajasdataengineering7585 28 күн бұрын

You are welcome!

@sravankumar1767 2 жыл бұрын

In Delta tables how shd we know Day -0 full load and Day-1 Incremental Loading. In our project we Need create Day 0 & Day 1 pipelines separate. There is no Merge statement in our databricks notebook. How should we find whether it is Day 0 & Day 1. Could you please clarify my doubts

@rajasdataengineering7585 2 жыл бұрын

Hi Sravan, I couldn't understand the requirement exactly. But I can guide based on what I understood. You have 2 different pipelines to populate data into delta table and later you want to know which pipeline got executed. For this scenario, let's better go with log messages. In the log output file, we can provide detailed information about the pipeline

@sravankumar1767 2 жыл бұрын

@@rajasdataengineering7585 in the user story they mentioned Day 0 & Day 1 for ingestion and consumption pipelines. Here we didn't write te any merge statement for Day 1 .how shd we know in Delta tables whether it is delta load or incremental. Is there any specific field is available for delta load and incremental loading 🤔

@rajasdataengineering7585 2 жыл бұрын

No Sravan, there is no specific functionality to handle this scenario

@joyo2122 2 жыл бұрын

@@sravankumar1767 First Time run should always be Full Load then incremental

@farzicoderz 2 жыл бұрын

If I get it write do you mean, you want to have a separate pipeline for loading data for first time (day 0 ) which contains data till current date from system or called historical load/full load. So you can have a separate pipeline to load full data. Then once you are done loading day 0, you would want to read incremental any data that came after your day 0 load you call it as day 1, bau data. So you can build a separate pipeline for it. It you don't want to have a merge sttatemt basically you want to keep all the data you read. Like say for day 0 you have id 1,2,3 now in day 1 if Id comes as 1,5,6 you want to store them all. Don't want to check if there's already an existing record for previously existing I'd like I'd 1 here. Is that what you are saying here? There is a concept of slowly changing dimension (scd1, scd2 etc) you should give it a read.

@Vishu-ru4iw 9 ай бұрын

Hi sir, the video is really helpful and clear but when i tried the same i got 00000000000000000001.00000000000000000006.compacted.json on the 10th execution instead of a checkpoint parquet file, can you please help with this

@patriotbharath 2 жыл бұрын

Please provide code snippet and file which you are using for practice bro.Great content

@totnguyen3308 Жыл бұрын

Thanks for your tutorial. I have a question about how to create another folder in DBFS like in video. I tried right clicking and creating folder but it didn't work.

@rajasdataengineering7585 Жыл бұрын

You can create a folder using file system command %fs mkdirs or dbutils command dbutils.fs.mkdirs

@totnguyen3308 Жыл бұрын

@@rajasdataengineering7585 Thank you, I did it.

@rajasdataengineering7585 Жыл бұрын

Great

@midhunrajaramanatha5311 2 жыл бұрын

Can you provide link to access the notebooks in the video description which will be very useful

@joyo2122 2 жыл бұрын

The functionality of Time Travel of Tables is Awesome

@rajasdataengineering7585 2 жыл бұрын

Very true

@UmerPKgrw 2 ай бұрын

[2024-08-15 11:27 BST] Databricks Community Edition is very slow. Page are taking too much time to load. Internet speed is fine. Does anyone know why this is so.

@kelvink6470 Жыл бұрын

Explained very well. Thank you.

@rajasdataengineering7585 Жыл бұрын

Glad you liked it!

@sangeetharamakrishnan6288 2 жыл бұрын

Really helpful video...Thank you very much indeed

@rajasdataengineering7585 2 жыл бұрын

Thanks!

@ashishbarwad9471 2 ай бұрын

BEST MEANS BEST VIDEO EVER IF YOU ARE INTERESED TO LEARN THEN .

@rajasdataengineering7585 2 ай бұрын

Thank you

@Umerkhange Жыл бұрын

When working on big projects do you create any framework or just use the spark core APIs

@rajasdataengineering7585 Жыл бұрын

There is no standard framework. Depending on the use case, framework is designed

@UmerPKgrw Жыл бұрын

@@rajasdataengineering7585 it will really great if u make a video on framework. There is a channel on youtube which explains development through framework but his video are very boring. Ref Dirtylab

@UmerPKgrw Жыл бұрын

@@rajasdataengineering7585 Datyrlab

@venkatasai4293 2 жыл бұрын

Hi raja, delta table does not support bucketing . How can we achieve bucketing in delta table.also could you please make one detailed video on bucketing explaining about internals . When we create bucketing in hive the total number of files will be number of buckets but in spark it is different . Could you explain like how data is distributed on each node from two files . It will be great for us . Thank you.

@rajasdataengineering7585 2 жыл бұрын

Hi venkata, yes delta tables does not support bucket but there are 2 workarounds 1. We can use delta table optimization Z-order which co-locates related data together same like bucket and improve the performance 2. We can write bucketed data in parquet file in a location and convert these parquet files to delta table Yes I can post a video on bucketing soon

@venkatasai4293 2 жыл бұрын

@@rajasdataengineering7585 thank you for the info . In my requirement I have 1 fact table and 20 base tables . In this scenario which will be efficient bucketing or broadcasting . Since AQE is enabled it will prefer broadcast join . Also for this req which cluster instance will be efficient compute optimized or storage optimized ?

@venkatasai4293 2 жыл бұрын

@@rajasdataengineering7585 also small doubt .let us assume I have a df1 with 5 bucketed files. Df2 with 5 bucketed files .so totally 10 files . I have 4 worker nodes . How the data distribution happens here . How it eliminates shuffling ?

@rajasdataengineering7585 2 жыл бұрын

Hi Venkat, broadcast is suitable if your dim tables are tiny (around 10 mb size). Regarding cluster type, it is depending on what kind of operations you perform on these tables and size of these tables

@rajasdataengineering7585 2 жыл бұрын

In bucketing, both side tables are bucketed to specific key, which means data is already shuffled once based on key and sorted data is written to disc. So currently there are 5 bucketed files for each table and each one file will match with each one file other side. This means pre- sorted data is loaded into cluster memory and all relevant keys are on same executor memory for both tables. So no further shuffling needed and boosts the performance

@niteshsoni2282 Жыл бұрын

GREAT SIR....LOVED

@rajasdataengineering7585 Жыл бұрын

Thank you

@tejashrikadam7704 Жыл бұрын

You are doing great work😊

@rajasdataengineering7585 Жыл бұрын

Thanks! Hope it helps data engineering community

@kamatchiprabu Ай бұрын

Clearly understand sir.. thanks

@rajasdataengineering7585 Ай бұрын

Glad to hear that! You are welcome

@adityaarbindam 3 ай бұрын

excellent explanation Raja ..very insightful

@rajasdataengineering7585 3 ай бұрын

Glad you liked it! Keep watching

@adityaarbindam 3 ай бұрын

is it you Kartik ? i am guessing because of the way you uses notepad++🙂

@rajasdataengineering7585 3 ай бұрын

No, this is Raja

@bhavanabh-o1h 10 ай бұрын

can you please share this notebook ?

@andre__luiz__ Жыл бұрын

Thank you for this video and it's amazing content!!!!

@rajasdataengineering7585 Жыл бұрын

Glad you enjoyed it!

@ravikumar-sz1je 2 жыл бұрын

Very good explanation

@rajasdataengineering7585 2 жыл бұрын

Thank you

@Umerkhange Жыл бұрын

superb

@padmavathyk1538 Жыл бұрын

Could you please post the queries which you used in the video?

@tanushreenagar3116 Жыл бұрын

Best tutorial 👌

@rajasdataengineering7585 Жыл бұрын

Glad it helped

@sravankumar1767 2 жыл бұрын

Nice explanation Raja 👌 👍 👏

@rajasdataengineering7585 2 жыл бұрын

Thank you Sravan!

@harshadeep7506 9 ай бұрын

Nice one

@rajasdataengineering7585 9 ай бұрын

Thanks for watching

@sapkyoshi Жыл бұрын

What are all the slashes for can anyone tell?

@sureshrecinp Жыл бұрын

thank you for the info

@rajasdataengineering7585 Жыл бұрын

Any time!

@aravind5310 Жыл бұрын

Nice content.

@rajasdataengineering7585 Жыл бұрын

Thanks!

@sowmyakanduri-t8t 4 ай бұрын

The lectures are very good but they are not organized properly. It covers more of pyspark in databricks. Not much about databricks.