61. Databricks | Pyspark | Delta Lake : Slowly Changing Dimension (SCD Type2)

  Рет қаралды 42,971

Raja's Data Engineering

Raja's Data Engineering

Күн бұрын

Azure Databricks Learning:
==================
How to handle Slowly Changing Dimension Type2 (SCD Type2) requirement in Databricks using Pyspark?
This video covers end to end development steps of SCD Type 2 using Pyspark in Databricks environment
#DatabricksSCDType2 #SCDType2, #SparkSCDType2,#PySparkSCDType2,#SlowlyChangingDimenson2, #DatabricksSlowlyChangingDimension2 #DatabricksPerformanceOptimization, #DatabricksScenarioBasedInterviewQuestion, #SparkScenarioBasedInterviewQuestion, #DatabricksReadCsvInterviewQuestion, #SparkJobs, #NumberofSparkJobs, #DatabricksSparkJobs,#DatabricksRealtime, #SparkRealTime, #DatabricksInterviewQuestion, #DatabricksInterview, #SparkInterviewQuestion, #SparkInterview, #PysparkInterviewQuestion, #PysparkInterview, #BigdataInterviewQuestion, #BigdataInterviewQuestion #BigDataInterview #PysparkPerformanceTuning #PysparkPerformanceOptimization #PysparkPerformance #PysparkOptimization #PysparkTuning #DatabricksTutorial, #AzureDatabricks #Databricks #Pyspark #Spark #AzureDatabricks #AzureADF #Databricks #LearnPyspark #LearnDataBRicks #DataBricksTutorial #azuredatabricks #notebook #Databricksforbeginners

Пікірлер: 71
@nareshvemula2204
@nareshvemula2204 Жыл бұрын
Thank you for your video. Nice explanation. I found below gap in the solution. In SCD Type-2, at any point of time, Source record might stop coming due to various reasons. At that time, the corresponding record in the target should be kept the status as "deleted" or "not active". So, if we do "left outer join" we won't be able to identify which record in target need to be in "deleted" status. it should be full outer join and need to alter code accordingly to make the target reflect the correct data.
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Thanks for your comment. Yes this solution doesn't cover delete feature. It is mainly upsert in this example
@vaibhavvalandikar3913
@vaibhavvalandikar3913 4 ай бұрын
Wonderful 🙌 I have got a similar use case at work Will be using this approach Thanks!
@rajasdataengineering7585
@rajasdataengineering7585 4 ай бұрын
Glad it was helpful!
@arryanderson
@arryanderson Жыл бұрын
i checked so many documents, so many articles. None of them explained the concept, and somehow copied the same what databricks told. This is the only video which explains it well. and Step 5 is the most important and the merge_key there is the trick
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Thank you for your comment!
@chappasiva
@chappasiva 2 ай бұрын
Hi Raja, Nice explanation. Not only this topic, you covered each and every topic in very much in detail. could you please share notebook for the above one.
@ranjansrivastava9256
@ranjansrivastava9256 9 ай бұрын
Well Explained Raja ! Appreciate your hard work Bhai !!!!
@rajasdataengineering7585
@rajasdataengineering7585 9 ай бұрын
It's my pleasure! Thank you Ranjan
@oiwelder
@oiwelder 2 жыл бұрын
Hi, could you recommend me a book on SCD using pyspark? I would like to delve deeper. I've done projects based on the Azure documentation and with this presentation made by you, I was very excited. Thanks for sharing your knowledge. I speak of Brazil and I already admire you. Thanks.
@rajasdataengineering7585
@rajasdataengineering7585 2 жыл бұрын
Thanks Welder for your comments! There is no specific book or documentation for scd but databricks has created few examples for merge operation with Pyspark and that can be leveraged
@oiwelder
@oiwelder 2 жыл бұрын
@@rajasdataengineering7585 how to insert multiple conditions in merge? Current format is inserting newlines even with no changes from current records.
@rishadm1771
@rishadm1771 Жыл бұрын
@@oiwelder I would suggest you to sort that out first using pyspark df and then do the merge operations after it
@naganavya0705
@naganavya0705 4 ай бұрын
Hi Raja , instead of joins ,Can we write 2 merge conditions.. in sql we write stored procedures, it just insert the all the rows and when matches then update
@premsaikarampudi3944
@premsaikarampudi3944 Жыл бұрын
Hi, Do we have a Type 3 SCD video ? "Introduce new columns for updated values."
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Hi, no not yet created video on scd type 3
@gurumoorthysivakolunthu9878
@gurumoorthysivakolunthu9878 Жыл бұрын
This is Epic, Sir... This is very Ultimate... Thank you, Sir...
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Thank you
@sravankumar1767
@sravankumar1767 2 жыл бұрын
Nice explanation 👌
@rajasdataengineering7585
@rajasdataengineering7585 2 жыл бұрын
Thanks!
@indhu139
@indhu139 10 ай бұрын
Hii ,when using merge key .. so this works only for table having two primary keys or can we apply this to table with one primary key as merge key ?
@rajasdataengineering7585
@rajasdataengineering7585 10 ай бұрын
Hi, we can do for single primary key also
@indhu139
@indhu139 10 ай бұрын
@@rajasdataengineering7585 Thank you so much for your prompt reply .
@rajasdataengineering7585
@rajasdataengineering7585 10 ай бұрын
Welcome
@vaibhavb8772
@vaibhavb8772 2 жыл бұрын
Hi sir thank you this amazing videos. I am doing same thing. For me I need to track deletion of record as well. Where I can change in current notebook so that I will able to track deletion of records
@petermarquesliveLIVE
@petermarquesliveLIVE 2 жыл бұрын
Great explaination.. Thank you.. Could you also make video on SCD type 1
@rajasdataengineering7585
@rajasdataengineering7585 2 жыл бұрын
Thanks Peter. Have already made a video on SCD type 1. It is added in delta lake play list
@petermarquesliveLIVE
@petermarquesliveLIVE 2 жыл бұрын
@@rajasdataengineering7585 ok thank you
@purnimasharma9734
@purnimasharma9734 2 жыл бұрын
Hi Raja, Great tutorial! Can you add a video on Change Data Capture (CDC) for deletes?
@rajasdataengineering7585
@rajasdataengineering7585 2 жыл бұрын
Hi Purnima, thank you. Sure, will post a video for CDC
@Umerkhange
@Umerkhange Жыл бұрын
do we need to create separate Database in delta lake for dimensions tables. Usually we create database name "RDS" and put all the reporting tables such as dimenation and fact tables as dim_product,dim_customer,dim_date and Sales_Fact. is there any standard we can follow?
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
We can keep fact and dimension tables in same database
@UmerPKgrw
@UmerPKgrw Жыл бұрын
@@rajasdataengineering7585 is there any video where you have demonstrated such concepts. Stagging area, model and reporting. I saw the earlier videos where you have told about silver, bronze and gold. Do we create similar folders in Data lake.
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Yes that is the solution architecture of delta databases
@tanushreenagar3116
@tanushreenagar3116 Жыл бұрын
Superb sir 👌 can I access to this notebook
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Thank you
@sujithkumar2362
@sujithkumar2362 2 жыл бұрын
Hai when we are running multiple times by updating the same record the history data is growing when we run 4 times with update we are getting 3 inactive and 1 active record ..can you help out this as in other process history is not growing ...
@rajasdataengineering7585
@rajasdataengineering7585 2 жыл бұрын
Hi Sujith, Keeping all historical records is business requirement in most of the cases. But if you don't want to keep many historical records, SCD type 1 or type 3 should be helpful
@sujithkumar2362
@sujithkumar2362 2 жыл бұрын
Thanks for that reply
@TamilSelvan-gq7vn
@TamilSelvan-gq7vn Жыл бұрын
Hi sir, Can you do a Video for Delta Lake SCD Type 4 also.
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Sure Tamil, will make a video on this requirement
@TamilSelvan-gq7vn
@TamilSelvan-gq7vn Жыл бұрын
@@rajasdataengineering7585 thanks for your reply sir.
@Nrn-bi4wu
@Nrn-bi4wu 10 ай бұрын
Hi Mr.Raja i found your tutorial videos very interesting. Would you be able to share the dbc files for the tutorial...it will be helpful to refresh my memory and also prep for any future interviews...i will be happy to subscribe to a paid online course as well if any for the dbc files...
@gurumoorthysivakolunthu9878
@gurumoorthysivakolunthu9878 Жыл бұрын
Hi Sir.... As you mentioned this type SCD 2 will. Create duplicate primary keys... What is the solution for this problem, Sir...? Thank you, Sir...
@IamAdarshKumar
@IamAdarshKumar Жыл бұрын
Use an "updated_at" field as a timestamp
@bhaveshkilaru7555
@bhaveshkilaru7555 Жыл бұрын
Hi Sir, SInce you mentioned as SCD type 2, if another new record for combination(222, unit 2) comes from source, then how are you deleting the current inactive record and make the current active record as in active. As SCD type 2 should maintain only one history record?
@tallaravikumar4560
@tallaravikumar4560 Жыл бұрын
Is it really one version of history should be maintained in scd2 ?
@madhurasarkar4983
@madhurasarkar4983 9 ай бұрын
In step4 we are filtering the record where targetpk is not null...and then we are making mergekey column with null value.. There is a mistake while explaining.
@rajasdataengineering7585
@rajasdataengineering7585 9 ай бұрын
Apologies if I made any mistake while explaining. Hope you got the concept
@madhurasarkar4983
@madhurasarkar4983 9 ай бұрын
@@rajasdataengineering7585 yes I got the concept thanks for explaining nicely.
@niharikakota6027
@niharikakota6027 Жыл бұрын
Hi sir, I am doing merge Operation which is working for the first time... When ran that merge statement multiple times it is giving ambiguity error.... Can you please suggest solution for this
@indhu139
@indhu139 10 ай бұрын
Hi Niharika .. even am facing this issue for the first few runs it works fine then it shows duplicate records .. how did u resolve that
@srinivasvemula-f2y
@srinivasvemula-f2y 6 ай бұрын
🙏🏻🙏🏻🙏🏻🙏🏻
@rajasdataengineering7585
@rajasdataengineering7585 6 ай бұрын
Keep watching
@niharikakota6027
@niharikakota6027 Жыл бұрын
Hi sir , I am using simple SQL merge query in databricks... MERGE INTO ...TARGET TABLE... USING ...SOURCE TABLE... ON ....MERGE CONDITION... WHEN MATCHED THEN UPDATE SET ....MATCH CONDITION.... WHEN NOT MATCHED THEN INSERT () VALUES() but this is working fine for first time, wheni ran same condition multiple times.. it is giving ambiguity error
@niharikakota6027
@niharikakota6027 Жыл бұрын
Please suggest solution for this issue
@kartheeksurampalli7583
@kartheeksurampalli7583 Жыл бұрын
@@niharikakota6027 One reason could be multiple matches on the condition, try narrowing the condition to single match and check
@TechYantra2
@TechYantra2 Жыл бұрын
thanks !
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Welcome!
@ardavanmoin3321
@ardavanmoin3321 2 жыл бұрын
Hi Raja, is it possible to share this notebook please?
@aishwaryap.s.v.s7387
@aishwaryap.s.v.s7387 2 жыл бұрын
can you give documentation link of slowly changing dimension
@amandeepsingh7648
@amandeepsingh7648 5 ай бұрын
HI i am getting this error while doing Merge - com.databricks.sql.transaction.tahoe.DeltaUnsupportedOperationException: Cannot perform Merge as multiple source rows matched and attempted to modify the same, please suggest
@RahulJaiswal-nw1kf
@RahulJaiswal-nw1kf Жыл бұрын
can we pls have the code file please or the Ipython Notebook if possible ??
@farookshaik7462
@farookshaik7462 2 жыл бұрын
What about delete?
@anjibabumakkena
@anjibabumakkena Жыл бұрын
seems to be very lengthy process sir.
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Yes that's right. SCD type 2 is always quite lengthy
@shankrukulkarni3234
@shankrukulkarni3234 Жыл бұрын
Can you please share the excel
@AIartWork-pw3lo
@AIartWork-pw3lo Жыл бұрын
When I tried the same , I'm getting the below error. What could be the reason?. I have a row with updated values. This should eventually make flag as N and insert the updated value as new row. But this is failing in merge logic. com.databricks.sql.transaction.tahoe.DeltaUnsupportedOperationException: Cannot perform Merge as multiple source rows matched and attempted to modify the same
62. Databricks | Pyspark | Delta Lake: Time Travel
8:47
Raja's Data Engineering
Рет қаралды 15 М.
버블티로 부자 구별하는법4
00:11
진영민yeongmin
Рет қаралды 20 МЛН
Это было очень близко...
00:10
Аришнев
Рет қаралды 4,1 МЛН
Good teacher wows kids with practical examples #shorts
00:32
I migliori trucchetti di Fabiosa
Рет қаралды 12 МЛН
REAL 3D brush can draw grass Life Hack #shorts #lifehacks
00:42
MrMaximus
Рет қаралды 11 МЛН
SCD: Slowly changing dimensions explained with real examples
25:43
75. Databricks | Pyspark | Performance Optimization - Bucketing
22:03
Raja's Data Engineering
Рет қаралды 18 М.
74. Databricks | Pyspark | Interview Question: Sort-Merge Join (SMJ)
16:46
Raja's Data Engineering
Рет қаралды 16 М.
Understand Slowly Changing Dimensions
23:21
Bryan Cafferky
Рет қаралды 21 М.
71. Databricks | Pyspark | Window Functions: Lead and Lag
15:04
Raja's Data Engineering
Рет қаралды 9 М.
65. Databricks | Pyspark | Delta Lake: Vacuum Command
15:32
Raja's Data Engineering
Рет қаралды 17 М.
버블티로 부자 구별하는법4
00:11
진영민yeongmin
Рет қаралды 20 МЛН