Raja's Data Engineering

Raja's Data Engineering

Welcome to Raja's Data Engineering!

Are you ready to embark on a thrilling journey into the world of Azure Databricks and Apache Spark? Look no further! Our channel is your go-to destination for all things related to these powerful data processing and analytics tools.

Join us as we delve into the depths of Azure Databricks and Apache Spark, unraveling their capabilities, exploring best practices, and unlocking the secrets to harnessing their true potential. Whether you're a data engineer, data scientist, or a curious learner passionate about big data technologies, our channel offers a wealth of knowledge to fuel your growth.

Here's what you can expect:
In-depth Tutorials
Best Practices and Tips
Use Case Discussions
Performance Optimization
Interview Preparation

Get ready to unlock the full potential of Azure Databricks and Apache Spark with our engaging and informative videos. Don't forget to subscribe to our channel and hit the notification bell, so you never miss an update.

130. Databricks | Pyspark| Delta Lake: Change Data Feed

17:26

130. Databricks | Pyspark| Delta Lake: Change Data Feed

4 ай бұрын

129. Databricks | Pyspark| Delta Lake: Deletion Vectors

25:03

129. Databricks | Pyspark| Delta Lake: Deletion Vectors

5 ай бұрын

128. Databricks | Pyspark| Built-In Function: TRANSFORM

15:36

128. Databricks | Pyspark| Built-In Function: TRANSFORM

5 ай бұрын

127. Databricks | Pyspark| SQL Coding Interview:LeetCode-1045: Customers Who Bought All Products

9:51

127. Databricks | Pyspark| SQL Coding Interview:LeetCode-1045: Customers Who Bought All Products

6 ай бұрын

126. Databricks | Pyspark | Downloading Files from Databricks DBFS Location

8:00

126. Databricks | Pyspark | Downloading Files from Databricks DBFS Location

8 ай бұрын

125. Databricks | Pyspark| Delta Live Table: Data Quality Check - Expect

8:16

125. Databricks | Pyspark| Delta Live Table: Data Quality Check - Expect

9 ай бұрын

124. Databricks | Pyspark| Delta Live Table: Datasets - Tables and Views

15:09

124. Databricks | Pyspark| Delta Live Table: Datasets - Tables and Views

Жыл бұрын

123. Databricks | Pyspark| Delta Live Table: Declarative VS Procedural

10:17

123. Databricks | Pyspark| Delta Live Table: Declarative VS Procedural

Жыл бұрын

122. Databricks | Pyspark| Delta Live Table: Introduction

24:25

122. Databricks | Pyspark| Delta Live Table: Introduction

Жыл бұрын

121. Databricks | Pyspark| AutoLoader: Incremental Data Load

34:56

121. Databricks | Pyspark| AutoLoader: Incremental Data Load

Жыл бұрын

120. Databricks | Pyspark| SQL Coding Interview: Employees Earning More Than Department Avg Salary

11:36

120. Databricks | Pyspark| SQL Coding Interview: Employees Earning More Than Department Avg Salary

Жыл бұрын

119. Databricks | Pyspark| Spark SQL: Except Columns in Select Clause

8:54

119. Databricks | Pyspark| Spark SQL: Except Columns in Select Clause

Жыл бұрын

118. Databricks | PySpark| SQL Coding Interview: Employees Earning More than Managers

10:58

118. Databricks | PySpark| SQL Coding Interview: Employees Earning More than Managers

Жыл бұрын

117. Databricks | Pyspark| SQL Coding Interview: Total Grand Slam Titles Winner

19:08

117. Databricks | Pyspark| SQL Coding Interview: Total Grand Slam Titles Winner

Жыл бұрын

116. Databricks | Pyspark| Query Dataframe Using Spark SQL

10:46

116. Databricks | Pyspark| Query Dataframe Using Spark SQL

Жыл бұрын

115. Databricks | Pyspark| SQL Coding Interview: Number of Calls and Total Duration

16:52

115. Databricks | Pyspark| SQL Coding Interview: Number of Calls and Total Duration

Жыл бұрын

114. Databricks | Pyspark| Performance Optimization: Re-order Columns in Delta Table

18:14

114. Databricks | Pyspark| Performance Optimization: Re-order Columns in Delta Table

Жыл бұрын

113. Databricks | PySpark| Spark Reader: Skip Specific Range of Records While Reading CSV File

13:19

113. Databricks | PySpark| Spark Reader: Skip Specific Range of Records While Reading CSV File

Жыл бұрын

112. Databricks | Pyspark| Spark Reader: Skip First N Records While Reading CSV File

6:31

112. Databricks | Pyspark| Spark Reader: Skip First N Records While Reading CSV File

Жыл бұрын

111. Databricks | Pyspark| SQL Coding Interview: Exchange Seats of Students

22:50

111. Databricks | Pyspark| SQL Coding Interview: Exchange Seats of Students

Жыл бұрын

110. Databricks | Pyspark| Spark Reader: Reading Fixed Length Text File

7:47

110. Databricks | Pyspark| Spark Reader: Reading Fixed Length Text File

Жыл бұрын

109. Databricks | Pyspark| Coding Interview Question: Pyspark and Spark SQL

20:46

109. Databricks | Pyspark| Coding Interview Question: Pyspark and Spark SQL

Жыл бұрын

108. Databricks | Pyspark| Window Function: First and Last

12:27

108. Databricks | Pyspark| Window Function: First and Last

Жыл бұрын

107. Databricks | Pyspark| Transformation: Subtract vs ExceptAll

8:37

107. Databricks | Pyspark| Transformation: Subtract vs ExceptAll

Жыл бұрын

106.Databricks|Pyspark|Automation|Real Time Project:DataType Issue When Writing to Azure Synapse/SQL

14:06

106.Databricks|Pyspark|Automation|Real Time Project:DataType Issue When Writing to Azure Synapse/SQL

Жыл бұрын

105. Databricks | Pyspark |Pyspark Development: Spark/Databricks Interview Question Series - V

25:00

105. Databricks | Pyspark |Pyspark Development: Spark/Databricks Interview Question Series - V

Жыл бұрын

104. Databricks | Pyspark |Pyspark Development: Spark/Databricks Interview Question Series - IV

29:53

104. Databricks | Pyspark |Pyspark Development: Spark/Databricks Interview Question Series - IV

Жыл бұрын

103. Databricks | Pyspark |Delta Lake: Spark/Databricks Interview Question Series - III

26:14

103. Databricks | Pyspark |Delta Lake: Spark/Databricks Interview Question Series - III

Жыл бұрын

102. Databricks | Pyspark |Performance Optimization: Spark/Databricks Interview Question Series - II

38:27

102. Databricks | Pyspark |Performance Optimization: Spark/Databricks Interview Question Series - II

Жыл бұрын

Пікірлер

@use-lucky Күн бұрын

how can i get this ppt?

@manjunathb.n7465

@manjunathb.n7465 2 күн бұрын

Hi Sir, Thanks for all your videos. Can you please make video on how to read private API data incrementally with credentials and bearer token using databricks?

@rajasdataengineering7585

@rajasdataengineering7585 Күн бұрын

Hi Manju, sure will create a video on this requirement soon

@yash.1th 2 күн бұрын

Hi Sir, Can you please start the Unity Catalogue series.

@rajasdataengineering7585

@rajasdataengineering7585 2 күн бұрын

Hi Yash, sure, will start soon

@suryasabulalmathew1331

@suryasabulalmathew1331 5 күн бұрын

Hi Sir, Can you tell in these examples you have shown, why many jobs are created for each of the join query you have executed. I have understood the stages, explain plan and the DAG. But, the number of jobs part is not clear for me, can you shed some light on it.

@amiyarout217 6 күн бұрын

nice

@rajasdataengineering7585

@rajasdataengineering7585 6 күн бұрын

Thanks! Keep watching

@HariMuppa 6 күн бұрын

Your explanation is greatly appreciated.

@rajasdataengineering7585

@rajasdataengineering7585 6 күн бұрын

Glad it was helpful! Keep watching

@avirupmukherjee2080

@avirupmukherjee2080 8 күн бұрын

Hi Raja, thanks for explaining. Just wanted to check that first cluster you created JOB cluster however later on you created two all purpose cluster. Could you please explain why you have taken all purpose instead of job cluster

@AnujRastogi-m6v

@AnujRastogi-m6v 9 күн бұрын

I Required more sql and pyspark problems for practice is there any paid course you have which i can purchase for practice only

@TejaswiniKulkarni-g9s

@TejaswiniKulkarni-g9s 9 күн бұрын

Can you please Add video related to unity catalog feature in Databricks.

@milind6217 11 күн бұрын

Hello @rajasdataengineering7585, could you please share the csv files used in the course, the course provides excellent information and if the files are available it would really be helpful.

@saifahmed7843 11 күн бұрын

Greetings, Is this sufficient for Databricks Certified Associate Developer exam? Appreciate any clarity.

@rajasdataengineering7585

@rajasdataengineering7585 11 күн бұрын

Yes, I have covered almost all the topics. If you understand all the concepts I explained in this channel, that's more than sufficient

@saifahmed7843 11 күн бұрын

@@rajasdataengineering7585 Thank you so much for the info.

@rajasdataengineering7585

@rajasdataengineering7585 10 күн бұрын

Welcome

@raviyadav2552 11 күн бұрын

I found the explanation very detailed, grt work ,keep it up sir

@rajasdataengineering7585

@rajasdataengineering7585 11 күн бұрын

Thank you, Ravi! Keep watching

@ShivamGupta-wn9mo

@ShivamGupta-wn9mo 12 күн бұрын

hi raja can you cover all streaming related usecases wrt spark and databricks?

@rajasdataengineering7585

@rajasdataengineering7585 11 күн бұрын

Sure Shivam, will create videos soon

@ShivamGupta-wn9mo

@ShivamGupta-wn9mo 12 күн бұрын

great video raja

@rajasdataengineering7585

@rajasdataengineering7585 12 күн бұрын

Thank you! Keep watching

@saayamprakash8832

@saayamprakash8832 12 күн бұрын

Hi sir, can we use groupby dept and take average and then filter using Having clause?

@tusharagarwal3553

@tusharagarwal3553 13 күн бұрын

Where can we get these notebook..please share link for all the notebook.. so that we can revise

@ShivamGupta-wn9mo

@ShivamGupta-wn9mo 14 күн бұрын

great

@rajasdataengineering7585

@rajasdataengineering7585 14 күн бұрын

Thank you! Keep watching

@AnjanaDevi-e4s

@AnjanaDevi-e4s 15 күн бұрын

Really awesome explanation, no video taught difference btw above 3 with much clarity, thank you...

@rajasdataengineering7585

@rajasdataengineering7585 14 күн бұрын

Thank you! Keep watching

@ShivamGupta-wn9mo

@ShivamGupta-wn9mo 15 күн бұрын

need separate series on Spark Streaming

@rajasdataengineering7585

@rajasdataengineering7585 14 күн бұрын

Sure, will create soon

@ShivamGupta-wn9mo

@ShivamGupta-wn9mo 16 күн бұрын

simpler way: df_flattened=df.select("*",explode("Employee").alias("new_emp"))\ .drop("Employee")\ .select("Department","new_emp.emp_name","new_emp.salary","new_emp.yrs_of_service","new_emp.Age") df_flattened.show()

@HariprasanthSenthilkumar

@HariprasanthSenthilkumar 16 күн бұрын

In step 2 (Databricks cell.no 7) While Executing filterDF , you are concatenating all columns of both source and target and checking for the equivalent string to filter . But in this, columns having different values in both source and target can still can give equivalent string and that data should be updated in final table but it will be filtered out by the given condition. Example : dim1=100,dim2=201,dim=300,dim=400 Target_dim1=100,Target_dim2=20,Target_dim=1300,Target_dim=400 Concat value will be 100201300400 . So in this case it will be filtered out without updating

@WB_Tom 16 күн бұрын

I learned that Databricks use datelakehouse and delta lake format, in that case if i create or add files to DBFS then it will convert into delta lake or it'll be as it is, and DBFS is delta lake house?

@ShivamGupta-wn9mo

@ShivamGupta-wn9mo 17 күн бұрын

hi raja, can you also add detailed spark streaming section ? thanks your content is great!!

@rajasdataengineering7585

@rajasdataengineering7585 17 күн бұрын

Hi Shivam, sure I will cove advanced concepts in spark streaming. I have already covered the basic in one of the previous video

@AjithM-h5q 17 күн бұрын

i have chosen my carrier path by learning and watching these videos Raja..Thanks Much

@rajasdataengineering7585

@rajasdataengineering7585 17 күн бұрын

All the best! Keep watching

@AjithM-h5q 17 күн бұрын

Hi All sessions are well organized and detailed explanation can we have any notes for these topics to summarize or revise it for learning purposes. Thanks Much and great work

@wolfguptaceo 18 күн бұрын

Sir, why did you switch from Databricks Community Edition to Azure in this video?

@rajasdataengineering7585

@rajasdataengineering7585 18 күн бұрын

Hi, all features are not available in community edition so used azure databricks to cover all important features

@wolfguptaceo 18 күн бұрын

@@rajasdataengineering7585 Hi sir, thanks for clarifying

@ShivamGupta-wn9mo

@ShivamGupta-wn9mo 19 күн бұрын

df_ans=df.withColumn("new_id",when((col("id")%2!=0) & (col("id")!=df.count()),col("id")+1)\ .when(col("id")%2==0,col("id")-1)\ .otherwise(col("id")))\ .drop("id")\ .orderBy(col("new_id"))\ .show() df.createOrReplaceTempView("students") spark.sql( ''' select *, case when id%2==0 then id-1 when id%2!=0 and id!=(select count(*) from students) then id+1 else id end as new_id from students order by ''' ).show()

@ShivamGupta-wn9mo

@ShivamGupta-wn9mo 20 күн бұрын

my sol: window_base=Window.orderBy('date') Dataframe API df_t=df.withColumn("diff", dense_rank().over(window_base)- dense_rank().over(window_base.partitionBy("status")))\ .groupBy("status","diff").agg(min("date").alias("start_date")\ ,(max("date").alias("end_date")))\ .orderBy("start_date")\ .show() Spark SQL 1 df.createOrReplaceTempView("match") spark.sql(''' with cte as( select *, dense_rank() over(order by date)- dense_rank() over(partition by status order by date) as diff from match) select status,diff, min(date) as start, max(date) as end from cte group by status,diff order by start ''' ).show() Spark SQL 2 df.createOrReplaceTempView("match") spark.sql( ''' with cte as( select *, dense_rank() over(order by date) as rn1, dense_rank() over(partition by status order by date) as rn2, dense_rank() over(order by date) -dense_rank() over(partition by status order by date) as diff from match) select a.status,max(a.start_date) as start_date,max(a.end_date) as end_date from (select date,status,diff, min(date) over(partition by status,diff ) as start_date, max(date) over(partition by status,diff ) as end_date from cte order by date) a group by a.status,a.diff order by start_date asc ''' ).show() enjoy

@rajasdataengineering7585

@rajasdataengineering7585 19 күн бұрын

Thank you for sharing your approach

@amiyarout217 20 күн бұрын

great explanation

@rajasdataengineering7585

@rajasdataengineering7585 19 күн бұрын

Glad it was helpful! Keep watching

@aiswaryakraj4210

@aiswaryakraj4210 20 күн бұрын

data = [ ("2020-06-01","Won"), ("2020-06-02","Won"), ("2020-06-03","Won"), ("2020-06-04","Lost"), ("2020-06-05","Lost"), ("2020-06-06","Lost"), ("2020-06-07","Won") ] df=spark.createDataFrame(data,['event_date','event_status'])

@devmaharaj4640

@devmaharaj4640 21 күн бұрын

Hello, your videos inspire to choose Data Engineering as a Career, why you stopped making Interview questions videos ?

@rajasdataengineering7585

@rajasdataengineering7585 21 күн бұрын

Hello, thanks for your comment! I have covered almost all possible questions so that I stopped. Will revisit once again and cover any missing topics

@lanyofrancis1195

@lanyofrancis1195 21 күн бұрын

Thanks for the videos and nice explanation, I have a question. Default partition size if 128 MB, so for 2 GB RDD file, no of partitions created will be 16. Here in your example are you changing the default size of partition to 10 MB instead of default 128 MB? Please correct if I am missing anything?

@ShivamGupta-wn9mo

@ShivamGupta-wn9mo 22 күн бұрын

If after reading and writing all csv file we again upload the same 5 csv file will it process it again or skip it after checking the checkpointing metadata?

@rk-ej9ep 22 күн бұрын

Such a great info. Awesome..

@rajasdataengineering7585

@rajasdataengineering7585 22 күн бұрын

Glad it was helpful! Keep watching

@saturdaywedssunday

@saturdaywedssunday 23 күн бұрын

Hi Anna, nice to see your videos and you do reply to all our doubts. One thing related to this transformations applied here, we haven't done any kind of partioning here yet right just read data and how come we are confirming these as narrow and wide transformations. Will partioning happen by default once data is read? Do clarify anna.

@MichaelAdebayo-y4n

@MichaelAdebayo-y4n 23 күн бұрын

Would watching these videos be enough to help pass the Databricks Certified Associate Developer for Apache Spark 3.0 - Scala exam?

@rajasdataengineering7585

@rajasdataengineering7585 23 күн бұрын

Yes most of the concepts are covered in this channel.

@ShivamGupta-wn9mo

@ShivamGupta-wn9mo 24 күн бұрын

great

@rajasdataengineering7585

@rajasdataengineering7585 23 күн бұрын

Thanks

@RajBalaChauhan-b4w

@RajBalaChauhan-b4w 24 күн бұрын

Thank you for such clarity. But I have a query - As Catalyst Optimizer will consider the broadcast join itself if a table is small enough to fit in memory, even if we haven't performed any broadcast join. So, is it really going to help us out in performance optimization? Or the performance will remain same only even after applying broadcast join?

@rajasdataengineering7585

@rajasdataengineering7585 23 күн бұрын

Catalyst optimiser won't apply broadcast join by default. Either we need to apply manually or adaptive query execution needs to be enabled (AQE is enabled for recent spark versions)

@chaitanyanagare757

@chaitanyanagare757 24 күн бұрын

Thanks Raja,,

@rajasdataengineering7585

@rajasdataengineering7585 24 күн бұрын

Welcome!

@chaitanyanagare757

@chaitanyanagare757 24 күн бұрын

Great video content.. Thank you so much

@rajasdataengineering7585

@rajasdataengineering7585 24 күн бұрын

Glad you liked it! Keep watching

@PraveenKumar-ev1uv

@PraveenKumar-ev1uv 24 күн бұрын

How to get the opprtunity to work on databricks with pyspark..what all real time scenarios to get started with?

@ShivamGupta-wn9mo

@ShivamGupta-wn9mo 25 күн бұрын

great playlist

@rajasdataengineering7585

@rajasdataengineering7585 24 күн бұрын

Thank you

@nitinpandey4857

@nitinpandey4857 26 күн бұрын

How does spark.read differ from spark.load?

@Prakash-r9o2w 28 күн бұрын

Hi sir if possible can you provide CSV files what you have used, thank you.

@rabink.5115 29 күн бұрын

i believe, now autoloader can be also applied as a batch processing so, it can have triggering effect for whenever files get arrived.

@rabink.5115 Ай бұрын

Hi Raja, do you have these code stored in github?

@varalaxmi1742 Ай бұрын

Hi, In this video you say serializing and deserializing overhead is there for off heap memory, but in previous videos you said, we can avoid serialization and deserialization in off heap memory. Could you please clarify which one is correct?

@ramvrikshsharma7724

@ramvrikshsharma7724 Ай бұрын

can you please make one video on DLT.

@rajasdataengineering7585

@rajasdataengineering7585 Ай бұрын

I have covered basics of DLT in another playlist

@ankitachaturvedi1138

@ankitachaturvedi1138 Ай бұрын

This interview series is really helpful. I haven't worked much on databricks but these videos are giving great insight of internal working & concepts. I am able to crack interviews. Thanks a lot for such informative videos!!

@rajasdataengineering7585

@rajasdataengineering7585 Ай бұрын

Glad to hear this! Keep watching

@amiyarout217 Ай бұрын

use chatgpt for sample dataset creation