Data Engineering Toolbox

Data Engineering Toolbox

Welcome to Data Engineering Toolbox - your go-to channel for mastering the tools, techniques, and technologies in the world of data engineering! 🚀

Whether you're a beginner or a seasoned professional, this channel provides insights, tutorials, and hands-on guides on data engineering topics such as:

Building ETL pipelines using tools like Apache NiFi, Airflow, and Kafka
Mastering databases like SQL Server, MySQL, and MongoDB
Harnessing big data technologies like Apache Spark and PySpark
Exploring data visualization with Power BI and Tableau
Real-time data streaming, cloud-based solutions, and much more!
Join me on this journey as we turn data into actionable insights. Subscribe now for regular updates, deep dives, and practical tips to take your data engineering skills to the next level! 💡

Building a Dashboard with Databricks Community Edition Overcoming Limitations and Exploring AI & BI

10:17

Building a Dashboard with Databricks Community Edition Overcoming Limitations and Exploring AI & BI

Ай бұрын

PySpark DataFrame Row Control Transformations in Databricks

19:55

PySpark DataFrame Row Control Transformations in Databricks

Ай бұрын

PySpark DataFrame Transformations: Dataframe Column and Cell Control

24:48

PySpark DataFrame Transformations: Dataframe Column and Cell Control

Ай бұрын

PySpark DataFrame Transformations: Statistical Functions

19:58

PySpark DataFrame Transformations: Statistical Functions

Ай бұрын

PySpark DataFrame Transformations in Databricks: Grouped Data Functions

18:54

PySpark DataFrame Transformations in Databricks: Grouped Data Functions

Ай бұрын

The Medallion Architecture: Advanced Silver Layer Transformations in Databricks

28:09

The Medallion Architecture: Advanced Silver Layer Transformations in Databricks

Ай бұрын

Essential SQL Formulas and Analytics for Data Analysis

47:18

Essential SQL Formulas and Analytics for Data Analysis

Ай бұрын

Why and When Must Switch Between PySpark and SQL in a Databricks Project

27:26

Why and When Must Switch Between PySpark and SQL in a Databricks Project

Ай бұрын

Mastering Data Shuffling in Spark: Optimizing Joins and Improving Performance | Spark Tutorial

14:59

Mastering Data Shuffling in Spark: Optimizing Joins and Improving Performance | Spark Tutorial

2 ай бұрын

Credit Risk Analysis Series:Calculating Loss Given Default (LGD) | Fixed Recovery Rate with PySpark

9:35

Credit Risk Analysis Series:Calculating Loss Given Default (LGD) | Fixed Recovery Rate with PySpark

2 ай бұрын

Credit Risk Analysis Series: Calculating Exposure at Default (EAD) with Databricks and PySpark

11:02

Credit Risk Analysis Series: Calculating Exposure at Default (EAD) with Databricks and PySpark

2 ай бұрын

Credit Risk Analysis Series: Calculating Probability of Default (PD) with Databricks and PySpark

8:26

Credit Risk Analysis Series: Calculating Probability of Default (PD) with Databricks and PySpark

2 ай бұрын

Tracking Customer Journey in E commerce with SQL Recursive CTEs

15:54

Tracking Customer Journey in E commerce with SQL Recursive CTEs

2 ай бұрын

Hierarchical Organization Structure with SQL Server using Recursive CTE

10:53

Hierarchical Organization Structure with SQL Server using Recursive CTE

2 ай бұрын

Credit Risk Analysis Using Databricks | A Comprehensive Guide

15:53

Credit Risk Analysis Using Databricks | A Comprehensive Guide

2 ай бұрын

Uber SQL Interview Question: Monthly Forecasting and RMSE Calculation - Step-by-Step Solution

13:25

Uber SQL Interview Question: Monthly Forecasting and RMSE Calculation - Step-by-Step Solution

2 ай бұрын

Master Data Management (MDM) in SQL Server: Complete Tutorial on Metadata, Quality, & Integration

11:37

Master Data Management (MDM) in SQL Server: Complete Tutorial on Metadata, Quality, & Integration

2 ай бұрын

From Raw Data to Insights: Bronze, Silver, and Gold Layers in Customer Order Analytics

15:28

From Raw Data to Insights: Bronze, Silver, and Gold Layers in Customer Order Analytics

2 ай бұрын

Handling Slowly Changing Dimensions SCD Type 1 with Delta Lake in Databricks

7:16

Handling Slowly Changing Dimensions SCD Type 1 with Delta Lake in Databricks

2 ай бұрын

Slowly Changing Dimensions SCD Type 2 in Delta Lake using PySpark

9:11

Slowly Changing Dimensions SCD Type 2 in Delta Lake using PySpark

2 ай бұрын

Ingest DataFrame into Databricks ( Medallion Architecture )

5:51

Ingest DataFrame into Databricks ( Medallion Architecture )

2 ай бұрын

Data Analysis with T-SQL | Episode 2: Customer Lifetime Value (CLV) Calculation & Segmentation

14:48

Data Analysis with T-SQL | Episode 2: Customer Lifetime Value (CLV) Calculation & Segmentation

3 ай бұрын

Data Analysis with T-SQL | Episode 1: Customer Segmentation with SQL Server

8:04

Data Analysis with T-SQL | Episode 1: Customer Segmentation with SQL Server

3 ай бұрын

Master PySpark date_format() Function in Databricks with Complex Examples | Date Manipulation

6:56

Master PySpark date_format() Function in Databricks with Complex Examples | Date Manipulation

3 ай бұрын

How to Fetch API Data and Implement Incremental Loading in PySpark with Delta Lake | Databricks

9:32

How to Fetch API Data and Implement Incremental Loading in PySpark with Delta Lake | Databricks

3 ай бұрын

Mastering Padding in PySpark | Complex Example in Databricks

6:47

Mastering Padding in PySpark | Complex Example in Databricks

3 ай бұрын

Advanced PySpark Tutorial: Using collect_list Function in Databricks with Complex Examples

5:51

Advanced PySpark Tutorial: Using collect_list Function in Databricks with Complex Examples

3 ай бұрын

Calculating Sales Differences Using PySpark: Lag Function and Window Specifications

6:11

Calculating Sales Differences Using PySpark: Lag Function and Window Specifications

3 ай бұрын

How to Use Widgets in Databricks to Filter Data Interactively | Databricks Widgets Tutorial

6:39

How to Use Widgets in Databricks to Filter Data Interactively | Databricks Widgets Tutorial

4 ай бұрын

Пікірлер

@angelmartinez8353

@angelmartinez8353 9 күн бұрын

Hello from Spain, One question: When you use Bronze data, the dataframe is called: bronze_df, but when you create the logic it is called delta_bronze, is this a mistake? Thanks a lot

@sureshkondapaturi7403

@sureshkondapaturi7403 26 күн бұрын

Pls provide repo for this

@baharfathalizadeh3945

@baharfathalizadeh3945 Ай бұрын

It was helpful, thanks

@DataEngineeringToolbox

@DataEngineeringToolbox 18 күн бұрын

thanks for your comment

@baharfathalizadeh3945

@baharfathalizadeh3945 Ай бұрын

It was so helpful

@baharfathalizadeh3945

@baharfathalizadeh3945 Ай бұрын

Thanks

@DataEngineeringToolbox

@DataEngineeringToolbox Ай бұрын

Welcome

@houstonfirefox

@houstonfirefox Ай бұрын

Good content. Suggestion: Edit out the long pauses and repeated "Uhmmms". I usually have the scripts already written and on another screen ready to copy into the presentation so the learners don't have to wait while I type large sections of code.

@DataEngineeringToolbox

@DataEngineeringToolbox Ай бұрын

Thank you so much for the feedback and kind words! 😊 I really appreciate your suggestion. You're absolutely right-editing out long pauses and "uhms" can make the video more engaging. I'll definitely work on improving that in future videos. Having the script ready to paste is a fantastic idea, too! It would help keep the flow smooth while still demonstrating the process. Thanks for sharing your tip-I'll give it a try!

@lucasrueda3089

@lucasrueda3089 2 ай бұрын

hi i execute that in vsc and have many errors from enviroments vars

@ll_ashu_ll 5 ай бұрын

Please make more videos related to pyspark and databricks

@houstonfirefox

@houstonfirefox 5 ай бұрын

Very good video. I would recommend ensuring all syntax errors be edited out or re-recorded so the viewer doesn't get confused. The channel name (Data Engineering Toolbox) is a bit confusing as these function comparisons between SQL Server and PySpark fall under the realm of Data Science. A Data Engineer moves, converts and stores data from system to system whereas a Data Scientist extracts and interprets the data provided by the Data Engineer. A small point to be sure but wanted to be more accurate. In Variance: The avg_rating column returned integer values because the underlying column "review_score" was also an integer. To get the PySpark equivalent of a floating point avg_rating you could change the column type to FLOAT (unnecessary really) or use CONVERT(FLOAT, VAR(review_score)) to return the true (more accurate) Variation complete with decimal places. New sub. I am interested to see even more Data Science equivalent functions in SQL Server that may be native ( i.e; CORR() ) and how to write functions that emulate some of the functionality in PySpark 🙂

@DataEngineeringToolbox

@DataEngineeringToolbox 5 ай бұрын

thanks for amazing your suggestions

@yashwanthv5604

@yashwanthv5604 9 ай бұрын

Can you provide code github link

@Grover-mb 9 ай бұрын

buen video amigo, pero me sale un error que no puedo solucionarlo --------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last) Cell In[10], line 14 3 spark = sqlContext.sparkSession \ 4 #.appName("Mi_Aplicacion") \ 5 #.getOrCreate() 6 7 # Tu código de Spark aquí 8 jdbcDF = spark.read.format('jdbc') \ 9 .option('url',url) \ 10 .option('query',query) \ 11 .option('user',user) \ 12 .option('password',password) \ 13 .option('driver',driver) \ ---> 14 .load() 16 spark.stop() # No olvides detener la sesión de Spark al finalizar File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\pyspark\sql eadwriter.py:314, in DataFrameReader.load(self, path, format, schema, **options) 312 return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path))) 313 else: --> 314 return self._df(self._jreader.load()) File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\py4j\java_gateway.py:1322, in JavaMember.__call__(self, *args) 1316 command = proto.CALL_COMMAND_NAME +\ 1317 self.command_header +\ ... at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Thread.java:1570) Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...

@venvilhenrydsilva8354

@venvilhenrydsilva8354 10 ай бұрын

"You are trying to pass an insecure Py4j gateway to Spark. This" " is not allowed as it is a security risk." while sc = SparkContext(conf=conf)

@nickoder4374 10 ай бұрын

parasha

@平凡-p1v 11 ай бұрын

code is not clear to follow.

@DataEngineeringToolbox

@DataEngineeringToolbox 11 ай бұрын

Thanks , I will try to provide better tutorial in the future

@sravankumar1767

@sravankumar1767 11 ай бұрын

Superb explanation 👌 👏 👍

@DataEngineeringToolbox

@DataEngineeringToolbox 11 ай бұрын

Thanks

@lucaslira5 Жыл бұрын

Using auto loader it’s not necessary

@平凡-p1v Жыл бұрын

the video is not clear even in full screen mode.

@DataEngineeringToolbox

@DataEngineeringToolbox Жыл бұрын

Thanks for the feedback! I apologize for the video quality issue. I'm working on improving it for future videos. Your input is valuable, and I appreciate your understanding

@DatabricksPro Жыл бұрын

Cool. thx

@leonmason9141 Жыл бұрын

*promo sm*