What to Learn for Your Career Path
16:53
About My Channel
7:28
3 ай бұрын
Python Streamlit for Dummies
16:54
10 ай бұрын
Why Databricks Delta Live Tables?
16:43
Пікірлер
@Ali-q4d4c
@Ali-q4d4c 3 күн бұрын
amazing explanation. are you planning to make more videos on performance tuning?
@BryanCafferky
@BryanCafferky Күн бұрын
Yes. It's on my list. :-) Thanks
@alivecoding4995
@alivecoding4995 3 күн бұрын
How is DataBricks an alternative to Airflow? DataBricks isn't a workflow engine, is it?
@BryanCafferky
@BryanCafferky 3 күн бұрын
Yes. Databricks has a workflow service that is powerful. See docs.databricks.com/en/jobs/index.html
@needtorename6425
@needtorename6425 4 күн бұрын
Why do we need to deactivate the Environment? .. considering, we want to create it for all of our projects and we need it.
@BryanCafferky
@BryanCafferky 3 күн бұрын
You create a separate environment for each project so you must deactivate one to switch to another.
@needtorename6425
@needtorename6425 4 күн бұрын
After a days of struggle I found this video managed to explain clearly the concept of Virtual Environment. Great job!
@tomwright9904
@tomwright9904 5 күн бұрын
cron job + make?
@tomwright9904
@tomwright9904 5 күн бұрын
Hmm... not sure about the idea of throwing away configuration that is written down with a bunch of non-documented, non-recreatable jobs.
@shimeineri208
@shimeineri208 6 күн бұрын
Thank you
@IIIxwaveIII
@IIIxwaveIII 7 күн бұрын
i liked this vid. I understand your view but not sure I would call airflow a schedculer... anyway, its late 2024, which open source, on prem tool (bare metal and private cloud), would you use for ETL processes? (the more options the marrier) 10x!
@BryanCafferky
@BryanCafferky 6 күн бұрын
Good question. When I did on prem ETL, we used SQL Server Integration Services (SSIS) which is proprietary but an excellent ETL tool. For open-source, I really like Dagster (dagster.io/) which is a Python based data orchestration framework. While it has some of the same issues as Airflow, the data centric nature of Dagster including integrated data validation, data lineage tracking, and composability make it far superior to Airflow in my opinion. I have not used it in production so I would recommend a POC and pilot before committing to Dagster. The Dagster university free online training is good to get started. Dagster is still an orchestrator, not an ETL service. I've seen and looked at some open source ETL tools but have not dug in enough to recommend any. Does that help?
@AdelSa1023
@AdelSa1023 7 күн бұрын
Hi Bryan, thanks for creating such an informative series. I endorse you for your work here. I have a comment regarding spark-sql shell which you you said does not exist. I have installed spark 3.5.2 locally on my ubuntu machine and it does contain spark-sql shell.
@BryanCafferky
@BryanCafferky 6 күн бұрын
Interesting. I think that must have been added to a new release of Spark since I recorded this video. Thanks for the comment.
@anshuman7559
@anshuman7559 8 күн бұрын
41:18 in this example, why 2000 and 8% are not considered as facts? They are measurable metrics. Please someone explain. Thanks.
@BryanCafferky
@BryanCafferky 6 күн бұрын
Sorry for any confusion. 2,000 is a fact b/c is a quantity and likely to be summarized by the business. 8% is an attribute of the payment method so in this case, I am saying I would add that to the payment dimension. If the business wants to focus on the credit card interest, like aggregate over dimensions, then it could be created as a fact. This is where there is ambiguity. How the data is used will drive whether it is a fact or dimension but facts must be quantifiable. Thanks for the question.
@frag_it
@frag_it 9 күн бұрын
The new way is to use streaming or materialized view no more live table , also the implementation that I am trying to do with the cloud_files doesn’t seem to be working at all CREATE OR REPLACE MATERIALIZED VIEW mat_tst AS SELECT * FROM cloud_files("/Volumes/main/bronze/csv", "csv", map('schema', 'ID INT, Name STRING, Shortcode STRING, Category STRING', 'header', 'true', 'mergeSchema', 'true'))
@bulgunchimidova9081
@bulgunchimidova9081 12 күн бұрын
Thank you so much, Bryan! So glad I found you on reddit. Many were recommending your chanel.
@BryanCafferky
@BryanCafferky 12 күн бұрын
You're welcome!@ Glad my content is helpful.
@jeremyturner4327
@jeremyturner4327 13 күн бұрын
The data tab appears to be absent in the new ui?
@BenMai178
@BenMai178 14 күн бұрын
Good explanation
@dunlapww
@dunlapww 14 күн бұрын
This is a phenomenal presentation on dimensional modeling but i don’t understand the implementation of surrogate keys. I feel like I’m missing an obvious and low compute way of maintaining all the surrogate keys on your facts. No videos I’ve seen discuss this. But it seems every time a new fact record is generated you have to join every related dim on the foreign natural key and update the fact with the dim’s related surrogate key. So that you can later perform joins using the surrogate key. Am I thinking through this correctly?
@BryanCafferky
@BryanCafferky 13 күн бұрын
Yeah. It does add complexity but you have the gist of it. Surrogate keys are particuarly important when you want SCD history since natural keys would result in duplicate keys on the dim tables. Also, they isolate changes from the backend systems to the dw. But they do add some extra work.
@dunlapww
@dunlapww 11 күн бұрын
Thank you for confirming my understanding and great presentation!
@dhayes_
@dhayes_ 15 күн бұрын
Hey Bryan! This is amazing, thank you for the great video. Could we get a download link for the slide deck?
@bottle0ketchup211
@bottle0ketchup211 16 күн бұрын
Awesome video, appreciate your knowledge sharing as always!
@rmj5410
@rmj5410 17 күн бұрын
Absolutely the best explanation of Databricks I've ever heard
@banihas22
@banihas22 17 күн бұрын
Excellent intro of DuckDB! I love how you setup with an overview to give better context ❤
@hassanahmed2781
@hassanahmed2781 18 күн бұрын
Thank God I wanted to use for a non data engineering task and I was so confused since every video talked about data engineering and ETL and It looked like a scheduler that managed dependent task Thanks for this amazing video to tell actual what it is specially one something is promoted in wrong direction by almost everyone else This is the stuff i love youtube for 🤍
@aeggeska1
@aeggeska1 20 күн бұрын
I have been reading on their website, and I just can't understand what airflow even is or does.
@georgwagner5577
@georgwagner5577 21 күн бұрын
I used to work with pandas - I hate it. Time to switch :) Thank you, I love your guidance.
@antwanto
@antwanto 22 күн бұрын
wow that was very informative and amazing, thank you for your efforts
@mustafakara7739
@mustafakara7739 22 күн бұрын
I learned Docker just to automate R scripts along with Python. I understand what you mean with steep learning curve. It requires intermediate/advanced level of Docker and network knowledge. But DockerOperator can be used to automate other programming languages.
@S2100211
@S2100211 23 күн бұрын
You need to get rid of you picture as it takes up space on the scrren and hided code that you type ans also close the commands box as it also takes up space and is confusing. apart from that, this was a very useful explanation.
@JuanRuiz-pf5eu
@JuanRuiz-pf5eu 23 күн бұрын
Wonderful!! Thanks! Great Job!!
@tomfontanella6585
@tomfontanella6585 26 күн бұрын
Excellent video. Thanks for the fair and clear perspective.
@BryanCafferky
@BryanCafferky 26 күн бұрын
You're welcome!
@mubar2
@mubar2 27 күн бұрын
From your perspective, why would someone choose MsFabric over DBX? Stored procedures, the quickstarts and ADF? Seems like costs would be higher in MsFabric from having cluster specific servers.
@BryanCafferky
@BryanCafferky 27 күн бұрын
Good question. 1) Fabric is a end user self service platform whereas Databricks is more of a Data Engineering and AI solution building platform. 2) Fabric is extremely well suited to supporting Power BI reporting and can even eliminate the need to load a Tabular Model., 3) the t-shirt Fabric capacities means you select a fixed compute level that all the Fabric objects and services share so the cost is contained. However, you could find that you need to increase the Fabric capacity at some point. 4) with Fabric, you don't need to manage or even explicitly create the underlying Azure resources, Fabric does it for you and even hides the details. This is extended with the OneLake and shortcuts concepts. Make sense?
@mubar2
@mubar2 27 күн бұрын
@@BryanCafferky Would you then say, that Fabric is more akin to a Modern Data Stack and more focused on the Analytics Engineer role rather than to a dimensional modeled warehouse?
@BryanCafferky
@BryanCafferky 26 күн бұрын
@@mubar2 I think Fabric is aimed at the end user, not the engineer. It gives business users tools to do analysis, build dashboards and build solutions but they will not have the skills to build a a Kimball styled data warehouse in my opinion. They will get something up and running fast.
@SYLDE-c3v
@SYLDE-c3v 29 күн бұрын
Thanks for the explanations. Really liked your approach, good pace, not too easy, not too long. I took some notes :)
@BryanCafferky
@BryanCafferky 28 күн бұрын
Thank you! Glad it is helpful.
@user-ox1ud8zn9g
@user-ox1ud8zn9g Ай бұрын
I would add that, after conducting data analysis, the data scientist develops and tunes a machine learning (ML) model or a set of models. They test these models and design the overall ML pipeline to prevent issues like model drift and other undesirable effects. The ML engineer then takes over, building the ML pipeline, including configuration files, and deploying it. They also test the deployment cycle and provide feedback to the data scientist. This process includes monitoring and notification systems as part of the MLOps framework. This description is based on my personal experience, though practices may vary across different industries.
@Srinivasan-xd9ql
@Srinivasan-xd9ql Ай бұрын
there is no code in the github related to the DLT
@regilenemariano9244
@regilenemariano9244 Ай бұрын
Até q enfim um não indiano, pela mor de Deus. Só jesus. Thank you for this video!
@konzy2
@konzy2 Ай бұрын
In Airflow the correct pattern is not to write top level code. In the Postgres example, in production you could have a DBT file or .sql file that contains the queries. Airflow specifically says not to do processing on Airflow itself, it's used for kicking off jobs elsewhere, monitoring them and dealing with the results. Some examples would be running a Glue job, Apache Spark, start an EMR cluster, making an RESTful call to an API, train a model on a Ray cluster. Most of your code to do the data processing should be on those platforms and can be in Scala, Java, Golang, C. Apache NiFi is also good for ETL, but parts of it require that data moves along its processors. Such as converting from one file format to another or regexing columns. So, some parts of it need more compute to process the data needing to scale the NiFi cluster. NiFi 1.x is Java only and only recently with 2.x Python is supported.
@BryanCafferky
@BryanCafferky Ай бұрын
Thanks for your thoughts.
@jaydeep244821
@jaydeep244821 Ай бұрын
Your video series is incredibly informative and offers a fantastic way to grasp the concepts of Spark. I truly appreciate the effort and dedication you’ve put into creating these videos and sharing your knowledge. Great work! 👏
@BryanCafferky
@BryanCafferky Ай бұрын
Thank You!
@JustinTimeNocap
@JustinTimeNocap Ай бұрын
Great content. Your teaching style is on point..practical and straight to the point. Spark is really blowing up in the market. Regards from Brazil.
@BryanCafferky
@BryanCafferky Ай бұрын
Thank You! Yes. Spark is on fire. :-)
@shreyasd99
@shreyasd99 Ай бұрын
Hi, I am also trying to build a DLT pipeline manually, I have performed everything in the same way, but it shows "waiting for resources" for a very long time to me
@BryanCafferky
@BryanCafferky Ай бұрын
Hmmm... Not sure what you mean by building manually. I think that's the only way you can create DLT pipelines. Bear in mind, you can NOT run a notebook directly in the notebook UI for DLT.
@shreyasd99
@shreyasd99 Ай бұрын
Hey sorry I didn't mean building manually, I meant running manually after managing the cluster configurations (node type Id for both driver and the worker) and then managing whether to store the target schema at the catalogs schema. I've given the location of the notebook for the pipeline. Not sure where it's going wrong..
@BryanCafferky
@BryanCafferky Ай бұрын
@@shreyasd99 Take a look at this blog about DLT cluster configuration docs.databricks.com/en/delta-live-tables/settings.html
@tonatiuhdeleon8236
@tonatiuhdeleon8236 Ай бұрын
does databricks charge $ everytime you run a notebook code chunk?
@BryanCafferky
@BryanCafferky Ай бұрын
I think it is really the compute you use that cost the money. If its 40 nodes running in parallel over 200 TBs of data for 3 hours, that will cost whereas a single node running for a few minutes is cheap. I do it all the time on my personal account.
@tonatiuhdeleon8236
@tonatiuhdeleon8236 Ай бұрын
@@BryanCafferky thank you good sir
@osoucy
@osoucy Ай бұрын
Awesome decision tree for selecting DLT or not! If people are interested by DLT, but are concerned about vendor lock-in or "stickiness", you might want consider Laktory (kzbin.info/www/bejne/eIuuYYN7YrSln7M) that allows to declare a pipeline using a yaml configuration file and deploy it as DLT, but also as a Databricks job or even run it locally if you want to move away from Databricks.
@BryanCafferky
@BryanCafferky Ай бұрын
Looks interesting. How does it fit in and work with Databricks Asset Bundles (DABs)? Does it work with Azure DevOps? Thanks
@osoucy
@osoucy Ай бұрын
@@BryanCafferky It can actually be used as a replacement to DABs. I started working on this project before DABs was announced :) It offers more or less the same capabilities as DABs, except for a few key differences: - State management is more aligned with Terraform than DABs, meaning that the state is not automatically saved to your Databricks workspace. As a consequence, the deployments are global by default and not "user-specific". - Laktory supports multiple IaC backends. You can use terraform, but you can also use Pulumi. - Laktory supports almost any Databricks resources. You can not only deploy notebooks and jobs, but also Catalogs, Schemas, Tables, clusters, warehouses, vector search endpoints, queries, secrets, etc. - As per my initial comment, Laktory is an ETL framework, so you can use it to define all your data transformations through SQL statements or Spark Chain, a serialized expression of spark commands. In other words, Laktory is like DAB + dbt in a single framework, but with a strong focus on DataFrame and Spark transformations. I haven't used it with Azure DevOps yet, but they would definitely work nicely together using Laktory CLI. You can find an example of a git action here: github.com/okube-ai/lakehouse-as-code/blob/main/.github/workflows/_job_laktory_deploy.yml The syntax is a bit different than with Azure DevOps, but it would be very similar.
@phillipdataengineer
@phillipdataengineer Ай бұрын
Learning so some with your videos! F* awesome man thank you!
@BryanCafferky
@BryanCafferky Ай бұрын
Glad they are helping.
@SuperIloveeric
@SuperIloveeric Ай бұрын
This is amazing! Everything you do is so well aligned with your goal to effectively streamline. Thank you.
@BryanCafferky
@BryanCafferky Ай бұрын
Thank You!
@yosh_2024
@yosh_2024 Ай бұрын
Useful and objective analysis.
@marcjkeppler3590
@marcjkeppler3590 Ай бұрын
How does this video have fewer than 1k likes? 😅
@hemalpbhatt
@hemalpbhatt Ай бұрын
I am getting this error: UnityCatalogServiceException: [RequestId=4a3d6ef7-7b72-4487-b53b-b08bad1f0894 ErrorClass=INVALID_PARAMETER_VALUE] GenerateTemporaryPathCredential uri /FileStore/tables/DimDate.csv is not a valid URI. Error message: INVALID_PARAMETER_VALUE: Missing cloud file system scheme. I added dbfs and still syntax won't run [UC_FILE_SCHEME_FOR_TABLE_CREATION_NOT_SUPPORTED] Creating table in Unity Catalog with file scheme dbfs is not supported. Instead, please create a federated data source connection using the CREATE CONNECTION command for the same table provider, then create a catalog based on the connection with a CREATE FOREIGN CATALOG command to reference the tables therein. SQLSTATE: 0AKUC
@BryanCafferky
@BryanCafferky Ай бұрын
Looks like you have Unity Catalog enabled and it is conflicting with this code. UC came out after I created this video.
@ichtot71
@ichtot71 Ай бұрын
Honest question isnt split apply combine just map reduce?
@BryanCafferky
@BryanCafferky Ай бұрын
Looks very similar but split apply works on dataframes whereas map reduce works on RDDs.
@ichtot71
@ichtot71 Ай бұрын
@@BryanCafferky I thought it’s just a pattern?
@BryanCafferky
@BryanCafferky Ай бұрын
@@ichtot71 It is supported in the PySpark library. See www.databricks.com/blog/2020/05/20/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html
@yosh_2024
@yosh_2024 Ай бұрын
I watched some of Bryan's videos earlier. But this introduction video itself is cool....
@l.kennethwells2138
@l.kennethwells2138 Ай бұрын
Bryan, I have a community edition of Databricks will I be able to follow along with most of the lectures with just that? Thank you
@BryanCafferky
@BryanCafferky Ай бұрын
Yes. You can. It won't scale out to multiple machines but the code should still work.
@NoahPitts713
@NoahPitts713 Ай бұрын
Great and timeless info as always! qq - how has your experience been using the IDENTITY Column since making this video?
@BryanCafferky
@BryanCafferky Ай бұрын
I have not dared to try it again. lol I don't think it is the best idea for a scaled out service to be honest. Ok if updates are limited but I would still try to find an alternative.
@NeverthelessXS
@NeverthelessXS Ай бұрын
great tutorial
@devigugan
@devigugan Ай бұрын
Excellent narrative ❤❤❤
@jeanpro1825
@jeanpro1825 Ай бұрын
So helpful !
@danielejiofor7032
@danielejiofor7032 Ай бұрын
Best DB tutorial out there!!!