Thank you for watching, if you enjoyed this episode, please hit like 👍subscribe, and turn notifications on 🔔it helps us more than you know. 🙏
@Neilakirby21 күн бұрын
Awesome series. I work for a large organisation and was wondering how to implement the medallion architecture. Would it be best to have workspaces per domain/groups e.g. Transport/Finance/HR each with bronze/silver/gold?
@pankajmaheshwari128Ай бұрын
Just now i have completed this whole playlist as i have to start with one new project on fabric. Thanks a lot for providing these framework level information. Highly appreciated. Please do continue updating this playlist with more insight. liked and subscribed 😊
@Nalaka-Wanniarachchi5 ай бұрын
Nice share on best practices round up ...
@geirforsmo87492 ай бұрын
Hi, great video series. I am really enjoyed by watching your contribution of knowlegde. One thing I need to ask about is when you read the csv file, it has no headers. Then you do the apply_transformation things which I believe need header info to work properly, or am I wrong? I can't see any steps or code that add header info before you do the df = PricePaidWrangler.apply_transformations(df). Can you comment on this?
@StefanoMagnasco-po5bb5 ай бұрын
Thanks for the great video, very useful. One question: you are using PySpark in your notebooks, but how would you recommend modularizing the code in Spark SQL? Maybe by defining UDFs in separate notebooks that are then called in the 'parent' notebook?
@endjin5 ай бұрын
Sadly you don't have that many options here without having to fall back to Python/Scala. You can modularize at a very basic level using notebooks as the "modules", containing a bunch of cells which contain Spark SQL commands. Then call these notebooks from the parent notebook. Otherwise, as you say, one step further would be defining UDFs using some Python and then using spark.udf.register to be able to invoke them from SQL. Ed
@MuhammadKhan-wp9zn3 ай бұрын
This is a framework level work, not sure how many will understand and appreciate your efforts you did to create a video, but I will highly appreciate your thoughts and work and at one point I was thinking if I got a chance to create a framework how I will do, you gave very nice guide line here, once again thank you for video, I would like to see your other videos too.
@endjin2 ай бұрын
Thanks for the comment! Ed
@gpc395 ай бұрын
Very useful. One thing I would like to do is avoid having to add lakehouses to each notebook. Is there a way to do this within the notebook? Eventually given two Lakehouses, Bronze and Silver, I would want to merge from the Bronze table into the Silver Table. I have the merge statement working; it's just the adding of the lakehouses, which I can't see. I'm doing most of the programming with SQL, as I am less adept with PySpark, but am learning. Thanks Graham
@endjin5 ай бұрын
Hi Graham. Thanks for your comment! By "do this within the notebook" do you mean "attach a Lakehouse programmatically"? If so, take a look at this: community.fabric.microsoft.com/t5/General-Discussion/How-to-set-default-Lakehouse-in-the-notebook-programmatically/m-p/3732975 By my understanding, a notebook needs to have at least one Lakehouse attached to it in order to run Spark SQL statements that read from Lakehouses. Once it has one Lakehouse, remember that you can reference other Lakehouses in your workspace by using two-part naming (SELECT * FROM .`) without having to explicitly attach the other Lakehouses. And if you need to reference Lakehouses from other workspaces, you'll need to add a shortcut first and then use two part naming. Ed
@ThePPhilo5 ай бұрын
Great videos 👍👍 Microsoft advocate using seperate workspaces for bronze, silver and gold but that seems to be harder to achieve due to some current limitations. If we go with a single workspace and a folder based set up like the example will it be hard to switch to seperate workspaces in future? Is there any prep we can do to make this switch easier going forward (or would there be no need to switch to a workspace approach)?
@edfreeman78672 ай бұрын
Hi there, Thanks for your comment! It's a great question. Personally, unless there are strong requirements (e.g. high data sensitivity/unique security requirements) for splitting zones apart into separate workspaces, I would default to one workspace. In my experience, the same team is often involved in managing all three layers of the Lakehouse, and "end-users" mostly only get access to semantic models (or the "Gold" layer at a push), so giving every zone its own workspace isn't really justified. Deployment also becomes trickier when multiple workspaces are involved. All that being said, I can appreciate there are scenarios where multiple is more suitable. W.r.t. how to design for the future: my first comment would be "only do it if you need to". No need to change tack just for the sake of it. If a single workspace is working for you then just stick with it. But if you do need to switch for whatever reason, then sadly there'll always be a significant migration overhead. The best thing you can do is have well structured workspaces and notebooks like I've shown here. Highlight the artifacts that are relevant to each layer of the architecture so you can get a picture of the interdependencies. Deployment pipelines / REST APIs will be your friend if you have to migrate too - worth getting familiar with these if you haven't already. The beauty of the medallion architecture is that the core structure of your data pipelines stays the same, whatever workspace infrastructure architecture you opt for! Hope this helps, Ed
@ramonsuarez61055 ай бұрын
Excellent video, thanks. Do you ever use a master notebook or pipeline to run the 3 stages one after the other for the initial upload or subsequent incremental uploads? Why not use python files or spark job definitions instead of some of the notebooks that only have classes and methods ? How do you integrate these notebooks with testing in CD/CI?
@endjin5 ай бұрын
All great questions! I'll be covering most of your points in upcoming videos, but for now I'll try to answer them here.. > Do you ever use a master notebook or pipeline to run the 3 stages one after the other for the initial upload or subsequent incremental uploads? Yes, we either use a single orchestration notebook or a pipeline to chain the stages together. On the notebook side, more recently we've been experimenting with the "new" mssparkutils.notebook.runMultiple() utility function to create our logical DAG for the pipeline. We've found this to be quite powerful so far. On the pipeline side, the simplest thing to do is to have multiple notebook activities chained together in the correct dependency tree. The benefit of the initial method is that the same Spark Session is used. This is particularly appealing in Azure Synapse Analytics where Spark Sessions take a while to provision (although this is less significant in Fabric since sessions are provisioned much more quickly). The benefit of the pipeline approach is that it's nicer to visualise, but arguably harder to code review given its underlying structure. One thing we do in either option is make the process metadata-driven. So we'd have an input parameter object which captures the variable bits of configuration about how our pipeline should run. E.g. ingestToBronze = [true/false], ingestToBronzeConfig = {...}, processToSilver = [true/false],processToSilverConfig = {...}, .... This contains all the information we need to control the flow of the various permutations of processing we need. Stay tuned - there'll be a video on this later on in the series! > Why not use python files or spark job definitions instead of some of the notebooks that only have classes and methods? We could and we sometimes do! But the reality is that managing custom Python libraries and SJDs are a step up the maturity ladder that not every org is ready to adopt. So this video was meant to provide some inspiration about a happy middle-ground - still using Notebooks, but structuring them in such a way that follows good development practices and would make it easier to migrate to custom libraries in the future should that be necessary. > How do you integrate these notebooks with testing in CD/CI? Generally we create a separate set of notebooks that serve as our tests. See endjin.com/blog/2021/05/how-to-test-azure-synapse-notebooks. Naturally, development of these tests isn't great within a notebook, and it's a bit cumbersome to take advantage of some of the popular testing frameworks out there (pytest/behave). But some tests are better than no tests. Then, to integrate into CICD, we'd wrap our test notebooks in a Data Factory pipeline, and call that pipeline from ADO/GitHub: learn.microsoft.com/en-us/fabric/data-factory/pipeline-rest-api#run-on-demand-item-job. --- Sadly I can't cover absolutely everything in this series, so I hope this comment helps!
@ramonsuarez61055 ай бұрын
@@endjin Thanks a lot Ed. You guys do a great job with your videos and posts. Super helpful and inspiring :)
@ManojKatasani4 ай бұрын
very clean explanation, appreciate your efforts. is there any chance we get code on each layer ( Bronze to sliver etc.. advance thank you
@jensenb95 ай бұрын
Great stuff. Is this E2E content hosted in a Git repo somewhere that we can access? Thanks!
@endjin5 ай бұрын
Not yet, but I believe that is the plan.
@EllovdGriek2 ай бұрын
@@endjin I had the same question, great content and i wanted to try it myself but with the guidance of your code. So, is it somewhere available?