Advancing Spark - External Tables with Unity Catalog

  Рет қаралды 16,173

Advancing Analytics

Advancing Analytics

Күн бұрын

Пікірлер: 34
@paulroome3928
@paulroome3928 2 жыл бұрын
Great video! One suggestion - each external location can/should be used to create many external tables. Register an external location on a parent folder of your adls account, and when you create external tables in child directories, unity will automatically figure out you have access to do that!
@AdvancingAnalytics
@AdvancingAnalytics 2 жыл бұрын
Oh absolutely! The location I added was for the full lake container in this case, apologies if that wasn't clear in the vid!
@alexischicoine2072
@alexischicoine2072 Жыл бұрын
Creating tables as external has saved me many times when deleting tables by mistake. Having that extra step of having to delete the data in storage isn't so bad if the data is important and hard to recover. Then if you messed up recreating the external table is very easy. Obviously if you're in a mature organization where you don't do anything manual in prod it's not as much of an issue.
@AyyyyyyyyyLmao
@AyyyyyyyyyLmao Жыл бұрын
I came here looking for tips on how to build a table within the unity game engine. Not what I asked for but a great video nonetheless!
@ShravanKumar-yv4be
@ShravanKumar-yv4be Жыл бұрын
Any suggestions as to when to use Managed and External tables? Would it be a good idea to use managed for bronzer/silver and external for Gold layer?
@KarthikPaladugu-kz8rt
@KarthikPaladugu-kz8rt Жыл бұрын
In articles we are not seeing how we need to update external locations in existing workflows after enabling unity catalog. we can not use DBFS as per recommendations, we need to use external locations, how to update existing code to point to external locations. we will use upgrade option in unity catalog to migrate external tables but how to update workflows to point external locations
@dofa47
@dofa47 Жыл бұрын
Hi, late to the party but I have a question: Can I have a read only access to the storage account with the Data? So one cannot modifiy the prod data? In other words, where the metadata of the external table are saved?
@BritonWells
@BritonWells Жыл бұрын
Is there a solution/update to the issue around 15:00 - to me that seems like a deal breaker? I want to expose the same external table to many curated catalogs. Do managed tablea have the same limit?
@KarloKotarac
@KarloKotarac Жыл бұрын
Thnx for the great video(s)! Unfortunately, I am not able to create an external location as you suggested in the video root@storage_account_name that should (I guess) point to the whole storage account - I need to do it container by container. Also creating external tables doesn't seem to be possible anymore by using storage credentials directly (previously it was by WITH command). Do you have any idea how to address this issue?
@yashgoel5315
@yashgoel5315 Жыл бұрын
Is there any video on how to setup unity catalog??
@RaiOkami
@RaiOkami 2 жыл бұрын
Great videos! I am relatively new to Databricks and even more so for UC. So your videos have been a really great help! I am interested in implementing it to our project for a client just to get the permissions and governance stuff out of the way. But what exactly is the best practice for storing these tables? Is it really better to save them as external tables than managed? I was told from the forums that UC would handle the file saves in ADLS plus the actual table registration in the metastore. Yet, by default it is still a managed table.
@AdvancingAnalytics
@AdvancingAnalytics 2 жыл бұрын
So our /old/ advice was always to use external for full control over storage. Given UC gives far more flexibility, that's less important. Now the question is how strongly coupled do you want the SQL object to the underlying data. If someone deletes the table, do you want the data to be deleted too? Comes down to process/ownership
@RaiOkami
@RaiOkami 2 жыл бұрын
@@AdvancingAnalytics Appreciate the response! I see. So for managed tables the real big drawback is that the data itself is tied to the table object in metastore (be it in hive or UC). Unlike external tables where you can drop the table "representation" of the files from ADLS but can anytime pull them back with all the original data untouched. Did I get the rationale?
@AdvancingAnalytics
@AdvancingAnalytics 2 жыл бұрын
@@RaiOkami yep, exactly
@palanithangaraj797
@palanithangaraj797 Жыл бұрын
can you create persistent views from external tables to within Unity Catalog?
@alexanderowens4898
@alexanderowens4898 2 жыл бұрын
Always love your videos…a couple of questions: 1. How can I provide an external path using Scala saveAsTable() for UC? 2. Wouldn’t the use of external tables limit the ability to get and use the lineage tracking if you loads data from and then save data to external locations (unmanaged tables)?
@AdvancingAnalytics
@AdvancingAnalytics 2 жыл бұрын
Hey! When using external tables traditionally, you can create the Database with a location (CREATE DATABASE LOCATION '/mnt/data/mydatabase/' etc. Then if you use saveAsTable() it would inherit the external location of that database. I've not tried it with UC, so can't guarantee that it works! And with Lineage, I've not looked into how supported external tables are currently, but when lineage is GA, I'd hope that it supports both types of table!
@andrewli7542
@andrewli7542 2 жыл бұрын
Something like df.write.option("path", "s3://some/path").saveAsTable("t") should work. Lineage will support both managed and external tables.
@alexanderowens4898
@alexanderowens4898 2 жыл бұрын
@@andrewli7542 that was initially what I had tried; though we use an azure mount path /mnt/account/somepath and I was receiving an "Missing Cloud file system scheme" error notice.
@andrewli7542
@andrewli7542 2 жыл бұрын
@@alexanderowens4898 You can't use mounts with external locations for now. A full path with abfss:// is needed.
@alexanderowens4898
@alexanderowens4898 2 жыл бұрын
@@andrewli7542 great tip, this is what I ended up doing .write.option("path", s"abfss://$container@$account.dfs.core.windows.net/$system/$table").saveAsTable(s"$container.$system.$table") and it worked great! Thank you very much.
@majetisaisowmya909
@majetisaisowmya909 Жыл бұрын
can u please help me "create metastore,catalog ,table automatically using python or powershell
@AdvancingAnalytics
@AdvancingAnalytics Жыл бұрын
So you'll still need to throw that over to the SQL side of things, but you can build up the SQL string programatically. For example, using python inside a Databricks notebook you could use: CatName = "myCatalog" SQLString = f"CREATE CATALOG {CatName}" spark.sql(SQLString) That's going to write a sql command for you, then execute that command. You can use that in a loop to programatically create schemas, tables, whatever you want!
@mauromi68
@mauromi68 2 жыл бұрын
really a great video. I'm new to DataBricks Unity Catalog and I tried to replicate these steps, but I still get the error "Error in SQL statement: UnauthorizedAccessException: PERMISSION_DENIED: request not authorized" It seems to me I did whatever I had to do: I created a Databricks access connector in Azure (which becomes a managed identity) I created a storage Account ADLS Gen2 (DAtalake with hierarchical namespace) plus container On my datalake container I assigned Storage Blob Data Contributor role to the managed identity above I created a new Databricks Premium Workspace I created a new metastore in Unity Catalog that "binds" the access connector to the DataLake Bound the metastore to the premium databricks workspace I gave my Databricks user Admin permission on the above Databricks workspace I created a new cluster in the same premium workspaces, choosing framework 11.1 and "single user" access mode I ran the workspace, which correctly created a new catalog, assinged proper rights to it, created a schema, confirmed that I am the owner for that schema The only (but most important) SQL command of the same notebook that fails is the one that tries to create a managed Delta table and insert two records: CREATE TABLE IF NOT EXISTS quickstart_catalog_mauromi.quickstart_schema_mauromi.quickstart_table (columnA Int, columnB String) PARTITIONED BY (columnA); When I run it, it starts working and in fact it starts creating the folder structure for this delta table in my storage account enter image description here , however then it fails with the following error: java.util.concurrent.ExecutionException: Failed to acquire a SAS token for list on /data/a3b9da69-d82a-4e0d-9015-51646a2a93fb/tables/eab1e2cc-1c0d-4ee4-9a57-18f17edcfabb/_delta_log due to java.util.concurrent.ExecutionException: com.databricks.sql.managedcatalog.acl.UnauthorizedAccessException: PERMISSION_DENIED: request not authorized Please consider that I didn't have any folder created under "unity-catalog" container before running the table creation command. So it seems that is can successfully create the folder structure, but after it creates the "table" folder, it can't acquare "the SAS token". So I can't understand since I am an admin in this workspace and since Databricks managed identity is assigned the contributor role on the storage container, and since Databricks actually starts creating the other folders. What else should I configure?
@mauromi68
@mauromi68 2 жыл бұрын
I found it: you need to only to assign, at container level, the Storage Blob Data Contributor role to the Azure Databricks Connector. In fact, you need to assign the same role and the same connector at STORAGE ACCOUNT level. I couldn't find this information in the documentation and I frankly can't understand why this is needed since the delta table path was created. However, this way, it works.
@rakeshprasad4969
@rakeshprasad4969 2 жыл бұрын
is this managed identity for all the workspaces i have in my tenant? or its for one specific one? if its latter, how do we know MI belongs to which workspace?
@rakeshprasad4969
@rakeshprasad4969 2 жыл бұрын
i should have watched it bit more. step 8:18 explains it. Thanks. good video. i think ADB should have this MI concept without unity too. its a good way and makes it at par with ADF or any other Azure PaaS (from identity perspective). can any one use MI that are in catalogue? or ADB has released grant for MI too? i would like to control who/which code has access to which MI.
@AdvancingAnalytics
@AdvancingAnalytics 2 жыл бұрын
@@rakeshprasad4969 good question, haven't looked at the permission model for who/what has access to the managed instance. I assume the access is delegated by the user who adds the credential to the workspace
@t0nyc0llins
@t0nyc0llins Жыл бұрын
@@rakeshprasad4969​ @Advancing Analytics I'd really like to understand this too. If I create an MI with contributor to the root, can all analysts use this connection to read/write anything in the lake? Hopefully I am missing something key.
@blackdeckerz0r
@blackdeckerz0r 2 жыл бұрын
Only problem with abfss is that python only code like pandas and open() doesn't work with that path, we are currently migrating from mounting the storage account to abfss, and we found this limitation
@AdvancingAnalytics
@AdvancingAnalytics 2 жыл бұрын
Unity Catalog is expecting you to work entirely through the SQL tables registered. So it's only with the initial table registry that you use the full abfss:// path... That said, harder to use pandas with the SQL tables ;)
@andrewli7542
@andrewli7542 2 жыл бұрын
For now, access to External location paths is limited to Spark Dataset API or dbutils.fs API. PySpark Pandas APIs that use Spark Dataset APIs under the hood should also work like ps.read_table().
@t0nyc0llins
@t0nyc0llins Жыл бұрын
Can I ask how you now access files on your datalake with pandas etc? We are just starting to move to Unity Catalog and I am unsure how we do this without mounting etc
Advancing Spark - Setting up Databricks Unity Catalog Environments
21:21
Advancing Analytics
Рет қаралды 18 М.
Advancing Spark - Tracking Lineage with Unity Catalog
15:00
Advancing Analytics
Рет қаралды 5 М.
REAL 3D brush can draw grass Life Hack #shorts #lifehacks
00:42
MrMaximus
Рет қаралды 12 МЛН
Fake watermelon by Secret Vlog
00:16
Secret Vlog
Рет қаралды 31 МЛН
When mom gets home, but you're in rollerblades.
00:40
Daniel LaBelle
Рет қаралды 78 МЛН
Ouch.. 🤕⚽️
00:25
Celine Dept
Рет қаралды 27 МЛН
Spark Data Engineering Patterns - Shortcuts and external tables
15:48
Azure Synapse Analytics
Рет қаралды 7 М.
Databricks Unity Catalog : Setup and Demo on AWS
17:20
Make With Data
Рет қаралды 7 М.
Advancing Spark - Row-Level Security and Dynamic Masking with Unity Catalog
20:43
Advancing Spark - Automated Data Quality with Lakehouse Monitoring
17:37
Advancing Analytics
Рет қаралды 7 М.
Advancing Spark - Lakehouse Observability with Unity Catalog System Tables
19:34
Advancing Spark - Understanding the Unity Catalog Permission Model
23:58
Advancing Analytics
Рет қаралды 11 М.
Advancing Spark - Rethinking ETL with Databricks Autoloader
21:09
Advancing Analytics
Рет қаралды 26 М.
Advancing Spark - Give your Delta Lake a boost with Z-Ordering
20:31
Advancing Analytics
Рет қаралды 29 М.
Advancing Spark - Developing Python Libraries with Databricks Repos
25:12
Advancing Analytics
Рет қаралды 20 М.
REAL 3D brush can draw grass Life Hack #shorts #lifehacks
00:42
MrMaximus
Рет қаралды 12 МЛН