Simplify and Scale Data Engineering Pipelines with Delta Lake

  Рет қаралды 27,956

Databricks

Databricks

Күн бұрын

Пікірлер: 12
@andyharman1771
@andyharman1771 4 жыл бұрын
Starts at 3:10
@Databricks
@Databricks 4 жыл бұрын
Thanks Andy, I trimmed it. Video starts right at 0:00
@jasonabhi
@jasonabhi 4 жыл бұрын
Amazing Hands On Session
@KoushikPaulliveandletlive
@KoushikPaulliveandletlive 4 жыл бұрын
Wonderful Demonstration and very handy notebook. Following are my assumptions. 1. Deltalake keeps multiple version of the data( like HBASE ) . 2. Deltalake takes care of the automicity for the user showing only the latest file if not specified otherwise. 3. Deltalake checks the schema before appending to prevent corruption of the table, this makes developers job easy, similar things can be achieved with manual effort like manually mentioning the schema instead of infering it. 4. In case of update it always overwrites the entire table or the entire partition(dataframes are immutable) . Questions. 1. If it keeps multiple version is there a default limit for number of versions ? 2. As it keeps multiple versions so is it only for smaller tables ? for tables in terabytes wont it be a waste of space? 3. In relational DB data is tightly coupled with metadata/schema , so we can only get the data only from the table not the data files . But in hive / spark this is different. external tables are also allowed . Without having access the metadata, we can recreate the table . How it is handled in DeltaLake , because we have multiple snapshot/version of the same table , without the log/metadata will someone be able to access it? In hive/Spark multiple table with different tool ( hive, presto, spark) can be created on the same data. Can other tool share the same data with deltalake ?
@vinyasshetty4042
@vinyasshetty4042 4 жыл бұрын
For updates, it will not overwrite the entire table, but look at the files that has the data that needs to be updated and create the new copy of only those files . Such files will have the updates in them + non update records in that file.To eventually clean up the older version you will have to run a vacuum command. Currently only sparksql works for querying the delta location but I believe they are working on making presto, hive work with it.
@CoopmanGreg
@CoopmanGreg 2 жыл бұрын
If the streaming / batch notebook you demonstrated were being run in a workflow and and lets say100k rows have streamed in successfully, but then an error occurs and the job fails. As I understand it, the 100K rows and all other changes that occurred in the workflow would be automatically rolled back. Is this correct?
@nit46hin
@nit46hin 4 жыл бұрын
Great demo... very useful for learning delta architecture
@Databricks
@Databricks 4 жыл бұрын
Thanks for the feedback Nithin! Glad you enjoyed it.
@nit46hin
@nit46hin 4 жыл бұрын
Can you help to share the steps on how to import the notebook from the github link to databricks community edition.
@dennylee4934
@dennylee4934 4 жыл бұрын
Please refer to the "Importing Notebooks" section of github.com/delta-io/delta/tree/master/examples/tutorials/saiseu19#importing-notebooks for step-by-step instructions. HTH!
Getting Data Ready for Data Science with Delta Lake and MLflow
58:45
Making Apache Spark™ Better with Delta Lake
58:10
Databricks
Рет қаралды 181 М.
Beginner's Crash Course to Elastic Stack -  Part 1: Intro to Elasticsearch and Kibana
56:42
Beyond Lambda: Introducing Delta Architecture
57:35
Databricks
Рет қаралды 36 М.
Amazon EMR Deep Dive and Best Practices - AWS Online Tech Talks
40:32
AWS Developers
Рет қаралды 58 М.
Delta Live Tables A to Z: Best Practices for Modern Data Pipelines
1:27:52
Building Production RAG Over Complex Documents
1:22:18
Databricks
Рет қаралды 23 М.
Её автомобиль никто не хотел ремонтировать!
20:12
Гараж Автоэлектрика
Рет қаралды 1,5 МЛН
DID YOU NOTICE ANY LAPSES IN THE VIDEO or NOT? / MARGO_FLURY
0:34
MARGO FLURY | Маргарита Дьяченкова
Рет қаралды 12 МЛН
пранк🤣😂😂
0:51
Numdexx1
Рет қаралды 1,2 МЛН