Getting Data Ready for Data Science with Delta Lake and MLflow

Рет қаралды 9,680

Күн бұрын

Пікірлер: 4

@KoushikPaulliveandletlive 4 жыл бұрын

Example: I have ran a delete operation with a filter which will take 5 minutes to complete, right after I ran the delete command, I ran an update query on the same filter, what will happen when both the queries finishes? If it was not deltalake I would have got an exception for the second query as the first was not complete and if I would have waited and ran the query after the delete was complete, the update query wont have any effect on the table. Because there will not be any data left on the table for that filter.

@Xiphos76 4 жыл бұрын

So for schema changes that involve completely new fields, Delta will handle gracefully with .option("mergeSchema", "true"). However how do you handle situations where the schema change is more subtle? When processing JSON to a table, I've come across situations where the key is logically the same, but differs only in the case; MyField vs myfield. Databrick ingestions fail even when mergeSchema is set to true with the following error: "org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data to save: myfield"

@dennyglee 4 жыл бұрын

While Apache Spark can be used in case sensitive or insensitive (default) mode, Delta Lake is case-preserving but insensitive when storing the schema. Parquet is case sensitive when storing and returning column information. To avoid issues and possible dat a corruptions, Delta Lake cannot have column names that differ only by case. For more info, please refer to databricks.com/blog/2019/09/24/diving-into-delta-lake-schema-enforcement-evolution.html. HTH!