Рет қаралды 674
Apache Iceberg has rocketed to the forefront of the big data industry in recent years. When combined with Trino, the open source engine that powers both Starburst Galaxy and Starburst Enterprise the Icehouse architecture is born.
Sections:
00:00 - Why Apache Iceberg is winning the table format race
00:30 - Icehouse architecture is the new open data lakehouse
00:43 - Data lakehouses replace Hive data lakes
2:27 - Icehouse data architecture over Delta Lake and Hudi
04:02 - Why Iceberg defines Icehouse architecture
04:17 - Apache Iceberg and Trino
What is a data Icehouse architecture?
At its heart, an Icehouse architecture is two revolutions rolled into one. It uses the Apache Iceberg table format with the Trino query engine, and adds in a few other components like data ingestion, data governance, data management, and automatic capacity management. If you want to read more about Icehouse architecture, Starburst CEO Justin Borgman sets it out in the Icehouse Manifesto: www.starburst.io/blog/icehous...
You can think of an Icehouse architecture as a roadmap for achieving an open data lakehouse. This is a big shift, and is truly a revolution in data strategy.
You can think of this a revolution in two parts.
The first revolution pushes against the dominant data lake technology still in use today, Apache Hive. If you use production data lakes today, there's a good chance they use Hive. This is a legacy of its early position in Hadoop clusters. But as data velocity has increased exponentially, one of Hive's inability to easily update files has become a major drawback. The Icehouse architecture pushes against this, disrupting Hive's dominance and presenting a data lakehouse alternative along with the other data lakehouse table formats.
The second revolution pushes against Delta Lake and Hudi, the other data lakehouse table formats. Although all data lakehouse table formats collect more metadata and capture changes in the state of the dataset, only Apache Iceberg fully realizes the potential of an open data lakehouse. This potential basically amounts to achieving all of the flexibility advantages of a cloud object storage data lake and all of the performance of a data warehouse. This data warehouse-like experience is based on inexpensive cloud object storage using AWS, Azure, or GCP. It presents major cost savings for organizations that adopt it and embraces data openness.
#Data #DataAnalytics #DataEngineering #Trino #ApacheIceberg #Iceberg #Icehouse #IcehouseArchitecture #DataIngestion #Hive #ApacheHive #DataLake #DataLakehouse #DataWarehouse #OpenDataLakehouse #DataStrategy #DataRevolution #Starburst #StarburstGalaxy #DataIngestion #DataManagement #DataGovernance