Create on premise Data Lakehouse with Apache Iceberg | Nessie | MinIO

Create on premise Data Lakehouse with Apache Iceberg | Nessie | MinIO | Lakehouse

Рет қаралды 6,722

Күн бұрын

Пікірлер: 27

@BiInsightsInc Жыл бұрын

Link to to Data Lake Videos On-premis and AWS: kzbin.info/www/bejne/en21moipZqqpnq8&t kzbin.info/www/bejne/gafXqZd8bMeSopo

@hungnguyenthanh4101 Жыл бұрын

Can you try with another project with deltalake,hive-metastore?

@SonuKumar-fn1gn 3 ай бұрын

Great video 👍 Thanks for sharing the video 💝

@jeanchindeko5477 4 ай бұрын

3:18 Apache Iceberg has ACID transactions out of the box, and it’s not Nessie which brings ACID transactions to Iceberg. In Iceberg specification the catalog only has knowledge of the list of snapshots, and the catalog doesn’t track the list of individual files part of commit or snapshots.

@orafaelgf 7 ай бұрын

great video, congrats. If possible, bring an end-to-end architecture with streaming data ingested directly into the lakehouse. also something related to the integration of datalake and datalakehouse.

@BiInsightsInc 7 ай бұрын

That’s a great idea 💡. I will put something together that combines the data streaming and the data lake. This will give an end to end implementation.

@luisurena1770 Ай бұрын

is it possible to replace dremio with Trino?

@BiInsightsInc 27 күн бұрын

Yes, it’s possible to use Trino with Nessie’s catalog. Here is the link to their docs: projectnessie.org/iceberg/trino/

@Muno_edits06 2 ай бұрын

I want to create Iceberg table with rest catalog using pyiceberg. Does this setup works for it?

@BiInsightsInc 2 ай бұрын

Hi there, you can create an iceberge table using the Python library or the SQL console. This set up is using Nessie's catalog. If you mean Tabular's "rest catalog" then that's not used in this tutorial.

@Muno_edits06 2 ай бұрын

@@BiInsightsInc I am trying to make a rest catalog for nessie using pyiceberg library for iceberg. In that i am trying to access the following uri: "uri": "localhost:19120/api/v1" but it is not accessing it

@jestinsunny575 Ай бұрын

Great Video! Although i have set this up in a server, how would i be able to get data from tables and write to using pyiceberg or any other library. I'm trying fetch data from the iceberg using API. I have tried a lot of methods they are not working. Please help. thanks

@BiInsightsInc Ай бұрын

Thanks. You can use the dremio client to query/write data stored in the tables. If you want to use a python library then I will cover that in a future video.

8 ай бұрын

Today I use apache Nifi to retrieve data from APIs, DBs and mariadb is my main DW. I've been testing dremio/nessie/minIO using docker-compose and I still have doubts about the best way to ingest data in Dremio. There are databases and APIs that cannot be connected directly to it. I tested sending parquet files directly to the storage, but the upsert/merge is very complicated and the jdbc connection with Nifi didn't help me either. What would you recommend for these cases?

@BiInsightsInc 8 ай бұрын

Hi there, Dremio is a SQL Query Engine like Trino and Presto. You do not insert/ingest data in dremio directly. The S3 layer is where you store your data. Apache Iceberg provides the Lakehouse Management service (upsert/merge) for the objects in the catalog. I'd advise to handle upsert/merge in the catalog layer rather than S3, sole reason of the iceberg's presence in this stack. Here is an article on how to handle upsert using SQL. medium.com/datamindedbe/upserting-data-using-spark-and-iceberg-9e7b957494cf

@KlinciImut 3 ай бұрын

Is Nessie storing the data in a different file? or it will refer and update the original file of 'sales_data.csv'?

@BiInsightsInc 3 ай бұрын

Hey there, the data is managed by iceberg and yes it’s stored in the parquet format.

@KlinciImut 2 ай бұрын

@@BiInsightsInc so the original csv file stay as it is, nessie-iceberg will create a parquet file which contain the actual and most updated data. Is my understanding correct?

@andriifadieiev9757 Жыл бұрын

Great video, thank you!

8 ай бұрын

This is so insane. Is it also possible to query data from a specific versionstate directly instead of only the metadata? I am wondering if this would be suitable for bigger Datasets? Have you ever benchmarked this stack with a big Dataset? If the versioncontrol is scalable with bigger datasets and higher change frequency, this would be a crazy good solution to implement.

@BiInsightsInc 8 ай бұрын

Yes, it is possible to query data using the specific snapshot id. We can time travel using the available snapshot id to view our Iceberg data from a different point in time, see Time Travel Queries. The processing of large dataset depends on your set up. If you have multiple node with enough ram/compute power than you can process large data. Or levrage a cloud cluster that you can scale up or down depening on your needs. select count(*) from s3.ctas.iceberg_blog AT SNAPSHOT '4132119532727284872';

@dipakchandnani4310 26 күн бұрын

It is giving me "A fatal error has been detected by the Java Runtime Environment" error, it was working fine 3-4 months back but is failing now. I am using Mac and there is no openjdk installed in "cd /opt/java/openjdk cd: no such file or directory: /opt/java/openjdk" appreciate this video and your help here

@BiInsightsInc 26 күн бұрын

Try installing OpenJDK and re-try. The error can be caused by the missing installation or corrupted files.

@nicky_rads Жыл бұрын

nice video! Data lakehouses offer a lot of functionality at an affordable price. It seems like dremio is the platform that allows you to aggregate all of these services together ? Could you go a little more in depth on some of the services.

@BiInsightsInc Жыл бұрын

Thanks. Yes, dremio engines brings various services together to offer data lake house functionality. I will be going over Iceberg and the project Nessie in the future.