Link to to Data Lake Videos On-premis and AWS: kzbin.info/www/bejne/en21moipZqqpnq8&t kzbin.info/www/bejne/gafXqZd8bMeSopo
@hungnguyenthanh4101 Жыл бұрын
Can you try with another project with deltalake,hive-metastore?
@SonuKumar-fn1gn3 ай бұрын
Great video 👍 Thanks for sharing the video 💝
@jeanchindeko54774 ай бұрын
3:18 Apache Iceberg has ACID transactions out of the box, and it’s not Nessie which brings ACID transactions to Iceberg. In Iceberg specification the catalog only has knowledge of the list of snapshots, and the catalog doesn’t track the list of individual files part of commit or snapshots.
@orafaelgf7 ай бұрын
great video, congrats. If possible, bring an end-to-end architecture with streaming data ingested directly into the lakehouse. also something related to the integration of datalake and datalakehouse.
@BiInsightsInc7 ай бұрын
That’s a great idea 💡. I will put something together that combines the data streaming and the data lake. This will give an end to end implementation.
@luisurena1770Ай бұрын
is it possible to replace dremio with Trino?
@BiInsightsInc27 күн бұрын
Yes, it’s possible to use Trino with Nessie’s catalog. Here is the link to their docs: projectnessie.org/iceberg/trino/
@Muno_edits062 ай бұрын
I want to create Iceberg table with rest catalog using pyiceberg. Does this setup works for it?
@BiInsightsInc2 ай бұрын
Hi there, you can create an iceberge table using the Python library or the SQL console. This set up is using Nessie's catalog. If you mean Tabular's "rest catalog" then that's not used in this tutorial.
@Muno_edits062 ай бұрын
@@BiInsightsInc I am trying to make a rest catalog for nessie using pyiceberg library for iceberg. In that i am trying to access the following uri: "uri": "localhost:19120/api/v1" but it is not accessing it
@jestinsunny575Ай бұрын
Great Video! Although i have set this up in a server, how would i be able to get data from tables and write to using pyiceberg or any other library. I'm trying fetch data from the iceberg using API. I have tried a lot of methods they are not working. Please help. thanks
@BiInsightsIncАй бұрын
Thanks. You can use the dremio client to query/write data stored in the tables. If you want to use a python library then I will cover that in a future video.
8 ай бұрын
Today I use apache Nifi to retrieve data from APIs, DBs and mariadb is my main DW. I've been testing dremio/nessie/minIO using docker-compose and I still have doubts about the best way to ingest data in Dremio. There are databases and APIs that cannot be connected directly to it. I tested sending parquet files directly to the storage, but the upsert/merge is very complicated and the jdbc connection with Nifi didn't help me either. What would you recommend for these cases?
@BiInsightsInc8 ай бұрын
Hi there, Dremio is a SQL Query Engine like Trino and Presto. You do not insert/ingest data in dremio directly. The S3 layer is where you store your data. Apache Iceberg provides the Lakehouse Management service (upsert/merge) for the objects in the catalog. I'd advise to handle upsert/merge in the catalog layer rather than S3, sole reason of the iceberg's presence in this stack. Here is an article on how to handle upsert using SQL. medium.com/datamindedbe/upserting-data-using-spark-and-iceberg-9e7b957494cf
@KlinciImut3 ай бұрын
Is Nessie storing the data in a different file? or it will refer and update the original file of 'sales_data.csv'?
@BiInsightsInc3 ай бұрын
Hey there, the data is managed by iceberg and yes it’s stored in the parquet format.
@KlinciImut2 ай бұрын
@@BiInsightsInc so the original csv file stay as it is, nessie-iceberg will create a parquet file which contain the actual and most updated data. Is my understanding correct?
@andriifadieiev9757 Жыл бұрын
Great video, thank you!
8 ай бұрын
This is so insane. Is it also possible to query data from a specific versionstate directly instead of only the metadata? I am wondering if this would be suitable for bigger Datasets? Have you ever benchmarked this stack with a big Dataset? If the versioncontrol is scalable with bigger datasets and higher change frequency, this would be a crazy good solution to implement.
@BiInsightsInc8 ай бұрын
Yes, it is possible to query data using the specific snapshot id. We can time travel using the available snapshot id to view our Iceberg data from a different point in time, see Time Travel Queries. The processing of large dataset depends on your set up. If you have multiple node with enough ram/compute power than you can process large data. Or levrage a cloud cluster that you can scale up or down depening on your needs. select count(*) from s3.ctas.iceberg_blog AT SNAPSHOT '4132119532727284872';
@dipakchandnani431026 күн бұрын
It is giving me "A fatal error has been detected by the Java Runtime Environment" error, it was working fine 3-4 months back but is failing now. I am using Mac and there is no openjdk installed in "cd /opt/java/openjdk cd: no such file or directory: /opt/java/openjdk" appreciate this video and your help here
@BiInsightsInc26 күн бұрын
Try installing OpenJDK and re-try. The error can be caused by the missing installation or corrupted files.
@nicky_rads Жыл бұрын
nice video! Data lakehouses offer a lot of functionality at an affordable price. It seems like dremio is the platform that allows you to aggregate all of these services together ? Could you go a little more in depth on some of the services.
@BiInsightsInc Жыл бұрын
Thanks. Yes, dremio engines brings various services together to offer data lake house functionality. I will be going over Iceberg and the project Nessie in the future.