How to build on-premise Data Lake? | Build your own Data Lake | Open Source Tools

How to build on-premise Data Lake? | Build your own Data Lake | Open Source Tools | On-Premise

Рет қаралды 12,566

BI Insights Inc

Күн бұрын

Пікірлер: 47

@alaab82 6 ай бұрын

one of the best tuto's on youtube, thank you so much !

@hernanlopezvergara6133 10 ай бұрын

Thank you very much for this short but very useful video!

@pabloqp7929 Ай бұрын

Unbelievable, thank you!

@datawise.education 11 ай бұрын

I love all your videos Haq. Great work. :)

@calvinball1 20 күн бұрын

Amazing

@khanetor 3 ай бұрын

Thank you so much for the video! I am trying to have the same setup, but I ran into some issue. I use the official Hive v4 docker image. I think it does not have the necessary jars to work with MinIO, because I got the ClassNotFound errors. I hope you can help me with 3 questions: 1. Do you know the complete list of jars to make Hive work with MinIO? 2. Do I need to use Hive 3 HMS instead of v4? 3. Does HMS need access to MinIO? From the diagram, there is no connection between HMS and MinIO, but it seems that it does, because the ClassNotFound errors were due to the lack of AWS jars, hinting that some AWS activities were going on.

@BiInsightsInc 2 ай бұрын

Hi there, I don't have the comprehensive list but check their docs they may have a dependency list. You will need AWS S3 jars along with hadoop jars at the minimum. You can start with Hive 3 since it will have more coverage and someone may already have a solution out there. I covered the Hive Metastore xml file where we provide MinIO creds and url therefore, it does need a connection. Since MinIO does not have jars we use the AWS jars to connect to it hence you see AWS related errors. Hope this helps with your project.

@khanetor 2 ай бұрын

@BiInsightsInc thanks for the tips!

@orafaelgf 2 ай бұрын

hi, I'm revisiting this video because I managed to create the entire infra, but the trino '351' version does not accept 'delta_lake'. in addition to the name of the 'hive-hadoop2' connector having changed to 'hive'. I tried everything to make it work by changing the 'trino' version to >=400 and the 'delta_lake' and 'iceberg' connectors and I couldn't. It would be great to see an update to this video using newer versions. tks

@BiInsightsInc 2 ай бұрын

Hi Rafael, can you share your docker and catalog file? So we can see what components you've upgraded and what's causing the issue. This tutorial does not use iceberg or delta lake. Maybe those components are causing this issue.

@orafaelgf 2 ай бұрын

@@BiInsightsInc nice. I sent invite connection on linkedin to send files there. Tks a lot.

@orafaelgf 2 ай бұрын

@@BiInsightsInc I sent an invitation on your LinkedIn today (and you already accepted). I sent my situation by message. Thank you very much in advance for any help.

@paraffin333 Ай бұрын

Hi, when I am trying to create a schema, I am getting an error that says doesBucketExist, I thin this has something to do with TLS access, if possible could you please guide me through that

@BiInsightsInc Ай бұрын

Hi there, by the error "error that says doesBucketExist" it seems the S3 bucket is not here. Check the bucket name and try again.

@TheMahardiany Жыл бұрын

Thankyou for the video, great as always 🎉 I want to ask in this video, when we use trino for query engine, can we use DML and even DDL for that external table ? or we can just select from it ? Thank you

@BiInsightsInc Жыл бұрын

There are a number limitations to do DMLs on Hive. Please read the documentation link for more details - cwiki.apache.org/confluence/display/Hive/Hive+Transactions. It’s recommend not to use DML on Hive managed tables especially if the data volume is huge these operations would become too slow. DML operations would be considerably faster if done on a partition/bucket instead of the full tables. Nevertheless it better to handle the edits in file and do a full refresh via external table and only use DML on managed tables as last resort. We define the table via DDL so yes.

@akaile2233 Жыл бұрын

Thankyou for video Hi sir, if I want to use Spark to save data to the data lake you built, how do I do that? (I just started learning about Data lake and Spark)

@BiInsightsInc Жыл бұрын

Below is a sample code to write data to MinIO bucket with Spark. package com.medium.scala.sparkbasics import com.amazonaws.SDKGlobalConfiguration import org.apache.spark.sql.SparkSession object MinIORead_Medium extends App { System.setProperty(SDKGlobalConfiguration.DISABLE_CERT_CHECKING_SYSTEM_PROPERTY, "true") lazy val spark = SparkSession.builder().appName("MinIOTest").master("local[*]").getOrCreate() val s3accessKeyAws = "minioadmin" val s3secretKeyAws = "minioadmin" val connectionTimeOut = "600000" val s3endPointLoc: String = "127.0.0.1:9000" spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", s3endPointLoc) spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", s3accessKeyAws) spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", s3secretKeyAws) spark.sparkContext.hadoopConfiguration.set("fs.s3a.connection.timeout", connectionTimeOut) spark.sparkContext.hadoopConfiguration.set("spark.sql.debug.maxToStringFields", "100") spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true") spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") spark.sparkContext.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled", "true") val yourBucket: String = "minio-test-bucket" val inputPath: String = s"s3a://$yourBucket/data.csv" val outputPath = s"s3a://$yourBucket/output_data.csv" val df = spark .read .option("header", "true") .format("minioSelectCSV") .csv(inputPath) df .write .mode("overwrite") .parquet(outputPath) }

@akaile2233 Жыл бұрын

@@BiInsightsInc Hi sir, I did everything like your video and it worked fine, but when I remove the schema 'sales' there is an error 'access denied' ?

@BiInsightsInc Жыл бұрын

@@akaile2233 You cannot delete objects from Trino engine. You can do so in the Hive Metastore. In this example, we re using Maria db. So you can connect to it delete objects from there. Changes will be reflected in the mappings you see in Trino.

@akaile2233 Жыл бұрын

@@BiInsightsInc Sorry to bother, there are too many tables in metastore_db, which ones should I delete?

@orafaelgf 8 ай бұрын

great video, congrats.

@GuilhermeMendesG 28 күн бұрын

Really great video! I have a question. I'm trying to create a external table in Trino using a external_location (which is minio) and when I'm reading the parquet files, 2 string columns from the parquet files could not be converted to VARCHAR, apparently. The error is: Erro SQL [65536]: Query failed (#20241118_143937_00020_d5ivv): io.trino.spi.type.VarcharType. What should I do?

@BiInsightsInc 27 күн бұрын

This can happen if one record has a a type different than what is defined in the metastore. Check your data for anamolies. Also make sure you have all the columns in the table that are in the file. Here is a sample table script: github.com/hnawaz007/pythondataanalysis/blob/main/data-lake/Create%20Table.sql

@Chay-u1y Жыл бұрын

thank you. can we build transactional data lake using iceberg /hudi on this minio storage.

@BiInsightsInc Жыл бұрын

Yes, you can build a data lake using Iceburg and MinIO. Here is a guide that showcases both of these tools in conjunction. resources.min.io/c/lakehouse-architecture-with-iceberg-minio?x=jAF4uk&Lakehouse+%2B+Icerberg+on+PF+1.0+-+080322&hsa_acc=8976569894&hsa_cam=17954061482&hsa_grp=139012460799&hsa_ad=614757163838&hsa_src=g&hsa_tgt=kwd-1717916787486&hsa_kw=apache%20iceberg&hsa_mt=b&hsa_net=adwords&hsa_ver=3&gclid=Cj0KCQjwnrmlBhDHARIsADJ5b_mrNZMG2PHc14akJyBoy3nW-8INcEQ8MFRjifDGkjGDeDiNqcAxVvkaAgToEALw_wcB

@anujsharma4011 Жыл бұрын

Can we directly connect trino with S3? no hive inbetween. I want to install trino on EC2

@BiInsightsInc Жыл бұрын

I’m afraid not. Trino needs the tables schema/metadata and that’s managed by the Hive metastore. Alternatively we can use Apache icebergs but we need the table mappings before Trino query engine can access the data stored in s3.

@LucasRalambo-bp3vb Жыл бұрын

This is very informative !!! Thank you... Can you please also make a video about creating an open source version of Amazon Forecast?

@BiInsightsInc Жыл бұрын

Amazon Forecast is a time-series forecasting service based on machine learning (ML). We can certainly do it using open source. I will cover time-series forecasting in the future. In the mean time check out the ML Predictive Analytics with following video: kzbin.info/www/bejne/ioOZp6Fqob9mg9E&t

@juliovalentim6178 Жыл бұрын

Sorry for the noob question, but can I create a Data Lake like this inside a PowerEdge T550 server instead of my desktop or laptop? Without resorting to paid cloud services?

@BiInsightsInc Жыл бұрын

Yes, you can create this setup on your server. This way you will use your own infrastructure and avoid paid services and data exposure to outside services.

@juliovalentim6178 Жыл бұрын

@@BiInsightsInc Hi! Thank you very much for answering my question. And would you be able to tell me what RAM, cache and SSD requirements I need to have on the server to implement this setup, without slowing down processing for Data Science?

@BiInsightsInc Жыл бұрын

@@juliovalentim6178 the hardware requirements will ultimately depends on the amount of data you are processing and you can tweak it once you perform tests with actual data. Anyways, here are some recommendations from minIO. First is an actual data lake deployment you can use for reference. Second link is for a production scale data lake. Hope this helps. blog.min.io/building-an-on-premise-ml-ecosystem-with-minio-powered-by-presto-r-and-s3select-feature/ min.io/product/reference-hardware

@juliovalentim6178 Жыл бұрын

@@BiInsightsInc Of course it helped! Thank you so much again. Congratulations on the excellent content of your channel. I will always be following. Best Regards!

@zera215 9 ай бұрын

You mentioned to someone that Apache Iceberg could be an alternative to Hive. Would you be interested in recording a new video about it?

@BiInsightsInc 9 ай бұрын

@zera215 I have covered the Apache Iceberg and how to utilize in the similar setup in the following video: kzbin.info/www/bejne/rJ-xeXevoayne80

@zera215 9 ай бұрын

@@BiInsightsInc I am looking for full open source solution. Do you know if I can just exchange hive and Iceberg in the architecture of this video?

@BiInsightsInc 8 ай бұрын

@@zera215 you can use Hive and Iceberg together. Yout still need a metastore in order to work with Iceberg. Here is an example of how to use them together. iceberg.apache.org/hive-quickstart/

@zera215 8 ай бұрын

@@BiInsightsInc Thank you, and congrats for your great work =-D

@oscardelacruz3087 9 ай бұрын

Hi, Haq. Nice video, i'm trying to make it works but cannot load minio catalog.

@BiInsightsInc 9 ай бұрын

You are not able to connect to MinIO in DBeaver? What's the error you receive there?

@oscardelacruz3087 9 ай бұрын

@@BiInsightsInc hi haq, thanks. Finally i connect minio and trino. But i have a question how much deep trino can read parquet file? I trying to read parquet file from minio with directory structure: s3a://datalake/bronze/erp/customers/. Inside customer folder i have folders for each year/month/day. When try to read the files trino return 0 rows.