one of the best tuto's on youtube, thank you so much !
@hernanlopezvergara613310 ай бұрын
Thank you very much for this short but very useful video!
@pabloqp7929Ай бұрын
Unbelievable, thank you!
@datawise.education11 ай бұрын
I love all your videos Haq. Great work. :)
@calvinball120 күн бұрын
Amazing
@khanetor3 ай бұрын
Thank you so much for the video! I am trying to have the same setup, but I ran into some issue. I use the official Hive v4 docker image. I think it does not have the necessary jars to work with MinIO, because I got the ClassNotFound errors. I hope you can help me with 3 questions: 1. Do you know the complete list of jars to make Hive work with MinIO? 2. Do I need to use Hive 3 HMS instead of v4? 3. Does HMS need access to MinIO? From the diagram, there is no connection between HMS and MinIO, but it seems that it does, because the ClassNotFound errors were due to the lack of AWS jars, hinting that some AWS activities were going on.
@BiInsightsInc2 ай бұрын
Hi there, I don't have the comprehensive list but check their docs they may have a dependency list. You will need AWS S3 jars along with hadoop jars at the minimum. You can start with Hive 3 since it will have more coverage and someone may already have a solution out there. I covered the Hive Metastore xml file where we provide MinIO creds and url therefore, it does need a connection. Since MinIO does not have jars we use the AWS jars to connect to it hence you see AWS related errors. Hope this helps with your project.
@khanetor2 ай бұрын
@BiInsightsInc thanks for the tips!
@orafaelgf2 ай бұрын
hi, I'm revisiting this video because I managed to create the entire infra, but the trino '351' version does not accept 'delta_lake'. in addition to the name of the 'hive-hadoop2' connector having changed to 'hive'. I tried everything to make it work by changing the 'trino' version to >=400 and the 'delta_lake' and 'iceberg' connectors and I couldn't. It would be great to see an update to this video using newer versions. tks
@BiInsightsInc2 ай бұрын
Hi Rafael, can you share your docker and catalog file? So we can see what components you've upgraded and what's causing the issue. This tutorial does not use iceberg or delta lake. Maybe those components are causing this issue.
@orafaelgf2 ай бұрын
@@BiInsightsInc nice. I sent invite connection on linkedin to send files there. Tks a lot.
@orafaelgf2 ай бұрын
@@BiInsightsInc I sent an invitation on your LinkedIn today (and you already accepted). I sent my situation by message. Thank you very much in advance for any help.
@paraffin333Ай бұрын
Hi, when I am trying to create a schema, I am getting an error that says doesBucketExist, I thin this has something to do with TLS access, if possible could you please guide me through that
@BiInsightsIncАй бұрын
Hi there, by the error "error that says doesBucketExist" it seems the S3 bucket is not here. Check the bucket name and try again.
@TheMahardiany Жыл бұрын
Thankyou for the video, great as always 🎉 I want to ask in this video, when we use trino for query engine, can we use DML and even DDL for that external table ? or we can just select from it ? Thank you
@BiInsightsInc Жыл бұрын
There are a number limitations to do DMLs on Hive. Please read the documentation link for more details - cwiki.apache.org/confluence/display/Hive/Hive+Transactions. It’s recommend not to use DML on Hive managed tables especially if the data volume is huge these operations would become too slow. DML operations would be considerably faster if done on a partition/bucket instead of the full tables. Nevertheless it better to handle the edits in file and do a full refresh via external table and only use DML on managed tables as last resort. We define the table via DDL so yes.
@akaile2233 Жыл бұрын
Thankyou for video Hi sir, if I want to use Spark to save data to the data lake you built, how do I do that? (I just started learning about Data lake and Spark)
@BiInsightsInc Жыл бұрын
Below is a sample code to write data to MinIO bucket with Spark. package com.medium.scala.sparkbasics import com.amazonaws.SDKGlobalConfiguration import org.apache.spark.sql.SparkSession object MinIORead_Medium extends App { System.setProperty(SDKGlobalConfiguration.DISABLE_CERT_CHECKING_SYSTEM_PROPERTY, "true") lazy val spark = SparkSession.builder().appName("MinIOTest").master("local[*]").getOrCreate() val s3accessKeyAws = "minioadmin" val s3secretKeyAws = "minioadmin" val connectionTimeOut = "600000" val s3endPointLoc: String = "127.0.0.1:9000" spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", s3endPointLoc) spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", s3accessKeyAws) spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", s3secretKeyAws) spark.sparkContext.hadoopConfiguration.set("fs.s3a.connection.timeout", connectionTimeOut) spark.sparkContext.hadoopConfiguration.set("spark.sql.debug.maxToStringFields", "100") spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true") spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") spark.sparkContext.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled", "true") val yourBucket: String = "minio-test-bucket" val inputPath: String = s"s3a://$yourBucket/data.csv" val outputPath = s"s3a://$yourBucket/output_data.csv" val df = spark .read .option("header", "true") .format("minioSelectCSV") .csv(inputPath) df .write .mode("overwrite") .parquet(outputPath) }
@akaile2233 Жыл бұрын
@@BiInsightsInc Hi sir, I did everything like your video and it worked fine, but when I remove the schema 'sales' there is an error 'access denied' ?
@BiInsightsInc Жыл бұрын
@@akaile2233 You cannot delete objects from Trino engine. You can do so in the Hive Metastore. In this example, we re using Maria db. So you can connect to it delete objects from there. Changes will be reflected in the mappings you see in Trino.
@akaile2233 Жыл бұрын
@@BiInsightsInc Sorry to bother, there are too many tables in metastore_db, which ones should I delete?
@orafaelgf8 ай бұрын
great video, congrats.
@GuilhermeMendesG28 күн бұрын
Really great video! I have a question. I'm trying to create a external table in Trino using a external_location (which is minio) and when I'm reading the parquet files, 2 string columns from the parquet files could not be converted to VARCHAR, apparently. The error is: Erro SQL [65536]: Query failed (#20241118_143937_00020_d5ivv): io.trino.spi.type.VarcharType. What should I do?
@BiInsightsInc27 күн бұрын
This can happen if one record has a a type different than what is defined in the metastore. Check your data for anamolies. Also make sure you have all the columns in the table that are in the file. Here is a sample table script: github.com/hnawaz007/pythondataanalysis/blob/main/data-lake/Create%20Table.sql
@Chay-u1y Жыл бұрын
thank you. can we build transactional data lake using iceberg /hudi on this minio storage.
@BiInsightsInc Жыл бұрын
Yes, you can build a data lake using Iceburg and MinIO. Here is a guide that showcases both of these tools in conjunction. resources.min.io/c/lakehouse-architecture-with-iceberg-minio?x=jAF4uk&Lakehouse+%2B+Icerberg+on+PF+1.0+-+080322&hsa_acc=8976569894&hsa_cam=17954061482&hsa_grp=139012460799&hsa_ad=614757163838&hsa_src=g&hsa_tgt=kwd-1717916787486&hsa_kw=apache%20iceberg&hsa_mt=b&hsa_net=adwords&hsa_ver=3&gclid=Cj0KCQjwnrmlBhDHARIsADJ5b_mrNZMG2PHc14akJyBoy3nW-8INcEQ8MFRjifDGkjGDeDiNqcAxVvkaAgToEALw_wcB
@anujsharma4011 Жыл бұрын
Can we directly connect trino with S3? no hive inbetween. I want to install trino on EC2
@BiInsightsInc Жыл бұрын
I’m afraid not. Trino needs the tables schema/metadata and that’s managed by the Hive metastore. Alternatively we can use Apache icebergs but we need the table mappings before Trino query engine can access the data stored in s3.
@LucasRalambo-bp3vb Жыл бұрын
This is very informative !!! Thank you... Can you please also make a video about creating an open source version of Amazon Forecast?
@BiInsightsInc Жыл бұрын
Amazon Forecast is a time-series forecasting service based on machine learning (ML). We can certainly do it using open source. I will cover time-series forecasting in the future. In the mean time check out the ML Predictive Analytics with following video: kzbin.info/www/bejne/ioOZp6Fqob9mg9E&t
@juliovalentim6178 Жыл бұрын
Sorry for the noob question, but can I create a Data Lake like this inside a PowerEdge T550 server instead of my desktop or laptop? Without resorting to paid cloud services?
@BiInsightsInc Жыл бұрын
Yes, you can create this setup on your server. This way you will use your own infrastructure and avoid paid services and data exposure to outside services.
@juliovalentim6178 Жыл бұрын
@@BiInsightsInc Hi! Thank you very much for answering my question. And would you be able to tell me what RAM, cache and SSD requirements I need to have on the server to implement this setup, without slowing down processing for Data Science?
@BiInsightsInc Жыл бұрын
@@juliovalentim6178 the hardware requirements will ultimately depends on the amount of data you are processing and you can tweak it once you perform tests with actual data. Anyways, here are some recommendations from minIO. First is an actual data lake deployment you can use for reference. Second link is for a production scale data lake. Hope this helps. blog.min.io/building-an-on-premise-ml-ecosystem-with-minio-powered-by-presto-r-and-s3select-feature/ min.io/product/reference-hardware
@juliovalentim6178 Жыл бұрын
@@BiInsightsInc Of course it helped! Thank you so much again. Congratulations on the excellent content of your channel. I will always be following. Best Regards!
@zera2159 ай бұрын
You mentioned to someone that Apache Iceberg could be an alternative to Hive. Would you be interested in recording a new video about it?
@BiInsightsInc9 ай бұрын
@zera215 I have covered the Apache Iceberg and how to utilize in the similar setup in the following video: kzbin.info/www/bejne/rJ-xeXevoayne80
@zera2159 ай бұрын
@@BiInsightsInc I am looking for full open source solution. Do you know if I can just exchange hive and Iceberg in the architecture of this video?
@BiInsightsInc8 ай бұрын
@@zera215 you can use Hive and Iceberg together. Yout still need a metastore in order to work with Iceberg. Here is an example of how to use them together. iceberg.apache.org/hive-quickstart/
@zera2158 ай бұрын
@@BiInsightsInc Thank you, and congrats for your great work =-D
@oscardelacruz30879 ай бұрын
Hi, Haq. Nice video, i'm trying to make it works but cannot load minio catalog.
@BiInsightsInc9 ай бұрын
You are not able to connect to MinIO in DBeaver? What's the error you receive there?
@oscardelacruz30879 ай бұрын
@@BiInsightsInc hi haq, thanks. Finally i connect minio and trino. But i have a question how much deep trino can read parquet file? I trying to read parquet file from minio with directory structure: s3a://datalake/bronze/erp/customers/. Inside customer folder i have folders for each year/month/day. When try to read the files trino return 0 rows.
@hungnguyenthanh4101 Жыл бұрын
How to build data lakehouse? Please next video with topic Lakehouse.❤
@BiInsightsInc Жыл бұрын
Yes, data lake house is on my radar. I will cover it in the future videos.