Incremental Data Load in Hive | Big data interview questions

Рет қаралды 48,291

GK Codelabs

Күн бұрын

Пікірлер: 117

@ArtAlive 4 жыл бұрын

Thanks a lot was an awesome explanation. I was searching the answer for this .nice thank you so much

@kumarrk6343 5 жыл бұрын

9:40 to 12:40 .No words .What a simple explanation.Really mind blowing.I have been rejected more than 25 interviews till now though I have 2 years genuine big data experience.I have come to know where I am lacking . Definitely I can crack my next interview with the help of your videos.

@anshulbisht4130 4 жыл бұрын

what u did for 2 years ?

@venkatramana7980 4 жыл бұрын

Bro, Really you are Hero.Helping others without expecting anything is really a big big thing.Thanks a lot Bro

@sivak9750 4 жыл бұрын

Best and simple explanation. I didn't find this solution any where. Thanks alot!!.

@DeepakSharma_youtube 4 жыл бұрын

Very good explanation but I have a few questions, because I've used a slightly different approach in our prod environment, and this approach will also not solve our issue (Q3 below): Q1: @14:42 you didn't update the date to 2019-04-23 but it shows in your view. How? Q2: The other question I have is, how would you handle 'DELETES' on the source system? Q3: As we approach Day30, or Day365 etc. the main EXT table would be huge. Is there a way to kind of 'Reset' that base table at some point so it doesn't grow every time?

@nareshj6370 3 жыл бұрын

I have the same question:)

@aa-kj9zm 4 жыл бұрын

seems real time work.I am learning Hadoop but lost my way because i am not taking any training.This is very helpful. I will check all your videos.Thanks for this awesome video

@sunshinemoon922 2 жыл бұрын

Awesome video sir. Very useful for interviews. Thank you very much.

@RaviKumar-uu4ro 5 жыл бұрын

Tons of Thanks to your valuable videos. Really marvelous and uncomparable to any other.

@christiandave100 3 жыл бұрын

why extra subquery t2.. we can remove the second subquery e.g select t1.* from (selct * from inc_table)t1 join (select empid,max(modified_date) max_modified from inc_table t2 group by empid) s on t1.empid=s.empid and t1.modified_date=s.max_modified

@ririraman7 2 жыл бұрын

Brilliant video! much needed...to the point!

@Sagar-gi5zq 2 жыл бұрын

what if we dont have modified_date column...?

@gauravpathak7017 4 жыл бұрын

just wow! This is the best that anyone can have on incremental Load in Hive. cheers :)

@ramkumarananthapalli7151 Жыл бұрын

Quite useful!! Thank you for making 💐💐

@hemanthreddykolli 3 жыл бұрын

This video is very helpful to understand the CDC concept. thanks for sharing your knowledge.

@subramanianchenniappan4059 5 жыл бұрын

i am a java developer with hadoop handson. i will see all your videos , thanks for your help

@svdfxd 5 жыл бұрын

I am preparing for Big Data interviews and such interview series would be really helpful. Please add Spark interview questions as well. The way you explained patiently with example is really good.

@GKCodelabs 5 жыл бұрын

Sure, My interview series will cover a wide range of interview questions in all BD technologies. Hive, Spark, HBase, and Datawarehousing concepts will be a major part of those, as these are the most important skills in demand for most of the interviews. #KeepWatching :)

@adityapratapsingh7649 3 жыл бұрын

Thanks for detailed video. I have one question we can do the same with window functions right like using row_number(). So which approach is the optimized one? select * from (select *, row_number() over (partition by id order by modifiedDate) as rk from v1) a where rk=1

@dhivakarsathya3918 3 жыл бұрын

I would prefer to use group by and inner join which GK as used which runs much faster than window functions in hive. Better to follow sqoop import if possible else hdfs storage size would become massive and ur view process will take lot of time

@debatrii 4 жыл бұрын

Very good..well explained..thanks

@sourav7413 3 жыл бұрын

thanks for your good informative video ...

@ajinkyahatolkar6518 2 жыл бұрын

quite informative video. thanks !

@puneetbhatia 2 жыл бұрын

explained amazingly. thank you so much!!!!!!!!

@bobbyvenkatesan3657 4 жыл бұрын

Thanks for sharing this kind of videos. Very helpful

@narasimharao3665 4 жыл бұрын

If we do like this every time that duplicate records will be there in located file and that file size is extremely increased and whenever we run that view there are sub queries in that and it is also decrease the performance. instead of this we can use sqoop ("sqoop incremental option") to import the incremental data into hdfs directory file or cloud(like aws s3).

@bsrameshonline 4 жыл бұрын

very good explanation on incremental loading

@rajnimehta9189 3 жыл бұрын

Awesome explanation

@naveenvinayak1088 4 жыл бұрын

well explained..helping others without expectations

@tallaravikumar4560 2 жыл бұрын

Good explanation but Ur text application is unclear atleast full black with white font would been more clearer,what if the modified data is not updated.

@udaynayak4788 2 жыл бұрын

thank you so much for detailed explanation

@sumitkumarsahoo 4 жыл бұрын

Thanks alot! I was actually looking for something like this for loading incremental data

@arunkumar-th8vy 4 жыл бұрын

can you please help me if we dont have any date column and after loading day2 into my history table(day1) i need to make it doesn't contain any duplicates

@the_high_flyer 4 жыл бұрын

Super...thanks a ton for your video.

@astropanda1623 Жыл бұрын

Very good explanation

@Sumit261990 4 жыл бұрын

Nice video, Really useful. Thanks a lot

@akshaychoudhari5641 2 жыл бұрын

how incremental load in Hive is different than the incremental load we do in scoop operation. can you explain

@bigdatabites6551 2 жыл бұрын

Great job Bro....

@abhiganta 5 жыл бұрын

Hi Sir, What is the need to create t2 ? We can directly directly query as (select empid, max(modDate) from inc_table group by empid) s and then join t1 and S ? Please correct me if wrong.

@vermad6233 2 жыл бұрын

Same question from my side!

@ajaythedaredevil7220 5 жыл бұрын

1.Can we use merge statement for simplification? 2.what if a employee id has been deleted in new data set and now we don’t want it in our final table. I can see the join will take the left employee id also Many thanks!

@abhiganta 5 жыл бұрын

I have the same doubt.. What if some of the records are deleted from source db and we need to remove those records in hive ?

@arindampatra6283 4 жыл бұрын

That means the new data has all the employees info and you can simply filter the latest date😊

@saurav0777 4 жыл бұрын

Is this the implementation for SCD type 2 as well?

@kilarivenkatesh9844 3 жыл бұрын

Nice explanation.. Bro

@NextGen_Tech_Hindi 8 ай бұрын

Thanks man your are amazing 😍❤❤❤

@tarunreddy5917 Жыл бұрын

@GK Code labs , Is there any difference between incremental data and delta data

@richalikhyani7204 3 жыл бұрын

Do you have any course playlist.

@vru5696 3 жыл бұрын

Awsome video... Any videos related to SCD and SCD revert in Hive? Please share link.

@pravinmahindrakar6144 7 ай бұрын

Thanks a lot, I think we can use row_number window function to get updated records by using partitions by emp_id and order by date desc. Finally can filter for row_number=1

@rakshithbs882 4 жыл бұрын

Hi, very nice explanation. I have one doubt. what if we use only S alias table query. Whether it will return same output

@manikandanl4909 3 жыл бұрын

we are not selecting all columns in S alias table. So we are joining with t1 alias table to get all columns

@rajeshkumardash611 4 жыл бұрын

@GK Codelabs This may not work if data has deleted records

@anilpatil6783 5 жыл бұрын

Thank you GK. This incremental data load is the basis millions of ETL jobs. Thank you for such a pitch perfect explaination. I have a question how this 9:40 > after logic is put into production? I mean to say how it actually made to run everyday. Here I could see view only. Is this view used to load data from staging to some other layer each day.

@rathnakarlanka2624 4 жыл бұрын

Thank you GK. If we miss incremental data extract for couple of times and if we use max date to join then there is a scope of missing the records right. So, How come we overcome this problem?

@ANUKARTHIM 4 жыл бұрын

Thanks for the video. Good work. Looking for more videos on Hbase and its stuff, how regions work in Hbase. How to define regions while creating hbase table and many more Bro.

@sumitkhandwekar6021 2 жыл бұрын

Amazing brooooo

@rohitaute9928 Жыл бұрын

what if we need to maintain versions in hbase?

@sagarsinghrajpoot6788 5 жыл бұрын

I got this real time case . Thanks :) Now we got here how to handle incremental data but do you have have any video on a different use case - "Data Transofrmation usecase" using Hive ( applied business Transformations )? if yes please tell me . I became fan of you man .now onward i will also do practise like this by my own ....

@ravikumark6746 4 жыл бұрын

Great sir..thank you

@bhushanmayank 5 жыл бұрын

I have a doubt, how long are we going to store the daily files in hdfs ? Don't you think the performance of the view is going to be hit as more csv files are stored in the hdfs location to run a view on top of them ? Is there any way to keep only relevant records and a fresh file to process in hive and rest we move to cold storage ?

@sagarsinghrajpoot6788 5 жыл бұрын

you are awesome man ;) I liked your vidoes i feel like i am watching like netfix easy to understand :)

@prashantahire143 5 жыл бұрын

Good Explanation ! How to perform incremetal hive load from the HDFS for the Partition Table? The table do not have the date/timestamp column

@GKCodelabs 5 жыл бұрын

Hi Prashant, Thanks for your comment.. ☺️ We can use many other internal checkpoints in such cases. Thanks for sharing the scenario, i will surely explain this in one of my coming videos.. #KeepWatching

@arindampatra6283 4 жыл бұрын

I think you would have got your answers by now? If not then let's discuss. What is the partition column? How are you loading new data in that table?

@mahesh.h1b339 Жыл бұрын

@@GKCodelabs what type of join is this bro ?

@gandlapentasabjan9115 3 жыл бұрын

Hi Bro, Very helpful videos thank you so much for sharing this with us. I have a small doubt, If suppose we don't have date column, then how to do?

@junaidansari675 5 жыл бұрын

Very helpful, please make videos about other components n theory, hadoop admin jobs related videos...

@Kutub2005 3 жыл бұрын

Hello GK Codelabs, thanks for this awesome video. Would you please make a video to add column for where modified data will be refect ? Senerio is I don't have modified_date column in existing hive table so if I want to use the stratagy that you have shown in this video, then how do I add modified_date column in existing hive table and hdfs data ??

@MrManish389 5 жыл бұрын

Sir, easy and simple explained

@naveengupta7268 4 жыл бұрын

Great Explanation!

@naveenvinayak1088 4 жыл бұрын

can u do video about kafka?

@ArunKumar-gw2ux 4 жыл бұрын

Good explanation! But this wont work in enterprise level data. this is not a scalable solution. For instance, if the incremental is maintained for 12 months and update is coming everyday, this deduping will take long time to complete

@kiranmudradi26 4 жыл бұрын

Can you please let us know better solution in such scenarios? Thanks

@seetharamireddybeereddy222 5 жыл бұрын

can i know how to work with stagging table in hive

@zeeshan42007 4 жыл бұрын

can you please this Cloudera image I have downloaded from CDH it is very heavy not able to work.

@mahammadshoyab9717 5 жыл бұрын

Dont think asking much, If you explain us one end to end scenario from kafka pulling to hdfs landing and hive loading.its very helpful for the persons who are poor and struggling to clear interview?

@M-Fash0070 Жыл бұрын

Plz make one more vdo on rdbms to hive with maintaining history,updated data and new data....

@deepikakumari5369 4 жыл бұрын

nice explanation, please upload a video on "How to handle multiple small files generating hive as output?".. Thank you :)

@gsp4420 3 жыл бұрын

Hi, if we don't have sequence I'd and the CSV/table data contains only data with duplicates but the total combination of row values will be unique, how to do the incremental load in this situation, thank you

@gsp4420 3 жыл бұрын

And will not get load date and unique column is in source table and target table

@deepikakumari5369 4 жыл бұрын

Sir, will you please give me answer to this? What approach we should take to load thousands of small 1 KB files using Hive, do we load one by one or should we merge together and load at once and how to do this?

@ririraman7 2 жыл бұрын

I believe hive is not meant for small files!

@ravishankarrallabhandi531 6 ай бұрын

How can we handle the case where source records are closed / deleted ?

@arupanandaprasad2202 3 жыл бұрын

How to do incremental load in spark?

@kumarraja4759 4 жыл бұрын

Nice explanation, But I have question here . The final join query is required to pick the latest records? We can select all the columns with max(modified_date) will give the desired output i believe. correct me if i wrong

@manikandanl4909 3 жыл бұрын

Bro, when do aggregation by some column(here mod_date) we need to specify all other columns in "group by". In case if have 100s of column then we need to specify everything. Thats why joined with original table.

@ririraman7 2 жыл бұрын

@@manikandanl4909 Thank you for clearing the doubt brother!

@arunkumarreddy9736 4 жыл бұрын

what if we don't have date column , can you please help

@snagendra5415 2 жыл бұрын

Could you do a video on the small file problem

@The_Code_Father_v1.0 4 жыл бұрын

@gkcodelabs can you please make some similar videos on pyspark with use cases asked in interview

@sathishanumaiah6907 5 жыл бұрын

Could you please explain the same process with rdbms data instead of files

@routhmahesh9525 3 жыл бұрын

Can you please make a video on handling small file in Apache spark?

@ambikaprasadbarik6400 4 жыл бұрын

Thanks a lot!

@rajeshreddy906 4 жыл бұрын

Hi your videos are gr8. If you don't mind could you please post a video on sort merge bucket ( SMB ) join

@ravikirantuduru1061 4 жыл бұрын

Can you create videos on spark join if data is skewed and if joining small data and large data and how to do joins and explain how spark does sort merge and shuffle join.

@sudhakarsubramani1528 2 жыл бұрын

thank you

@PramodKhandalkar5 3 жыл бұрын

thanks man

@rohitsotra2010 5 жыл бұрын

What if we don't have any date column like modified date??

@suhaskolaskar552 4 жыл бұрын

you need to first learn slowly changing dimension then you wont ask this question.

@kalyanis6886 3 жыл бұрын

Hello, Can you share the VM image.

@haranadhsanka9699 5 жыл бұрын

Really appreciate , can you also explain spark way of incremental load Thanks in advance

@GKCodelabs 5 жыл бұрын

Sure Haranadh, i will explain in one of my coming videos. KeepWatching ☺️

@rakshithbs882 4 жыл бұрын

@@GKCodelabs Hi, very nice explanation. I have one doubt. what if we use only S alias table query. Whether it will return same output

@nlaxman5091 4 жыл бұрын

hi bro please do one vedio on how to choose memory , core ,excutores in spark cluster

@mahammadshoyab9717 5 жыл бұрын

Hi Bro, How to perform incremental load when no primary key and datestamp coulmns in table? Thanks in advance

@arindampatra6283 4 жыл бұрын

If there's no primary key the data is gibberish 😊

@rakshithbs882 4 жыл бұрын

@@arindampatra6283 Hi, very nice explanation. I have one doubt. what if we use only S alias table query. Whether it will return same output

@arunsakkumar8463 4 жыл бұрын

Please post top most interview questions for hive

@BigDataWithSky Жыл бұрын

Why not u uploading the videos regularly

@vikky7480 5 жыл бұрын

please make a video on accumulators and broadcast variables. as well as aggregate by key() with a good example.

@GKCodelabs 5 жыл бұрын

Awesome Vicky, somehow you cracked what,the next video is going to be about ☺️☺️, its the very next video,which you Requested.. coming soon (couple of days)☺️☺️☺️💐