Data Savvy

5:47

Lambda Architecture | Data Engineer System Design Interview | spark interview questions

Жыл бұрын

5:53

Proxy vs. Reverse Proxy | System Design questions

Жыл бұрын

8:34

Consistent Hashing Explained: Balancing Data in Distributed Systems | System Design Questions

Жыл бұрын

9:26

Bloom Filter: Filtering the Noise in the Age of Information Overload

Жыл бұрын

10:33

CAP Theorem: Consistency, Availability, and Partition Tolerance Explained| System Design

Жыл бұрын

2:40

Spark Sort Merge Join: Efficient Data Joining : Spark SQL interview questions

Жыл бұрын

3:40

Spark Shuffle Hash Join: Spark SQL interview question

Жыл бұрын

2:40

Bootstrap Server | kafka Interview Questions

4 жыл бұрын

7:18

Apache Spark Memory Management | Unified Memory Management

4 жыл бұрын

36:30

2nd Data Engineering Interview | Apache Spark Interview | Live Big Data Interview

4 жыл бұрын

1:16:46

Jacek Laskowski Interview | Talk with Big Data Expert

4 жыл бұрын

1:35

Trailer | Jacek Laskowski Interview | Talking to Expert | Data Engineering Expert

4 жыл бұрын

34:03

Data Engineering Interview | Apache Spark Interview | Live Big Data Interview

4 жыл бұрын

3:57

Consumer vs Consumer group | Apache Kafka Tutorial

4 жыл бұрын

4:41

Fault Tolerance kafka | Topic Replication | Apache Kafka Tutorial

4 жыл бұрын

4:25

Kafka core Concepts | Apache kafka Tutorial

4 жыл бұрын

5:25

Introduction to Kafka | Apache Kafka Tutorial

4 жыл бұрын

7:38

Spark Out of Memory Issue | Spark Memory Tuning | Spark Memory Management | Part 1

4 жыл бұрын

4:25

Hash Partitioning vs Range Partitioning | Spark Interview questions

4 жыл бұрын

9:15

Partition vs bucketing | Spark and Hive Interview Question

4 жыл бұрын

5:55

Stream Files to Structured Streaming | Structured streaming Tutorial

4 жыл бұрын

2:19

Topic Vs Queue | Kafka Interview questions

4 жыл бұрын

6:32

Dynamic Partition Pruning | Spark Performance Tuning

4 жыл бұрын

7:01

Trigger Practical Example | Spark Structured Streaming Tutorial

4 жыл бұрын

4:10

Repartition vs Coalesce | Spark Interview questions

4 жыл бұрын

4:24

Understanding Triggers | Structured Streaming Tutorial

4 жыл бұрын

10:34

Writing Your First Streaming Job | Spark Structured Streaming Tutorial

4 жыл бұрын

7:01

Streaming Introduction | Spark Structured Streaming Tutorial

4 жыл бұрын

11:39

Install Apache Spark on Mac, Windows, Linux

4 жыл бұрын

Пікірлер

@alchemista2 10 күн бұрын

Sorry but you are way wrong on the MQ side. You can have persistent queues, ordering, multiple consumers, etc. This is wrong information.

@isharkpraveen 10 күн бұрын

This guy and his contents are gold ❤❤❤...Just in 5 mins he explaned very beautifully...

@isharkpraveen 10 күн бұрын

Please explain catalyst optimizer

@adityakvs3529 29 күн бұрын

sir i i have some questions can i contact you

@isharkpraveen Ай бұрын

Simple and Clean explanation 👍

@sky-i8d 2 ай бұрын

Hadoop is generally for big data. So for storage the minimum block size is 128MB, so having such a small file can significantly waste storage as atleast one block will be assigned to one file. Please correct me if I am wrong here.

@shishirkumar4932 3 ай бұрын

Can we partition by range of data?

@RishavKumar-n2o 3 ай бұрын

nice content

@engineerbaaniya4846 3 ай бұрын

is it correct to say that partitioning will create multiple folders while bucketing will create multiple files in spark

@Khang-lt4gk 3 ай бұрын

Question 1 at 3:15: Issues with many small files on Hadoop. - Resource utilization problem: Each task is assigned to process data in a single partition. Multiple small files -> multiple small partitions -> multiple tasks are required -> multiple tasks are in queue -> high frequency for context switching -> high load on driver node (for allocation and orchestrating tasks among executors and cores) -> high possibility for driver OOM. - .metadata file (responsible for storing address of compressed partition files) consists of many key pairs to map -> low shuffle efficiency for almost every transformations.

@punpompur 3 ай бұрын

Wouldn't it be possible for data in buckets to be skewed as well? Does the hash function ensure that each bucket will be the same size?

@keerthymganesh1495 3 ай бұрын

Very clear explanation 😊

@isharkpraveen 3 ай бұрын

Just in 4 min video he explained well

@yoniperach 4 ай бұрын

In your diagram, you present a CA system, no true CA systems exist in a system of more than one DB. That really confused me.

@anudipray4492 4 ай бұрын

Docker is working fine in Windows home edition sir.

@gurumoorthysivakolunthu9878 4 ай бұрын

The explanation is crisp and clear... Thank you... Please share the next part of this video...

@srinivasjagalla7864 4 ай бұрын

Nice discussion

@Rohit-r1q1h 4 ай бұрын

Can you post video now for data engineering interview and also post question sets as well

@pradhyumansinghmandloi8240 4 ай бұрын

Can you just make video on system design topics and technique we should learn or you are going to cover in this series...

@adityakvs3529 4 ай бұрын

How hashcode decided

@DataSavvy 4 ай бұрын

Hashcode is calculated using hash algorithm

@jayantmeshram7370 5 ай бұрын

I am trying to run your code in ubuntu system but not able to .. could you please make a video how to run your code in ubuntu/linux .. thanks

@anudeepk7390 5 ай бұрын

Did the participant consent for posting this online? If not u should blur his face

@DataSavvy 5 ай бұрын

Yes.. It was agreed with participants

@roopashastri9908 5 ай бұрын

How can we orchestrate this in airflow?what should be the schedule interval

@DeepakNanaware-ze8pq 6 ай бұрын

Hey bro, your videos is good but if you are creating video of how to handle small file issue using practical way so it's very helpful for us.

@khrest-lt6gj 6 ай бұрын

I really like the pace of the videos. Great job! And thank you!

@biswadeeppatra1726 6 ай бұрын

Please share the doc that you are using in this video

@VivekKBangaru 6 ай бұрын

very informative one. Thanks Buddy.

@akashhudge5735 6 ай бұрын

In lambda architecture so far no one has explained how de duplication is handled when the batch and stream processing data is combined in serving layer? whatever data is processed by streaming layer will eventually gets processed in batch layer? if this is true then previous streaming layer processed data is no more required. so do we need to remove that data processed by streaming layer?

@jasbirkumar7770 7 ай бұрын

sir can you tell me some about housekeeping executive spark deta. i dont understand spark word. facility company JLL requird he have spark exprience

@deepanshuaggarwal7042 7 ай бұрын

"flatMapGroupsWithState" is a statefull operation? Do you have any tutorial on it?

@briandevvn 8 ай бұрын

To whom may be concerned when to use GroupByKey over ReduceByKey: groupByKey() can be used for non-associative operations, where the order of application of the operation matters. For example, if we want to calculate the median of a set of values for each key, we cannot use reduceByKey(), since median is not an associative operation.

@sreekantha2010 8 ай бұрын

Awesome!! wonderful explanation. Before this, I have see so many videos but none of those explained the steps in such a clarity. Thank you sharing.

@BishalKarki-pe8hs 8 ай бұрын

vak mugi

@ldk6853 8 ай бұрын

Terrible accent… 😮

@maturinagababu98 9 ай бұрын

Hi sir pls help me with the following requirement id|count| +---+-----+ | a| 3| | b| 2| | c| 4| +---+-----+ need the following output using spark a a a b b c c c c

@ramyajyothi8697 9 ай бұрын

What do you mean by application needing a lot of joins? Can you please clarify how the joins are affecting the architecture decision?

@suresh.suthar.24 9 ай бұрын

i have one doubt: reserved memory and yarn overhead memory are same ? because reserved memory also stored spark internals. Thank you for your time.

@vipulbornare34 3 ай бұрын

No, reserved memory and YARN overhead memory are not exactly the same, though they both deal with memory allocation in a Spark application. Reserved Memory: This refers to the memory reserved by Spark for internal operations. This includes things like data structures required for task execution, internal buffers, and other overheads used by Spark itself (such as execution, storage, and shuffle memory). It's typically not available for user tasks or for storing data in RDDs or DataFrames. Reserved memory is usually fixed, and a portion of the total executor memory is set aside for this purpose. YARN Overhead Memory: This is memory that is specifically allocated to account for overhead when running Spark on YARN. YARN manages the resources for distributed applications, and overhead memory includes things like the container’s JVM overhead, memory for running the YARN daemons, or any other extra memory YARN needs to manage the container (like logging, etc.). This is configured with the spark.yarn.executor.memoryOverhead property. Both affect how much memory is available for your actual Spark jobs, but they refer to different types of overheads: one internal to Spark and one related to the YARN resource manager.

@ahmedaly6999 9 ай бұрын

how i join small table with big table but i want to fetch all the data in small table like the small table is 100k record and large table is 1 milion record df = smalldf.join(largedf, smalldf.id==largedf.id , how = 'left_outerjoin') it makes out of memory and i cant do broadcast the small df idont know why what is best case here pls help

@naveena2226 9 ай бұрын

Hi @all I just got to know about the wonderful videos in datasavvy channel. In that executor OOM - big partitions slide, in spark every partition is of block size only ryt(128MB) , then how come big partition will cause an issue? Can Simeon please explain this? Little confused here Even if there is 10gb file , when spark reads the file it creates around 80 partition of 128mb. Even if one of the partition is high it cannot increase 128mb ryt.. then how come OOM occurs??

@sheshkumar8502 10 ай бұрын

Hi how are you

@praptijoshi9102 10 ай бұрын

amazing

@adityakvs3529 10 ай бұрын

I have who take care of task sheduling

@kaladharnaidusompalyam851 10 ай бұрын

If we maintain replica of data in three diff racks in hadoop. if we submit job we get results right. hy we dont get copies of data execution. how can / what is the operation that is there in hadoop only one block of data ned to process in hadoop if we have two more duplictaes