Пікірлер
@alchemista2
@alchemista2 10 күн бұрын
Sorry but you are way wrong on the MQ side. You can have persistent queues, ordering, multiple consumers, etc. This is wrong information.
@isharkpraveen
@isharkpraveen 10 күн бұрын
This guy and his contents are gold ❤❤❤...Just in 5 mins he explaned very beautifully...
@isharkpraveen
@isharkpraveen 10 күн бұрын
Please explain catalyst optimizer
@adityakvs3529
@adityakvs3529 29 күн бұрын
sir i i have some questions can i contact you
@isharkpraveen
@isharkpraveen Ай бұрын
Simple and Clean explanation 👍
@sky-i8d
@sky-i8d 2 ай бұрын
Hadoop is generally for big data. So for storage the minimum block size is 128MB, so having such a small file can significantly waste storage as atleast one block will be assigned to one file. Please correct me if I am wrong here.
@shishirkumar4932
@shishirkumar4932 3 ай бұрын
Can we partition by range of data?
@RishavKumar-n2o
@RishavKumar-n2o 3 ай бұрын
nice content
@engineerbaaniya4846
@engineerbaaniya4846 3 ай бұрын
is it correct to say that partitioning will create multiple folders while bucketing will create multiple files in spark
@Khang-lt4gk
@Khang-lt4gk 3 ай бұрын
Question 1 at 3:15: Issues with many small files on Hadoop. - Resource utilization problem: Each task is assigned to process data in a single partition. Multiple small files -> multiple small partitions -> multiple tasks are required -> multiple tasks are in queue -> high frequency for context switching -> high load on driver node (for allocation and orchestrating tasks among executors and cores) -> high possibility for driver OOM. - .metadata file (responsible for storing address of compressed partition files) consists of many key pairs to map -> low shuffle efficiency for almost every transformations.
@punpompur
@punpompur 3 ай бұрын
Wouldn't it be possible for data in buckets to be skewed as well? Does the hash function ensure that each bucket will be the same size?
@keerthymganesh1495
@keerthymganesh1495 3 ай бұрын
Very clear explanation 😊
@isharkpraveen
@isharkpraveen 3 ай бұрын
Just in 4 min video he explained well
@yoniperach
@yoniperach 4 ай бұрын
In your diagram, you present a CA system, no true CA systems exist in a system of more than one DB. That really confused me.
@anudipray4492
@anudipray4492 4 ай бұрын
Docker is working fine in Windows home edition sir.
@gurumoorthysivakolunthu9878
@gurumoorthysivakolunthu9878 4 ай бұрын
The explanation is crisp and clear... Thank you... Please share the next part of this video...
@srinivasjagalla7864
@srinivasjagalla7864 4 ай бұрын
Nice discussion
@Rohit-r1q1h
@Rohit-r1q1h 4 ай бұрын
Can you post video now for data engineering interview and also post question sets as well
@pradhyumansinghmandloi8240
@pradhyumansinghmandloi8240 4 ай бұрын
Can you just make video on system design topics and technique we should learn or you are going to cover in this series...
@adityakvs3529
@adityakvs3529 4 ай бұрын
How hashcode decided
@DataSavvy
@DataSavvy 4 ай бұрын
Hashcode is calculated using hash algorithm
@jayantmeshram7370
@jayantmeshram7370 5 ай бұрын
I am trying to run your code in ubuntu system but not able to .. could you please make a video how to run your code in ubuntu/linux .. thanks
@anudeepk7390
@anudeepk7390 5 ай бұрын
Did the participant consent for posting this online? If not u should blur his face
@DataSavvy
@DataSavvy 5 ай бұрын
Yes.. It was agreed with participants
@roopashastri9908
@roopashastri9908 5 ай бұрын
How can we orchestrate this in airflow?what should be the schedule interval
@DeepakNanaware-ze8pq
@DeepakNanaware-ze8pq 6 ай бұрын
Hey bro, your videos is good but if you are creating video of how to handle small file issue using practical way so it's very helpful for us.
@khrest-lt6gj
@khrest-lt6gj 6 ай бұрын
I really like the pace of the videos. Great job! And thank you!
@biswadeeppatra1726
@biswadeeppatra1726 6 ай бұрын
Please share the doc that you are using in this video
@VivekKBangaru
@VivekKBangaru 6 ай бұрын
very informative one. Thanks Buddy.
@akashhudge5735
@akashhudge5735 6 ай бұрын
In lambda architecture so far no one has explained how de duplication is handled when the batch and stream processing data is combined in serving layer? whatever data is processed by streaming layer will eventually gets processed in batch layer? if this is true then previous streaming layer processed data is no more required. so do we need to remove that data processed by streaming layer?
@jasbirkumar7770
@jasbirkumar7770 7 ай бұрын
sir can you tell me some about housekeeping executive spark deta. i dont understand spark word. facility company JLL requird he have spark exprience
@deepanshuaggarwal7042
@deepanshuaggarwal7042 7 ай бұрын
"flatMapGroupsWithState" is a statefull operation? Do you have any tutorial on it?
@briandevvn
@briandevvn 8 ай бұрын
To whom may be concerned when to use GroupByKey over ReduceByKey: groupByKey() can be used for non-associative operations, where the order of application of the operation matters. For example, if we want to calculate the median of a set of values for each key, we cannot use reduceByKey(), since median is not an associative operation.
@sreekantha2010
@sreekantha2010 8 ай бұрын
Awesome!! wonderful explanation. Before this, I have see so many videos but none of those explained the steps in such a clarity. Thank you sharing.
@BishalKarki-pe8hs
@BishalKarki-pe8hs 8 ай бұрын
vak mugi
@ldk6853
@ldk6853 8 ай бұрын
Terrible accent… 😮
@maturinagababu98
@maturinagababu98 9 ай бұрын
Hi sir pls help me with the following requirement id|count| +---+-----+ | a| 3| | b| 2| | c| 4| +---+-----+ need the following output using spark a a a b b c c c c
@ramyajyothi8697
@ramyajyothi8697 9 ай бұрын
What do you mean by application needing a lot of joins? Can you please clarify how the joins are affecting the architecture decision?
@suresh.suthar.24
@suresh.suthar.24 9 ай бұрын
i have one doubt: reserved memory and yarn overhead memory are same ? because reserved memory also stored spark internals. Thank you for your time.
@vipulbornare34
@vipulbornare34 3 ай бұрын
No, reserved memory and YARN overhead memory are not exactly the same, though they both deal with memory allocation in a Spark application. Reserved Memory: This refers to the memory reserved by Spark for internal operations. This includes things like data structures required for task execution, internal buffers, and other overheads used by Spark itself (such as execution, storage, and shuffle memory). It's typically not available for user tasks or for storing data in RDDs or DataFrames. Reserved memory is usually fixed, and a portion of the total executor memory is set aside for this purpose. YARN Overhead Memory: This is memory that is specifically allocated to account for overhead when running Spark on YARN. YARN manages the resources for distributed applications, and overhead memory includes things like the container’s JVM overhead, memory for running the YARN daemons, or any other extra memory YARN needs to manage the container (like logging, etc.). This is configured with the spark.yarn.executor.memoryOverhead property. Both affect how much memory is available for your actual Spark jobs, but they refer to different types of overheads: one internal to Spark and one related to the YARN resource manager.
@ahmedaly6999
@ahmedaly6999 9 ай бұрын
how i join small table with big table but i want to fetch all the data in small table like the small table is 100k record and large table is 1 milion record df = smalldf.join(largedf, smalldf.id==largedf.id , how = 'left_outerjoin') it makes out of memory and i cant do broadcast the small df idont know why what is best case here pls help
@naveena2226
@naveena2226 9 ай бұрын
Hi @all I just got to know about the wonderful videos in datasavvy channel. In that executor OOM - big partitions slide, in spark every partition is of block size only ryt(128MB) , then how come big partition will cause an issue? Can Simeon please explain this? Little confused here Even if there is 10gb file , when spark reads the file it creates around 80 partition of 128mb. Even if one of the partition is high it cannot increase 128mb ryt.. then how come OOM occurs??
@sheshkumar8502
@sheshkumar8502 10 ай бұрын
Hi how are you
@praptijoshi9102
@praptijoshi9102 10 ай бұрын
amazing
@adityakvs3529
@adityakvs3529 10 ай бұрын
I have who take care of task sheduling
@kaladharnaidusompalyam851
@kaladharnaidusompalyam851 10 ай бұрын
If we maintain replica of data in three diff racks in hadoop. if we submit job we get results right. hy we dont get copies of data execution. how can / what is the operation that is there in hadoop only one block of data ned to process in hadoop if we have two more duplictaes
@prathapganesh7021
@prathapganesh7021 10 ай бұрын
Great content thank you
@RakeshMumbaikar
@RakeshMumbaikar 10 ай бұрын
very well explained
@ayushigupta542
@ayushigupta542 10 ай бұрын
Great content! Are you on Topmate or any other platform where I can connect with you. Need some career advice/guidance from you.
@Pratik0917
@Pratik0917 10 ай бұрын
Then people arent using dataset everywhere?
@TarikaBhardwaj
@TarikaBhardwaj 10 ай бұрын
Hi Harjeet, Getting Kafka utils not found error while creating dstream
@harshitsingh9842
@harshitsingh9842 10 ай бұрын
where is the volume?
@harshitsingh9842
@harshitsingh9842 10 ай бұрын
Having a diff table at the end of the video would be appreciated.