Sorry but you are way wrong on the MQ side. You can have persistent queues, ordering, multiple consumers, etc. This is wrong information.
@isharkpraveen10 күн бұрын
This guy and his contents are gold ❤❤❤...Just in 5 mins he explaned very beautifully...
@isharkpraveen10 күн бұрын
Please explain catalyst optimizer
@adityakvs352929 күн бұрын
sir i i have some questions can i contact you
@isharkpraveenАй бұрын
Simple and Clean explanation 👍
@sky-i8d2 ай бұрын
Hadoop is generally for big data. So for storage the minimum block size is 128MB, so having such a small file can significantly waste storage as atleast one block will be assigned to one file. Please correct me if I am wrong here.
@shishirkumar49323 ай бұрын
Can we partition by range of data?
@RishavKumar-n2o3 ай бұрын
nice content
@engineerbaaniya48463 ай бұрын
is it correct to say that partitioning will create multiple folders while bucketing will create multiple files in spark
@Khang-lt4gk3 ай бұрын
Question 1 at 3:15: Issues with many small files on Hadoop. - Resource utilization problem: Each task is assigned to process data in a single partition. Multiple small files -> multiple small partitions -> multiple tasks are required -> multiple tasks are in queue -> high frequency for context switching -> high load on driver node (for allocation and orchestrating tasks among executors and cores) -> high possibility for driver OOM. - .metadata file (responsible for storing address of compressed partition files) consists of many key pairs to map -> low shuffle efficiency for almost every transformations.
@punpompur3 ай бұрын
Wouldn't it be possible for data in buckets to be skewed as well? Does the hash function ensure that each bucket will be the same size?
@keerthymganesh14953 ай бұрын
Very clear explanation 😊
@isharkpraveen3 ай бұрын
Just in 4 min video he explained well
@yoniperach4 ай бұрын
In your diagram, you present a CA system, no true CA systems exist in a system of more than one DB. That really confused me.
@anudipray44924 ай бұрын
Docker is working fine in Windows home edition sir.
@gurumoorthysivakolunthu98784 ай бұрын
The explanation is crisp and clear... Thank you... Please share the next part of this video...
@srinivasjagalla78644 ай бұрын
Nice discussion
@Rohit-r1q1h4 ай бұрын
Can you post video now for data engineering interview and also post question sets as well
@pradhyumansinghmandloi82404 ай бұрын
Can you just make video on system design topics and technique we should learn or you are going to cover in this series...
@adityakvs35294 ай бұрын
How hashcode decided
@DataSavvy4 ай бұрын
Hashcode is calculated using hash algorithm
@jayantmeshram73705 ай бұрын
I am trying to run your code in ubuntu system but not able to .. could you please make a video how to run your code in ubuntu/linux .. thanks
@anudeepk73905 ай бұрын
Did the participant consent for posting this online? If not u should blur his face
@DataSavvy5 ай бұрын
Yes.. It was agreed with participants
@roopashastri99085 ай бұрын
How can we orchestrate this in airflow?what should be the schedule interval
@DeepakNanaware-ze8pq6 ай бұрын
Hey bro, your videos is good but if you are creating video of how to handle small file issue using practical way so it's very helpful for us.
@khrest-lt6gj6 ай бұрын
I really like the pace of the videos. Great job! And thank you!
@biswadeeppatra17266 ай бұрын
Please share the doc that you are using in this video
@VivekKBangaru6 ай бұрын
very informative one. Thanks Buddy.
@akashhudge57356 ай бұрын
In lambda architecture so far no one has explained how de duplication is handled when the batch and stream processing data is combined in serving layer? whatever data is processed by streaming layer will eventually gets processed in batch layer? if this is true then previous streaming layer processed data is no more required. so do we need to remove that data processed by streaming layer?
@jasbirkumar77707 ай бұрын
sir can you tell me some about housekeeping executive spark deta. i dont understand spark word. facility company JLL requird he have spark exprience
@deepanshuaggarwal70427 ай бұрын
"flatMapGroupsWithState" is a statefull operation? Do you have any tutorial on it?
@briandevvn8 ай бұрын
To whom may be concerned when to use GroupByKey over ReduceByKey: groupByKey() can be used for non-associative operations, where the order of application of the operation matters. For example, if we want to calculate the median of a set of values for each key, we cannot use reduceByKey(), since median is not an associative operation.
@sreekantha20108 ай бұрын
Awesome!! wonderful explanation. Before this, I have see so many videos but none of those explained the steps in such a clarity. Thank you sharing.
@BishalKarki-pe8hs8 ай бұрын
vak mugi
@ldk68538 ай бұрын
Terrible accent… 😮
@maturinagababu989 ай бұрын
Hi sir pls help me with the following requirement id|count| +---+-----+ | a| 3| | b| 2| | c| 4| +---+-----+ need the following output using spark a a a b b c c c c
@ramyajyothi86979 ай бұрын
What do you mean by application needing a lot of joins? Can you please clarify how the joins are affecting the architecture decision?
@suresh.suthar.249 ай бұрын
i have one doubt: reserved memory and yarn overhead memory are same ? because reserved memory also stored spark internals. Thank you for your time.
@vipulbornare343 ай бұрын
No, reserved memory and YARN overhead memory are not exactly the same, though they both deal with memory allocation in a Spark application. Reserved Memory: This refers to the memory reserved by Spark for internal operations. This includes things like data structures required for task execution, internal buffers, and other overheads used by Spark itself (such as execution, storage, and shuffle memory). It's typically not available for user tasks or for storing data in RDDs or DataFrames. Reserved memory is usually fixed, and a portion of the total executor memory is set aside for this purpose. YARN Overhead Memory: This is memory that is specifically allocated to account for overhead when running Spark on YARN. YARN manages the resources for distributed applications, and overhead memory includes things like the container’s JVM overhead, memory for running the YARN daemons, or any other extra memory YARN needs to manage the container (like logging, etc.). This is configured with the spark.yarn.executor.memoryOverhead property. Both affect how much memory is available for your actual Spark jobs, but they refer to different types of overheads: one internal to Spark and one related to the YARN resource manager.
@ahmedaly69999 ай бұрын
how i join small table with big table but i want to fetch all the data in small table like the small table is 100k record and large table is 1 milion record df = smalldf.join(largedf, smalldf.id==largedf.id , how = 'left_outerjoin') it makes out of memory and i cant do broadcast the small df idont know why what is best case here pls help
@naveena22269 ай бұрын
Hi @all I just got to know about the wonderful videos in datasavvy channel. In that executor OOM - big partitions slide, in spark every partition is of block size only ryt(128MB) , then how come big partition will cause an issue? Can Simeon please explain this? Little confused here Even if there is 10gb file , when spark reads the file it creates around 80 partition of 128mb. Even if one of the partition is high it cannot increase 128mb ryt.. then how come OOM occurs??
@sheshkumar850210 ай бұрын
Hi how are you
@praptijoshi910210 ай бұрын
amazing
@adityakvs352910 ай бұрын
I have who take care of task sheduling
@kaladharnaidusompalyam85110 ай бұрын
If we maintain replica of data in three diff racks in hadoop. if we submit job we get results right. hy we dont get copies of data execution. how can / what is the operation that is there in hadoop only one block of data ned to process in hadoop if we have two more duplictaes
@prathapganesh702110 ай бұрын
Great content thank you
@RakeshMumbaikar10 ай бұрын
very well explained
@ayushigupta54210 ай бұрын
Great content! Are you on Topmate or any other platform where I can connect with you. Need some career advice/guidance from you.
@Pratik091710 ай бұрын
Then people arent using dataset everywhere?
@TarikaBhardwaj10 ай бұрын
Hi Harjeet, Getting Kafka utils not found error while creating dstream
@harshitsingh984210 ай бұрын
where is the volume?
@harshitsingh984210 ай бұрын
Having a diff table at the end of the video would be appreciated.