System Design distributed web crawler to crawl Billions of web pages | web crawler system design

  Рет қаралды 183,980

Tech Dummies Narendra L

Tech Dummies Narendra L

5 жыл бұрын

Learn webcrawler system design, software architecture
Design a distributed web crawler that will crawl all the pages on the internet.
Question asked in most of the top company interviews like GOOGLE, FACEBOOK, and AMAZON
Let's learn how to build google sipderbot or google distributed web crawler.
#crawlersystemdesign
#systemdesigntips #systemdesign #computerscience #learnsystemdesign #interviewpreperation #amazoninterview #googleinterview #uberinterview #micrsoftinterview
#crawler #webcrawler

Пікірлер: 197
@sumonmal009
@sumonmal009 3 жыл бұрын
estimation 5:30 HLD 6:33 queue manage 25:30 update and duplicate handle 33:40 Sim hash 39:26 storage 42:00
@arvindaaswani1303
@arvindaaswani1303 4 жыл бұрын
Awesome explanation, As a engineer i know that, how much hard work behind the scenes. Really Appreciate 👏
@adamhughes9938
@adamhughes9938 4 жыл бұрын
Makes me sad that this dude crams so much amazing content into these videos and gets 42k views but the dumbest 10 second videos get millions of views... I wish youtube had a notion of content score and quality.
@junjiechen7341
@junjiechen7341 3 жыл бұрын
ikr! too much going on to be fully appreciated in his vids.
@warriorgeneral2735
@warriorgeneral2735 3 жыл бұрын
Hey it totally depends on what people are interested in...
@Sarah-il5dr
@Sarah-il5dr 4 жыл бұрын
Guys, please like the video, as an engineer I know how much hard work behind a video like this. This is my go to system design resources. Great work!
@Sarah-il5dr
@Sarah-il5dr 4 жыл бұрын
Goutam Singh well, some of the audience might be “engineer-to-be”😉
@ghisskartadchoo3618
@ghisskartadchoo3618 Жыл бұрын
Fake comments
@iitgupta2010
@iitgupta2010 5 жыл бұрын
Finally you start building the video in actual flow, that's really great and it will really help the viewer to understand and build the actual knowledge of SD. Great bro.
@ksenthu
@ksenthu 3 жыл бұрын
The more detailed and clear content of crawler design I've seen. Thanks for doing this. It would be great if you can also clarify how the data transition happens between various services such as Extractor, Duplicate Detection, URL filter and Loader.
@petar55555
@petar55555 2 жыл бұрын
Great in detail System design. The only part I would probably skip is the heap (each queue is already tied to a thread/worker) as it looks more like a bottleneck and serves only as a timer to slow down the crawling for politeness which can be done in different ways.
@aarushjuneja6640
@aarushjuneja6640 Жыл бұрын
I was also thinking on the same lines.
@NANDINIGOEL
@NANDINIGOEL 7 ай бұрын
Think it this way, priorities based queue and then host based but you don’t know once in hosts which host to handle first ( priority is lost) so pq is filled with first elements of each of back a and then urls downloaded based on priority ensuring politeness. Merge k sorted arrays is good pointer to this , there is no point in locking threads to each queue if that is doubt because then priority is per q and not across all. Think a host has all priory 100 urls and others have 1-99 so then why should that 100 host be prioritized, it should not be unless we implement nice call something similar to increase priority to avoid aging
@monikaa8230
@monikaa8230 3 жыл бұрын
I have a suggestion to include two things in your videos which will definitely help: 1. QPS Calculation 2. Sharding key when we are planning to shard the DB
@TheMdaliazhar
@TheMdaliazhar 28 күн бұрын
Thanks for this. Most detailed design. No other youtuber explained exactly how the URL Frontier works.
@jessica-mx5pw
@jessica-mx5pw Жыл бұрын
thank you for the video! this was by far the most helpful system design video walk through I've seen. I've been struggling a lot with system design. Thank you for putting this together!
@venjan21
@venjan21 3 жыл бұрын
Generally I don't post comments but this is one of the best system design (in detail) I have ever seen. It has re-kindled my thought process on how to think for a System Design question.
@howellPan
@howellPan 5 жыл бұрын
Great content.. appreciate the details and thoroughness!
@sampathkarupakula7647
@sampathkarupakula7647 5 жыл бұрын
making things clear and easier, thanks for your effort. I really appreciate your efforts.
@ajaypremshankar
@ajaypremshankar 5 жыл бұрын
It's not easy to make such in-depth content-rich video. Thank you Narendra :)
@prathashukla6596
@prathashukla6596 4 жыл бұрын
awesome explaination of all the high level components. Good job
@PeterParker-vn2hv
@PeterParker-vn2hv 2 жыл бұрын
Narenda, thank you for this excellent video. Much appreciated.
@adrianliu2817
@adrianliu2817 5 жыл бұрын
You are the best! Enjoyed all of your system design videos!
@anastasianaumko923
@anastasianaumko923 Жыл бұрын
Thank you for this elaborate design, great work!
@manojbgm
@manojbgm 3 жыл бұрын
Awesome, knowledgeable. thank you for the video
@JM_utube
@JM_utube 4 жыл бұрын
thank you so much for posting! i love your videos. i just got asked this in a facebook interview and i wish i had seen this video beforehand.
@akshaymonga
@akshaymonga 2 жыл бұрын
very nice n detailed video, thank you sir!
@pryansh_
@pryansh_ 2 жыл бұрын
very informative, thanks
@heller166
@heller166 3 жыл бұрын
This is going to be a lot of help for my distributed systems course :). Thanks for all the hard work.
@ashish0687
@ashish0687 5 жыл бұрын
Thank you Naren, These video's are great source of learning. Very much appreciate the details/time/efforts on your part to build the content and present/share it across. If possible can you also please make a video about Geohashing (& usecase around performing geospatial searches) ...
@ragdoll2324
@ragdoll2324 4 жыл бұрын
Very detailed discussion. Thanks for making this vdo.
@ShabnamKhan-cj4zc
@ShabnamKhan-cj4zc 3 жыл бұрын
Thanks a lot for exlpaning all the modules in simple manner.. Your channel is the place where one can stop and learn everything in easy way..thanks a ton and keep doing this great work
@rhythmPhil
@rhythmPhil 5 жыл бұрын
Thanks for your work. This was really interesting.
@theFifthMountain123
@theFifthMountain123 3 ай бұрын
Had to watch multiple times to understand everything in the video. Thanks for the awesome explanation!
@aleeshaali7180
@aleeshaali7180 2 жыл бұрын
Bes channel I came across for learning about system design, Thank you and keep it up Kudos to the wonderful work!!!
@t4ruvk107
@t4ruvk107 4 жыл бұрын
Thanks for your time,efforts and content.
@iitgupta2010
@iitgupta2010 5 жыл бұрын
I really really appreciate your effort bro, whoever ask me I always suggest your name first. There are few others like gkcs but if you ask me there are nothing in front of your design skills. You really talk about things which matters. This is something I have not found in even paid courses. This is awesome in one word. You should have lot of subscriber. They will be soon.
@harishaseri
@harishaseri 4 жыл бұрын
Best explained. Thanks u so much naran
@iitgupta2010
@iitgupta2010 5 жыл бұрын
I crawled a word from this video is "basically" and inverted index it ....lol [don' have that much time 😝 Great video as always
@impossible7434
@impossible7434 3 жыл бұрын
such an amazing explanation, thank you very much, keep up the good work
@manmohanakash4222
@manmohanakash4222 3 жыл бұрын
This is the kinda of teammate I would like to work with. So much content. Thanks for sharing
@argstutorial2916
@argstutorial2916 3 жыл бұрын
Very nice conceptual explanations & tools utilizations. You have put a lot of energy with R&D. I hope this will help who are seeking to develop their own system for data processing / scraping mechanisms. Great Work, Keep it Up MaN.
@Akashkumar-md6rg
@Akashkumar-md6rg 4 жыл бұрын
Thnq sir!! For such a great content. Your videos are the most practical and interesting way to learn CS. You made me your fan sir... I really appreciate your hard work. Keep going.🙌🙌
@alokuttamshukla
@alokuttamshukla 5 жыл бұрын
Thank you so much for these efforts. I mean 45 minutes video is not a joke with so much to grasp.
@TechDummiesNarendraL
@TechDummiesNarendraL 5 жыл бұрын
I am trying make it short. But failed to do so
@alokuttamshukla
@alokuttamshukla 5 жыл бұрын
@@TechDummiesNarendraL No , I am in no way complaining at all. I loved it. I am so thankful to you for this.
@TechDummiesNarendraL
@TechDummiesNarendraL 5 жыл бұрын
@@alokuttamshukla thanks
@readingsteiner6061
@readingsteiner6061 4 жыл бұрын
@@TechDummiesNarendraL Blaise Pascal, In his Lettres Provinciales, the French philosopher and mathematician Blaise Pascal famously wrote - "I would have written a shorter letter, but I did not have the time." : ) Buddy you're awesome. Keep up the good work. Wish you the best.
@JM_utube
@JM_utube 4 жыл бұрын
after watching a lot of system design videos i really had to understand that this level of detail is NOT EXPECTED in an interview. i really stressed myself out trying to ask so many clarifying questions, and cover every single aspect of a system in a 45 minute block. this is not expected. remember - these videos are edited, shortened, rehearsed, and practiced. trust me when i say set a lower bar for yourself for interviews LOL thanks!!!
@SkyCityInc
@SkyCityInc 2 жыл бұрын
This was a really, really excellent overview, thank you for putting this video together!
@utsavkapoor6069
@utsavkapoor6069 6 ай бұрын
Great explanation man. Loved your videos. Why have you stopped making these. Hope to see you back soon!!
@theranajayant
@theranajayant 4 жыл бұрын
Heyy Narendra, Quite interesting topic you have chosen and it's interesting to learn this topic. You are curating really good and valuable content.
@vishalmahavratayajula9658
@vishalmahavratayajula9658 4 жыл бұрын
Awesome video. Can't thank you enough narenndra
@user-hj2lb8mg8o
@user-hj2lb8mg8o 4 жыл бұрын
Hi, really awesome videos, thanks!
@AyushRaj-so3zh
@AyushRaj-so3zh 3 жыл бұрын
This was GOLD !! Amazing content
@spyros5528
@spyros5528 Жыл бұрын
Superb video, very helpful. Thank you.
@sayantanray9595
@sayantanray9595 4 жыл бұрын
Helpful and detailed!!!
@chaitanyareddy9848
@chaitanyareddy9848 4 жыл бұрын
Dude it's awesome job.
@pinkylover911
@pinkylover911 2 жыл бұрын
A lot of great effort has been put into your videos, thanks
@RealAbhishekSingh
@RealAbhishekSingh 3 жыл бұрын
wow, such great explanation, thank you :)
@elachichai
@elachichai 3 жыл бұрын
Definitely helpful ! Appreciate it Narendra!
@neoli8110
@neoli8110 3 жыл бұрын
why do you need a heap? it sounds like a bottom neck right there. why can't backqueue selector use LB like round robine select the queue and remove item from the queue.
@w.maximilliandejohnsonbour725
@w.maximilliandejohnsonbour725 4 жыл бұрын
Nice info...!!!!!.
@roooooot9545
@roooooot9545 4 жыл бұрын
Great work
@apurvasharma2853
@apurvasharma2853 4 жыл бұрын
Excellent explanation!
@helikopter1231
@helikopter1231 2 жыл бұрын
Wow such detail and explained so well! Thank you so much! You actually made it sound interesting haha - im not a huge fan of web stuff but this actually made me curious.
@aashnavaid6918
@aashnavaid6918 2 жыл бұрын
amazing video thank you so very much sir!!!
@vedant9173
@vedant9173 3 жыл бұрын
Sir, thank you so much for these great lessons
@renon3359
@renon3359 3 жыл бұрын
Great video man. You deserve much more subscribers.
@PiyushSingh-vx7bx
@PiyushSingh-vx7bx 4 жыл бұрын
Amazing explanation brother 🔥
@keshavKumar-le4df
@keshavKumar-le4df 3 ай бұрын
Nice explanation.
@aliaksandrsheliutsin2374
@aliaksandrsheliutsin2374 Жыл бұрын
Just have to say that it's amazing content. Ket it up, Narendra!
@hlibpylypets1333
@hlibpylypets1333 2 жыл бұрын
Very detailed explanation - best ever :)
@SimranGupta-pz7nw
@SimranGupta-pz7nw 2 жыл бұрын
Thank you so much for the beautiful explanation :)
@Imkflow
@Imkflow 2 жыл бұрын
Thanks for the work on this, very helpful. Quick note, I think if every processor need to receive the same message what you need is a topic instead of a queue.
@IdoKleinman
@IdoKleinman 2 жыл бұрын
Good stuff! Thank you. One suggestion, for the next video, keep the information text slides on screen for more than 300ms...
@StormcastMarine
@StormcastMarine Жыл бұрын
Thanks a lot for the video mate, really useful
@wellingtonrafaelbarrosamor4260
@wellingtonrafaelbarrosamor4260 2 жыл бұрын
Awesone didactic
@Wei-up2jn
@Wei-up2jn 3 жыл бұрын
Great content! One question I have in mind is why we want to use one queue for one host? Is it because of http connection overhead if you connect to different host back and forth is high? But in realability the URL coming from front queues might be mixed with different hosts, e.g. a.com/a, b.com, a.com/c, in that case we still have to connect back and forth (assuming we only have one back queue). Unless we could guarantee that all URLs from the same host will come together to the back queue router.
@puravshah2342
@puravshah2342 5 жыл бұрын
Hi Naren, thanks for the awesome video, can you also make a video on designing distributed scheduling system
@TechDummiesNarendraL
@TechDummiesNarendraL 5 жыл бұрын
Sure
@samirhere4341
@samirhere4341 5 жыл бұрын
Great video. Keep up the good work. Can you do system design video on amazon fresh/getbojo/blue apron/plated/embrace box/trytheworld. The concept of how subscription and continues reoccurring delivery system works. Thank you
@stalera
@stalera 3 жыл бұрын
Thanks a lot for taking efforts to build up the video. This was amazing. I learnt a lot from this video. Just 1 question: why would you want to store the file content in the compressed form. Is it being used anywhere later? I couldn't find any mention about it.
@shreyasns1
@shreyasns1 2 жыл бұрын
@Narendra, Thanks for the video and detailed explanation. Could you also add the links to white papers you mentioned in the video description? This would help us to dive deep further to understand the concepts. Thanks again
@subee128
@subee128 2 күн бұрын
Thanks
@gouravkhanijoe1059
@gouravkhanijoe1059 2 жыл бұрын
Nice
@shreyade5000
@shreyade5000 Жыл бұрын
Nice content but long pause at 40:31, it distracts you if you are listening with concentration. Please edit it.
@augustoclaro
@augustoclaro 2 жыл бұрын
I have watched this video so many times in the past year that I'm almost quoting every word you say
@dharmendrabhojwani
@dharmendrabhojwani 5 жыл бұрын
awesome
@puneetpatwari
@puneetpatwari 5 жыл бұрын
Nice video. I have 1 question. In the URL frontier, there is a heap. I want to know if the heap is stored at only 1 place and is thread-safe?
@pengli7213
@pengli7213 3 жыл бұрын
What is the implementation of back queue? I don't think it's a Kafka queue, right? Or there might be too many topics. I guess it can be a key-value data structure, such as [domain_name, url, fetched(boolean)] ? Each time when we want to get a url from the "back queue", we just query the key-value and get a url which is not fetched ?
@DebasisUntouchable
@DebasisUntouchable 4 жыл бұрын
Great video! Thanks for sharing! Can you please refer me a book where I can get such great examples on System design?
@nazmavazid9141
@nazmavazid9141 2 жыл бұрын
Very very nice sir
@forte9910
@forte9910 4 жыл бұрын
one question on front queues: if a site is newly added (perhaps as the result of being linked from a site previously crawled), will you leave it in the front queues forever and periodically crawl it?
@meetpatel5054
@meetpatel5054 Ай бұрын
Instead of coupling back-queues with threads, I would say have more number of threads for priority URLs and less for others. for this to work, we can handle the politeness at front-queues where we put the subsequent URLs in low priority queues.
@nikhilagrawal8888
@nikhilagrawal8888 4 жыл бұрын
amazing
@kartik-agarwal
@kartik-agarwal 2 жыл бұрын
Kudos
@CODFactory
@CODFactory 2 жыл бұрын
a) Why not use a graph db instead of bigtable or anything b) why do those envelope calculations like 6PB or anything when we never used it and we never proved that the design will handle that amount of data c) We definitely should talk about how to make it distributed since 1 crawler cannot crawl everything, so how are we going to make sure that multiple crawlers are not crawling the same things d) how are we going to store these documents in different db and what kind of sharding we are doing to use i think those are some important things to talk about especially giving interviews
@parupatimadhukarreddy6972
@parupatimadhukarreddy6972 5 жыл бұрын
Hi Narendra, I am basically a software developer who mainly deals with Java script technologies. I saw this videos of Distributed systems on your channel, it seems more interesting knowing the architectural front of the web space, even a newbies are able to understand the conceptual part of the subject Appreciate your efforts. What are the technologies or tools that i need to learn or start with to get to know more about Distributed Systems. Thank you
@iitgupta2010
@iitgupta2010 5 жыл бұрын
I think we should decoupled the priority based crawler to normal crawler otherwise due to back queue router, all low priority crawler will be starve and never gets the chance to get crawl. We can have two/more system which are responsible for crawling every minute or less (like share market), every 5 minute or 1hr ... 1 day or week up to 1 month. This way we can scale them very easily and manage them better. This also help us to build politeness too.
@vishalraut20
@vishalraut20 4 жыл бұрын
What is the purpose of Redis? if we are pushing the entries in the queue, what is the need of cache?
@rishabhnitc
@rishabhnitc 2 жыл бұрын
As always excellent. just remove the music at 46 second mark :)
@Tony-cy2yr
@Tony-cy2yr 4 жыл бұрын
Is someone knocking on the other side of the wall at 40:46? I saw you are waiting them to finish. :) BTW, a question, at 13:56 when will the obsolete persistent storage on the bottom right be clear out?
@ramesh4joylife
@ramesh4joylife Жыл бұрын
It would have helped much better if you had gone through this entire thing with an example crawl from a scaled site
@chickentikkasauce1301
@chickentikkasauce1301 4 жыл бұрын
Heap is an implementation detail. Im being nit picky (this is a great video) but just some thoughts - Why does time stamp based priority even matter in this system? You didn’t mention that. It could be because you don’t want certain queues to get starved. A simpler approach might be to process each queue round robin and only mention the priority queue to your interviewer if they nudge you in that direction or if you want to slowly build to it to discuss trade offs. If each back queue has a priority, then just call out that we want a priority queue. You could say back queues have same priority but maybe other back queues dedicated to urls that we expect are updated at a faster rate have higher priority. But then you need a solution to the problem of other lower priority queues getting starved.
@psn999100
@psn999100 4 жыл бұрын
Great explanation. Yes . What I gather is that "URL Frontier" essentially implements a 1. Priority selection . -> Front Queue 2. Politeness guarantee . -> Back Queue The main issue what we are looking at is how to pick the next URL from the "URL frontier" microservice to be sent to a thread for processing. As you said, we could do a round-robin method where all Back queues get picked from in an equal - fashion. Or kind of a "weighted" method aka. priority_queue based solution to make sure the hottest websites get crawled in smaller/tighter time intervals. I think its always better to just give the simplest approach first (i.e just draw a black box tagged "Queue Selection" ) and deep dive later if the interviewer wishes to. There is a saying in system designing world = "KISS" == Keep It Simple and Stupid . Its' unlikely that you would run your interviewer out of questions, so better to even nudge the interviewer in your direction of thinking by giving out ever so slightest of hints, so that he starts asking the questions which you already have the answer to.
@RAJESH2010able
@RAJESH2010able 5 жыл бұрын
Hi Narendra, will it be possible for you to do a video on 'Design Online food ordering service like Uber eats/doordash and explain how to integrate it with existing (Uber) ride-sharing service'?
@karllopes
@karllopes 4 жыл бұрын
What is the connection between the auxiliary table and the Heap? I know the table is used to store the host/back queue # info and the heap is used to store the back queue #/last time visited info. I do not see how the two fit together in the workflow. i.e. When is the table queried/Why do we need to store the host/back queue #? Is the table only used for the BackQueue Router to know which back queue to add the current URL?
@ambermani1667
@ambermani1667 4 жыл бұрын
19:06 why we directly jumped to conclusion to use bloom filter? why can't a distributed hash table will work to know if a site is already crawled or not. its not O(n). we can hash the urls and shard the urls based on hash, then search the url in specific shred hash table.
@nitinkulkarni7942
@nitinkulkarni7942 4 жыл бұрын
Naren, if size is a problem @45:00 and therefore NoSQL is not recommended, what about Redis, how can we still use Redis as cache for the same pages?
@mtsmithtube
@mtsmithtube 2 жыл бұрын
@16:38 "make it a standard convention of converting it to a lowercase" - careful because URLs are case sensitive. Maybe your duplicate detector should do a case insensitive compare but you don't want to lose the original case when saving urls.
@NdubisiOnuora
@NdubisiOnuora 3 жыл бұрын
Great explanation! You made it seem easy. How long did it take for you to understand the whitepapers for the Googlebot?
@scalechamp
@scalechamp Жыл бұрын
It's not the way google bot works, it's the Mercator Web Crawler architecture
@RahulSathe.07
@RahulSathe.07 4 жыл бұрын
Hey Naren, awesome video. What would be a good (& scalable) way to keep track of duplicate URLs?
@NANDINIGOEL
@NANDINIGOEL 7 ай бұрын
Bloom filter / count min sketch
System design basics: What is asynchronous processing?
33:36
Tech Dummies Narendra L
Рет қаралды 33 М.
Best Toilet Gadgets and #Hacks you must try!!💩💩
00:49
Poly Holy Yow
Рет қаралды 22 МЛН
EVOLUTION OF ICE CREAM 😱 #shorts
00:11
Savage Vlogs
Рет қаралды 11 МЛН
Jumping off balcony pulls her tooth! 🫣🦷
01:00
Justin Flom
Рет қаралды 28 МЛН
Doing This Instead Of Studying.. 😳
00:12
Jojo Sim
Рет қаралды 20 МЛН
System Design: Design a URL Shortener like TinyURL
16:00
Code Tour
Рет қаралды 82 М.
Google system design interview: Design Spotify (with ex-Google EM)
42:13
IGotAnOffer: Engineering
Рет қаралды 1 МЛН
Web Crawler System Design Concepts Nobody Talks About
21:42
Pratiksha Bakrola
Рет қаралды 6 М.
Twitter system design | twitter Software architecture | twitter interview questions
36:56
System Design Interview - Distributed Cache
34:34
System Design Interview
Рет қаралды 355 М.
NETFLIX System design | software architecture for netflix
51:26
Tech Dummies Narendra L
Рет қаралды 426 М.
System Design Interview: Design a Web Crawler w/ a Ex-Meta Staff Engineer
1:05:04
Hello Interview - SWE Interview Preparation
Рет қаралды 16 М.
Do you know Distributed transactions?
31:10
Tech Dummies Narendra L
Рет қаралды 228 М.
Best Toilet Gadgets and #Hacks you must try!!💩💩
00:49
Poly Holy Yow
Рет қаралды 22 МЛН