8: Design a Web Crawler | Systems Design Interview Questions With Ex-Google SWE

  Рет қаралды 27,982

Jordan has no life

Jordan has no life

Күн бұрын

They call me spiderman the way the ladies (euphemism for my friends) crawl into my web

Пікірлер: 143
@bchandra72
@bchandra72 10 ай бұрын
The videos were highly beneficial for my FANNG system design interview. I purchased several paid system design courses, yet your videos surpassed them all.
@jordanhasnolife5163
@jordanhasnolife5163 10 ай бұрын
That's really nice to hear, thanks a lot!
@knightbird00
@knightbird00 Ай бұрын
Bookmarks 2:40 Capacity estimations/ number of machines 4:00 Process Overview Talking points 6:45 Frontier evaluation (local vs distributed) 9:25 Duplicate fetches, robots.txt, DNS (consistent hash with state), duplicate content (redis) Processing & Storing 21.55 Fetch, parse, extract, add to frontier 22:50 Storage S3, HDFS Diagram 30:12
@panhejia
@panhejia 10 ай бұрын
Best discussion of crawler that I have seen. You abstract the whole HLD into frontier, content deduper, url deduper, and talks about each piece in details, regarding frontier strategies, content deduper db/memcache selection, url deduper and partitioning. You even touched the fail modes at the end. Nice job!
@jordanhasnolife5163
@jordanhasnolife5163 10 ай бұрын
Thanks!! Glad you enjoyed :)
@maxvettel7337
@maxvettel7337 4 ай бұрын
Can't go through this without time codes. Thanks for the video, Jordan 0:35 intro 1:31 Problem Requirements 2:31 Capacity Estimates 3:50 Process Overview 6:45 Fetching URLs to Crawl 7:28 Local Frontier Evaluation 8:48 Distributed Frontier 9:22 Avoiding Duplicate Fetches 10:02 Avoiding duplicate fetching, optimized 11:05 Avoid Duplicate Content On Different Sites 12:43 Content Hash Checking 14:45 Content Hash Low Latency 16:55 Domain Name Service 19:22 Robots.txt 21:51 Fetching the URL 22:43 Storing Results 24:03 Pushing To The Frontier 26:43 Architectural Choices 30:11 Final Design
@mrsimo7144
@mrsimo7144 4 ай бұрын
People like you are a godsend. Thank you.
@maxvettel7337
@maxvettel7337 4 ай бұрын
@@mrsimo7144 thanks, I appreciate it, hope this helps someone in preparation
@maxvettel7337
@maxvettel7337 4 ай бұрын
@@mrsimo7144 thanks, I appreciate it, hope this helps someone in preparation
@jordanhasnolife5163
@jordanhasnolife5163 4 ай бұрын
Thank you sir!
@dare2expressmonicakothari415
@dare2expressmonicakothari415 7 ай бұрын
Excellent video and explanation. Every trade-off that you speak about is an invaluable nugget and made it super easy to understand why we should choose a particular strategy in design. One of the best channels on System Design !! Keep up the great work!
@jordanhasnolife5163
@jordanhasnolife5163 7 ай бұрын
Thank you!!
@quirkyquester
@quirkyquester 2 ай бұрын
beautifully explained, i read groking book and still kinda confused, then i come to watch your video to get more depth and details, thank you so much! watching your video makes me wanna pick up a flink book to dive deep into how it works and what its capable, thank you!
@2sourcerer
@2sourcerer 3 ай бұрын
I find it a little easier to watch other videos about the same topic first because yours is more detailed. I have a little trouble grasping a lot of details discussed before the main diagram were shown at the very end of the video. There are pieces here and there, but it's hard to see the big picture and that makes it harder to understand. Maybe that's why a high level diagram is usually recommend for the actual interview process. It makes a little bit easier to grasp. Excellent content, thank you!
@vimalma1093
@vimalma1093 3 ай бұрын
I would like to provide constructive criticism. This video is applicable for a new-grad or may be a mid-level candidate, but definitely not for Senior or Staff. Reasons below: 1) Any distributed and scalable applicable won't implement the approach discussed at 7:07. Distributed is the way to go. If a senior or staff even mentions this approach, I would fail them. 2) At 10:06, if using Kafka or any distributed message queue, partition by the domain and hashing/distribution is handled inherently. No need to implement ourselves. Implementing a consistent hashing of URLs (which in itself is a system design question) is not in scope of the Web crawler design. 3) At 14:50, use multi master Redis replication which handles the conflict by itself. OR even this is a not real time user facing application which needs super low latency. It is a crawler running across many machines and hitting a Redis cluster running in across DC should still be in milli seconds .Mentioning CRDT is fine but using it here is a overkill and a blunder in this use case.. Implementing CRDT across multiple URL processing nodes is error prone as the nodes may come up and down depending on the queue size. 4) Again at 19:45, use a distributed cache like Redis and use a proper Redis client to cache contents locally with expiry set out. Storing that critical information on just the local machine is a blunder in the cloud environment where nodes come and go. Overall a nice video but a bit simplified and misleading at times
@jordanhasnolife5163
@jordanhasnolife5163 3 ай бұрын
I appreciate the critiques, especially from someone with a lot of experience as yourself. I'll provide the following responses. 1) I never claim this is the best approach. Nonetheless, it is still distributed. You start off in a bunch of places with a seed URL and continue onwards. Saying that you'll fail someone for even mentioning this seems completely unfair to me. In Leetcode interviews, you're often expected to rattle off a non optimal solution before you optimize. I'm doing the same here. Nonetheless, this isn't a "mock" interview, and nor do I claim it is. I reserve the right to go off on tangents, where you might not in an actual interview, because it's important to me that people understand the tradeoffs of different approaches in depth. 2) You have to provide a partition key. I know that Kafka does the partitioning under the hood, but people will inevitably ask how this is done, and so I think that for the sake of educating viewers I don't mind mentioning the thought behind how kafka partitioning works. Certainly not proposing implementing ourselves. 3) Regardless of whether you want to call it a CRDT, a multi leader converging Redis key value map is a "conflict free replicated data type". It's the same idea. If I can avoid using a specific technology name and talk about the underlying reason behind it, I do so, again for educational purposes. I'm not quite sure how the nodes going up or coming down is really a big issue for using something like a CRDT, all they have to do is go register themselves to some service discovery platform and then can easily communicate with one another. I do completely agree though that in reality, using an existing database technology (like multi master redis) is the proper approach, unless you're really looking to optimize things. I do certainly agree with your point that not coupling the storage of these document hashes with the number of web crawlers is the correct approach in retrospect, that layer can and should scale independently. 4) Agreed Happy to hear other places where you feel this is misleading.
@vimalma1093
@vimalma1093 3 ай бұрын
@@jordanhasnolife5163 I agree I was bit harsh on failing a candidate. I am sorry. Appreciate your honesty and keep making videos.
@jordanhasnolife5163
@jordanhasnolife5163 3 ай бұрын
@@vimalma1093 Appreciate your feedback! Always nice to have a constructive conversation that doesn't get heated :)
@slover4384
@slover4384 4 ай бұрын
The anti-entropy mentioned in 16:54 doesn't seem to be actual anti-entropy... Anti-entropy implies that data on each node should almost always be in sync, but occasionally goes out of sync - so we detect out-of-sync state with a "cheap" detection protocol, and only need to occasionally follow that up and fix mis-aligned state using an expensive operation. In this case, the data on each node is is almost always way out of sync... so anti-entropy wouldn't be what we use. We'd literally just share state without any cheap detection mechanisms. So maybe I'd call it "gossip-based sharing of hashed state" instead. Because what you are really implying (I think) is that you will share hashed values between the nodes using an epidemic protocol versus a broadcast or centralized. This is orthogonal to the entropy comment which isn't really what is going on here.
@jordanhasnolife5163
@jordanhasnolife5163 4 ай бұрын
It's not anti entropy in the dynamo sense, you're right. This is just normal CRDT behavior.
@Adityaq3k
@Adityaq3k 5 ай бұрын
For HTML content dedup, instead of having our own CRDT implementation, we can run Redis in an active-active geo distributed topology. Thank you for the video!!
@meenalgoyal8933
@meenalgoyal8933 9 ай бұрын
Thank you for the video! :D For document content dedup, I was thinking of another option. May be have redis cluster partitioned by document location and still use single leader. This way cache of the checksum for document content from different geographic location will stay close to that location. So more and faster hits to cache. And in the background these cache partitions can sync up. What are your thoughts?
@jordanhasnolife5163
@jordanhasnolife5163 9 ай бұрын
I think the main thing to note here is that websites with different URLs can have the same conteng
@jordanhasnolife5163
@jordanhasnolife5163 9 ай бұрын
So location based partioning doesn't necessarily help us
@jiananstrackjourney370
@jiananstrackjourney370 4 ай бұрын
Nice video! One question, what difference does it make if instead have one bigger kafka message queue, and let flink instances consume from one big queue, if the flink instances and kafka are in the same data center?
@jordanhasnolife5163
@jordanhasnolife5163 4 ай бұрын
1) That queue has far more load on it 2) The flink instances have to read and discard far more messages.
@aprilli531
@aprilli531 3 ай бұрын
two things i noticed: 1. for I/O bounded tasks, we can use async or just way more threads than CPU thread count because a lot of threads will be waiting for I/O anyway. so we probably need way less than 100 servers here 2. DFS is less preferred, but still doable, we don't need stack, but it can be worker calling gRPC API for each child url, and this API is served by the same set of workers. so it's still a recursion, just gets run by multiple workers instead of multiple threads or sequentially run in the same thread.
@jordanhasnolife5163
@jordanhasnolife5163 3 ай бұрын
1) agree 2) what if you run out of workers?
@HrN-h1m
@HrN-h1m 9 ай бұрын
Hey Jordan, great and in-depth content. Just one quick suggestion : Can you please put a framework to every problem? Like, solve problems using the same framework so that we can draw a pattern. Something Like Functional req -> Non-Functional req -> Capacity Esti -> Data Model -> High Level -> Deep Dive -> Tradeoffs/Future etc. Also if you can explain using images and design flows more, it sticks better. I know this is extra work but it will really help us. Your content is really useful but a little verbose. Anyhow, it's really helpful and free, so no complaints :)
@jordanhasnolife5163
@jordanhasnolife5163 9 ай бұрын
Hey I appreciate it! I try to generally follow this format actually, but sometimes I feel that it can be too rigid. For example, if I talk about one component, I may have to talk about other components before it makes sense to discuss tradeoffs. Thanks for the feedback, will try to take this into account.
@HrN-h1m
@HrN-h1m 9 ай бұрын
@@jordanhasnolife5163 Will it be okay for you to share your notes that you use during these videos?
@jordanhasnolife5163
@jordanhasnolife5163 9 ай бұрын
@@HrN-h1m Yep! Going to do so once I finish up with the series
@erictsai1592
@erictsai1592 11 ай бұрын
Hey great videos as always! Just wondering if you will talk about one popular design question ads aggregation in near real-time and batch processing some time?
@jordanhasnolife5163
@jordanhasnolife5163 11 ай бұрын
Can you elaborate on this question? I'm not entirely sure what we're trying to design here.
@erictsai1592
@erictsai1592 11 ай бұрын
​@@jordanhasnolife5163 Sure! As below requirements, design Ad-Click Event Aggregation system in the online advertising so the campaign managers can adjust bidding strategies, control the budget and change targeted audience groups, etc. - Events are sent to the distributed message queue. - Facebook or Google scale. - Design for as much granularity with max 1 min. - Events can arrive later than expected. - Need to view both real time and archival information. - Two query requirements: 1. Return the number of click events in the last M minutes for a particular advertisement. 2. Return the top N ads clicked in the last M minutes.
@KannanMavila
@KannanMavila 7 ай бұрын
Great content, as always! Thank you Jordan :) A small feedback: it would be nice to maintain a crawl depth with every url in the queue to avoid Crawler Traps (infinite loops).
@jordanhasnolife5163
@jordanhasnolife5163 7 ай бұрын
Good idea!
@reedomrahman8799
@reedomrahman8799 11 ай бұрын
Yo been watching your system design 2.0 vids. Have been a great supplement to DDIA. Have you made a video on how todo estimations in system design?
@jordanhasnolife5163
@jordanhasnolife5163 11 ай бұрын
I have not, but I do them in just about every video if there's anything to be learned there haha. I don't know that there's a cut and dry way to do "estimations", I think that it's sort of problem specific.
@canijustgetanamealre
@canijustgetanamealre 9 ай бұрын
these are great breakdowns. I much prefer your explanations to what grokking has
@ReallyFrickinEasy
@ReallyFrickinEasy 2 ай бұрын
8:30 “DB, like how I am, Down Bad.” This guy is a national treasure!😂
@jordanhasnolife5163
@jordanhasnolife5163 2 ай бұрын
I'm like Nic Cage
@JoseEGuerra
@JoseEGuerra 7 ай бұрын
Great video as always Jordan! Can't thank you enough for all the value you have been providing On the topic of having nodes working at full capacity... Hash range by hostname evenly distributes URLs across partitions but some URLs may require more work than others to be fetched. This can be amplified if larger hostnames end up in the same partition. What would be your strategy so that for the current URLs in the queue we don't end up with idle nodes and nodes having more work that they can handle? Is there a feasible solution considering you are keeping the same amount of servers with similar capacity and still want to finish the task in one week?
@jordanhasnolife5163
@jordanhasnolife5163 7 ай бұрын
Yeah this is definitely a problem for sure! I think in such a situation, you'd want a couple of nodes continuously pinging the flink nodes to see how backed up they are on their queues, and potentially modifying hashing configurations in zookeeper accordingly. This certainly isn't easy, as it requires querying some state from the backed up node to get it onto the idle node in order to run there!
@slover4384
@slover4384 4 ай бұрын
About final architecture diagram: Where do you store the actual data about which URLs were processed, or failed to process, and which S3 results links go with each URL? I'd think there needs to be some datastore to maintain all this. Also, if a crawler dies, how do you recover the robots.txt information to prevent any new crawler from breaking the rate limits?
@jordanhasnolife5163
@jordanhasnolife5163 4 ай бұрын
Oof jeez yeah lazy diagram. Will certainly have to be the case that we store S3 links to each URL. Technically, you could store those per flink instance and then access them directly, but agree that writing them to a DB upon completion is preferable in order to have views per crawler iteration. If a crawler dies, we ideally should have its state checkpointed in flink. This either means that we already have the robots.txt info pre-cached, or in the event that we don't, we just go fetch it again on the first hostname load.
@WINDSORONFIRE
@WINDSORONFIRE 6 ай бұрын
Your video is the best for sure. Two questions. Wouldn't the load balancer become a single point of failure. Also I don't see any kind of mechanism to prevent crawling a particular website too fast.
@jordanhasnolife5163
@jordanhasnolife5163 6 ай бұрын
Yep - didn't draw it in the diagram but this is why we use an active-active or active-passive configuration for fault tolerance with load balancers. See my load balancing concept video for reference.
@WINDSORONFIRE
@WINDSORONFIRE 6 ай бұрын
​@@jordanhasnolife5163Thx. How much would you charge if you were willing that is to do a video call to review one of my designs. 1hr? I am not rich and I'm sure you're super busy. But I thought I'd ask.
@reddy5095
@reddy5095 9 ай бұрын
Shouldn't we use cassandra to store the 1mb files? HDFS is usually for large files ~125mb or more
@jordanhasnolife5163
@jordanhasnolife5163 9 ай бұрын
I think that's fair, or alternatively every 128 URL crawls you can aggregate the results into a file and upload to HDFS. Either way works.
@KarthikVasudevan27
@KarthikVasudevan27 7 ай бұрын
Hey Jordan, thanks for the content. Very useful. Got a question buddy, is there a need for the LB ? Cant we just direct the messages to the respective Kafka topics directly with a local hash function determining the node and topic?
@jordanhasnolife5163
@jordanhasnolife5163 7 ай бұрын
You can, we just then risk inconsistency in the state of each publisher with respect to what partitions handle what keys A single load balancer makes it so that this isn't a problem, which can be useful for Kafka since pretty much any of our servers can publish there so now they'd all have to listen to zookeeper
@leocambridge
@leocambridge 7 ай бұрын
Could you explain how to send URLs from node to node? At 10:00 you said DB requires a network call, isn't node to node also a network call? If there are 10 nodes, then 1 node need to send urls to the other 9 nodes?
@jordanhasnolife5163
@jordanhasnolife5163 7 ай бұрын
Fair point 1) You're just putting incoming URLs into kafka (using the built in load balancer, partitioned on host id) which I'd expect to be very fast due to just writing to the end of a log 2) The main thing here is to avoid duplicate work, hence why we want the same URLs being handled on the same node. If I can do this, I can avoid things like having to reach out to a DNS service each time that I look to load a link. If we're performing load balancing *anyways*, because we want to ensure that each node has equal amounts of load, then sending certain links to certain hosts based on the URL itself allows us to cache state locally on each processor node without having to reach out to a database to see if we've already read that URL.
@chawlagarima
@chawlagarima 5 ай бұрын
Thanks ..Excellent explanation and video. I've one question - How will the urls be sent to each processor, will Load balancer be assigning the urls based on hash of the host or system would need a coordination service like zookeeper?
@jordanhasnolife5163
@jordanhasnolife5163 5 ай бұрын
Both most likely. The load balancer listens to zookeeper basically
@Xyzhshshhdj
@Xyzhshshhdj 10 ай бұрын
Hi ​@jordanhasnolife5163 Your content is one of the best out there in system design. I've few doubt regarding the architecture Load balancer - Is it just the Kafka cluster where partition key is hash of the host and each flink instance is going to read the data from that particular partition always ?. Will it not cause the issue once this particular flink instance is died.
@jordanhasnolife5163
@jordanhasnolife5163 10 ай бұрын
Hey! So if our flink node dies, another one can come up, and restore its state from the S3 checkpoint. Or an existing one can take over and grab the share of the keys from that kafka partition.
@Amin-wd4du
@Amin-wd4du 3 ай бұрын
Why S3 traffic is going through LB? Flink looks overkill for this problem.
@jordanhasnolife5163
@jordanhasnolife5163 3 ай бұрын
I basically use Flink as a way of avoiding the "what do you do if your stream consumer goes down" question that I'd get asked all the time if I didn't. I don't really know what makes it overkill, as your stream consumer is in fact prone to going down lmao. Probably a typo on my part as for the S3 thing.
@time_experiment
@time_experiment 9 ай бұрын
Why can't we just cache the domain name mappings? Earlier in the problem we use the assumption that we would index about 1B webpages. If we assume all those are uniquely hosted, then we can also map all of the IPs to domain name with about 8GB of memory.
@jordanhasnolife5163
@jordanhasnolife5163 9 ай бұрын
DNS can change :) But yeah we basically are caching them until we get 404s
@time_experiment
@time_experiment 9 ай бұрын
@@jordanhasnolife5163 Thanks for your videos. Love the depth.
@seemapandey2020
@seemapandey2020 9 ай бұрын
How are we achieving prioritisation among different URLs here ?
@jordanhasnolife5163
@jordanhasnolife5163 9 ай бұрын
In this video, I don't talk about this. However, presumably, we'd need to build a custom distributed priority queue, which is really just a partitioned database index on whatever field it is that you care about. Ideally we only care about making a local index, and not global one though, or else that becomes the "top k" problem which is a whole separate video.
@tarunpahuja3443
@tarunpahuja3443 5 ай бұрын
how are we sending the data (URLS) from the flink to the LB. Is there a queue to send message from flink to LB ?
@jordanhasnolife5163
@jordanhasnolife5163 5 ай бұрын
Nope, just a normal http request
@tarunpahuja3443
@tarunpahuja3443 5 ай бұрын
@@jordanhasnolife5163 would it be a bad idea to send urls to lb through Kafka queue, given that we are already using it
@tarunpahuja3443
@tarunpahuja3443 5 ай бұрын
@@jordanhasnolife5163 would it be bad idea to use Kafka queue offer http request to send back urls to lb?
@IiAzureI
@IiAzureI 10 ай бұрын
Is Kafka your load balancer? I don't see how Kafka is doing anything if it's completely local to each node. Doesn't the LB have to do the partitioning of the URLs? (ye I don't understand kafka, reading more now).
@jordanhasnolife5163
@jordanhasnolife5163 10 ай бұрын
Kafka is itself a distributed system with many nodes. There's typically a zookeeper and load balancer that don't have to live on the same nodes as the queues.
@alphabeta644
@alphabeta644 5 ай бұрын
@14:27, we claim that with 8GB centralized REDIS (storing hash(html)) we can solve HTML dedupe problem (i.e. different URLs with same HTML content). If that is a practical solution, why can we not has another 8GB centralized REDIS (storing hash(URL)) for URL dedupe problem, to overcome the problem of node local frontier.
@alphabeta644
@alphabeta644 5 ай бұрын
And another related question. At @17:09 the slide says too large to cache whole hostname to IP mapping. If the # of URLs we assume is ~1B, why is it not OK to cache the IP address of all these URLs, that would only be 32 bits per entry, so only 4GB total memory for 1B urls.
@jordanhasnolife5163
@jordanhasnolife5163 5 ай бұрын
1) I'd prefer not to have a centralized redis, it's just the only solution there. It's easier to avoid a read to redis, throw the thing in the kafka queue, and just let the consumer instantly throw it out or something. Perhaps in practice redis is faster though as it leads to fewer kafka reads and writes. 2) I'm not sure how many URLs there are, perhaps it's completely cachable. That being said, the IP addresses corresponding to URLs can also change, so in that situation you'd have to go back to DNS.
@bogdax
@bogdax 8 ай бұрын
If we partition data by host, how do we deal with hot partitions then? Some nodes could run forever, while other may be idle.
@jordanhasnolife5163
@jordanhasnolife5163 8 ай бұрын
While I'd hope that no individual host is too overly complex, I believe kafka can perform dynamic load balancing, so while each host is limited to one consumer, a consumer with far fewer messages get assigned more hosts.
@tarunpahuja3443
@tarunpahuja3443 5 ай бұрын
One humble request: Can you please take 1,2 examples to explain the final flow of the the design at the end while drawing all the components together.
@jordanhasnolife5163
@jordanhasnolife5163 5 ай бұрын
I can give it a shot
@priteshacharya
@priteshacharya 8 ай бұрын
On your final design, Can you explain the path to write to S3. The diagram shows write from flink -> LB -> S3, but shouldn't the flink be able to write to s3 directly, why would it go to LB?
@jordanhasnolife5163
@jordanhasnolife5163 8 ай бұрын
That's correct, I mainly had the load balancer there to represent how the crawling consumers would communicate with another. I agree that we'd write straight to s3.
@777roadkill
@777roadkill 5 ай бұрын
Hi Jordan, If we partition by hash of the domain name, could this lead to ‘hot’ partitions? The node where the twitter or Wikipedia domain is mapped could possibly have a lot more work to do than some other nodes in the system. Is this something to be worried about or are we ok with some hotter partitions as a trade off to enable caching on DNS and robot.txt. Thanks again!
@jordanhasnolife5163
@jordanhasnolife5163 5 ай бұрын
Yeah it could - in theory you could split even further down the host domain if needed a.k.a facebook.com/a/* and facebook.com/b/* or move some of the hosts off of the high load flink node and just keep around the domain with a ton of pages
@777roadkill
@777roadkill 5 ай бұрын
@@jordanhasnolife5163 Thanks man! I think that could do the trick for us! I was also considering a 'hybrid' approach similar to the twitter example you shared. In this case if a website is 'popular' then we partition manually using some range based partitioning schema but use consistent hashing for all other websites. What do you think of this? Appreciate you!
@jordanhasnolife5163
@jordanhasnolife5163 5 ай бұрын
@@777roadkill Seems reasonable to me! Just have to know what's popular in advance somehow..
@booboo910301
@booboo910301 5 ай бұрын
The initial estimation for concurrency needed seem to be considering only synchronous behaviors, if you use async, I'd imagine the # of nodes you need will be much lower.
@jordanhasnolife5163
@jordanhasnolife5163 5 ай бұрын
Makes sense to me
@maxvettel7337
@maxvettel7337 11 ай бұрын
Good support for my preparation
@javaadpatel9097
@javaadpatel9097 2 ай бұрын
Can a flink processor take multiple urls off the kafka queue to process in parallel? Otherwise wouldn't we need a ton of kafka partitions and flink consumers since doing the processing takes quite a while for content download, content parsing and content checking/storing?
@jordanhasnolife5163
@jordanhasnolife5163 2 ай бұрын
I think you're basically making the case for mini batching here, look into spark streaming :)
@javaadpatel9097
@javaadpatel9097 2 ай бұрын
@@jordanhasnolife5163 makes sense, can flink also do mini batching or is that a spark only thing?
@jordanhasnolife5163
@jordanhasnolife5163 2 ай бұрын
@@javaadpatel9097 To tell you the truth I'm not sure off the top of my head, would recommend googling it
@javaadpatel9097
@javaadpatel9097 2 ай бұрын
So flink can do mini batching as well 👍🏽. From a video I watched it also seems like it can be run on historical data to do analysis as well, which I think mini batching helps a lot with because you don't need the output in real-time.
@jordanhasnolife5163
@jordanhasnolife5163 2 ай бұрын
@@javaadpatel9097 nice find!
@Hollowpoint321
@Hollowpoint321 8 ай бұрын
Hi Jordan - just wondering how you would modify this system to ensure stuff is re-crawled with approximate accuracy? E.g. in your intro you said you'd be aiming to complete this within 1 week - if your crawler is running persistently, how would you enqueue sites to be re-crawled a week later, given that there is no message visibility equivalent in kafka, versus a traditional message broker?
@jordanhasnolife5163
@jordanhasnolife5163 8 ай бұрын
Can you elaborate on this question? You'd typically just enqueue the same seed websites that you enqueued into kafka before (maybe we add a date_id or a crawl_id to show that this is a different iteration of the crawl).
@KENTOSI
@KENTOSI 11 ай бұрын
Hey Jordan this was an excellent coverage of this interview question. I was asked this once and didn't even occur to me to think about the robot.txt file. Nice work!
@jordanhasnolife5163
@jordanhasnolife5163 11 ай бұрын
I appreciate it!
@ahmedkhaled7960
@ahmedkhaled7960 10 ай бұрын
If the content of the web pages that we are crawling is text I think we can just use a relational database and have a UNIQUE index over the content column and call it a day
@jordanhasnolife5163
@jordanhasnolife5163 10 ай бұрын
A couple things: 1) It's a lot of text, and comparing text for equality can only be done in O(n) time in relation to the length of the text (which to be fair it takes the same amount of time to generate all of those hashes, but at least then we don't have to store them) 2) To actually do the indexing and ordering over bigger pieces of text we need to compare the two pieces of text which is longer if they are full documents. 3) More data to send over the network.
@Thordata
@Thordata 21 күн бұрын
Such a great video.but background music was not needed in the video
@jordanhasnolife5163
@jordanhasnolife5163 20 күн бұрын
there is 0 background music in this video lol
@sauhard12
@sauhard12 6 ай бұрын
It is not clear to me why DNS needs to be mapped to IP? I mean, my server can make HTTP request to the URL directly.
@jordanhasnolife5163
@jordanhasnolife5163 6 ай бұрын
What do you think that does internally
@sauhard12
@sauhard12 6 ай бұрын
@@jordanhasnolife5163 Yeah. I understand that. But are you trying to imply that if my server were to call the DNS which internally maps to an IP, the latency would be lower if I were to hit the IP directly? And to achieve this, we cache the DNS to IP map and use IP.
@jordanhasnolife5163
@jordanhasnolife5163 6 ай бұрын
@@sauhard12 As far as I understand it yes, but let me know if I'm incorrect
@ShivangiSingh-wc3gk
@ShivangiSingh-wc3gk 6 ай бұрын
Although I am randomly jumping from video to video and sometimes miss something’s which I would know if I watched the videos in sequence. I love your teaching style, I will eventually get to reading data intensive design but I feel like I’m lazy grasping a lot. If you are in DC area hit me up. Would love to take you out to lunch.
@0xhhhhff
@0xhhhhff 11 ай бұрын
Chaddest system designer i know
@jordanhasnolife5163
@jordanhasnolife5163 11 ай бұрын
Big designs, small...well you know
@jshaikh3897
@jshaikh3897 11 ай бұрын
Is this a good video to start for total beginners in sys design...?
@jordanhasnolife5163
@jordanhasnolife5163 11 ай бұрын
I'd probably start from episode 1 of my systems design concepts 2.0 series.
@LeoLeo-nx5gi
@LeoLeo-nx5gi 11 ай бұрын
Hi Jordan, awesome indepth explanation, thanks!!
@jordanhasnolife5163
@jordanhasnolife5163 11 ай бұрын
Thanks!!
@felixliao8314
@felixliao8314 4 ай бұрын
great stuff buddy. thanks for doing this!
@ShortGiant1
@ShortGiant1 4 ай бұрын
Top notch! Although I’m not sure the 1B number is accurate. For example just Wikipedia has about 60M websites! Not just trying to find a fault in your video, just trying to reason about the design. With such a high number, it might not be possible to crawl all of Wikipedia from just 1 host..
@jordanhasnolife5163
@jordanhasnolife5163 4 ай бұрын
Yep! Definitely the case that for certain hosts we might have to repartition them even further!
@yrfvnihfcvhjikfjn
@yrfvnihfcvhjikfjn 11 ай бұрын
Are you using a bath towel in your kitchen?
@jordanhasnolife5163
@jordanhasnolife5163 11 ай бұрын
Nope I just have a bunch of hand towels lol
@LouisDuran
@LouisDuran 7 ай бұрын
I was asked this question from Amazon a couple years ago... and failed because I hadn't watched this video first. 😞
@jordanhasnolife5163
@jordanhasnolife5163 7 ай бұрын
Only way to improve is to fail first :)
@ramannanda
@ramannanda 6 ай бұрын
also things will be i/o bound, so CPU estimates are not that important.
@jordanhasnolife5163
@jordanhasnolife5163 6 ай бұрын
Suppose it depends on the task actually, most databases these days are CPU bound funny enough. But yeah hard to say empirically really without doing some profiling.
@mridulsetia4974
@mridulsetia4974 11 ай бұрын
Great content, just one suggestion. Please type out notes instead of writing
@jordanhasnolife5163
@jordanhasnolife5163 11 ай бұрын
I think I'm in a bit too deep for this one but I appreciate the feedback!
@AlbertLeng
@AlbertLeng 11 ай бұрын
At around 3:12, how do you determine 500 threads are needed?
@AR-jx3cs
@AR-jx3cs 8 ай бұрын
@@AlbertLeng It's based on the assumption that an average web-page takes 1/3 of a second to load.
@harshjha6774
@harshjha6774 3 ай бұрын
THANKS CRACKED A FAANG BECAUSE OF U AND ME
@jordanhasnolife5163
@jordanhasnolife5163 3 ай бұрын
Congrats!!! Good luck in the new role!
@GeorgeDicu-hs5yp
@GeorgeDicu-hs5yp 10 ай бұрын
you dont talk about costs in any of these videos. cost is a very imp. aspect which can redefine the entire solution. But good video, you bring new dimensions to my thought process.
@jordanhasnolife5163
@jordanhasnolife5163 10 ай бұрын
Duplicate comment, responded to the last one lol
@emenikeanigbogu9368
@emenikeanigbogu9368 4 ай бұрын
Jordan I got you at +2.5
@sachinfulsunge9977
@sachinfulsunge9977 5 ай бұрын
Bro has no control over intrusive thoughts
@jordanhasnolife5163
@jordanhasnolife5163 5 ай бұрын
No sir
@maxmanzhos8411
@maxmanzhos8411 10 ай бұрын
GOAT
@joealtona2532
@joealtona2532 10 ай бұрын
Your choice of Flink is questionable. An overkill for simple URI fetching and not really practical for JavaScript driven websites. The crawler worker must be more sophisticated, carry a sort of headless browser, python libs, like beautiful soup, etc wrapping to a Docker image might work better and running this on Kubernetes or other container orchestrator.
@jordanhasnolife5163
@jordanhasnolife5163 10 ай бұрын
I don't necessarily see why the two can't be used in conjunction, to be honest. I think that you make a good point regarding needing certain libraries, but if we don't run those directly within Flink I think that having a stateful consumer that guarantees messages replay still makes it a viable job orchestrator of a bunch of docker images deployed on kubernetes (as you mentioned). Appreciate the comment!
@jordanhasnolife5163
@jordanhasnolife5163 10 ай бұрын
From my understanding of actual web crawling as a service companies, much of the work is done via a bunch of proxies (unfortunately on user devices without even explicit consent) in order to avoid violating robots.txt policies and/or seeming like a genuine user to the site
@canijustgetanamealre
@canijustgetanamealre 9 ай бұрын
@@jordanhasnolife5163 this is correct
@ronin4518
@ronin4518 6 ай бұрын
I don't understand what using fink has to do with the crawler's implementation or even JavaScript. If you watched the video, you'd know flink was used for resilience i.e. local state management to help an instance recover and continue from consumption from Kafka from the last known state. Has very little to do with the implementation of how it handles fetching of page content.
@AP-eh6gr
@AP-eh6gr 9 ай бұрын
my IQ shot up to 145 watching this content 😏
@zen5882
@zen5882 9 ай бұрын
something else of mine shot up
Web Server Concepts and Examples
19:40
WebConcepts
Рет қаралды 266 М.
Жездуха 42-серия
29:26
Million Show
Рет қаралды 2,6 МЛН
Sigma girl VS Sigma Error girl 2  #shorts #sigma
0:27
Jin and Hattie
Рет қаралды 124 МЛН
What is the merkle tree in Bitcoin?
3:53
Keifer Kif
Рет қаралды 50 М.
My C++ Orderbook
7:45
Ilya Tataurov
Рет қаралды 3,7 М.
How I Mastered System Design Interviews
10:22
Ashish Pratap Singh
Рет қаралды 231 М.
How to Crack Any System Design Interview
8:19
ByteByteGo
Рет қаралды 471 М.
5 Tips for System Design Interviews
8:19
Gaurav Sen
Рет қаралды 630 М.
Жездуха 42-серия
29:26
Million Show
Рет қаралды 2,6 МЛН