web crawler system design
description
design a web crawler to discover new or updated content, the way it works, is by passing a seed URL and it will traverse through the content of the page searching for new content, (similar to traversing a graph). you have all the resources you need to make it possible.
usages
- data mining
- the web is huge so we can use the information to build recommendation systems, feed AI nodes, virtual map
- search engines
- google uses a web crawler to feed their main service
- site archiving (old websites)
- a web site, can face multiple stages or versions, to keep a timeline
- web monitoring
- keep track of the behavior of the web, probably filter harmful content from your brand
constrains
- search in a single web page
- keep the content updated
- check for loops
- workers can go offline any time
- system should be polite (limit number of request per site
main functionalities
because of the potential of the project we’ll limit the goals to:
- given a set of URLS download, download and parse the body
- search for relevant content, it might be html, images, new urls. in this case we only support text and URLS
- save the content for future usage (persist data)\
- pass this new URLS to the list
- repeat steps
key features
- extensibility,
- we want the capability of possible scale or adding new features without the necessity of modifying the original structure
- scalability
- because of the internet being massive and the possibility of the a web site updating, adding new content. We need a way to gracefully scale as the internet does.
- robustness
- the web is full of skulduggery and errors the system needs a way of recognize and recover gracefully from errors that it may face
- politeness
- is important to limit the number of request we send to foreign servers
blue print
URL Seed
the initial URLs that we pass to the system to start discovering from there, this set of URLs can be pre-selected to ensure the success of the flow. to keep it simple, the URL seed can get URLS from the url storage to keep the content updated
frontier
this part of the system handles multiple responsibilities wort mentioning
- politeness
- priority
- this can mean multiple thing, for example
- home pages take priority
- pages with high probability of being updated
- this can mean multiple thing, for example
- freshness
- this is a topic of active research here’s an insightful answers from stackoverflow
fetcher
fetches and downloads the web page, also responsible of handling the request errors, we could re-insert the failed URLS back to the URL storage or add a message queue that URL seed will be subscribed, also is important to keep a retry limit to not fall on an infinite loop
monitoring
check the heath of the system, also help us triggering alerts when something goes wrong and need manual checking
- check statistics
parser
parses or formats the data to a desired structure, to facilitate the data extraction from the web page
store
stores relevant information on the database following, we could save the text or htmls on disk and keep the route on the database, assuming that some data doesn’t need to live on the cache or primary memory
URL extractor
extract relevant URLS from the parsed content
URL filter
the main responsibility is to keep the content unique, meaning it’s important to not have redundant information some of the reasons might be
- different urls having the same content
- duplicated urls
- harmful content (blacklist)
important notes
- we use workers or bots to handle the urls
- use message queue to handle parallel work
- we could