skip to content
tamagoyaki logo

tamagoyaki

web crawler system design

description

design a web crawler to discover new or updated content, the way it works, is by passing a seed URL and it will traverse through the content of the page searching for new content, (similar to traversing a graph). you have all the resources you need to make it possible.

web crawler example

usages

  • data mining
    • the web is huge so we can use the information to build recommendation systems, feed AI nodes, virtual map
  • search engines
    • google uses a web crawler to feed their main service
  • site archiving (old websites)
    • a web site, can face multiple stages or versions, to keep a timeline
  • web monitoring
    • keep track of the behavior of the web, probably filter harmful content from your brand

constrains

  • search in a single web page
  • keep the content updated
  • check for loops
  • workers can go offline any time
  • system should be polite (limit number of request per site

main functionalities

because of the potential of the project we’ll limit the goals to:

  • given a set of URLS download, download and parse the body
  • search for relevant content, it might be html, images, new urls. in this case we only support text and URLS
  • save the content for future usage (persist data)\
  • pass this new URLS to the list
  • repeat steps

key features

  • extensibility,
    • we want the capability of possible scale or adding new features without the necessity of modifying the original structure
  • scalability
    • because of the internet being massive and the possibility of the a web site updating, adding new content. We need a way to gracefully scale as the internet does.
  • robustness
    • the web is full of skulduggery and errors the system needs a way of recognize and recover gracefully from errors that it may face
  • politeness
    • is important to limit the number of request we send to foreign servers

blue print

web crawler blueprint

URL Seed

the initial URLs that we pass to the system to start discovering from there, this set of URLs can be pre-selected to ensure the success of the flow. to keep it simple, the URL seed can get URLS from the url storage to keep the content updated

frontier

this part of the system handles multiple responsibilities wort mentioning

fetcher

fetches and downloads the web page, also responsible of handling the request errors, we could re-insert the failed URLS back to the URL storage or add a message queue that URL seed will be subscribed, also is important to keep a retry limit to not fall on an infinite loop

monitoring

check the heath of the system, also help us triggering alerts when something goes wrong and need manual checking

  • check statistics

parser

parses or formats the data to a desired structure, to facilitate the data extraction from the web page

store

stores relevant information on the database following, we could save the text or htmls on disk and keep the route on the database, assuming that some data doesn’t need to live on the cache or primary memory

URL extractor

extract relevant URLS from the parsed content

URL filter

the main responsibility is to keep the content unique, meaning it’s important to not have redundant information some of the reasons might be

  • different urls having the same content
  • duplicated urls
  • harmful content (blacklist)

important notes

  • we use workers or bots to handle the urls
  • use message queue to handle parallel work
  • we could

updated blue print

web craweler blueprint v2