TSE Crawler Requirements Spec
The TSE crawler is a standalone program that crawls the web and retrieves webpages starting from a “seed” URL. It parses the seed webpage, extracts any embedded URLs, then retrieves each of those pages, recursively, but limiting its exploration to a given “depth”.
The crawler shall:
- execute from a command line with usage syntax
./crawler seedURL pageDirectory maxDepth- where
seedURLis used as the initial URL, - where
pageDirectoryis the pathname for an existing directory in which to write downloaded webpages, and - where
maxDepthis a non-negative integer representing the maximum crawl depth.
- crawl all pages reachable from
seedURL, following links to a maximum depth ofmaxDepth; wheremaxDepth=0means that crawler only explores the page atseedURL,maxDepth=1means that crawler only explores the page atseedURLand those pages to whichseedURLlinks, and so forth inductively. - pause at least one second between page fetches.
- ignore URLs that are not “internal” (meaning, outside the designated CS50 server).
- write each explored page to the
pageDirectorywith a unique document ID, wherein- the document
idstarts at 1 and increments by 1 for each new page, - and the filename is of form
pageDirectory/id, - and the first line of the file is the URL,
- and the second line of the file is the depth,
- and the rest of the file is the page content (the HTML, unchanged).
- the document
In a requirements spec, shall do means must do.
Be polite
Webservers do not like crawlers (think about why).
Indeed, it you hit a web server too hard, its operator may block your crawler based on its Internet address.
Actually, they’ll usually block your whole domain.
A hyperactive CS50 crawler could cause some websites to block the whole of dartmouth.edu.
To be polite, our crawler purposely slows its behavior by introducing a delay, sleeping for one second between fetches.
Furthermore, our crawler will limit its crawl to a specific web server inside CS, so we don’t bother any other servers on campus or beyond.