TSE Crawler Requirements Spec
The TSE crawler is a standalone program that crawls the web and retrieves webpages starting from a “seed” URL. It parses the seed webpage, extracts any embedded URLs, then retrieves each of those pages, recursively, but limiting its exploration to a given “depth”.
The crawler shall:
- execute from a command line with usage syntax
./crawler seedURL pageDirectory maxDepth
- where
seedURL
is used as the initial URL, - where
pageDirectory
is the pathname for an existing directory in which to write downloaded webpages, and - where
maxDepth
is a non-negative integer representing the maximum crawl depth.
- crawl all pages reachable from
seedURL
, following links to a maximum depth ofmaxDepth
; wheremaxDepth=0
means that crawler only explores the page atseedURL
,maxDepth=1
means that crawler only explores the page atseedURL
and those pages to whichseedURL
links, and so forth inductively. - pause at least one second between page fetches.
- ignore URLs that are not “internal” (meaning, outside the designated CS50 server).
- write each explored page to the
pageDirectory
with a unique document ID, wherein- the document
id
starts at 1 and increments by 1 for each new page, - and the filename is of form
pageDirectory/id
, - and the first line of the file is the URL,
- and the second line of the file is the depth,
- and the rest of the file is the page content (the HTML, unchanged).
- the document
In a requirements spec, shall do means must do.
Be polite
Webservers do not like crawlers (think about why).
Indeed, it you hit a web server too hard, its operator may block your crawler based on its Internet address.
Actually, they’ll usually block your whole domain.
A hyperactive CS50 crawler could cause some websites to block the whole of dartmouth.edu
.
To be polite, our crawler purposely slows its behavior by introducing a delay, sleeping for one second between fetches.
Furthermore, our crawler will limit its crawl to a specific web server inside CS, so we don’t bother any other servers on campus or beyond.