TSE Crawler design and implementation

Goals

To understand the specifications for the Tiny Search Engine (TSE)
To develop an implementation plan for the TSE Crawler
To demonstrate the execution of one Crawler solution

My first Terminal script shows a short run of crawler over the cs50 home directory.

My second Terminal script shows some short runs of crawler over the small ‘letters’ playground, then a run over the ‘wikipedia’ playground; I eventually killed the maxDepth=2 crawl because it takes a long time to finish.

Specifications

A good software implementation requires a good software design, and a good software design must be based on a clear set of requirements. We think about these specifications as a sort of “contract” between the programmer (who writes the code) and the customer (the ultimate user of the software). Thus, we need three specs:

Requirements spec: specifies what the software must do
Design spec: specifies the structure of the software in a language-independent, machine-dependent way
Implementation spec: specifies the language-dependent, machine-dependent details of the implementation.

We’ll take a closer look at the design process in the next lecture. For today, let’s apply this approach to the TSE.

Tiny Search Engine (TSE)

Our Tiny Search Engine (TSE) design is inspired by the material in the paper Searching the Web, by Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, and Sriram Raghavan (Stanford University); ACM Transactions on Internet Technology (TOIT), Volume 1, Issue 1 (August 2001).

TSE Requirements Spec

The Tiny Search Engine (TSE) shall consist of three subsystems:

crawler, which crawls the web from a given seed to a given maxDepth and caches the content of the pages it finds, one page per file, in a given directory.
indexer, which reads files from the given directory, builds an index that maps from words to pages (URLs), and writes that index to a given file.
querier, which reads the index from a given file, and a query expressed as a set of words optionally combined by (AND, OR), and outputs a ranked list of pages (URLs) in which the given combination of words appear.

Each subsystem is a standalone program executed from the command line, but they inter-connect through files in the file system.

In a spec document, we write shall do to mean must do.

We’ll look deeper at the requirements for the each subsystem, later. First, let’s go to the next level on the overall TSE: the Design Spec.

TSE Design Spec

The overall architecture presented below shows the modular decomposition of the system:

Tiny Search Engine modular design

The above diagram is consistent with the Requirements Spec: we can clearly see three sub-systems, their interconnection through files, and the user interface for submitting queries to the querier. The querier subsystem has an internal ranking module, which we anticipate might be separate from the query processor module; we’ll look more closely when we come to the querier design.

Next, we describe each sub-system and its high-level design.

The crawler crawls a website and retrieves webpages starting with a specified URL. It parses the initial webpage, extracts any embedded href URLs and retrieves those pages, and crawls the pages found at those URLs, but limiting itself to maxDepth hops from the seed URL. When the crawler process is complete, the indexing of the collected documents can begin.

The indexer extracts all the keywords for each stored webpage and creates a lookup table that maps each word found to all the documents (webpages) where the word was found.

The query engine responds to requests (queries) from users. The query processor module loads the index and searches for pages that include the search keywords. Because there may be many hits, we need a ranking module to rank the results (e.g., high to low number of instances of a keyword on a page).

TSE Crawler Specs

We’ll look deeper at the requirements for the indexer and querier later. Right now, let’s focus on the crawler:

Requirements Spec.
Design Spec.
Implementation spec: The Design Spec describes abstractions - abstract data structures, and pseudocode. The same design could be implemented in C or Java or another language. The Implementation Spec gets more specific. It is language, operating system, and hardware dependent. The implementation spec includes many or all of these topics:
- Detailed pseudo code for each of the objects/components/functions,
- Definition of detailed APIs, interfaces, function prototypes and their parameters,
- Data structures (e.g., struct names and members),
- Security and privacy properties,
- Error handling and recovery,
- Resource management,
- Persistant storage (files, database, etc).

The Lab3 assignment is written like an implementation spec, right down to the specific function prototypes.

You need to write the Implementation spec for the Crawler in Lab 4.

Demonstration

Crawler execution and output

Below is a snippet of when the program starts to crawl the CS50 website to a depth of 1. The crawler prints status information as it goes along.

Note, you might consider covering this debugging print-out code in an #ifdef block that can be triggered by a compile-time switch. See the Lecture extra about this trick.

$ ./crawler http://old-www.cs.dartmouth.edu/~cs50/index.html data 1
 Fetched: http://old-www.cs.dartmouth.edu/~cs50/index.html
   Saved: http://old-www.cs.dartmouth.edu/~cs50/index.html
Scanning: http://old-www.cs.dartmouth.edu/~cs50/index.html
   Found: http://old-www.cs.dartmouth.edu/~cs50/
   Added: http://old-www.cs.dartmouth.edu/~cs50/
   Found: https://canvas.dartmouth.edu/courses/14815
IgnExtrn: https://canvas.dartmouth.edu/courses/14815
   Found: https://piazza.com/dartmouth/spring2017/cosc050sp17/home
IgnExtrn: https://piazza.com/dartmouth/spring2017/cosc050sp17/home
   Found: http://old-www.cs.dartmouth.edu/~cs50/Schedule.pdf
IgnExtrn: http://old-www.cs.dartmouth.edu/~cs50/Schedule.pdf
   Found: http://www.cs.dartmouth.edu/~cs50/Logistics/
IgnExtrn: http://www.cs.dartmouth.edu/~cs50/Logistics/
   Found: http://www.cs.dartmouth.edu/~cs50/Lectures/
IgnExtrn: http://www.cs.dartmouth.edu/~cs50/Lectures/
   Found: http://www.cs.dartmouth.edu/~cs50/Reading/
IgnExtrn: http://www.cs.dartmouth.edu/~cs50/Reading/
   Found: http://www.cs.dartmouth.edu/~cs50/examples/
IgnExtrn: http://www.cs.dartmouth.edu/~cs50/examples/
   Found: http://www.cs.dartmouth.edu/~cs50/Sections/
IgnExtrn: http://www.cs.dartmouth.edu/~cs50/Sections/
   Found: http://www.cs.dartmouth.edu/~cs50/Labs/
IgnExtrn: http://www.cs.dartmouth.edu/~cs50/Labs/
   Found: http://www.cs.dartmouth.edu/~cs50/Labs/Project/
IgnExtrn: http://www.cs.dartmouth.edu/~cs50/Labs/Project/
   Found: https://gitlab.cs.dartmouth.edu
IgnExtrn: https://gitlab.cs.dartmouth.edu
   Found: http://www.cs.dartmouth.edu/~cs50/Resources/
IgnExtrn: http://www.cs.dartmouth.edu/~cs50/Resources/
   Found: http://old-www.cs.dartmouth.edu/~dfk/
   Added: http://old-www.cs.dartmouth.edu/~dfk/
  Fetched: http://old-www.cs.dartmouth.edu/~dfk/
    Saved: http://old-www.cs.dartmouth.edu/~dfk/
  Fetched: http://old-www.cs.dartmouth.edu/~cs50/
    Saved: http://old-www.cs.dartmouth.edu/~cs50/

Notice how I printed the depth of the current crawl at left, then indented slightly based on the current depth, then printed a single word meant to indicate what is being done, then printed the URL. By ensuring a consistent format, and choosing a simple/unique word for each type of line, I can post-process the output with grep, awk, and so forth, enabling me to run various checks on the output of the crawler. Much better than a mish-mash of arbitrary output formats!

To make this easy, I wrote a simple function to print those lines:

// log one word (1-9 chars) about a given url
inline static void logr(const char *word, const int depth, const char *url)
{
  printf("%2d %*s%9s: %s\n", depth, depth, "", word, url);
}

I thus have just one printf call, and if I want to tweak the format, I just need to edit one place and not every log-type printf in the code.

Notice the inline modifier. This means that C is allowed to compile this code ‘in line’ where the function call occurs, rather than compiling code that actually jumps to a function and returns. Syntactically, in every way, it’s just like a function - but more efficient. Great for tiny functions like this one, where it’s worth duplicating the code (making the executable bigger) to save time (making the program run slightly faster).

Anyway, at strategic points in the code, I make a call like this one:

   logr("Fetched", page->depth, page->url);

Contents of pageDirectory after crawler has run

For each URL crawled the program creates a file and places in the file the URL and filename followed by all the contents of the webpage. But for a maxDepth = 1 as in this example there are only a few webpages crawled and files created. Below is a peek at the files created during the above crawl. Notice how each page starts with the URL, then a number (the depth of that page during the crawl), then the contents of the page (here I printed only the first line of the content).

$ ls
1	2	3
$ head -3 1
http://old-www.cs.dartmouth.edu/~cs50/index.html
0
<!DOCTYPE html>
$ head -3 2
http://old-www.cs.dartmouth.edu/~dfk/
1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
$ head -3 3
http://old-www.cs.dartmouth.edu/~cs50/
1
<!DOCTYPE html>

Organization of the TSE code

Let’s take a look at the structure of the TSE solution - so you can see what you’re aiming for.

Directory structure

My TSE comprises six subdirectories:

libcs50 - a library of code we provide
common - a library of code you write
crawler - the crawler
indexer - the indexer
querier - the querier
data - with subdirectories where the crawler and indexer can write files, and the querier can read files.

My top-level .gitignore file excludes data from the repository, because the data files are big, changing often, and don’t deserve to be saved.

The full tree looks like this:

├── common
│   ├── index.c
│   ├── index.h
│   ├── Makefile
│   ├── pagedir.c
│   ├── pagedir.h
│   ├── word.c
│   └── word.h
├── crawler
│   ├── crawler.c
│   ├── crawlertest.sh
│   ├── Makefile
│   └── README.md
├── indexer
│   ├── indexer.c
│   ├── indexsort.awk
│   ├── indexsort.sh
│   ├── indextest.c
│   ├── indextest.sh
│   ├── Makefile
│   └── README.md
├── libcs50
│   ├── bag.c
│   ├── bag.h
│   ├── counters.c
│   ├── counters.h
│   ├── file.c
│   ├── file.h
│   ├── file.md
│   ├── hashtable.c
│   ├── hashtable.h
│   ├── jhash.c
│   ├── jhash.h
│   ├── Makefile
│   ├── memory.c
│   ├── memory.h
│   ├── memory.md
│   ├── README.md
│   ├── set.c
│   ├── set.h
│   ├── webpage.c
│   ├── webpage_fetch.c
│   ├── webpage.h
│   ├── webpage_internal.h
│   └── webpage.md
├── Makefile
├── querier
│   ├── fuzzquery.c
│   ├── Makefile
│   ├── querier.c
│   ├── README.md
│   └── testqueries
└── README.md

Source files

My crawler, indexer, and querier each consist of just one .c file. They share some common code, which I keep in the common directory:

pagedir - a suite of functions to help the crawler write pages to the pageDirectory and help the indexer read them back in
index - a suite of functions that implement the “index” data structure; this module includes functions to write an index to a file (used by indexer) and read an index from a file (used by querier).
word - a function NormalizeWord used by both the indexer and the querier.

Each of the program directories (crawler, indexer, querier) include a few files related to testing, as well.

You’ll recognize the Lab3 data structures - they’re all in the libcs50 library. Note the flatter organization - there’s not a separate subdirectory (with Makefile or test code) for each data structure.

Makefiles

The starter kit includes a Makefile at the top level and another to build the libcs50 library.

The top-level Makefile recursively calls Make on each of the source directories.

The libcs50/Makefile demonstrates how you can build a library, in this case libcs50.a, from a collection of object .o files. Study this Makefile, because you’ll need to write something similar for your common directory.

You can then link these libraries into your other programs without having to list all the individual .o files on the commandline. For example, when I build my crawler Make runs commands as follows:

make -C crawler
gcc -Wall -pedantic -std=c11 -ggdb -DAPPTEST  -I../libcs50 -I../common   -c -o crawler.o crawler.c
gcc -Wall -pedantic -std=c11 -ggdb -DAPPTEST  -I../libcs50 -I../common crawler.o ../common/common.a ../libcs50/libcs50.a -lcurl -o crawler

The crawler/Makefile is written in good style, with appropriate use of variables, so the rule that causes Make to run the above commands is actually much shorter:

crawler: crawler.o $(LLIBS)
        $(CC) $(CFLAGS) $^ $(LIBS) -o $@

We’ll work more with Makefiles in upcoming classes.

Activity

In today’s activity your group will run the crawler over a set of web pages and try to understand why it takes the steps it does.