TSE Indexer Requirements Spec
The TSE indexer is a standalone program that reads the document files produced by the TSE crawler, builds an index, and writes that index to a file. Its companion, the index tester, loads an index file produced by the indexer and saves it to another file.
The indexer shall:
- execute from a command line with usage syntax
./indexer pageDirectory indexFilename
- where
pageDirectory
is the pathname of a directory produced by the Crawler, and - where
indexFilename
is the pathname of a file into which the index should be written; the indexer creates the file (if needed) and overwrites the file (if it already exists).
- read documents from the
pageDirectory
, each of which has a unique document ID, wherein- the document
id
starts at 1 and increments by 1 for each new page, - and the filename is of form
pageDirectory/id
, - and the first line of the file is the URL,
- and the second line of the file is the depth,
- and the rest of the file is the page content (the HTML, unchanged).
- the document
- build an inverted-index data structure mapping from words to (documentID, count) pairs, wherein each count represents the number of occurrences of the given word in the given document. Ignore words with fewer than three characters, and “normalize” the word before indexing. (Here, “normalize” means to convert all letters to lower-case.)
- create a file
indexFilename
and write the index to that file, in the format described below.
The indexer shall validate its command-line arguments:
pageDirectory
is the pathname for an existing directory produced by the crawler, andindexFilename
is the pathname of a writeable file; it may or may not already exist.
The indexer may assume that
pageDirectory
has files named 1, 2, 3, …, without gaps.- The content of files in
pageDirectory
follow the format as defined in the specs; thus your code (to read the files) need not have extensive error checking.
The index tester shall:
- execute from a command line with usage syntax
./indextest oldIndexFilename newIndexFilename
- where
oldIndexFilename
is the name of a file produced by the indexer, and - where
newIndexFilename
is the name of a file into which the index should be written.
- load the index from the
oldIndexFilename
into an inverted-index data structure. - create a file
newIndexFilename
and write the index to that file, in the format described below.
It need not validate its command-line arguments other than to ensure that it receives precisely two arguments; it may simply try to open the oldIndexFilename
for reading and, later, try to open the newIndexFilename
for writing.
You may want to run this program as part of testing script that verifies that the output is identical to (or equivalent to) the input.
The index tester may assume that
- The content of the index file follows the format specified below; thus your code (to recreate an index structure by reading a file) need not have extensive error checking.
Index file format
The indexer writes the inverted index to a file, and both the index tester and the querier read the inverted index from a file; the file shall be in the following format.
- one line per word, one word per line
- each line provides the word and one or more (docID, count) pairs, in the format
- word docID count [docID count]…
- where word is a string of lower-case letters,
- where docID is a positive non-zero integer,
- where count is a positive non-zero integer,
- where the word and integers are separated by spaces.
Within the file, the lines may be in any order.
Within a line, the docIDs may be in any order.