YCP Logo Assignment 6: Search Engine

Due: Tuesday, November 23rd by 11:59 PM

Getting started

Start by importing the file CS201_Assign6.zip into Eclipse: File->Import->Existing Projects into Workspace, click Select archive file, select CS201_Assign6.zip from the file selection dialog, and then click the Finish button. You should see a project called CS201_Assign6 in the Package Explorer.

Your task is to implement the main method of the SearchEngine class, as described below.

Your Task

Write a search engine that reads all of the words from a collection of text files, and builds an index recording all occurrences of all words occurring in any file.

Each time an occurrence of a word is observed, the search engine should record:

  • the name of the file the word occurred in
  • the author of the document
  • the title of the document
  • the line number of the line in which the word occurred
  • the text of the line in which the word occurred

The search engine should be searchable. The user will enter a series of queries indicating words to look for. The search engine should consult its index, and print a list of all occurrences of the queried word.

The query !q should be interpreted as a request to exit the program.

Example session (user input in bold):

Directory to scan for documents: H:/documents
Scanning documents
.............
Scanning completed in 5155 milliseconds
Query> portico
Search completed in 1 millisecond(s)
9 result(s)
Result 1:
  Document: H:/documents/doriangray.txt
  Author  : Oscar Wilde
  Title   : The Picture of Dorian Gray
  Line    : 3465
  Text    : jade-green piles of vegetables.  Under the portico, with its grey,
Result 2:
  Document: H:/documents/doriangray.txt
  Author  : Oscar Wilde
  Title   : The Picture of Dorian Gray
  Line    : 8509
  Text    : all dark.  After a time, he went away and stood in an adjoining portico
Result 3:
  Document: H:/documents/10booksarch.txt
  Author  : Vitruvius
  Title   : Ten Books on Architecture
  Line    : 529
  Text    : THEATRE PORTICO ACCORDING TO VITRUVIUS                               152
Result 4:
  Document: H:/documents/10booksarch.txt
  Author  : Vitruvius
  Title   : Ten Books on Architecture
  Line    : 2851
  Text    : Portico of Metellus, and the Marian temple of Honour and Valour
Result 5:
  Document: H:/documents/10booksarch.txt
  Author  : Vitruvius
  Title   : Ten Books on Architecture
  Line    : 2852
  Text    : constructed by Mucius, which has no portico in the rear.
Result 6:
  Document: H:/documents/10booksarch.txt
  Author  : Vitruvius
  Title   : Ten Books on Architecture
  Line    : 5519
  Text    : 16. Portico
Result 7:
  Document: H:/documents/10booksarch.txt
  Author  : Vitruvius
  Title   : Ten Books on Architecture
  Line    : 11503
  Text    : Metellus, portico of, 78.
Result 8:
  Document: H:/documents/ulysses.txt
  Author  : James Joyce
  Title   : Ulysses
  Line    : 10330
  Text    : The portico.
  Result 9:
Document: H:/documents/littledorrit.txt
  Author  : Charles Dickens
  Title   : Little Dorrit
  Line    : 36633
  Text    : steps of the portico, looking at the fresh perspective of the street in
Query> haberdasher
Search completed in 0 millisecond(s)
0 result(s)
Query> phaeton
Search completed in 0 millisecond(s)
7 result(s)
Result 1:
  Document: H:/documents/pandp.txt
  Author  : Jane Austen
  Title   : Pride and Prejudice
  Line    : 2369
  Text    : to drive by my humble abode in her little phaeton and ponies."
Result 2:
  Document: H:/documents/pandp.txt
  Author  : Jane Austen
  Title   : Pride and Prejudice
  Line    : 5387
  Text    : quest of this wonder; It was two ladies stopping in a low phaeton at the
Result 3:
  Document: H:/documents/pandp.txt
  Author  : Jane Austen
  Title   : Pride and Prejudice
  Line    : 5710
  Text    : along, and how often especially Miss de Bourgh drove by in her phaeton,
Result 4:
  Document: H:/documents/pandp.txt
  Author  : Jane Austen
  Title   : Pride and Prejudice
  Line    : 10790
  Text    : till I have been all round the park. A low phaeton, with a nice little
Result 5:
  Document: H:/documents/ulysses.txt
  Author  : James Joyce
  Title   : Ulysses
  Line    : 29926
  Text    : phaeton with good working solidungular cob (roan gelding, 14 h).
Result 6:
  Document: H:/documents/littledorrit.txt
  Author  : Charles Dickens
  Title   : Little Dorrit
  Line    : 4055
  Text    : his time, and had kept a phaeton, he said. He boasted that he stood up
Result 7:
  Document: H:/documents/littledorrit.txt
  Author  : Charles Dickens
  Title   : Little Dorrit
  Line    : 14249
  Text    : They had a little open phaeton for the journey, and were soon in it on
Query> !q
Bye!

Increasing the Heap Size

If your program is exiting with an OutOfMemoryError, you may need to increase the amount of memory your program is allowed to allocate.

To do this, right-click the SearchEngine class, and choose Run As->Run Configurations.... Select the Arguments tab, and in the box labeled VM arguments enter the following option:

-Xmx512m

This will allocate 512 megabytes of memory for the program to use when allocating objects using the new operator. You can increase this value until you find that the program is able to run without generating an out of memory error.

Requirements, Hints

Suggested approach: define class to represent a single occurrence of a word in a document. Its fields should record the filename of the document, author, title, line number, and text line.

Your index should be a map where the key type is String and the value type is a List whose elements are instances of your occurrence class (as described above). (A list is necessary because a word may occur in multiple documents, and may occur multiple times in the same document.)

You can assume that somewhere in each file, there is a line which begins with the text Author: and another line which begins with the text Title:. Use those lines as the author and title of the document, respectively.

A word should be considered to be a sequence of letters.

The search engine should treat upper and lower case letters as equivalent. So, for example, the queries Carriage, carriage, and CARRIAGE should all return the same set of occurrences.

The java.io.File class represents the name of a single file or directory. When called on a File object representing a directory, the listFiles() method returns an array of File objects representing files in that directory. So, your program can do something like this to build the index:

BufferedReader keyboard = new BufferedReader(new InputStreamReader(System.in));

System.out.println("Directory to scan for documents:");
String dirName = keyboard.readLine();

File dir = new File(dirName);
File[] contents = dir.listFiles();

for (File f : contents) {
        if (f.isFile() && f.getPath().endsWith(".txt")) {
                // open the file f, add all words to the index
        }
}

As the example above demonstrates, you only need to consider files whose names end with ".txt" when building the index.

Test corpus

To test your program, download the following zip file:

CS201_Assign6_Corpus.zip

Extract the zip file. When running your program, specify the directory containing the extracted text files.

These documents (and many others) are in the public domain, and are available from the Project Gutenberg Website.

Grading

For a grade of up to 90 points, complete all of the features described above. Of those 90 points, 15 are allocated to how well you used classes and objects, and to coding style (indentation, meaningful identifiers, etc.)

For up to 100 points, handle multiple word queries. If the user types multiple words in double quote characters, the search engine should return a list of all occurrences of that exact sequence of words.

Submitting

Export your project as a zip file (File->Export...->Archive File) and upload it to the submission server as assign6. The URL of the server is

https://camel.ycp.edu:8443/