CMPU 101 - Lab 10

$Revision: 1.5 $

In this lab you will write classes that count the number of times words appear in an input document.

Getting Started

Download the file cs101-lab10.zip from the Labs section of the class web page.  Save the file in the Home directory.  Extract the contents by opening a terminal window and running the command:

unzip cs101-lab10.zip

The WordTabulator and WordCount classes

One way to solve the problem is to use a pair of classes: let's call them WordCount and WordTabulator.

WordCount represents a single word and the number of times it appears in the input file:

It has the following methods:

public WordCount(String word_)
Constructor.  The word_ parameter specifies the word the object represents.  The count should be set to 1.

public String getWord()
Return the word the object represents.

public void addToCount()
Increase the count by 1.

public int getCount()
Get the count (the number of times the word occurs in the input file).

WordTabulator represents a collection of WordCount objects:

It has the following methods:

public WordTabulator()
Constructor.  Initializes the object so that it contains an empty collection of WordCount objects.  It needs to create a new ArrayList<WordCount> object and save a reference to it in an instance variable.

public void addWordOccurrence(String word)
Record an occurrence of the given word.  This is done by searching for a matching WordCount object in the collection.  If one is found, then its count is increased by one.  If no matching WordCount is found, then a new one is created for the word and added to the collection.

public int getNumWords()
Returns the number of distinct words that have been added to the collection.

public WordCount getWordCount(int index)
Get the WordCount object whose index in the collection is given.

public WordCount getWordCountByWord(String word)
Get the WordCount object for the given word.  Return null is the word does not appear in the collection.

public void sortByWordCount()
Sort the collection of WordCount objects so that they are arranged in increasing order by their counts.  This means that the most-frequently occurring words will be located at the end of the collection (at the highest index values).  You can implement this method using the following statement:
java.util.Collections.sort(wordCountList, new WordCountComparator());
This assumes that the instance variable storing the ArrayList is called wordCountList.  The WordCountComparator class is included in the lab zipfile.

Your Tasks

Your first task is to create the WordCount and WordTabulator classes described above.  You can use the JUnit test classes WordCountTest and WordTabulatorTest to make sure that they work correctly.

Suggestion: first implement WordCount, then test it using WordCountTest, then implement WordTabulator, and then test it using WordTabulatorTest.

Your second task is to finish the CountWordsInFile class.  This class consists of a main method that opens an input file and uses a scanner to read words out of the input file.  You should modify the program so that it creates a WordTabulator object to keep track of number of times each word appears in the input file.  After all of the words in the file have been tabulated, it should use the WordTabulator object to print to System.out:

  1. The number of distinct words in the input file
  2. the 10 most-frequently occurring words and their occurrence counts

Running the CountWordsInFile program

To run the CountWordsInFile program, first open a terminal window and run the commands

cd
cd cs101-lab10

Two text files are included for you to test your program with:

declind.txt
The text of the U.S. Declaration of Independence.
pandp.txt
The text of Pride and Prejudice by Jane Austen.

To run the program, first compile all of your files, then run the following command in the terminal window:

java CountWordsInFile filename.txt

substituting the name of the file you want to read in place of filename.txt.

Test your program using declind.txt first, since it will take a fairly long time to tabulate all of the words in Pride and Prejudice.

If all goes well you should see output similar to the following (this is for Pride and Prejudice):

The file contained 6324 distinct words
The most-frequent words:
the: 4331
to: 4163
of: 3609
and: 3585
her: 2227
i: 2069
a: 1956
in: 1879
was: 1847
she: 1710

If your program does not terminate, you can kill it by typing Control-C in the terminal window.

When You Are Done

Show the output of the CountWordsInFile program on one of the input files to a lab coach or instructor.

Run the following commands in a terminal window:

cd
zip -9r cs101-lab10-solution.zip cs101-lab10

Make sure all files get added properly.

Submit cs101-lab10-solution.zip using the CS 101 Submission WebsiteMake sure you submit cs101-lab10-solution.zip, not cs101-lab10.zip.