CMPU 102 - Lab 12

$Revision: 1.3 $

In this lab you will implement two hash functions for strings, additive and multiplicative.  You will then perform an experiment to see how frequently collisions occur when these hash functions are used on words occurring in an input document.

Getting Started

Import lab12.zip into your Eclipse workspace.

The StringHash interface

The StringHash interface defines a single method:

public int computeHash(String s)
Compute a hash code for the String s and return it.

An object implementing the StringHash interface is a functor that computes a hash code for a String value.

First Task

Your first task is to implement two classes that implement the StringHash interface.

The AdditiveStringHash class should compute a hash code for the input String value by computing the sum of numeric values of the characters in the String.

The MultiplicativeStringHash class should compute a hash code for the input String value by summing the products of each numeric character value with decreasing powers of a factor C from Cn-1 down to C0.  In other words, if the numeric values of characters 0, 1, 2, ..., n-1 in a String are given by the sequence

a0, a1, a2, ..., an-1

then the multiplicative hash code for the String should be the sum

(Cn-1 ⋅ a0) + (Cn-2 ⋅ a1) + (Cn-3 ⋅ a2) + ... + (C1 ⋅ an-2) + (C0 ⋅ an-1)

A simple way to compute the sum of this sequence is with a loop from 0..n-1 where each iteration of the loop multiplies the current sum by C and then adds ai, where i is the loop variable.  The initial value of the sum should be 0.

The MultiplicativeStringHash class should have an instance variable to store the value of the factor C.  That way, you can vary C to investigate the choice of factor on hash collisions.

Both AdditiveStringHash and MultiplicativeStringHash should have a toString method.  AdditiveStringHash's toString should simply return the String "Additive".  MultiplicativeStringHash's toString should return a String containing "Multiplicative" followed by the value of the factor C that the object uses.

Second Task

Your second task is to complete the TestCollisions class.  First, implement the findNumCollisions method.  This method takes a StringHash functor and a Set of String values representing the words in the input file.  You should create a Set of integer objects to represent the set of hash values that have been encountered by by computing the hash codes of the input Strings.  You can create a new empty set of Integer objects as follows:

Set<Integer> hashCodeSet = new HashSet<Integer>();

Traverse through each String in the set of Strings and compute the hash code using the functor.  Check your set of hash codes to see if the hash code has been encountered previously.  If the hash code has appeared previously (as the hash code of an earlier word), count that is a single collision.  The contains method on the Set will tell you whether or not a particular hash code is contained in the Set. In either case (whether or not the hash code has occurred previously), add the hash code to the set of hash codes as a new Integer object.  (The add method adds a new Integer to the Set

After every String has been checked, return the total number of collisions.

Once you have implemented findNumCollisions, you can finish the main method.  The main method should make calls to the runTest method, testing a different hash function each time.  The specific hash functions that should be tested are described in a comment.  For example, you can test the additive hash function using the statement

runTest(new AdditiveStringHash(), wordSet);

Testing

To run TestCollisions,

  1. First, right-click on "TestCollisions.java" and choose "Run As->Java Application".  You should see an error message in the Console window.

  2. Next, right-click on "TestCollisions.java" in the Package Explorer, choose "Run As->Run...", and then enter
    pandp.txt
    
    in the "Arguments" tab.  Then click "Run".

This will test your hash functions on the words from Pride and Prejudice by Jane Austen.  You should see output similar to the following:

Read 12659 distinct words
Additive: 11236 collisions
Multiplicative(C=23): 1 collisions
...more lines follow...

After performing steps 1 and 2 above, you can run the program again by clicking the "Run" button in the toolbar:

When You Are Done

Show the output of running TestCollisions to a lab coach or instructor.

The questions you should ask yourself are why so many collisions occur using the additive hash function and the multiplicative hash function when C is a power of two.

Submit by running the following commands in a terminal window:

cd
cd eclipse-workspace
submit102 lab12