$Revision: 1.3 $
In this lab you will implement two hash functions for strings, additive and multiplicative. You will then perform an experiment to see how frequently collisions occur when these hash functions are used on words occurring in an input document.
Import lab12.zip into your Eclipse workspace.
The StringHash interface defines a single method:
An object implementing the StringHash interface is a functor that computes a hash code for a String value.
Your first task is to implement two classes that implement the StringHash interface.
The AdditiveStringHash class should compute a hash code for the input String value by computing the sum of numeric values of the characters in the String.
The MultiplicativeStringHash class should compute a hash code for the input String value by summing the products of each numeric character value with decreasing powers of a factor C from C^{n-1} down to C^{0}. In other words, if the numeric values of characters 0, 1, 2, ..., n-1 in a String are given by the sequence
a_{0}, a_{1}, a_{2}, ..., a_{n-1}
then the multiplicative hash code for the String should be the sum
(C^{n-1} ⋅ a_{0}) + (C^{n-2} ⋅ a_{1}) + (C^{n-3} ⋅ a_{2}) + ... + (C^{1} ⋅ a_{n-2}) + (C^{0} ⋅ a_{n-1})
A simple way to compute the sum of this sequence is with a loop from 0..n-1 where each iteration of the loop multiplies the current sum by C and then adds a_{i}, where i is the loop variable. The initial value of the sum should be 0.
The MultiplicativeStringHash class should have an instance variable to store the value of the factor C. That way, you can vary C to investigate the choice of factor on hash collisions.
Both AdditiveStringHash and MultiplicativeStringHash should have a toString method. AdditiveStringHash's toString should simply return the String "Additive". MultiplicativeStringHash's toString should return a String containing "Multiplicative" followed by the value of the factor C that the object uses.
Your second task is to complete the TestCollisions class. First, implement the findNumCollisions method. This method takes a StringHash functor and a Set of String values representing the words in the input file. You should create a Set of integer objects to represent the set of hash values that have been encountered by by computing the hash codes of the input Strings. You can create a new empty set of Integer objects as follows:
Set<Integer> hashCodeSet = new HashSet<Integer>();
Traverse through each String in the set of Strings and compute the hash code using the functor. Check your set of hash codes to see if the hash code has been encountered previously. If the hash code has appeared previously (as the hash code of an earlier word), count that is a single collision. The contains method on the Set will tell you whether or not a particular hash code is contained in the Set. In either case (whether or not the hash code has occurred previously), add the hash code to the set of hash codes as a new Integer object. (The add method adds a new Integer to the Set.
After every String has been checked, return the total number of collisions.
Once you have implemented findNumCollisions, you can finish the main method. The main method should make calls to the runTest method, testing a different hash function each time. The specific hash functions that should be tested are described in a comment. For example, you can test the additive hash function using the statement
runTest(new AdditiveStringHash(), wordSet);
To run TestCollisions,
in the "Arguments" tab. Then click "Run".pandp.txt
This will test your hash functions on the words from Pride and Prejudice by Jane Austen. You should see output similar to the following:
Read 12659 distinct words Additive: 11236 collisions Multiplicative(C=23): 1 collisions ...more lines follow...
After performing steps 1 and 2 above, you can run the program again by clicking the "Run" button in the toolbar:
Show the output of running TestCollisions to a lab coach or instructor.
The questions you should ask yourself are why so many collisions occur using the additive hash function and the multiplicative hash function when C is a power of two.
Submit by running the following commands in a terminal window:
cd cd eclipse-workspace submit102 lab12