Wednesday, March 21, 2012

Java Hashtable and the Apache Commons Lang package

Hash Table Review
Hash tables are data structures that map an object, the key, to another object, the value. Each key can only be mapped to one value, but different keys can be mapped to the same value. An array can be thought of as a hash table in which the key is an index and the value is the object stored in the corresponding position in the array. Each index maps one and only value in the array. Arrays have the advantage that, knowing the key, the time to retrieve the corresponding value (access time) is O(1). On the other hand, their limitation is that the key can only be a (non negative) integer.
Hash tables are designed to have O(1) access time just like an array, but with the feature of accepting arbitrary keys, for example floating point numbers, strings, or arbitrary objects. A hash table can be described as a collection of buckets containing a certain number of [key,value] pairs with distinct keys (e.g., Introduction to Algorithms by Cormen et al.). Each bucket is labeled by an integer and each key can only be in one of the buckets. The bucket label is called hash code. Each key uniquely defines its hash code, but two different keys can have the same hash code. The hash code can be computed in many ways, depending on the application and there are a lot of studies on this topic alone. Luckily, depending on the programming language of choice, there are facilities that simplify the creation of hash functions. For example, the Apache Software Foundation provides the commons lang package to create such functions in java, while the Boost C++ Libraries provide functions to do the same in C++. In my experience, using these functions greatly simplifies your life if you want to use hash tables.
Now, let's say we want to put a new [key,value] pair into a hash table. The given key will be compared to each key in the bucket corresponding to the key hash code. If the key is already there, its value will be replaced with the new one, otherwise the pair will be simply added to the bucket. This is the behavior we expect since each key maps to one value only. Similarly if we are looking for the value mapped to a given key, the hash table looks for the [key,value] pair in the bucket corresponding to the key hash code and returns the value (or nothing if the key is not there). If the buckets contain a small number of elements (and we will assume they will), the insertion time and access time are both O(1). For these two basic operations to work properly it must be possible to compare the keys to each other.

Java Hashtable
Now let's see what we need to use hash tables in java. First of all, it should be clear that we need to define a hash code function for the objects we intend to use as keys. The java Object class contains the hashCode() function, which should be overridden by the objects we want to use as keys. Second, we need to make sure that our keys can be compared to each other. This is done by overriding another method from the Object class, the equals() method.
The HashCodeBuilder class and the EqualsBuilder class from the Apache Commons Lang package allow to easily create these two function. Here's an example of a class that's ready to be used as a hash table key:

 import org.apache.commons.lang3.builder.EqualsBuilder;  
 import org.apache.commons.lang3.builder.HashCodeBuilder;  
 public class Point {  
      private final int x, y;  
      public Point(int x, int y) {  
           this.x = x;  
           this.y = y;  
      }  
      @Override  
      public int hashCode() {  
           return new HashCodeBuilder().append(x).append(y).toHashCode();  
      }  
      @Override  
      public boolean equals(Object other) {  
           if (other instanceof Point == false) {  
                return false;  
           }  
           if (this == other) {  
                return true;  
           }  
           Point rhs = (Point) other;  
           return new EqualsBuilder().append(x, rhs.x).append(y, rhs.y).isEquals();  
      }  
 }  

The Point class represents a simple data structure to store integer coordinates on a plane. The use of EqualsBuilder is probably an overkill for such a simple class but I think it's good practice to familiarize with it since it provides interesting features, e.g., the ability to use the super class' equals method as its basis.
Now the class Point can be used in a java Hashtable or HashSet (similar in behavior to a table, but only stores keys, no values). The use of final members in the class Point shouldn't surprise. It makes perfect sense that the members that contribute to the computation of the hash code won't change during the program execution. If this happened for a key already in the hash table, its hash code would suddenly change and thus any consequent access would fail to find the key because the hash table would look in a different bucket than before. Here's an example using the Point class:

 java.util.Hashtable<Point, Integer> table = new java.util.Hashtable<Point, Integer>();  
 table.put(new Point(0, 0), 0);  
 table.put(new Point(0, 1), 5);  
 table.put(new Point(1, 0), 8);  
 table.put(new Point(1, 1), 10);  
 // replacing a value  
 table.put(new Point(1, 0), 2);  
 System.out.println(table.get(new Point(1, 1)));  
 System.out.println(table.get(new Point(1, 0)));  

Running the code will print 10 and 2. Note that the type of the value of choice (Integer) is irrelevant to the example.

Double pitfall
here's an interesting example using the java Double class as a key. The functions hashCode() and equals() are already defined for this class and they (usually) work properly. However, I ran in an interesting behavior that can be reproduced easily with the following code (java 1.7):

 java.util.Hashtable<Double, String> table = new java.util.Hashtable<Double, String>();  
 table.put(0.1, "George");  
 table.put(Double.NaN, "Paul");  
 table.put(0.0, "John");  
 System.out.println(table.get(0.0));  
 // is this the same as before?  
 System.out.println(table.get(-0.0));  
 // replacing?  
 table.put(-0.0, "Ringo");  
 System.out.println(table.get(0.0));  
 System.out.println(table.get(-0.0));  
 System.out.println(table.get(Double.NaN));  

This example shows how the hashCode() function for the Double class seems to violate the assumption that two identical objects will necessarily lead to the same hash code: the hash codes for the value 0.0 and -0.0 are different. In fact, although 0.0 and -0.0 are the same number, their binary representations are different, thus the different hash codes (the hash code is computed using this binary representation). The design choice that led to the definition of this hash function can be considered poor, although it's consistent with the java guidelines to define hashCode(). Similarly, the equals() method has the same odd behavior:
 double positiveZero = 0.0, negativeZero = -0.0;  
 System.out.println(new Double(positiveZero).equals(new Double(negativeZero)));  
 System.out.println(positiveZero == negativeZero);  

The same applies to the Float class. My suggestion when using floating point values (or tuples of them) as keys in a java hash table is to make sure that only one of the two binary representations of zero is used throughout the code. For example:

 import org.apache.commons.lang3.builder.EqualsBuilder;  
 import org.apache.commons.lang3.builder.HashCodeBuilder;  
 public class PointFloat {  
      private final float x, y;  
      public PointFloat(float x, float y) {  
           if (x == 0)  
                x = 0;  
           if (y == 0)  
                y = 0;  
           this.x = x;  
           this.y = y;  
      }  
      @Override  
      public int hashCode() {  
           return new HashCodeBuilder().append(x).append(y).toHashCode();  
      }  
      @Override  
      public boolean equals(Object other) {  
           if (other instanceof PointFloat == false) {  
                return false;  
           }  
           if (this == other) {  
                return true;  
           }  
           PointFloat rhs = (PointFloat) other;  
           return new EqualsBuilder().append(x, rhs.x).append(y, rhs.y).isEquals();  
      }  
      public static void main(String[] args) {  
           java.util.Hashtable<PointFloat, Integer> table = new java.util.Hashtable<PointFloat, Integer>();  
           table.put(new PointFloat(0f, 0f), 0);  
           // replacing now works correctly  
           table.put(new PointFloat(-0f, 0f), 1);  
           table.put(new PointFloat(0f, -0f), 2);  
           table.put(new PointFloat(-0f, 1f), 5);  
           table.put(new PointFloat(1f, 0f), 8);  
           table.put(new PointFloat(1f, 1f), 10);  
           table.put(new PointFloat(1f, 0f), 3);  
           System.out.println(table.get(new PointFloat(1f, 1f)));  
           System.out.println(table.get(new PointFloat(1f, 0f)));  
           System.out.println(table.get(new PointFloat(0f, 0f)));  
      }  
 }  

The class constructor makes sure that if its members x,y are equal to zero, only the positive zero representation is used. Since the two member variables are declared final it's safe to assume that only the positive zero representation is used. The hash table now behaves as expected (the first 4 lines in the constructor can be commented to trigger the unwanted behavior).

No comments:

Post a Comment