A faster and more compact Set

Looking at some of the new code that is being added to Ontopia I thought it might be useful to point out that as part of Ontopia we have a class called CompactHashSet, which is both faster and more compact than java.util.HashSet. So when you use Sets in your code, it might be worth your while to use the CompactHashSet instead of HashSet.

The in-memory engine uses Sets throughout for all the collections in the core API. There are a lot of these: the set of topics, the set of names for each topic, the set of item identifiers for each topic, … Ontopia originally used a lot of memory, and we identified HashSet as the source of a lot of this, and CompactHashSet was written in order to reduce memory usage somewhat. An interesting side-effect was that it also turned out to be faster.

HashSet uses open hashing, which means that each bucket in the hash table refers to a linked list of entries whose hash codes place them in the same bucket. This means that for each entry in the set an extra linked list element object must be allocated, which of course requires extra memory.

CompactHashSet, by contrast, uses closed hashing. This is a strategy where if the bucket you want to place a new entry in is already occupied, you run further down the hash table (in a predictable way!) looking for free buckets. This means you can do away with the linked list, thus saving memory.

So how much faster and more compact is CompactHashSet? I put together a little unscientific test and ran it three times for each of the implementations. The test first adds to the set 1 million times, then does 1 million lookups, then traverses the set, then removes 1 million objects. (See below for the test code.) This is the results:

CLASS          TIME   MEMORY 
HashSet        4477   44644624
HashSet        4447   44651984
HashSet        4500   44632824
CompactHashSet 2416   22886464
CompactHashSet 2351   22889408
CompactHashSet 2370   22895872

In other words, in our little test, CompactHashSet is nearly twice as fast and uses about half as much memory compared to HashSet. When you win on both speed and memory use there isn’t much left, really…

Except, of course, reliability. Initially, there were some problems with the CompactHashSet class. In some cases, the run down the hashtable to find free buckets could get into an infinite loop without ever finding a bucket. That’s now been solved. And, many years ago, there was a memory leak in it, causing deleted objects to be kept when rehashing. This caused serious performance problems for some customers and took months to track down.

EDIT: Discussion on reddit.com shows that many people misunderstood the above. The infinite loop bug was found during initial testing. The memory leak was found once customers started deploying the code for real, which may have been about a year after it was implemented. This was in 2003. Since then we have found exactly zero bugs in the code. END EDIT

By now, however, we have an extensive test suite for the class, and it’s been in use unchanged for many years with no problems. Using CompactHashSet should be entirely safe.

The test code

If you want to try the test yourself, here is the test code:

import java.util.Set;
import java.util.HashSet;
import java.util.Iterator;
import net.ontopia.utils.CompactHashSet;

public class TestHashSet {
  private static final int TIMES = 1000000;
  private static final int MAX = 5000000;
  
  public static void main(String[] argv) {
    // first, get the JIT going
    test(false, new CompactHashSet());
    test(false, new HashSet());

    // then, do real timings
    for (int ix = 0; ix < 3; ix++)
      test(true, new HashSet());
    for (int ix = 0; ix < 3; ix++)
      test(true, new CompactHashSet());
  }

  public static void test(boolean output, Set set) {
    long start = System.currentTimeMillis();

    if (output) {
      System.gc(); System.gc();
    }
    long before = Runtime.getRuntime().totalMemory() -
      Runtime.getRuntime().freeMemory();
    
    // add
    for (int ix = 0; ix < TIMES; ix++)
      set.add(new Long(Math.round(Math.random() * MAX)));

    if (output) {
      System.gc(); System.gc();
      long after = Runtime.getRuntime().totalMemory() -
        Runtime.getRuntime().freeMemory();
      System.out.println("Memory before: " + before);
      System.out.println("Memory after: " + after);
      System.out.println("Memory usage: " + (after - before));
    }
    
    // lookup
    int count = 0;
    for (int ix = 0; ix < TIMES; ix++) {
      Long number = new Long(Math.round(Math.random() * MAX));
      if (set.contains(number))
        count++;
    }

    // iterate
    Iterator it = set.iterator();
    while (it.hasNext()) {
      Long number = (Long) it.next();
    }

    // remove
    for (int ix = 0; ix < TIMES; ix++) {
      Long number = new Long(Math.round(Math.random() * MAX));
      set.remove(number);
    }

    if (output)
      System.out.println("TIME: " + (System.currentTimeMillis() - start));
  }
}

11 thoughts on “A faster and more compact Set”

  1. Interesting. I tested this with gnu.trove.THashSet, and it is 17% slower and uses 25% less memory (on average on my Macbook Pro). Using it would trade speed for memory. 🙂

  2. FYI, javolution.util.FastSet is both slower and uses more memory that the other set implementations, so it is not fast afterall.

  3. Nevermind, my first comment about trove4j. It uses more memory *and* is slower than CompactHashSet. Not sure why I came to the original conclusion.

    Anyway, the CompactHashSet rules… 🙂

  4. Hi, just out of curiosity what exactly caused the two bugs you mentioned ? Diffs would be ideal, but I’m just wondering what kind of bugs people make in what’s essentially just a (very) good implementation of a text book hash table.

  5. I tried looking at the CVS logs for these, but they don’t actually show any of these problems, which is a bit confusing.

    There were some minor issues like the offset overflowing in some cases, and a divide by zero if the hash were created with zero size. Also, it was possible for the nullObject (see code) to leak out through the toArray() methods.

    That’s pretty much it, I think.

  6. I tried my own little benchmark for a real-world scenario I have and here are the results for everyone’s information:

    My scenario was based on a real-world need that I have in my analytical application. I do searches on large vectors (1MM items and up), and the case where I do matches against sets usually involve short strings that repeat many times (this is a a financial markets app so there are many transactions per symbol). I took sample vectors from a real data set and ran several iterations of searches over a million item vector to get average timing (there was a GC before each test and outside the timing calculation). I matched against 5 symbols and 50 symbols. The vector has 500 distinct symbols in it (repeating on average 2000 times).

    I tested the JDK HashSet, this CompactHashSet and Google’s ImmutableSet.

    The tests were conducted on a PC running Windows 7 with 12GB RAM and JDK 1.6.0_18. Both Windows and the JDK are 64bit

    The results had HashSet winning hands down for this unscientific scenario (timing in msec-averaged):

    5 50 sum
    hashset 19.88 22.34 42.22
    compact 32.54 48.28 80.82
    immutable 21.63 31.54 53.17
    compact 164% 216% 191%
    immutable 109% 141% 126%

    I didn’t have time for a whole suite of tests, but it could very well be that the performance would vary significantly due to:
    * the distribution of values
    * the data type (here String)
    * the specific JVM used
    * the platform

    My conclusion: if you are on the hunt for the best performing set implementation there probably isn’t an implementation today that has the “best performance” for all scenarios uniformly – you should test for your specific scenario and find out what works for you.

  7. Your code looks awesomly fast and with a very low footprint (I m looking for memory optimisation in my app).

    vs String HashSet it uses 2/3 times less memory !

    Your link is dead, and I think you should create tiny project on github or something because it should be usefull to many others people and this piece of code deserve more than a 404 not found.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s