A faster and more compact Set

Looking at some of the new code that is being added to Ontopia I thought it might be useful to point out that as part of Ontopia we have a class called CompactHashSet, which is both faster and more compact than java.util.HashSet. So when you use Sets in your code, it might be worth your while to use the CompactHashSet instead of HashSet.

The in-memory engine uses Sets throughout for all the collections in the core API. There are a lot of these: the set of topics, the set of names for each topic, the set of item identifiers for each topic, … Ontopia originally used a lot of memory, and we identified HashSet as the source of a lot of this, and CompactHashSet was written in order to reduce memory usage somewhat. An interesting side-effect was that it also turned out to be faster.

HashSet uses open hashing, which means that each bucket in the hash table refers to a linked list of entries whose hash codes place them in the same bucket. This means that for each entry in the set an extra linked list element object must be allocated, which of course requires extra memory.

CompactHashSet, by contrast, uses closed hashing. This is a strategy where if the bucket you want to place a new entry in is already occupied, you run further down the hash table (in a predictable way!) looking for free buckets. This means you can do away with the linked list, thus saving memory.

So how much faster and more compact is CompactHashSet? I put together a little unscientific test and ran it three times for each of the implementations. The test first adds to the set 1 million times, then does 1 million lookups, then traverses the set, then removes 1 million objects. (See below for the test code.) This is the results:

CLASS          TIME   MEMORY 
HashSet        4477   44644624
HashSet        4447   44651984
HashSet        4500   44632824
CompactHashSet 2416   22886464
CompactHashSet 2351   22889408
CompactHashSet 2370   22895872

In other words, in our little test, CompactHashSet is nearly twice as fast and uses about half as much memory compared to HashSet. When you win on both speed and memory use there isn’t much left, really…

Except, of course, reliability. Initially, there were some problems with the CompactHashSet class. In some cases, the run down the hashtable to find free buckets could get into an infinite loop without ever finding a bucket. That’s now been solved. And, many years ago, there was a memory leak in it, causing deleted objects to be kept when rehashing. This caused serious performance problems for some customers and took months to track down.

EDIT: Discussion on reddit.com shows that many people misunderstood the above. The infinite loop bug was found during initial testing. The memory leak was found once customers started deploying the code for real, which may have been about a year after it was implemented. This was in 2003. Since then we have found exactly zero bugs in the code. END EDIT

By now, however, we have an extensive test suite for the class, and it’s been in use unchanged for many years with no problems. Using CompactHashSet should be entirely safe.

The test code

If you want to try the test yourself, here is the test code:

import java.util.Set;
import java.util.HashSet;
import java.util.Iterator;
import net.ontopia.utils.CompactHashSet;

public class TestHashSet {
  private static final int TIMES = 1000000;
  private static final int MAX = 5000000;
  
  public static void main(String[] argv) {
    // first, get the JIT going
    test(false, new CompactHashSet());
    test(false, new HashSet());

    // then, do real timings
    for (int ix = 0; ix < 3; ix++)
      test(true, new HashSet());
    for (int ix = 0; ix < 3; ix++)
      test(true, new CompactHashSet());
  }

  public static void test(boolean output, Set set) {
    long start = System.currentTimeMillis();

    if (output) {
      System.gc(); System.gc();
    }
    long before = Runtime.getRuntime().totalMemory() -
      Runtime.getRuntime().freeMemory();
    
    // add
    for (int ix = 0; ix < TIMES; ix++)
      set.add(new Long(Math.round(Math.random() * MAX)));

    if (output) {
      System.gc(); System.gc();
      long after = Runtime.getRuntime().totalMemory() -
        Runtime.getRuntime().freeMemory();
      System.out.println("Memory before: " + before);
      System.out.println("Memory after: " + after);
      System.out.println("Memory usage: " + (after - before));
    }
    
    // lookup
    int count = 0;
    for (int ix = 0; ix < TIMES; ix++) {
      Long number = new Long(Math.round(Math.random() * MAX));
      if (set.contains(number))
        count++;
    }

    // iterate
    Iterator it = set.iterator();
    while (it.hasNext()) {
      Long number = (Long) it.next();
    }

    // remove
    for (int ix = 0; ix < TIMES; ix++) {
      Long number = new Long(Math.round(Math.random() * MAX));
      set.remove(number);
    }

    if (output)
      System.out.println("TIME: " + (System.currentTimeMillis() - start));
  }
}

New reordering optimizer

The most important optimizer in the tolog implementation is the one which reorders the predicates before a query is run, in order to ensure the optimal execution order. (Warning: Existing OKS users, you should read this posting (especially the last part, under “Consequences”), because it may cause difficulties for you when upgrading.)

What is this optimization?

The canonical example of such an optimization is this query, which finds opera composers inspired by Shakespeare:

composed-by($OPERA : work, $COMPOSER : composer),
based-on($OPERA : result, $WORK : source),
written-by($WORK : work, shakespeare : writer)?

The query is naturally written starting with the composer, moving to the opera, then from the opera to the work it’s based on, and finally from the work to Shakespeare. Unfortunately, this order is suboptimal. Executing the query this way produces first 150 opera/composer combinations, then from that 81 opera/composer/work combinations, and finally the correct 4 combinations by filtering on Shakespeare. This takes 52 milliseconds on my machine.

The reordering optimizer rewrites the query before it’s run to this:

written-by($WORK : work, shakespeare : writer),
based-on($OPERA : result, $WORK : source),
composed-by($OPERA : work, $COMPOSER : composer)?

This produces the correct 4 works in the first predicate, then uses the remaining two to fill in the missing values. Execution time is 0 milliseconds on my machine.

The question is: how does the optimizer decide which predicate to start with? Originally we used a class called SimpleCostEstimator, which implemented a very simple heuristic: cost depends on the number of unbound variables in the predicate. This worked beautifully in the Shakespare query, because the written-by clause has only one unbound varible, so it’s the best place to start. After that, based-on has one variable bound and one open, so based-on goes next, and so on.

Unfortunately, this soon turned out not to be enough, and so the SimpleCostEstimator quickly grew more complex. In fact, today it’s pretty hard to understand, and while it works, it certainly cannot be said to work well. Ideally, what it needs is more detailed knowledge of each specific predicate, which is what lines 56-65 have started gathering. It should be clear from a glance at the code that this approach is never going to scale.

A better solution is to have the estimator set up a scale of costs, and then ask the predicates to rank themselves on that scale. This is what the PredicateDrivenCostEstimator does. In this system, dynamic association predicates (which are what is used in the Shakespeare query) rank themselves based on the number of open variables and bound variables or literals that they get as parameters. The system is mostly the same as before, but it emphasises starting values a bit better. So the Shakespeare query comes out the same with the new optimizer.

The query in issue 11 shows the difference between the two quite clearly:

select $T from 
  $T=@T25720, 
  bk:vises-nytt-vindu($T : bk:vises-vindu)?

The old estimator gives the = clause a cost of 10 for the unbound $T variable, and a cost of 1 for the literal, altogether 11. The bk:vises-nytt-vindu has only a single varible, and so gets 10. So the optimizer chooses to start there. Which is dumb, because this way we first make a list of all topics which have that unary association, then remove the ones which do not have the ID @T25720.

The new estimator asks the predicates. bk:vises-nytt-vindu sees that it has one open parameter and no bound ones, and so estimates a BIG_RESULT. The = predicate sees that it has one open and one bound parameter, which means that it will produce a single result row, and so returns SINGLE_RESULT. It’s now obvious what to do.

Consequences

What the consequences are is hard to say. As far as we can see, the new estimator always produces better results, but since it’s new and much less tested than the old, it’s very likely that it will make bad choices in some cases. So users upgrading to the new open source release may find either that some queries crash, or that some run much slower than before. (Most likely they’ll also find that some queries run much faster than before.)

We cannot really say before it’s been tried, because this is very complex stuff. What we can say is that we’ve had this estimator for a couple of years, and when tested next to the old one it always seems to perform as well as or better than the old one. The test suite also passes with the new estimator, so it’s looking good.

This is the motivation for issue 13, which, if implemented, would allow users to set a single property in a single property file to tell tolog to stick to the old estimator. We will try to get this implemented before we release.