Looking at some of the new code that is being added to Ontopia I thought it might be useful to point out that as part of Ontopia we have a class called CompactHashSet, which is both faster and more compact than java.util.HashSet. So when you use Sets in your code, it might be worth your while to use the CompactHashSet instead of HashSet.
The in-memory engine uses Sets throughout for all the collections in the core API. There are a lot of these: the set of topics, the set of names for each topic, the set of item identifiers for each topic, … Ontopia originally used a lot of memory, and we identified HashSet as the source of a lot of this, and CompactHashSet was written in order to reduce memory usage somewhat. An interesting side-effect was that it also turned out to be faster.
HashSet uses open hashing, which means that each bucket in the hash table refers to a linked list of entries whose hash codes place them in the same bucket. This means that for each entry in the set an extra linked list element object must be allocated, which of course requires extra memory.
CompactHashSet, by contrast, uses closed hashing. This is a strategy where if the bucket you want to place a new entry in is already occupied, you run further down the hash table (in a predictable way!) looking for free buckets. This means you can do away with the linked list, thus saving memory.
So how much faster and more compact is CompactHashSet? I put together a little unscientific test and ran it three times for each of the implementations. The test first adds to the set 1 million times, then does 1 million lookups, then traverses the set, then removes 1 million objects. (See below for the test code.) This is the results:
CLASS TIME MEMORY
HashSet 4477 44644624
HashSet 4447 44651984
HashSet 4500 44632824
CompactHashSet 2416 22886464
CompactHashSet 2351 22889408
CompactHashSet 2370 22895872
In other words, in our little test, CompactHashSet is nearly twice as fast and uses about half as much memory compared to HashSet. When you win on both speed and memory use there isn’t much left, really…
Except, of course, reliability. Initially, there were some problems with the CompactHashSet class. In some cases, the run down the hashtable to find free buckets could get into an infinite loop without ever finding a bucket. That’s now been solved. And, many years ago, there was a memory leak in it, causing deleted objects to be kept when rehashing. This caused serious performance problems for some customers and took months to track down.
EDIT: Discussion on reddit.com shows that many people misunderstood the above. The infinite loop bug was found during initial testing. The memory leak was found once customers started deploying the code for real, which may have been about a year after it was implemented. This was in 2003. Since then we have found exactly zero bugs in the code. END EDIT
By now, however, we have an extensive test suite for the class, and it’s been in use unchanged for many years with no problems. Using CompactHashSet should be entirely safe.
The test code
If you want to try the test yourself, here is the test code:
import java.util.Set;
import java.util.HashSet;
import java.util.Iterator;
import net.ontopia.utils.CompactHashSet;
public class TestHashSet {
private static final int TIMES = 1000000;
private static final int MAX = 5000000;
public static void main(String[] argv) {
// first, get the JIT going
test(false, new CompactHashSet());
test(false, new HashSet());
// then, do real timings
for (int ix = 0; ix < 3; ix++)
test(true, new HashSet());
for (int ix = 0; ix < 3; ix++)
test(true, new CompactHashSet());
}
public static void test(boolean output, Set set) {
long start = System.currentTimeMillis();
if (output) {
System.gc(); System.gc();
}
long before = Runtime.getRuntime().totalMemory() -
Runtime.getRuntime().freeMemory();
// add
for (int ix = 0; ix < TIMES; ix++)
set.add(new Long(Math.round(Math.random() * MAX)));
if (output) {
System.gc(); System.gc();
long after = Runtime.getRuntime().totalMemory() -
Runtime.getRuntime().freeMemory();
System.out.println("Memory before: " + before);
System.out.println("Memory after: " + after);
System.out.println("Memory usage: " + (after - before));
}
// lookup
int count = 0;
for (int ix = 0; ix < TIMES; ix++) {
Long number = new Long(Math.round(Math.random() * MAX));
if (set.contains(number))
count++;
}
// iterate
Iterator it = set.iterator();
while (it.hasNext()) {
Long number = (Long) it.next();
}
// remove
for (int ix = 0; ix < TIMES; ix++) {
Long number = new Long(Math.round(Math.random() * MAX));
set.remove(number);
}
if (output)
System.out.println("TIME: " + (System.currentTimeMillis() - start));
}
}