New reordering optimizer

The most important optimizer in the tolog implementation is the one which reorders the predicates before a query is run, in order to ensure the optimal execution order. (Warning: Existing OKS users, you should read this posting (especially the last part, under “Consequences”), because it may cause difficulties for you when upgrading.)

What is this optimization?

The canonical example of such an optimization is this query, which finds opera composers inspired by Shakespeare:

composed-by($OPERA : work, $COMPOSER : composer),
based-on($OPERA : result, $WORK : source),
written-by($WORK : work, shakespeare : writer)?

The query is naturally written starting with the composer, moving to the opera, then from the opera to the work it’s based on, and finally from the work to Shakespeare. Unfortunately, this order is suboptimal. Executing the query this way produces first 150 opera/composer combinations, then from that 81 opera/composer/work combinations, and finally the correct 4 combinations by filtering on Shakespeare. This takes 52 milliseconds on my machine.

The reordering optimizer rewrites the query before it’s run to this:

written-by($WORK : work, shakespeare : writer),
based-on($OPERA : result, $WORK : source),
composed-by($OPERA : work, $COMPOSER : composer)?

This produces the correct 4 works in the first predicate, then uses the remaining two to fill in the missing values. Execution time is 0 milliseconds on my machine.

The question is: how does the optimizer decide which predicate to start with? Originally we used a class called SimpleCostEstimator, which implemented a very simple heuristic: cost depends on the number of unbound variables in the predicate. This worked beautifully in the Shakespare query, because the written-by clause has only one unbound varible, so it’s the best place to start. After that, based-on has one variable bound and one open, so based-on goes next, and so on.

Unfortunately, this soon turned out not to be enough, and so the SimpleCostEstimator quickly grew more complex. In fact, today it’s pretty hard to understand, and while it works, it certainly cannot be said to work well. Ideally, what it needs is more detailed knowledge of each specific predicate, which is what lines 56-65 have started gathering. It should be clear from a glance at the code that this approach is never going to scale.

A better solution is to have the estimator set up a scale of costs, and then ask the predicates to rank themselves on that scale. This is what the PredicateDrivenCostEstimator does. In this system, dynamic association predicates (which are what is used in the Shakespeare query) rank themselves based on the number of open variables and bound variables or literals that they get as parameters. The system is mostly the same as before, but it emphasises starting values a bit better. So the Shakespeare query comes out the same with the new optimizer.

The query in issue 11 shows the difference between the two quite clearly:

select $T from 
  bk:vises-nytt-vindu($T : bk:vises-vindu)?

The old estimator gives the = clause a cost of 10 for the unbound $T variable, and a cost of 1 for the literal, altogether 11. The bk:vises-nytt-vindu has only a single varible, and so gets 10. So the optimizer chooses to start there. Which is dumb, because this way we first make a list of all topics which have that unary association, then remove the ones which do not have the ID @T25720.

The new estimator asks the predicates. bk:vises-nytt-vindu sees that it has one open parameter and no bound ones, and so estimates a BIG_RESULT. The = predicate sees that it has one open and one bound parameter, which means that it will produce a single result row, and so returns SINGLE_RESULT. It’s now obvious what to do.


What the consequences are is hard to say. As far as we can see, the new estimator always produces better results, but since it’s new and much less tested than the old, it’s very likely that it will make bad choices in some cases. So users upgrading to the new open source release may find either that some queries crash, or that some run much slower than before. (Most likely they’ll also find that some queries run much faster than before.)

We cannot really say before it’s been tried, because this is very complex stuff. What we can say is that we’ve had this estimator for a couple of years, and when tested next to the old one it always seems to perform as well as or better than the old one. The test suite also passes with the new estimator, so it’s looking good.

This is the motivation for issue 13, which, if implemented, would allow users to set a single property in a single property file to tell tolog to stick to the old estimator. We will try to get this implemented before we release.