Ontopia 5.0.0beta1 released

The Ontopia project is proud to announce the release of its first open source beta version with Ontopia 5.0.0b1. This is still a beta, and so we encourage users to try it out and report problems to us, either on the mailing list or in the issue tracker.

The main change in this release is of course that Ontopia is now open source, and so the need for license keys has disappeared. And since everything is free there is now just one distribution, and not six as before.

In addition to bug fixes and minor changes, the main new features in this version:

TopicMapsLab joins the Ontopia project

TopicMapsLab
TopicMapsLab

Hannes Niederhausen of the TopicMapsLab in Leipzig became the second external committer to the Ontopia project, and has now written a TMAPI 2.0 implementation for the Ontopia project. The TopicMapsLab has announced that it intends, through Hannes, to continue work on the project, and also to use Ontopia for its own infrastructure.

New reordering optimizer

The most important optimizer in the tolog implementation is the one which reorders the predicates before a query is run, in order to ensure the optimal execution order. (Warning: Existing OKS users, you should read this posting (especially the last part, under “Consequences”), because it may cause difficulties for you when upgrading.)

What is this optimization?

The canonical example of such an optimization is this query, which finds opera composers inspired by Shakespeare:

composed-by($OPERA : work, $COMPOSER : composer),
based-on($OPERA : result, $WORK : source),
written-by($WORK : work, shakespeare : writer)?

The query is naturally written starting with the composer, moving to the opera, then from the opera to the work it’s based on, and finally from the work to Shakespeare. Unfortunately, this order is suboptimal. Executing the query this way produces first 150 opera/composer combinations, then from that 81 opera/composer/work combinations, and finally the correct 4 combinations by filtering on Shakespeare. This takes 52 milliseconds on my machine.

The reordering optimizer rewrites the query before it’s run to this:

written-by($WORK : work, shakespeare : writer),
based-on($OPERA : result, $WORK : source),
composed-by($OPERA : work, $COMPOSER : composer)?

This produces the correct 4 works in the first predicate, then uses the remaining two to fill in the missing values. Execution time is 0 milliseconds on my machine.

The question is: how does the optimizer decide which predicate to start with? Originally we used a class called SimpleCostEstimator, which implemented a very simple heuristic: cost depends on the number of unbound variables in the predicate. This worked beautifully in the Shakespare query, because the written-by clause has only one unbound varible, so it’s the best place to start. After that, based-on has one variable bound and one open, so based-on goes next, and so on.

Unfortunately, this soon turned out not to be enough, and so the SimpleCostEstimator quickly grew more complex. In fact, today it’s pretty hard to understand, and while it works, it certainly cannot be said to work well. Ideally, what it needs is more detailed knowledge of each specific predicate, which is what lines 56-65 have started gathering. It should be clear from a glance at the code that this approach is never going to scale.

A better solution is to have the estimator set up a scale of costs, and then ask the predicates to rank themselves on that scale. This is what the PredicateDrivenCostEstimator does. In this system, dynamic association predicates (which are what is used in the Shakespeare query) rank themselves based on the number of open variables and bound variables or literals that they get as parameters. The system is mostly the same as before, but it emphasises starting values a bit better. So the Shakespeare query comes out the same with the new optimizer.

The query in issue 11 shows the difference between the two quite clearly:

select $T from 
  $T=@T25720, 
  bk:vises-nytt-vindu($T : bk:vises-vindu)?

The old estimator gives the = clause a cost of 10 for the unbound $T variable, and a cost of 1 for the literal, altogether 11. The bk:vises-nytt-vindu has only a single varible, and so gets 10. So the optimizer chooses to start there. Which is dumb, because this way we first make a list of all topics which have that unary association, then remove the ones which do not have the ID @T25720.

The new estimator asks the predicates. bk:vises-nytt-vindu sees that it has one open parameter and no bound ones, and so estimates a BIG_RESULT. The = predicate sees that it has one open and one bound parameter, which means that it will produce a single result row, and so returns SINGLE_RESULT. It’s now obvious what to do.

Consequences

What the consequences are is hard to say. As far as we can see, the new estimator always produces better results, but since it’s new and much less tested than the old, it’s very likely that it will make bad choices in some cases. So users upgrading to the new open source release may find either that some queries crash, or that some run much slower than before. (Most likely they’ll also find that some queries run much faster than before.)

We cannot really say before it’s been tried, because this is very complex stuff. What we can say is that we’ve had this estimator for a couple of years, and when tested next to the old one it always seems to perform as well as or better than the old one. The test suite also passes with the new estimator, so it’s looking good.

This is the motivation for issue 13, which, if implemented, would allow users to set a single property in a single property file to tell tolog to stick to the old estimator. We will try to get this implemented before we release.

Source code in Subversion

We have now started loading the Ontopia source code into Subversion over at the Ontopia Google Code project. So far, what we have is the Java source code for the product plus the tests. We also have a guide to building the code.

This means that you can now browse the code and build the ontopia.jar and the ontopia-tests.jar files.

We are still working on getting the code for the web applications and so on into Subversion. We also need to finish the build scripts to make it possible to build a complete distribution.

Moving the source code

Today Geir Ove and I had our first meeting today where we decided how to approach the uploading of the code and how to organize the source code in the new project. So new code should start to appear in Google Code pretty soon. (I’ll post when it does.)

However, a major part of this work is deciding what will not go up, because over the past decade we have accumulated mountains of cruft. We are making use of this move to throw away lots of stuff that is not needed, and which we will be better off without.

This is the lines of code count on the old source tree:

http://cloc.sourceforge.net v 1.07  T=144.0 s (30.0 files/s, 3714.8 lines/s)
-------------------------------------------------------------------------------
Language          files     blank   comment      code    scale   3rd gen. equiv
-------------------------------------------------------------------------------
Java               2438     61492     54827    214443 x   1.36 =      291642.48
XML                 393      4506      6238     57678 x   1.90 =      109588.20
JSP                 809      6231      3181     38039 x   1.48 =       56297.72
Python              181      7785      4619     26197 x   4.20 =      110027.40
HTML                319      2011       358     25274 x   1.90 =       48020.60
CSS                  53      1707       463      6981 x   1.00 =        6981.00
C#                    5       685       568      3095 x   1.36 =        4209.20
Javascript           21       325       361      2610 x   1.48 =        3862.80
DTD                  14       379       450      1143 x   1.90 =        2171.70
DOS Batch            22       113       195       697 x   0.63 =         439.11
Bourne Shell         21       114       170       632 x   3.81 =        2407.92
SQL                   7       145        80       531 x   2.29 =        1215.99
make                 27        92        56       150 x   2.50 =         375.00
Lisp                  2        47        88        84 x   1.25 =         105.00
XSLT                  1        12         1        33 x   1.90 =          62.70
ASP.Net               4         8         4        27 x   1.29 =          34.83
-------------------------------------------------------------------------------
SUM:               4317     85652     71659    377614 x   1.69 =      637441.65
-------------------------------------------------------------------------------

One surprise here is the C# code. We have a Topic Maps engine written in C# by Graham Moore, that he brought with him when he joined Ontopia back in 2003. It doesn’t belong in the Ontopia project, so if anyone wants it, please let us know, and we’ll give it to you for nothing.

We also have some old Python code, as you can see, including the tmproc Topic Maps engine, an early tolog implementation based on it, plus the old Autogen framework. If anyone wants any of these, please let us know.

Email blackout

We have been getting lots of emails with requests for access to the source code, free licenses, and slides and other Ontopia documentation. We’re very happy to see all this interest in the product, but unfortunately replying to all these requests take time. We have to consider the request, in some cases talk to management, find the material requested, and so on.

At the same time we are really eager about the open sourcing and spending as much time as we can on setting up the infrastructure and getting the code ready. At the moment there is nothing left to wait for. The more time we can dedicate to setting this up, the faster it will become available for everyone.

Therefore, we have decided not to reply to such requests before we have released as much of our material as we can, since doing so would delay the release for everyone else.

So if we are not replying to your request now you know why. Sorry.