The tolog optimizations in Ontopia

At TMRA 2010 it struck me that we’d never really documented or described the optimizations that we do on tolog queries, and so I thought it would be useful to explain that. I did this in a 5-minute presentation in the open space session, the slides to which are embedded below.

An Ontopia/Joomla integration

At TMRA 2010 a team of Czech researchers at the University of Economics in Prague presented their integration of Ontopia with the Joomla! CMS. Joomla is written in PHP, so they used the TMRAP web service to invoke tolog queries on the Ontopia server, and then present the results with XSLT. They have a UI for managing this, and even a framework for building user-friendly tolog query builders.

For more information, see the presentation page, which has the full paper and the slides. (Later, it will also have a video of the presentation.)

Building Web Applications with Ontopia, part 3

The Christmas break gave Trond Pettersen enough time to complete part 3 in his series about Building Web Applications with Ontopia. This part shows how to write the code of the actual JSP pages that make up the web application. This means that the series is now completed to the point where it shows you how to actually build an entire web application using Ontopia and Topic Maps.

For reference, these are the parts completed so far:

  1. Installation and requirements
  2. Creating the database
  3. Writing the JSPs

Triggers

While presenting tolog updates at TMRA 2009 I casually referred to the possibility of using tolog updates to create a trigger mechanism in Ontopia similar to triggers in relational databases. After the presentation the idea grew on me, and so I decided to elaborate a little further on it and present it in the open space session. The result was this little presentation, which gives a brief outline of the idea.

tolog updates

Unlike SQL tolog has so far only had support for queries, but that has now changed. Lars Marius Garshol has now extended the tolog implementation in Ontopia with support for tolog updates, which extend tolog by adding DELETE, UPDATE, INSERT, and MERGE statements, in addition to the familiar SELECT.

The update statements are described in more detail in the TMRA 2009 paper. The slides may be a bit more approachable, but have less detail, obviously. The idea for the update language is actually quite old. Here’s a blog post about an early design from 2006.

Anyway, if you want to try it, you need to get the latest source code from Subversion. In Omnigator you’ll find the “Generic Query” plugin. If you put in an update statement and click the “Update” button it will run the update statement for you. One way to try it is to navigate to the city Munich in the Italian Opera topic map in Omnigator, then click to open the generic query plugin in a new tab. In that tab, run the following query:

update value($TN, "München")
from topic-name(munich, $TN)

Now go back to the original tab, and press reload. You’ll see the name of the city change. Magic!

Using Ontopia in PHP

Jan Schreiber has written an excellent blog post on accessing Ontopia from PHP, which shows how you can use the Java TMAPI implementation in PHP or alternatively run a tolog query from PHP.

As Jan pointed out in private conversation, there is probably a fair bit of overhead in calling Java classes from PHP in this way, so running tolog queries is probably the most efficient way to do this, since then all the heavy lifting happens inside the JVM.

Performance problems at NRK/Skole

In the presentation of the NRK/Skole case at the August 27 meeting the customer spent some time on the performance problems that they have been experiencing recently, and there was quite a bit of discussion at the meeting of what the possible causes might be. Given that the whole site is based on Ontopia, much of the suspicion obviously centered on the Ontopia software and possible performance problems with it.

The initial report we had from the customer was that the site was “unstable”, meaning that it would sometimes not respond or be very slow. On occasion it would also have to be restarted. There was just one server with Ontopia deployed on it, but traffic on the site could only be described as modest (it’s not officially in use by schools yet), so it should easily be able to handle the traffic.

The customer wanted to switch to Ontopia 5.0.0 from OKS 4.1 in order to be able to use two load-balanced servers in a cluster (OKS 4.1 is priced per CPU, so it was too expensive to do it with this version). Unfortunately, testing showed that Ontopia 5.0.0 performed even worse than 4.1. It was at this point that the customer asked us to investigate and solve the performance problems.

Investigation

Our first step was to use TologSpy, a query profiler for tolog queries, to see if any of the queries in the application were slow. We quickly determined that the new optimizer made one of the queries very slow because it, well, de-optimized the query quite considerably. However, this was a problem with Ontopia 5.0.0 only, and couldn’t be the cause of the problems with OKS 4.1. Further, inserting a comment with an pragma to the optimizer should solve the problem.

Further testing showed that other than this the application seemed perfectly fine. We tested it with JMeter scripts running 10 client threads in parallel without being able to provoke any real problems. TologSpy showed that the bad query was still an issue (even with the pragma), but other than that everything seemed just fine.

So we changed tactics. Instead of using JMeter to pound some randomly selected URIs we decided to write a Python script to play back portions of the access log from the production site. This gave the same results, initially. Then the customer noticed that two sets of URLs were “hanging” and not giving responses. So further investigation naturally focused on these.

Problem #1

It turned out that for some clips the metadata on the left would never be shown. Instead, the server would use all available memory and 100% CPU until the thread died with an out-of-memory error. Further digging showed that some of the metadata fields were being parsed with an Antlr-based parser written specifically for this project (ie: not part of Ontopia at all) in order to have the metadata nicely formatted. Running the parser from the command-line without Ontopia or a web server we found that some string values would make it consume 100% CPU until it hit the ceiling of 700 MB RAM, and then crash.

Studying the Antlr-based parser showed that its lexer had a token definition which could match empty strings. If an illegal character (that is, one that couldn’t match any token) appeared, the lexer would take the only option open to it and generate an infinite list of empty string tokens in front of that character.

This, of course, was the problem. Any time a user asked for one of these clips the server would use all available CPU and all memory, dumping out most of the objects in the Ontopia object cache. Typically, the user would reload when the server didn’t respond the first time, thus setting off the same thing again. Other pages would of course be slow during this time, especially as all caches would have to be filled up again, and the server would appear frozen/sluggish for a while before things calmed down again.

Fixing this required just changing a few lines in the .g file in the project code.

Problem #2

The second problem was the de-optimized tolog query. Setting the pragma option to turn off the new optimizer for some reason did not solve the problem. Running the query in the tolog plug-in in Omnigator worked fine, even if the query was a little slow. Running from the project Java code, however, the query would never complete.

It took a while to work out what was causing this, but in the end it was realized that the inference rules were running differently in Omnigator from in the Java code. The Omnigator plugin passes all rule declarations to the query processor together with the query as a single string, but the Java code was parsing the rule declarations into a DeclarationContextIF object, which was then passed in together with the query.

Further study showed that the code which optimizes the inference rule was not being run when rules were being parsed into a context object. They were, however, run when the rules were declared as part of the query. Once this observation was made, the fix was quite simple.

Conclusion

A bug in the project code (in the parser) was in other words the cause of the performance and stability issues. A fairly obscure bug in the tolog parser was the reason why 5.0.0 performed even worse than the old version. Now that these two issues are out of the way the site is behaving much better, and the customer will be able to change over to a load-balanced environment with two clustered Ontopia servers.

tolog.properties

Ontopia 5.0.0 enabled a new tolog optimizer by default, a move we knew risked causing difficulties for customers. One customer (NRK/Skole) has reported that they suspect this change caused problems for their application following the upgrade.

It is possible to control which optimizers are turned on and off in queries by including a comment in a special syntax. Unfortunately, in order for NRK to turn off the new optimizer in their application, they need to add that OPTION string to every single query in their application, since in 5.0.0 there is no way to set the option globally.

To avoid this, revision 450 implements a file called tolog.properties. This is a normal Java properties file, loaded from the classpath. In the new version, options in the query take precedence over options in tolog.properties, which again take precedence over the defaults.

This means you can now control properties globally for your application, should you wish to. If there is no tolog.properties everything will work as before.

New reordering optimizer

The most important optimizer in the tolog implementation is the one which reorders the predicates before a query is run, in order to ensure the optimal execution order. (Warning: Existing OKS users, you should read this posting (especially the last part, under “Consequences”), because it may cause difficulties for you when upgrading.)

What is this optimization?

The canonical example of such an optimization is this query, which finds opera composers inspired by Shakespeare:

composed-by($OPERA : work, $COMPOSER : composer),
based-on($OPERA : result, $WORK : source),
written-by($WORK : work, shakespeare : writer)?

The query is naturally written starting with the composer, moving to the opera, then from the opera to the work it’s based on, and finally from the work to Shakespeare. Unfortunately, this order is suboptimal. Executing the query this way produces first 150 opera/composer combinations, then from that 81 opera/composer/work combinations, and finally the correct 4 combinations by filtering on Shakespeare. This takes 52 milliseconds on my machine.

The reordering optimizer rewrites the query before it’s run to this:

written-by($WORK : work, shakespeare : writer),
based-on($OPERA : result, $WORK : source),
composed-by($OPERA : work, $COMPOSER : composer)?

This produces the correct 4 works in the first predicate, then uses the remaining two to fill in the missing values. Execution time is 0 milliseconds on my machine.

The question is: how does the optimizer decide which predicate to start with? Originally we used a class called SimpleCostEstimator, which implemented a very simple heuristic: cost depends on the number of unbound variables in the predicate. This worked beautifully in the Shakespare query, because the written-by clause has only one unbound varible, so it’s the best place to start. After that, based-on has one variable bound and one open, so based-on goes next, and so on.

Unfortunately, this soon turned out not to be enough, and so the SimpleCostEstimator quickly grew more complex. In fact, today it’s pretty hard to understand, and while it works, it certainly cannot be said to work well. Ideally, what it needs is more detailed knowledge of each specific predicate, which is what lines 56-65 have started gathering. It should be clear from a glance at the code that this approach is never going to scale.

A better solution is to have the estimator set up a scale of costs, and then ask the predicates to rank themselves on that scale. This is what the PredicateDrivenCostEstimator does. In this system, dynamic association predicates (which are what is used in the Shakespeare query) rank themselves based on the number of open variables and bound variables or literals that they get as parameters. The system is mostly the same as before, but it emphasises starting values a bit better. So the Shakespeare query comes out the same with the new optimizer.

The query in issue 11 shows the difference between the two quite clearly:

select $T from 
  $T=@T25720, 
  bk:vises-nytt-vindu($T : bk:vises-vindu)?

The old estimator gives the = clause a cost of 10 for the unbound $T variable, and a cost of 1 for the literal, altogether 11. The bk:vises-nytt-vindu has only a single varible, and so gets 10. So the optimizer chooses to start there. Which is dumb, because this way we first make a list of all topics which have that unary association, then remove the ones which do not have the ID @T25720.

The new estimator asks the predicates. bk:vises-nytt-vindu sees that it has one open parameter and no bound ones, and so estimates a BIG_RESULT. The = predicate sees that it has one open and one bound parameter, which means that it will produce a single result row, and so returns SINGLE_RESULT. It’s now obvious what to do.

Consequences

What the consequences are is hard to say. As far as we can see, the new estimator always produces better results, but since it’s new and much less tested than the old, it’s very likely that it will make bad choices in some cases. So users upgrading to the new open source release may find either that some queries crash, or that some run much slower than before. (Most likely they’ll also find that some queries run much faster than before.)

We cannot really say before it’s been tried, because this is very complex stuff. What we can say is that we’ve had this estimator for a couple of years, and when tested next to the old one it always seems to perform as well as or better than the old one. The test suite also passes with the new estimator, so it’s looking good.

This is the motivation for issue 13, which, if implemented, would allow users to set a single property in a single property file to tell tolog to stick to the old estimator. We will try to get this implemented before we release.