In the presentation of the NRK/Skole case at the August 27 meeting the customer spent some time on the performance problems that they have been experiencing recently, and there was quite a bit of discussion at the meeting of what the possible causes might be. Given that the whole site is based on Ontopia, much of the suspicion obviously centered on the Ontopia software and possible performance problems with it.
The initial report we had from the customer was that the site was “unstable”, meaning that it would sometimes not respond or be very slow. On occasion it would also have to be restarted. There was just one server with Ontopia deployed on it, but traffic on the site could only be described as modest (it’s not officially in use by schools yet), so it should easily be able to handle the traffic.
The customer wanted to switch to Ontopia 5.0.0 from OKS 4.1 in order to be able to use two load-balanced servers in a cluster (OKS 4.1 is priced per CPU, so it was too expensive to do it with this version). Unfortunately, testing showed that Ontopia 5.0.0 performed even worse than 4.1. It was at this point that the customer asked us to investigate and solve the performance problems.
Our first step was to use TologSpy, a query profiler for tolog queries, to see if any of the queries in the application were slow. We quickly determined that the new optimizer made one of the queries very slow because it, well, de-optimized the query quite considerably. However, this was a problem with Ontopia 5.0.0 only, and couldn’t be the cause of the problems with OKS 4.1. Further, inserting a comment with an pragma to the optimizer should solve the problem.
Further testing showed that other than this the application seemed perfectly fine. We tested it with JMeter scripts running 10 client threads in parallel without being able to provoke any real problems. TologSpy showed that the bad query was still an issue (even with the pragma), but other than that everything seemed just fine.
So we changed tactics. Instead of using JMeter to pound some randomly selected URIs we decided to write a Python script to play back portions of the access log from the production site. This gave the same results, initially. Then the customer noticed that two sets of URLs were “hanging” and not giving responses. So further investigation naturally focused on these.
It turned out that for some clips the metadata on the left would never be shown. Instead, the server would use all available memory and 100% CPU until the thread died with an out-of-memory error. Further digging showed that some of the metadata fields were being parsed with an Antlr-based parser written specifically for this project (ie: not part of Ontopia at all) in order to have the metadata nicely formatted. Running the parser from the command-line without Ontopia or a web server we found that some string values would make it consume 100% CPU until it hit the ceiling of 700 MB RAM, and then crash.
Studying the Antlr-based parser showed that its lexer had a token definition which could match empty strings. If an illegal character (that is, one that couldn’t match any token) appeared, the lexer would take the only option open to it and generate an infinite list of empty string tokens in front of that character.
This, of course, was the problem. Any time a user asked for one of these clips the server would use all available CPU and all memory, dumping out most of the objects in the Ontopia object cache. Typically, the user would reload when the server didn’t respond the first time, thus setting off the same thing again. Other pages would of course be slow during this time, especially as all caches would have to be filled up again, and the server would appear frozen/sluggish for a while before things calmed down again.
Fixing this required just changing a few lines in the .g file in the project code.
The second problem was the de-optimized tolog query. Setting the pragma option to turn off the new optimizer for some reason did not solve the problem. Running the query in the tolog plug-in in Omnigator worked fine, even if the query was a little slow. Running from the project Java code, however, the query would never complete.
It took a while to work out what was causing this, but in the end it was realized that the inference rules were running differently in Omnigator from in the Java code. The Omnigator plugin passes all rule declarations to the query processor together with the query as a single string, but the Java code was parsing the rule declarations into a DeclarationContextIF object, which was then passed in together with the query.
Further study showed that the code which optimizes the inference rule was not being run when rules were being parsed into a context object. They were, however, run when the rules were declared as part of the query. Once this observation was made, the fix was quite simple.
A bug in the project code (in the parser) was in other words the cause of the performance and stability issues. A fairly obscure bug in the tolog parser was the reason why 5.0.0 performed even worse than the old version. Now that these two issues are out of the way the site is behaving much better, and the customer will be able to change over to a load-balanced environment with two clustered Ontopia servers.