Performance problems at NRK/Skole

In the presentation of the NRK/Skole case at the August 27 meeting the customer spent some time on the performance problems that they have been experiencing recently, and there was quite a bit of discussion at the meeting of what the possible causes might be. Given that the whole site is based on Ontopia, much of the suspicion obviously centered on the Ontopia software and possible performance problems with it.

The initial report we had from the customer was that the site was “unstable”, meaning that it would sometimes not respond or be very slow. On occasion it would also have to be restarted. There was just one server with Ontopia deployed on it, but traffic on the site could only be described as modest (it’s not officially in use by schools yet), so it should easily be able to handle the traffic.

The customer wanted to switch to Ontopia 5.0.0 from OKS 4.1 in order to be able to use two load-balanced servers in a cluster (OKS 4.1 is priced per CPU, so it was too expensive to do it with this version). Unfortunately, testing showed that Ontopia 5.0.0 performed even worse than 4.1. It was at this point that the customer asked us to investigate and solve the performance problems.

Investigation

Our first step was to use TologSpy, a query profiler for tolog queries, to see if any of the queries in the application were slow. We quickly determined that the new optimizer made one of the queries very slow because it, well, de-optimized the query quite considerably. However, this was a problem with Ontopia 5.0.0 only, and couldn’t be the cause of the problems with OKS 4.1. Further, inserting a comment with an pragma to the optimizer should solve the problem.

Further testing showed that other than this the application seemed perfectly fine. We tested it with JMeter scripts running 10 client threads in parallel without being able to provoke any real problems. TologSpy showed that the bad query was still an issue (even with the pragma), but other than that everything seemed just fine.

So we changed tactics. Instead of using JMeter to pound some randomly selected URIs we decided to write a Python script to play back portions of the access log from the production site. This gave the same results, initially. Then the customer noticed that two sets of URLs were “hanging” and not giving responses. So further investigation naturally focused on these.

Problem #1

It turned out that for some clips the metadata on the left would never be shown. Instead, the server would use all available memory and 100% CPU until the thread died with an out-of-memory error. Further digging showed that some of the metadata fields were being parsed with an Antlr-based parser written specifically for this project (ie: not part of Ontopia at all) in order to have the metadata nicely formatted. Running the parser from the command-line without Ontopia or a web server we found that some string values would make it consume 100% CPU until it hit the ceiling of 700 MB RAM, and then crash.

Studying the Antlr-based parser showed that its lexer had a token definition which could match empty strings. If an illegal character (that is, one that couldn’t match any token) appeared, the lexer would take the only option open to it and generate an infinite list of empty string tokens in front of that character.

This, of course, was the problem. Any time a user asked for one of these clips the server would use all available CPU and all memory, dumping out most of the objects in the Ontopia object cache. Typically, the user would reload when the server didn’t respond the first time, thus setting off the same thing again. Other pages would of course be slow during this time, especially as all caches would have to be filled up again, and the server would appear frozen/sluggish for a while before things calmed down again.

Fixing this required just changing a few lines in the .g file in the project code.

Problem #2

The second problem was the de-optimized tolog query. Setting the pragma option to turn off the new optimizer for some reason did not solve the problem. Running the query in the tolog plug-in in Omnigator worked fine, even if the query was a little slow. Running from the project Java code, however, the query would never complete.

It took a while to work out what was causing this, but in the end it was realized that the inference rules were running differently in Omnigator from in the Java code. The Omnigator plugin passes all rule declarations to the query processor together with the query as a single string, but the Java code was parsing the rule declarations into a DeclarationContextIF object, which was then passed in together with the query.

Further study showed that the code which optimizes the inference rule was not being run when rules were being parsed into a context object. They were, however, run when the rules were declared as part of the query. Once this observation was made, the fix was quite simple.

Conclusion

A bug in the project code (in the parser) was in other words the cause of the performance and stability issues. A fairly obscure bug in the tolog parser was the reason why 5.0.0 performed even worse than the old version. Now that these two issues are out of the way the site is behaving much better, and the customer will be able to change over to a load-balanced environment with two clustered Ontopia servers.

Advertisements

5 thoughts on “Performance problems at NRK/Skole”

  1. Cool detective story. Too few developers get to debug their own code after it’s been running in production for a while. Hope its not my code which started this witch hunt 🙂

  2. It may be a sensitive subject, but just out of curiosity (and to compare with my own experiences debugging performance problems), can you estimate the time from the start of the investigations to their successful conclusions, in person-hours and/or calendar hours/days? Once the observation is made, the fix can be simple. But in my experience, making the observation can be anything but simple.

    By way of comparison: a database I was responsible for once had performance issues that were affecting not only the users of the database but other users of the server. If we had done the right thing from the beginning, with no false steps, the two of us who worked on it should have solved the problem in half a day; in reality, it took a full calendar day for us to make a partial solution, and several more days to do the job properly.

    If you got the analysis described here done in less than a week, I think I take my hat off to you.

  3. @Stig: As far as I can tell it was not your code. I think someone else wrote the Antlr parser. And the rule optimization problem was caused by code I wrote in the product… 🙂

    @Michael: In calendar time I spent 2 weeks on this (so you can leave your hat on :)). Of course, the customer had a person on it, and he’s the one who’s worked on this the longest, and I can’t really estimate how long he spent. Much of the delay here is due to the problem occurring on a customer site and trying to work with it remotely, which of course leads to long delays. In actual hours I’ve reported about 13 hours.

    You are of course quite right that the biggest problem here was identifying what it was that was causing the problem. Identifying the tolog problem was easy, but spotting the Antlr-caused problem was quite difficult, since we didn’t have a profiler to help us with this. And hardest of all, in a sense, was establishing that these were the only problems. That last part is not 100% finished yet, but the customer is working on it.

    –Lars M.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s