Debian 9.9, SMP Debian 4.9.168-1+deb9u3 (2019-06-16) x86_64 GNU/Linux, Oracle Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode), Tomcat 8.5.14-1+deb9u3
We have upgraded our webservers from Lucee 188.8.131.52 to 184.108.40.206 (release) on monday. The former has been rock solid for us, but we need a new feature in 5.3 (), so we decided it's time to take the plunge.
Apart from due to onSessionEnd NPEing, which we don't see as critical, we didn't see any issues for the first couple of hours. Then one of the servers suddenly became unresponsive. As this was after office-hours, I didn't get a chance to take a closer look at the cause.
A second server OOMed yesterday at midday. The hprof was some 11GB in size and I couldn't get it to parse in MAT. A third server then tanked yesterday evening, though it didn't OOMdump, it just went to an extremely high load and didn't seem to service any requests. This time I got the chance to take a closer look at what happened.
Whenever a server stalled, there wasn't much of any warning signs beforehand, so no slowly increasing heap issues, nor any slowly increasing load. It seems to happen purely at random out of the blue.
catalina.out shows lots and lots of these for the time of the event, though they don't show the cause, just the result:
lucee/server/global.log just shows the onSessionEnd NPEs and from time to time the following, which I guess may actually be related to the problem:
In lucee/web/exception.log I see the issue starting with a couple of timeouts of the very same code, all of them timing out on a cached query. The database in question doesn't show any issues at all, my guess is that there's some sort of deadlock on the cache-put-operation for a cached query. The dump only shows the named locks, and this list increases over time of course, as more and more threads are stalling, but this one here is the first log entry and the rest all show the same pattern:
None of the named locks in that list have any relation to that script in question that seems to stall here. I checked that part of the code and we're not using any locking in that specific area, so I am fairly certain that the named locks at the beginning are just the ordinary couple of pages that are being regenerated for memcached and have nothing to do with that RamCache-problem for the query.
This is the query-bit that's stated as hanging here:
A little time later on, there are other cached queries showing the same issue and again I guess that there must be some deadlock issue with the query caching under concurrency.
I have since updated the servers to 220.127.116.11, but judging from the first errors I see in the logs, the issue has not gone away, other cached queries are affected, too, the aforementioned script just happens to be one of the most requested. We haven't had a server stall so far, but I guess that it's only a matter of time. So 18.104.22.168 at least may be affected as well, though I cannot say for sure until one of the machines actually goes down again. Here's one fresh from the log a short while after the upgrade to 22.214.171.124:
The pattern in the stack traces is always the same - starting with org.apache.commons.collections4.map.AbstractHashedMap.getEntry(AbstractHashedMap.java:461) up until the lucee.runtime.tag.Query.doEndTag(Query.java:537).
The datasources in question are PostgreSQL datasources defined in the server admin. My guess is that there has been some change to the query caching that may be susceptible to lock congestion under concurrency.
I'll try with the latest snapshot next, but as the machines in question are running in production, I'll probably have to downgrade back to 5.2 for the time being, though I'd like to wait for the next server stall before making any rash decisions.
The issue is very likely hard to reproduce, one of our servers went for more than a day and didn't stall once in that time - this was unfortunately the one we're running FusionReactor on. We're also tracking Tomcat and general server metrics using munin and as I said, there are really no warning signs to see before it all goes south, I guess that as long as the request queue doesn't grow too much, everything is shiny. When it does however, things go really bad, as per usual.