Lucee 5.3 is able to hang on Linux, and have connection failures on Windows under concurrency

Description

I just spent most of the weekend on trying configurations, heap dumps, java profiling, different operating systems, different versions of Java. different versions of tomcat, updating the system, verifying things. I even learned how to use YourKit java profiler and jmap to create dumps of the hanging tomcat server. Very confusing stuff.

I'm pretty sure there is a problem with a missing synchronized block somewhere in Lucee 5.3. This is like a needle in a haystack for me, because there is no error logging coming out during the hang that provides anything useful for debugging this. If there was a way to modify Lucee to make it easier to debug this, that would help find future concurrency issues. Because the logs are so useless on this particular problem, I can't give you a helpful stack trace. I can give you one that makes it look like tomcat is the problem, because that is the only thing that gets logged in catalina.out. Someone complained about something like this in 2015, and no one helped them.

Exception in thread "http-nio-10888-exec-4" java.lang.IllegalMonitorStateException
at java.util.concurrent.locks.ReentrantLock$Sync.tryRelease(ReentrantLock.java:151)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1261)
at java.util.concurrent.locks.ReentrantLock.unlock(ReentrantLock.java:457)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:449)
at org.apache.tomcat.util.threads.TaskQueue.take(TaskQueue.java:103)
at org.apache.tomcat.util.threads.TaskQueue.take(TaskQueue.java:31)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)

In the process of testing, I've learned several new technologies, and learned that the behavior of this bug is not the same on every environment. For example, on windows, Lucee 5.3 can't survive a load test consistently. It will fail with connection reset errors randomly. However, Lucee 5.2 can sustain massive amounts of concurrency and request volume.

On Ubuntu 18.04 (or perhaps any linux?), Lucee 5.3 has 100% chance of hanging tomcat 8.5, 9, wildfly when you put it under concurrency.

I took away all of my application, all my configuration, and I still arrived at being able to hang Lucee 5.3 doing pretty close to nothing in the CFML.

On Lucee 5.2, that same linux system can do massive concurrency and number of requests.

My production server still uses Lucee 5.2, which seems to be a necessity considering the amount of problems there have been with Lucee 5.3 admin, and stability.

I'm sorry I didn't have time to report this sooner, but I've been dealing with Lucee 5.3 crashes for 1 to 2 months now, on pretty much any snapshot or RC release you make available.

Lucee 5.3 is also completely unreliable if I make a custom build of the Lucee 5.3 branch on github.

While I can make Lucee 5.3 fail under high concurrency, the more disturbing thing is that it will often fail randomly, and even when I'm only doing 2 or 3 requests during regular development. Something is very unsafe about Lucee 5.3.

I wish I knew a way to further isolate a concurrency flaw in Java/lucee, but even in CFML, concurrency problems show strange side effect behavior instead of calling themselves out usually. I think concurrency of the CFML engine is broken in the current Lucee 5.3 branch, and it seems broken pretty close to the servlet container stuff. It works fine on the CFML side, and it works fine if you do only 1 thing at a time, or a modest amount of work, but I've had it crash constantly lately doing my regular work, so I'm giving up on Lucee 5.3 for now. I won't be able to report further issues with this series until I can verify the admin works and Lucee 5.3 is stable under load, since it won't be safe for regular work right now.

At least windows doesn't crash the tomcat instance, but I guess Linux or this servlet technology doesn't know how to gracefully recover from some kinds of epoll wait/notify bugs. I would hope that a system like this would give you tools and methods to restart these hanging threads. This is the worst part about using Lucee or Java, it just gets stuck in managing the threads somehow and you can't do anything but restart it. I've seen it happen in production with Lucee 5.2 a few times, where nothing was wrong, it just gave up serving requests. It happens constantly on Lucee 5.3 though. I wonder if it is possible to build a technology that could handle logging or restarting these stuck threads by always reserving 1 extra thread for administration and giving us an interface or automatic methods of killing and restarting the connection queues to minimize downtime and keep the lucee cache warm.

I'm trying to provide the most information I can. I really worked hard on verifying this problem to make sure it isn't just one configuration.
I've made 6 different environments and configuration of lucee.
Custom Lucee 5.3 build on Tomcat 8.5 on windows and linux - both unstable.
Official Lucee 5.3 RC build on Tomcat 8.5 and Tomcat 9 on linux - both unstable.
Multiple newer Lucee 5.3 snapshots on Tomcat 8.5 / Tomcat 9 on both windows and linux - unstable.
In addition, I tried both Java 8 and Java 11 in several of those configurations, and the same problem exists in all of them.

The same windows and linux machines can handle massive concurrency and never fail in Lucee 5.2. The Lucee 5.2 series of releases were amazing quality, but Lucee 5.3 has been very full of problems for 6 to 12 months. I also wrote a custom Java HTTP web server with AIO which never fails under concurrency and it is faster then tomcat. Tomcat and Wildfly also function correctly when I make static requests to them. Once I engage the Lucee 5.3 engine under a brief load test, I make the parent process unresponsive within a few seconds on linux.

Environment

Pretty much any combination of Java, Java application server, and operating system has an issue

Activity

Show:

Andrew Kretzer 16 November 2018 at 15:36

I can confirm that this fixes my issue here:
https://luceeserver.atlassian.net/projects/LDEV/issues/LDEV-2072

Bruce Kirkpatrick 16 November 2018 at 14:04

I tested it on windows and linux. I'm not able to hang Lucee on the new snapshot with apachebench anymore. Thanks!

Pothys - MitrahSoft 16 November 2018 at 10:57

Issue can't able to reproduce in 5.3.1.91, 5.3.2.14 version.

Can you please test with above version.

Michael Offner 16 November 2018 at 07:37

in 5.3.1 we simply will disable async execution. in 5.3.2 we will rewrite the implementation to go on the ground of the issue.

Fixed

Details

Assignee

Reporter

Priority

Fix versions

New Issue warning screen

Before you create a new Issue, please post to the mailing list first https://dev.lucee.org

Once the issue has been verified, one of the Lucee team will ask you to file an issue

Sprint

Affects versions

Created 11 November 2018 at 22:21
Updated 8 May 2020 at 19:25
Resolved 16 November 2018 at 10:58