CFHTTP occasionally causes request to hang

Description

Update: We haven't seen this happening even once during the last 10 days, so I suspect that this was due to very rare circumstances and might not be worth investigating any further at this point. Updated priority to trivial, you may choose to dismiss this altogether for the time being.

–

Today I noticed an active request in FusionReactor that was running for more than 16 hours. The stack trace shows it hanging in a fairly ordinary cfhttp call:

There is really nothing at all fancy going on in this job, Application.cfc is mostly empty and judging from the stack trace there is no involvement of anything but that cfhttp code. FusionReactor's profiler shows 100% processing time spent in that tag.

RequestTimeout in Lucee Administrator is set to 50 seconds, this.applicationTimeout in the job's Application.cfc is set to 2 hours.

Target URL of that CFHTTP is just a plain http.// URL, not even https.

A first attempt to kill the thread via FusionReactor was not succesful, message was

"We were unable to kill the request transaction. This might be due to unreleasable locks on the request thread" - though that might be a FusionReactor issue. Transit Locks shows
"Transaction Kill Not Attempted: Transit did not attempt to kill this thread because it owns synchronizers which would not automatically be released.
Lock 0: java.util.concurrent.ThreadPoolExecutor$Worker@6182a26"

Forcing the Death of the thread after that didn't work either - result says "The Thread has been stopped.", but it's still showing up as Running in FusionReactor with a "Manual Abort Failed" label tacked on to the status Code.

I have attached the stacktrace and a screenshot of the profiler from FusionReactor from before I attempted to kill the request.

P.S.: Occasionally is perhaps an overstatement, we've only encountered this once so far, but we're only had this in production for some two weeks.

PPS: It's still happening occasionally, i.e. once or twice a week at most (with about 6-8 million web requests served each day). We're running CFHTTP calls to an Amazon webservice inside a named cflock that encloses a larger bit of code; as the primary request hangs, the lock is never released and concurrent requests to that same lock end up in the timeout. I have a gut feeling that there might be some kind of race condition leading to a deadlock situation, but I really cannot make heads or tails from the stack trace alone.

PPPS (2018-09-18): Still happens occasionally, but not just with the CFHTTP to Amazon, but other CFHTTP calls as well - and it seems to be unrelated to a cflock inside the code, but, and that's just a guess, rather some sort of built in locking going on.

In this example we have got a very simple cfhttp call inside a scheduled task, Application.cfc for this app is empty and there is no cflock or anything else suspicious around this call:
<cfhttp timeout="5" method="get" url="http://#variables.strURL#/" result="request.loggedInOutputAll"></cfhttp>

This request running for over three days now. Profiling started 14-Sep-18 20:50:04.242, according to FusionReactor, and the State is "Finished" - though the request is still listed as active. I

have attacked another StackTrace, it's a linear flow until the lucee.runtime.tag.Http._doEndTag(Http.java), where there is a second branch that ends with the final Waiting on lock <0x6349b70d> (a java.lang.Object).

There is a second active request that started some five minutes before that and also seems to hang after a CFHTTP call, the last bit of the StackTrace is
100,0% - 60,169s lucee.runtime.tag.Http.doEndTag(Http.java)
100,0% - 60,169s lucee.runtime.tag.Http._doEndTag(Http.java)

74,6% - 44,880s lucee.commons.io.IOUtil.toString(IOUtil.java)
74,6% - 44,880s lucee.commons.io.IOUtil.toString(IOUtil.java) (hide)
74,6% - 44,880s lucee.commons.io.IOUtil.copy(IOUtil.java)
74,6% - 44,880s lucee.commons.io.IOUtil.copy(IOUtil.java)
74,6% - 44,880s java.lang.Object.wait(Native Method)
74,6% - 44,880s Waiting on lock <0x744bdd4b> (a java.lang.Object)
25,4% - 15,289s java.lang.Object.wait(Native Method)
25,4% - 15,289s Waiting on lock <0x37b9199d> (a lucee.runtime.tag.Http)

Again, I had to restart Tomcat to get rid of these processes.

Update 2019-09-24: Today we're presented with a wonderful opportunity for reproducing the problem described here, as our hoster seems to be bombarded with DDOS, which currently causes some 30-90% packet loss in our connection to the outside world. In this scenario I would simply expect a cfhttp with a timeout-attribute set to 30 to die after more or less exactly 30 seconds. What I am seeing right now though are five fairly ancient requests, all of which hang in cfhttp calls for more than two hours. Request Timeout in Lucee Admin is set to 50 seconds and there is no cfsetting-override defined anywhere in the code, so there is no reason that these requests should run over a minute, let alone several hours. Thus CFHTTP may cause requests to enter a zombie state, which may only be terminated by a Tomcat restart, as far as I can tell.

Environment

Linux (4.9.0-7-amd64) 64bit, Apache Tomcat/8.5.14 (Debian), Java 1.8.0_181 (Oracle Corporation) 64bit

Attachments

Linked issues

relates to

LDEV-4788

excessive memory usage with cfhttp getasbinary=true

Activity

Show:

dan.switzer@givainc.com 1 May 2024 at 20:11

While I’m not sure that it matters, I’ve only see the problem occur with CFHTTP calls using the HEAD method. This might just be because this specific unit test is testing a lot of “HEAD” HTTP calls, but I thought it worth mentioning since HEAD is not a method I’d normally test with.

dan.switzer@givainc.com 1 May 2024 at 19:51

I’ve been seeing an issue with HTTP calls causing blocking as well. In our case, we’ve been testing with both 5.4.5.27-SNAPSHOT and 5.4.3.2.

In our case, the problem appears to be when CFHTTP is inside a thread, but that might be a false flag. What we see is that the Java ends up hitting a TIMED_WAITING event in the lucee.runtime.tag.Http._doEndTag method:

Here’s the full stacktrace of the JVM when the problem occurs:

When I start debugging what’s going on with FusionReactor, I can see that the HTTP operation has completed, but the thread is left open:

In looking at the code, it seems the synchronize block in the _doEndTag() designed to set the timeout is causing the deadlock:

My suspicion is that Lucee is trying to re-use connections, which is why the synchronized block is causing the deadlock.

I’ve been trying to come up with a test case to reproduce the problem, but haven’t had any luck. I’ve got a unit test case which runs into the problem about 70-80% of the time. It doesn’t always fail in the same part of the test case, but once it hits the deadlock the thread just stays in a TIMED_WAITING event until it eventually times out.

Zac Spitzer 11 September 2020 at 21:05

with 5.3.8.49, with that example, I randomly got between successful results

Premature end of Content-Length delimited message body (expected: 10; received: 9)

and

Illegal packet size: 39879

Illegal packet size: 33566

Pete Freitag 11 September 2020 at 19:45

Assuming it is the same issue, here is the stacktrace I have against a thread that is tied up with this:

Pete Freitag 11 September 2020 at 19:40

I think I have found the issue, when you have a connection that is writing slowly to the response it can exceed the timeout set on cfhttp. Here is a reproducable test case using httpbin which has a drip API which slowly outputs bytes over the given duration. If you set the duration to 5 seconds, the cfhttp timeout to 2 seconds the cfhttp call will not timeout.

The result is that page takes 5 seconds to execute and it is not interrupted by either the cfhttp timeout or the requesttimeout.

Note that if you switch that /drip to /delay/5 the timeout kicks in and works, so the problem only happens when it starts writing to the http response and takes a long time to finish writing. You could probably reproduce that drip call with a cfflush.

Details
Assignee
Unassigned
Reporter
Markus Wollny
Priority
Trivial
New Issue warning screen
Before you create a new Issue, please post to the mailing list first https://dev.lucee.org
Once the issue has been verified, one of the Lucee team will ask you to file an issue
Affects versions
5.2.8.50

Created 8 August 2018 at 12:18

Updated 1 May 2024 at 20:11