Datasource timeout (isvalid()) checks can completely lock connection pool

Description

The Controler thread that checks every JDBC connection with isValid() does so inside of a synchronized method that locks the entire DCStack object (which represents the connection pool). When one of the isValid() checks hangs for a long time, it completely kills the server since no other threads are allowed to get or return a connection to the pool.

Find a way to check out each connection and test them OUTSIDE of a synchronized method. It's simply too dangerous to have the entire pool tied up while you wait for a network connection to a remote resource, which could be slow, or never return. I realize this means that the pool may be changing while the timeout checks are running, but honestly I think this can be worked around.

  • If a new connection has been added, we know it's good because it's brand new.

  • And if a connection was removed from the pool while we're checking, then we know it's good as well because it's being used!

More information: I've run into this same issue (and brought it up) years ago and it's really biting a current client of mine using an Azure managed database. Azure uses an internal failover mode that can switch the physical machines around on the backend and is documented to have intermittent network drops that your application needs to work around. The isValid() timeout also seems to have documented bugs in MS SQL's JDBC driver where it can hang forever:
https://github.com/microsoft/mssql-jdbc/issues/1128

My client has updated to the very latest 8.4.1 version of MS's JDBC driver and it does not fix their issue. Below is what the stack trace looks like when an isValid() check goes out to lunch and never comes back and it takes down the entire server by keeping the connection pool locked. Ultimately, this is partially a problem with the underlying JDBC driver, but Lucee needs to do more to prevent a reset network connection from taking down the entire server.

My client has over 100 datasources defined, and by default Lucee's controller thread is checking every connection on every datasource once a minute. That's potentially hundreds to thousands of isValid() calls being made to the DB every minute!

"Thread-99468" Id=101598 RUNNABLE (in native) java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(Unknown Source) at java.net.SocketInputStream.read(Unknown Source) at java.net.SocketInputStream.read(Unknown Source) at com.microsoft.sqlserver.jdbc.TDSChannel$ProxyInputStream.readInternal(IOBuffer.java:1015) at com.microsoft.sqlserver.jdbc.TDSChannel$ProxyInputStream.read(IOBuffer.java:1005) at sun.security.ssl.InputRecord.readFully(Unknown Source) at sun.security.ssl.InputRecord.read(Unknown Source) at sun.security.ssl.SSLSocketImpl.readRecord(Unknown Source) - locked java.lang.Object@254e1ded at sun.security.ssl.SSLSocketImpl.readDataRecord(Unknown Source) at sun.security.ssl.AppInputStream.read(Unknown Source) - locked sun.security.ssl.AppInputStream@69eb7bec at com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:2054) at com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:6596) - locked com.microsoft.sqlserver.jdbc.TDSReader@605bbee0 at com.microsoft.sqlserver.jdbc.TDSCommand.startResponse(IOBuffer.java:7760) at com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:870) at com.microsoft.sqlserver.jdbc.SQLServerStatement$StmtExecCmd.doExecute(SQLServerStatement.java:767) at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7375) at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:3206) - locked java.lang.Object@74765a62 at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:247) at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:222) at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeQueryInternal(SQLServerStatement.java:699) at com.microsoft.sqlserver.jdbc.SQLServerConnection.isValid(SQLServerConnection.java:5795) at lucee.runtime.db.DCStack.isValidEL(DCStack.java:197) at lucee.runtime.db.DCStack.clear(DCStack.java:154) at lucee.runtime.db.DCStack.clear(DCStack.java:172) at lucee.runtime.db.DCStack.clear(DCStack.java:139) - locked lucee.runtime.db.DCStack@156d0c4e at lucee.runtime.db.DatasourceConnectionPool.clear(DatasourceConnectionPool.java:189) at lucee.runtime.engine.Controler.control(Controler.java:325) at lucee.runtime.engine.Controler.control(Controler.java:236) at lucee.runtime.engine.Controler.access$000(Controler.java:60) at lucee.runtime.engine.Controler$ControlerThread.run(Controler.java:113)

Environment

None

Activity

Show:

Brad Wood 30 October 2020 at 20:54

Thanks for the work on this. I had followed the network timeout before and I thought the

private static final int NETWORK_TIMEOUT_IN_SECONDS = 10;

was being used from the top of the DataSourceSupport class which is hard-coded to 10, but perhaps that's not the case? If the network timeout is coming from somewhere else, is there a way to change it on previous versions of Lucee?

I am ok with this solution so long as the isValid() will properly timeout and not hang forever. We'll have to do some testing and see if the timeout stops the hanging.

Michael Offner 30 October 2020 at 17:56

There was a problem and a bug with isValid(timeout), it is called with the value coming from “getNetworkTimeout()“, problem is that this value is in milliseconds and can be 0.

So i have made sure the value is translated to seconds (always round to the next second) and in case it is 0 it is set to a default timeout what is 5 seconds by default and can be set with the env variable “lucee.datasource.timeout.validation“ to whatever you like.

At this time i would prefer not to change the sync logic with the pool, because that needs a lot of testing, i would instead invest that time to move to apache commons pool2 as a datasource pool.

That change should avoid that isValid hangs forever.

ONe possible problem could be that we now have to short timeout for the call and because of that get a lot of reconnects.

Is that ticket fixed for you with that change?

Fixed

Details

Assignee

Reporter

Priority

Fix versions

New Issue warning screen

Before you create a new Issue, please post to the mailing list first https://dev.lucee.org

Once the issue has been verified, one of the Lucee team will ask you to file an issue

Created 28 October 2020 at 17:46
Updated 2 days ago
Resolved 30 October 2020 at 20:55

Flag notifications