The Controler thread that checks every JDBC connection with isValid() does so inside of a synchronized method that locks the entire DCStack object (which represents the connection pool). When one of the isValid() checks hangs for a long time, it completely kills the server since no other threads are allowed to get or return a connection to the pool.
Find a way to check out each connection and test them OUTSIDE of a synchronized method. It's simply too dangerous to have the entire pool tied up while you wait for a network connection to a remote resource, which could be slow, or never return. I realize this means that the pool may be changing while the timeout checks are running, but honestly I think this can be worked around.
If a new connection has been added, we know it's good because it's brand new.
And if a connection was removed from the pool while we're checking, then we know it's good as well because it's being used!
More information: I've run into this same issue (and brought it up) years ago and it's really biting a current client of mine using an Azure managed database. Azure uses an internal failover mode that can switch the physical machines around on the backend and is documented to have intermittent network drops that your application needs to work around. The isValid() timeout also seems to have documented bugs in MS SQL's JDBC driver where it can hang forever:
My client has updated to the very latest 8.4.1 version of MS's JDBC driver and it does not fix their issue. Below is what the stack trace looks like when an isValid() check goes out to lunch and never comes back and it takes down the entire server by keeping the connection pool locked. Ultimately, this is partially a problem with the underlying JDBC driver, but Lucee needs to do more to prevent a reset network connection from taking down the entire server.
My client has over 100 datasources defined, and by default Lucee's controller thread is checking every connection on every datasource once a minute. That's potentially hundreds to thousands of isValid() calls being made to the DB every minute!
There was a problem and a bug with isValid(timeout), it is called with the value coming from “getNetworkTimeout()“, problem is that this value is in milliseconds and can be 0.
So i have made sure the value is translated to seconds (always round to the next second) and in case it is 0 it is set to a default timeout what is 5 seconds by default and can be set with the env variable “lucee.datasource.timeout.validation“ to whatever you like.
At this time i would prefer not to change the sync logic with the pool, because that needs a lot of testing, i would instead invest that time to move to apache commons pool2 as a datasource pool.
That change should avoid that isValid hangs forever.
ONe possible problem could be that we now have to short timeout for the call and because of that get a lot of reconnects.
Is that ticket fixed for you with that change?
Thanks for the work on this. I had followed the network timeout before and I thought the
was being used from the top of the DataSourceSupport class which is hard-coded to 10, but perhaps that's not the case? If the network timeout is coming from somewhere else, is there a way to change it on previous versions of Lucee?
I am ok with this solution so long as the isValid() will properly timeout and not hang forever. We'll have to do some testing and see if the timeout stops the hanging.