Java heap memory exhaustion, pegged CPU, and unresponsive server

Description

Since upgrading to 5.2.2.71, my production Lucee server would become unresponsive within about 24 hours. Lucee was still running, but using 95%+ of the CPU and all HTTP requests to the server just hung. This is a serious issue that prevents some users from upgrading beyond version 5.2.1.9.

With some difficulty, I have been able to reproduce this locally in releases after and including 5.2.2.71. Version 5.2.1.9 does not have this problem, and is stable in production.

By capturing the output of getMemoryUsage() repeatedly shows the following pattern of memory usage in the problematic Lucee versions:

  1. Tenured generation space increases fairly rapidly, with only small portions being reclaimed by garbage collection runs.

  2. Tenured generation space becomes completely full.

  3. Used Eden space starts increasing.

  4. Eden space becomes completely full.

A little while after this, Lucee starts using a high percentage of CPU (presumably attempting repeated GC runs) and the server becomes unresponsive. By contrast, in version 5.2.1.9, tenured generation memory usage increases very slowly, and never maxes out even after weeks of uptime in production. See the attached charts for memory and CPU usage in the last good version, the first bad version, and the current version.

5.2.5.20 (bad)

5.2.2.71 (bad)

5.2.1.9 (good)

At this time, I unfortunately do not have a reproducible test case that I can share. Reproducing the problem involves crawling my site while capturing memory stats to a log file. Each test takes about an hour, and currently requires my application's custom code, which I am not at liberty to share. However, from the discussion boards, this seems to be a common complaint from other Lucee users, and affects applications other than mine (such as Mura CMS), so I wanted to create an authoritative place to collect information and discuss the issue.

References to other reports that I suspect may be experiencing this issue:

Environment

Ubuntu 16.04
Java 1.8.0_151
1.5GB RAM
MSSQL Datasource
JAVA_OPTS="-Xms256m -Xmx512m -XX:MaxPermSize=128m"

Activity

Show:
Leon Miller-Out
January 4, 2018, 8:11 PM

Test code for the cachedwithin="0" fix:

<cfset qry = QueryNew('blerg')>
<cfquery name="populateCache" dbtype="query" cachedwithin="#createTimeSpan(0,0,1,0)#">
select * from qry
</cfquery>
<cfset QueryAddRow(qry)>
<cfquery name="readCache" dbtype="query" cachedwithin="#createTimeSpan(0,0,1,0)#">
select * from qry
</cfquery>
<cfset QueryAddRow(qry)>
<cfquery name="clearCache" dbtype="query" cachedwithin="0">
select * from qry
</cfquery>
<cfset QueryAddRow(qry)>
<cfquery name="repopulateCache" dbtype="query" cachedwithin="#createTimeSpan(0,0,1,0)#">
select * from qry
</cfquery>
<cfscript>
WriteOutput("populateCache should find 0 rows. It found #populateCache.recordcount#<br>");
WriteOutput("readCache should find 0 rows. It found #readCache.recordcount#<br>");
WriteOutput("clearCache should find 2 rows. It found #clearCache.recordcount#<br>");
WriteOutput("repopulateCache should find 3 rows. It found #repopulateCache.recordcount#<br>");
</cfscript>

Note: on trycf.com, the only engines that run that test code correctly are Lucee 4.5 and Railo 4.5. I think trycf's configurations for ACF 10 and 11 must have query caching totally disabled, and ACF 2016 also has a bug with cachedwithin that has been fixed recently but isn't on trycf.com yet.

Leon Miller-Out
February 27, 2018, 5:40 PM

The rapid exhaustion of memory seems to be fixed in Lucee 5.2.7.21 (due to the fix for https://luceeserver.atlassian.net/browse/LDEV-1480). I still think that a cap on the query size is necessary to prevent eventual memory exhaustion, but I think this can now be closed in favor of https://luceeserver.atlassian.net/browse/LDEV-1643.

Igal Sapir
February 27, 2018, 5:57 PM

Great to hear, !

Igal Sapir
February 27, 2018, 5:59 PM

Fixed with according to OP

Igal Sapir
February 27, 2018, 6:01 PM

Actually since is still open (QA) I rather mark this one fixed rather than rejected.

Fixed

Assignee

Igal Sapir

Reporter

Leon Miller-Out

Priority

New

Fix versions

Sprint

None

Affects versions

Configure