Issues

Select view

Select search mode

 

Java heap memory exhaustion, pegged CPU, and unresponsive server

Fixed

Description

Since upgrading to 5.2.2.71, my production Lucee server would become unresponsive within about 24 hours. Lucee was still running, but using 95%+ of the CPU and all HTTP requests to the server just hung. This is a serious issue that prevents some users from upgrading beyond version 5.2.1.9.

With some difficulty, I have been able to reproduce this locally in releases after and including 5.2.2.71. Version 5.2.1.9 does not have this problem, and is stable in production.

By capturing the output of getMemoryUsage() repeatedly shows the following pattern of memory usage in the problematic Lucee versions:

  1. Tenured generation space increases fairly rapidly, with only small portions being reclaimed by garbage collection runs.

  2. Tenured generation space becomes completely full.

  3. Used Eden space starts increasing.

  4. Eden space becomes completely full.

A little while after this, Lucee starts using a high percentage of CPU (presumably attempting repeated GC runs) and the server becomes unresponsive. By contrast, in version 5.2.1.9, tenured generation memory usage increases very slowly, and never maxes out even after weeks of uptime in production. See the attached charts for memory and CPU usage in the last good version, the first bad version, and the current version.

5.2.5.20 (bad)

5.2.2.71 (bad)

5.2.1.9 (good)

At this time, I unfortunately do not have a reproducible test case that I can share. Reproducing the problem involves crawling my site while capturing memory stats to a log file. Each test takes about an hour, and currently requires my application's custom code, which I am not at liberty to share. However, from the discussion boards, this seems to be a common complaint from other Lucee users, and affects applications other than mine (such as Mura CMS), so I wanted to create an authoritative place to collect information and discuss the issue.

References to other reports that I suspect may be experiencing this issue:

Environment

Ubuntu 16.04
Java 1.8.0_151
1.5GB RAM
MSSQL Datasource
JAVA_OPTS="-Xms256m -Xmx512m -XX:MaxPermSize=128m"

Attachments

5

Details

Assignee

Reporter

Priority

Fix versions

New Issue warning screen

Before you create a new Issue, please post to the mailing list first https://dev.lucee.org

Once the issue has been verified, one of the Lucee team will ask you to file an issue

Sprint

Affects versions

Created 3 January 2018 at 21:19
Updated 8 March 2018 at 02:15
Resolved 27 February 2018 at 18:01

Activity

Show:

Igal Sapir27 February 2018 at 18:01

Actually since is still open (QA) I rather mark this one fixed rather than rejected.

Igal Sapir27 February 2018 at 17:59

Fixed with according to OP

Igal Sapir27 February 2018 at 17:57

Great to hear, !

Leon Miller-Out27 February 2018 at 17:40

The rapid exhaustion of memory seems to be fixed in Lucee 5.2.7.21 (due to the fix for https://luceeserver.atlassian.net/browse/LDEV-1480). I still think that a cap on the query size is necessary to prevent eventual memory exhaustion, but I think this can now be closed in favor of https://luceeserver.atlassian.net/browse/LDEV-1643.

Leon Miller-Out4 January 2018 at 20:11

Test code for the cachedwithin="0" fix:

<cfset qry = QueryNew('blerg')>
<cfquery name="populateCache" dbtype="query" cachedwithin="#createTimeSpan(0,0,1,0)#">
select * from qry
</cfquery>
<cfset QueryAddRow(qry)>
<cfquery name="readCache" dbtype="query" cachedwithin="#createTimeSpan(0,0,1,0)#">
select * from qry
</cfquery>
<cfset QueryAddRow(qry)>
<cfquery name="clearCache" dbtype="query" cachedwithin="0">
select * from qry
</cfquery>
<cfset QueryAddRow(qry)>
<cfquery name="repopulateCache" dbtype="query" cachedwithin="#createTimeSpan(0,0,1,0)#">
select * from qry
</cfquery>
<cfscript>
WriteOutput("populateCache should find 0 rows. It found #populateCache.recordcount#<br>");
WriteOutput("readCache should find 0 rows. It found #readCache.recordcount#<br>");
WriteOutput("clearCache should find 2 rows. It found #clearCache.recordcount#<br>");
WriteOutput("repopulateCache should find 3 rows. It found #repopulateCache.recordcount#<br>");
</cfscript>

Note: on trycf.com, the only engines that run that test code correctly are Lucee 4.5 and Railo 4.5. I think trycf's configurations for ACF 10 and 11 must have query caching totally disabled, and ACF 2016 also has a bug with cachedwithin that has been fixed recently but isn't on trycf.com yet.

Flag notifications