QueryColumn's internal data is grown beyond required size
Description
Environment
Attachments
- 13 Mar 2025, 02:21 pm
- 16 Jan 2025, 12:29 pm
relates to
Activity
Zac Spitzer 13 March 2025 at 14:21
it’s a good change! re-enabling it
https://github.com/lucee/lucee-testlab/actions/runs/13836434713/attempts/1#summary-38714045416
Zac Spitzer 13 March 2025 at 13:50Edited
I’m temp reverting this change in 6.2.1.64 to compare against 6.2.1.63 to do some benchmarking, to rule out a perf regression due to this on testlab
https://github.com/lucee/lucee-testlab/actions/runs/13788879542/attempts/2#summary-38711068483
Zac Spitzer 11 February 2025 at 14:25
Brad Wood 16 January 2025 at 19:25
@Zac Spitzer There’s nothing wrong with it, per se. This is common in the internals of many built-in Java data structures such as a string builder or arraylist. They allocate additional size in their internal data to prevent the overhead of having to grow more often. When doing my last round of performance improvements for Lucee, I found that building a large query object (thousands of rows up to a million rows) spent a very large amount of time growing all the arrays (each column must be grown separately), and worse yet-- the “grow” code is synchronized so it grinds all threads to a halt every time it grows. What this does is sacrifice a small amount of unused memory for the benefit of not having to grow as often. The idea is, if the query makes it to 4 rows, it’s likely to grow to 8 rows. And it it makes it to 200 rows, it’s likely to grow to 400 rows, etc. 2x was the sweet spot I had found in my testing that really optimized large queries with hundreds of thousands of rows that needed added. The larger the query gets, the longer the array copy takes, so the more important it is to avoid it! It may seem excessive, but the overhead of null items in a native Java array is very small. It’s just a null pointer on the heap, which is 4KB (for a compression pointer) on a 64 bit machine.
For a “normal” sized query of, say, 1000 rows, we’ll assume the actual array size is 2,000 items. So the extra 1000 null pointers is 1000 rows * 10 columns * 4 KB null pointer / 8 / 1024 = 5K of extra heap per column. 5B of heap is basically nothing, especially given the size of the actual 1000 rows with populated data. Let’s assume the query has an average number of 10 columns, which are a mix of integers and 250 character strings. The actual data in the query would be around 2.8 MB, so the 40KB of “extra” space (for all 10 columns) in the arrays adds an overhead of 1.4%. The ratio of overhead stays the same if the query grows, and diminishes, the larger the data in the cells is. For your 500k row example in the ticket, if it had 10 columns again with a mix of ints and 250 character strings, that’s 1.4 GB of data and 20 MB of overhead for the “empty” rows. Again, only a 1.4% overhead.
So basically, you can reduce it if you want, but I really think the overhead is reasonable given the huge performance increase it provides at large sizes and the relatively small memory trade off.
wouldn’t
System.arraycopy
be more efficient
Great question. I don’t see a reason not to switch to using it!
I’ll also note, in BoxLang I implemented several of my additional ideas that I tried to present to Lucee a few years back. I could never get traction here for larger architectural changes, but I implemented them all in BL. My query object uses an un-syncronized auto-growing ArrayLIst (I trust the authors of the JDK to have the best algorithm) and I store the data by row as a native array, which is more convenient for most looping operations and only has one “grow” point. I do a similar style of growing the List optimistically that doesn’t require me to do ANY manual copying of data, plus I am able to use NO synchronization at all when adding rows so it’s crazy fast under high concurrency. If you haven’t seen, the proof of my improved approach is documented in this recent blog post
Zac Spitzer 16 January 2025 at 15:42
I did play around with all this on java 21, with a few different approaches, doesn’t seem to make much a difference at all
Details
Assignee
UnassignedUnassignedReporter
Zac SpitzerZac SpitzerPriority
NewFix versions
New Issue warning screen
Before you create a new Issue, please post to the mailing list first https://dev.lucee.org
Once the issue has been verified, one of the Lucee team will ask you to file an issue
Details
Details
Assignee
Reporter
Priority
Fix versions
New Issue warning screen
Before you create a new Issue, please post to the mailing list first https://dev.lucee.org
Once the issue has been verified, one of the Lucee team will ask you to file an issue
@Brad Wood is all this doubling necessary? with this code, the column data object has now 126 objects (which is repeated for every column in a query)
Lucee/core/src/main/java/lucee/runtime/type/QueryColumnImpl.java at 6.2 · lucee/Lucee
<cfscript> q = queryNew("id"); loop times=40 { queryAddRow( q ); } </cfscript>