QueryColumn's internal data is grown beyond required size

Description

is all this doubling necessary? with this code, the column data object has now 126 objects (which is repeated for every column in a query)

Lucee/core/src/main/java/lucee/runtime/type/QueryColumnImpl.java at 6.2 · lucee/Lucee

 

<cfscript> q = queryNew("id"); loop times=40 { queryAddRow( q ); } </cfscript>

 

image-20250116-120556.png

Environment

None

Attachments

2
  • 13 Mar 2025, 02:21 pm
  • 16 Jan 2025, 12:29 pm

Activity

Show:

Zac Spitzer 13 March 2025 at 14:21

Zac Spitzer 13 March 2025 at 13:50
Edited

I’m temp reverting this change in 6.2.1.64 to compare against 6.2.1.63 to do some benchmarking, to rule out a perf regression due to this on testlab

https://github.com/lucee/lucee-testlab/actions/runs/13788879542/attempts/2#summary-38711068483

Brad Wood 16 January 2025 at 19:25

There’s nothing wrong with it, per se. This is common in the internals of many built-in Java data structures such as a string builder or arraylist. They allocate additional size in their internal data to prevent the overhead of having to grow more often. When doing my last round of performance improvements for Lucee, I found that building a large query object (thousands of rows up to a million rows) spent a very large amount of time growing all the arrays (each column must be grown separately), and worse yet-- the “grow” code is synchronized so it grinds all threads to a halt every time it grows. What this does is sacrifice a small amount of unused memory for the benefit of not having to grow as often. The idea is, if the query makes it to 4 rows, it’s likely to grow to 8 rows. And it it makes it to 200 rows, it’s likely to grow to 400 rows, etc. 2x was the sweet spot I had found in my testing that really optimized large queries with hundreds of thousands of rows that needed added. The larger the query gets, the longer the array copy takes, so the more important it is to avoid it! It may seem excessive, but the overhead of null items in a native Java array is very small. It’s just a null pointer on the heap, which is 4KB (for a compression pointer) on a 64 bit machine.

For a “normal” sized query of, say, 1000 rows, we’ll assume the actual array size is 2,000 items. So the extra 1000 null pointers is 1000 rows * 10 columns * 4 KB null pointer / 8 / 1024 = 5K of extra heap per column. 5B of heap is basically nothing, especially given the size of the actual 1000 rows with populated data. Let’s assume the query has an average number of 10 columns, which are a mix of integers and 250 character strings. The actual data in the query would be around 2.8 MB, so the 40KB of “extra” space (for all 10 columns) in the arrays adds an overhead of 1.4%. The ratio of overhead stays the same if the query grows, and diminishes, the larger the data in the cells is. For your 500k row example in the ticket, if it had 10 columns again with a mix of ints and 250 character strings, that’s 1.4 GB of data and 20 MB of overhead for the “empty” rows. Again, only a 1.4% overhead.

So basically, you can reduce it if you want, but I really think the overhead is reasonable given the huge performance increase it provides at large sizes and the relatively small memory trade off.

wouldn’t System.arraycopy be more efficient

Great question. I don’t see a reason not to switch to using it!

I’ll also note, in BoxLang I implemented several of my additional ideas that I tried to present to Lucee a few years back. I could never get traction here for larger architectural changes, but I implemented them all in BL. My query object uses an un-syncronized auto-growing ArrayLIst (I trust the authors of the JDK to have the best algorithm) and I store the data by row as a native array, which is more convenient for most looping operations and only has one “grow” point. I do a similar style of growing the List optimistically that doesn’t require me to do ANY manual copying of data, plus I am able to use NO synchronization at all when adding rows so it’s crazy fast under high concurrency. If you haven’t seen, the proof of my improved approach is documented in this recent blog post slightly smiling face

Zac Spitzer 16 January 2025 at 15:42

I did play around with all this on java 21, with a few different approaches, doesn’t seem to make much a difference at all

Fixed

Details

Assignee

Reporter

Priority

Fix versions

New Issue warning screen

Before you create a new Issue, please post to the mailing list first https://dev.lucee.org

Once the issue has been verified, one of the Lucee team will ask you to file an issue

Created 16 January 2025 at 12:27
Updated 13 March 2025 at 14:21
Resolved 16 January 2025 at 15:42

Flag notifications