perf: another ExternalSorter refactor#21688
Draft
mbutrovich wants to merge 12 commits intoapache:mainfrom
Draft
Conversation
…erge fan-in. Subset of changes from apache#21600.
Contributor
Author
|
run benchmark tpch10 |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing externalsorter4 (e13736a) to 4e015db (merge-base) diff using: tpch10 File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch10 — base (merge-base)
tpch10 — branch
File an issue against this benchmark runner |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Draft while I benchmark.
Which issue does this PR close?
Partially addresses #21543.
Rationale for this change
ExternalSorter sorts each incoming batch individually (typically 8192 rows), then k-way merges all of them. At scale (TPC-H SF10, ~60M rows in lineitem), this produces ~7300 individually-sorted batches feeding the k-way merge with high fan-in.
An earlier PR (#21629) explored coalescing all columns via
BatchCoalescerbefore sorting. That reduced fan-in and gave 1.2-1.5x TPC-H speedups, but caused 2.5x regressions on multi-column StringArray/DictionaryArray sorts due totakescatter-gathering across large string heaps exceeding L2 cache.This PR takes a different approach: extract only sort-key columns, use a cache-friendly sort kernel, and reconstruct full rows afterward.
What changes are included in this PR?
Key-only coalescing with adaptive sort kernel
Replaces ExternalSorter's per-batch sort architecture:
sort_coalesce_target_rows(default 32768) is reachedlexsort_to_indicesjumps between separate column arrays.lexsort_to_indices(SIMD-optimized, no encoding overhead)taked from the concatenated key batch, value columns areinterleaved from the original input batchesStreamingMergeBuilderThis reduces merge fan-in from ~7300 to ~1800 runs at SF10.
Why RowConverter for multi-column sort?
Arrow-rs benchmarks (arrow-rs#9683) show RowConverter-based sorting (
lexsort_rows) is 1.3-2.5x faster thanlexsort_to_indicesfor multi-column string schemas at 32K rows. The key insight:lexsort_to_indiceswith multiple columns does cascading comparisons across separate column arrays. At 32K rows with 3 string columns, the sort's random access pattern crosses cache boundaries on every tie-breaking comparison. RowConverter encodes all key columns into one contiguous buffer, so comparisons stay within a single memory region.This is the same row encoding the streaming merge already uses (via
RowCursorStream), and the same encoding that the future radix sort kernel (arrow-rs#9683) will sort on. When radix sort lands, it's a drop-in replacement on the same encoded data.Why key-only extraction?
Concatenating and reordering non-key columns is wasted work — they don't participate in sort comparisons. For wide schemas (e.g., small key + large value columns), this saves significant memory bandwidth. The value columns are reconstructed via
interleave_record_batchfrom the original input batches using the same sorted indices.Spill strategy
sort_spill_reservation_bytes > 0): merge all runs into a single sorted stream before spilling to one file. Fewer files = lower fan-in for the finalMultiLevelMerge.sort_spill_reservation_bytes == 0): spill each run as its own file. The multi-level merge handles low merge memory by reducing fan-in.Dead code removal
Removes
in_mem_sort_stream,sort_batch_stream,consume_and_spill_append,spill_finish,organize_stringview_arrays, andin_progress_spill_file.Config changes
sort_coalesce_target_rows(default 32768)sort_in_place_threshold_bytes(no longer read,warnattribute per API health policy)Are these changes tested?
information_schema.sltupdated for new configAre there any user-facing changes?
sort_coalesce_target_rows(default 32768) controls the coalesce target before sortingsort_in_place_threshold_bytesis deprecatedsort_spill_reservation_bytesnear zero) that previously failed now succeed via multi-level merge with reduced fan-in