Skip to content

perf: another ExternalSorter refactor#21688

Draft
mbutrovich wants to merge 12 commits intoapache:mainfrom
mbutrovich:externalsorter4
Draft

perf: another ExternalSorter refactor#21688
mbutrovich wants to merge 12 commits intoapache:mainfrom
mbutrovich:externalsorter4

Conversation

@mbutrovich
Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich commented Apr 17, 2026

Draft while I benchmark.

Which issue does this PR close?

Partially addresses #21543.

Rationale for this change

ExternalSorter sorts each incoming batch individually (typically 8192 rows), then k-way merges all of them. At scale (TPC-H SF10, ~60M rows in lineitem), this produces ~7300 individually-sorted batches feeding the k-way merge with high fan-in.

An earlier PR (#21629) explored coalescing all columns via BatchCoalescer before sorting. That reduced fan-in and gave 1.2-1.5x TPC-H speedups, but caused 2.5x regressions on multi-column StringArray/DictionaryArray sorts due to take scatter-gathering across large string heaps exceeding L2 cache.

This PR takes a different approach: extract only sort-key columns, use a cache-friendly sort kernel, and reconstruct full rows afterward.

What changes are included in this PR?

Key-only coalescing with adaptive sort kernel

Replaces ExternalSorter's per-batch sort architecture:

  • Incoming batches accumulate in an input buffer until sort_coalesce_target_rows (default 32768) is reached
  • Only sort-key columns are extracted and concatenated — non-key value columns stay in the original batches untouched
  • Sort kernel is chosen based on schema:
    • Multi-column variable-length keys (Utf8, Binary, StringView, etc.): RowConverter-based sort encodes keys into contiguous binary-comparable row format for cache-friendly byte comparisons. This avoids the multi-column cascading comparison cache problem where lexsort_to_indices jumps between separate column arrays.
    • Single-column or fixed-width keys: lexsort_to_indices (SIMD-optimized, no encoding overhead)
  • Output reconstruction is hybrid: key columns are taked from the concatenated key batch, value columns are interleaved from the original input batches
  • On memory pressure, sorted runs spill to disk (merged into one file when headroom is available, one file per run otherwise)
  • At query completion, runs are k-way merged via the existing StreamingMergeBuilder

This reduces merge fan-in from ~7300 to ~1800 runs at SF10.

Why RowConverter for multi-column sort?

Arrow-rs benchmarks (arrow-rs#9683) show RowConverter-based sorting (lexsort_rows) is 1.3-2.5x faster than lexsort_to_indices for multi-column string schemas at 32K rows. The key insight: lexsort_to_indices with multiple columns does cascading comparisons across separate column arrays. At 32K rows with 3 string columns, the sort's random access pattern crosses cache boundaries on every tie-breaking comparison. RowConverter encodes all key columns into one contiguous buffer, so comparisons stay within a single memory region.

This is the same row encoding the streaming merge already uses (via RowCursorStream), and the same encoding that the future radix sort kernel (arrow-rs#9683) will sort on. When radix sort lands, it's a drop-in replacement on the same encoded data.

Why key-only extraction?

Concatenating and reordering non-key columns is wasted work — they don't participate in sort comparisons. For wide schemas (e.g., small key + large value columns), this saves significant memory bandwidth. The value columns are reconstructed via interleave_record_batch from the original input batches using the same sorted indices.

Spill strategy

  • With merge headroom (sort_spill_reservation_bytes > 0): merge all runs into a single sorted stream before spilling to one file. Fewer files = lower fan-in for the final MultiLevelMerge.
  • Without headroom (sort_spill_reservation_bytes == 0): spill each run as its own file. The multi-level merge handles low merge memory by reducing fan-in.

Dead code removal

Removes in_mem_sort_stream, sort_batch_stream, consume_and_spill_append, spill_finish, organize_stringview_arrays, and in_progress_spill_file.

Config changes

  • New: sort_coalesce_target_rows (default 32768)
  • Deprecated: sort_in_place_threshold_bytes (no longer read, warn attribute per API health policy)

Are these changes tested?

  • 6 new unit tests (coalescing, partial flush, per-run spill, merged spill, StringArray key sort, wide schema sort)
  • All 32 sort unit tests pass
  • All sort fuzz, sort query fuzz, and spilling fuzz tests pass
  • information_schema.slt updated for new config

Are there any user-facing changes?

  • New config sort_coalesce_target_rows (default 32768) controls the coalesce target before sorting
  • sort_in_place_threshold_bytes is deprecated
  • Sorts under tight memory budgets (sort_spill_reservation_bytes near zero) that previously failed now succeed via multi-level merge with reduced fan-in

@github-actions github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate execution Related to the execution crate physical-plan Changes to the physical-plan crate labels Apr 17, 2026
@mbutrovich
Copy link
Copy Markdown
Contributor Author

run benchmark tpch10

env:
   PREFER_HASH_JOIN: false

@adriangbot
Copy link
Copy Markdown

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4264796101-1401-x4d22 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing externalsorter4 (e13736a) to 4e015db (merge-base) diff using: tpch10
Results will be posted here when complete


File an issue against this benchmark runner

@mbutrovich mbutrovich added the performance Make DataFusion faster label Apr 17, 2026
@adriangbot
Copy link
Copy Markdown

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

Comparing HEAD and externalsorter4
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                   HEAD ┃                       externalsorter4 ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │      371.31 / 373.70 ±1.75 / 376.12 ms │     371.17 / 373.88 ±2.36 / 377.45 ms │     no change │
│ QQuery 2  │     478.56 / 497.62 ±14.99 / 516.73 ms │     449.65 / 452.84 ±3.60 / 458.23 ms │ +1.10x faster │
│ QQuery 3  │     594.38 / 654.33 ±37.00 / 690.26 ms │     513.52 / 516.79 ±2.79 / 521.36 ms │ +1.27x faster │
│ QQuery 4  │      500.67 / 509.34 ±5.45 / 516.55 ms │     334.41 / 338.17 ±2.17 / 340.48 ms │ +1.51x faster │
│ QQuery 5  │  1057.93 / 1125.63 ±51.77 / 1189.47 ms │ 1060.91 / 1093.75 ±33.38 / 1155.59 ms │     no change │
│ QQuery 6  │      136.33 / 137.81 ±1.74 / 140.81 ms │     137.18 / 138.87 ±2.62 / 144.09 ms │     no change │
│ QQuery 7  │  1533.97 / 1558.86 ±16.62 / 1578.60 ms │ 1411.22 / 1433.35 ±18.94 / 1466.60 ms │ +1.09x faster │
│ QQuery 8  │ 1450.91 / 1745.37 ±340.21 / 2193.19 ms │ 1200.93 / 1216.57 ±12.56 / 1236.07 ms │ +1.43x faster │
│ QQuery 9  │ 2108.58 / 2306.95 ±103.05 / 2400.05 ms │ 1833.74 / 1916.67 ±50.53 / 1963.68 ms │ +1.20x faster │
│ QQuery 10 │     522.49 / 542.53 ±17.72 / 569.04 ms │     518.55 / 527.77 ±7.85 / 539.17 ms │     no change │
│ QQuery 11 │      448.01 / 460.84 ±6.97 / 468.01 ms │     434.45 / 440.04 ±3.86 / 445.56 ms │     no change │
│ QQuery 12 │      292.76 / 297.73 ±6.71 / 310.72 ms │     275.55 / 279.05 ±3.65 / 285.74 ms │ +1.07x faster │
│ QQuery 13 │      369.58 / 373.94 ±4.39 / 381.31 ms │     353.82 / 355.96 ±2.01 / 358.90 ms │     no change │
│ QQuery 14 │      195.09 / 197.38 ±1.94 / 199.66 ms │     192.09 / 195.77 ±2.60 / 199.02 ms │     no change │
│ QQuery 15 │      328.29 / 329.95 ±2.26 / 334.44 ms │     318.89 / 328.39 ±7.11 / 339.31 ms │     no change │
│ QQuery 16 │      118.89 / 122.13 ±2.46 / 125.42 ms │     122.22 / 124.91 ±2.38 / 129.16 ms │     no change │
│ QQuery 17 │ 1564.00 / 1702.80 ±148.38 / 1892.91 ms │  1380.95 / 1396.46 ±9.68 / 1411.33 ms │ +1.22x faster │
│ QQuery 18 │  1551.81 / 1578.92 ±20.49 / 1602.74 ms │ 1438.95 / 1455.37 ±14.13 / 1475.66 ms │ +1.08x faster │
│ QQuery 19 │     276.80 / 288.00 ±16.54 / 320.80 ms │    276.42 / 291.01 ±24.64 / 340.10 ms │     no change │
│ QQuery 20 │      448.94 / 451.07 ±1.88 / 453.34 ms │     440.85 / 447.96 ±4.43 / 453.68 ms │     no change │
│ QQuery 21 │ 2967.01 / 3271.30 ±162.92 / 3459.39 ms │ 2679.88 / 2699.79 ±22.84 / 2744.08 ms │ +1.21x faster │
│ QQuery 22 │      185.03 / 189.67 ±3.78 / 196.16 ms │     153.19 / 157.74 ±4.92 / 167.01 ms │ +1.20x faster │
└───────────┴────────────────────────────────────────┴───────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary              ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)              │ 18715.88ms │
│ Total Time (externalsorter4)   │ 16181.11ms │
│ Average Time (HEAD)            │   850.72ms │
│ Average Time (externalsorter4) │   735.50ms │
│ Queries Faster                 │         11 │
│ Queries Slower                 │          0 │
│ Queries with No Change         │         11 │
│ Queries with Failure           │          0 │
└────────────────────────────────┴────────────┘

Resource Usage

tpch10 — base (merge-base)

Metric Value
Wall time 94.0s
Peak memory 11.3 GiB
Avg memory 8.6 GiB
CPU user 869.6s
CPU sys 72.6s
Peak spill 0 B

tpch10 — branch

Metric Value
Wall time 81.2s
Peak memory 12.0 GiB
Avg memory 8.0 GiB
CPU user 807.0s
CPU sys 62.0s
Peak spill 0 B

File an issue against this benchmark runner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation execution Related to the execution crate performance Make DataFusion faster physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants