perf: another ExternalSorter refactor by mbutrovich · Pull Request #21688 · apache/datafusion

mbutrovich · 2026-04-17T02:14:09Z

Draft while I benchmark.

Which issue does this PR close?

Partially addresses #21543.

Rationale for this change

ExternalSorter sorts each incoming batch individually (typically 8192 rows), then k-way merges all of them. At scale (TPC-H SF10, ~60M rows in lineitem), this produces ~7300 individually-sorted batches feeding the k-way merge with high fan-in.

An earlier PR (#21629) explored coalescing all columns via BatchCoalescer before sorting. That reduced fan-in and gave 1.2-1.5x TPC-H speedups, but caused 2.5x regressions on multi-column StringArray/DictionaryArray sorts due to take scatter-gathering across large string heaps exceeding L2 cache.

This PR takes a different approach: extract only sort-key columns, use a cache-friendly sort kernel, and reconstruct full rows afterward.

What changes are included in this PR?

Key-only coalescing with adaptive sort kernel

Replaces ExternalSorter's per-batch sort architecture:

Incoming batches accumulate in an input buffer until sort_coalesce_target_rows (default 32768) is reached
Only sort-key columns are extracted and concatenated — non-key value columns stay in the original batches untouched
Sort kernel is chosen based on schema:
- Multi-column variable-length keys (Utf8, Binary, StringView, etc.): RowConverter-based sort encodes keys into contiguous binary-comparable row format for cache-friendly byte comparisons. This avoids the multi-column cascading comparison cache problem where lexsort_to_indices jumps between separate column arrays.
- Single-column or fixed-width keys: lexsort_to_indices (SIMD-optimized, no encoding overhead)
Output reconstruction is hybrid: key columns are taked from the concatenated key batch, value columns are interleaved from the original input batches
On memory pressure, sorted runs spill to disk (merged into one file when headroom is available, one file per run otherwise)
At query completion, runs are k-way merged via the existing StreamingMergeBuilder

This reduces merge fan-in from ~7300 to ~1800 runs at SF10.

Why RowConverter for multi-column sort?

Arrow-rs benchmarks (arrow-rs#9683) show RowConverter-based sorting (lexsort_rows) is 1.3-2.5x faster than lexsort_to_indices for multi-column string schemas at 32K rows. The key insight: lexsort_to_indices with multiple columns does cascading comparisons across separate column arrays. At 32K rows with 3 string columns, the sort's random access pattern crosses cache boundaries on every tie-breaking comparison. RowConverter encodes all key columns into one contiguous buffer, so comparisons stay within a single memory region.

This is the same row encoding the streaming merge already uses (via RowCursorStream), and the same encoding that the future radix sort kernel (arrow-rs#9683) will sort on. When radix sort lands, it's a drop-in replacement on the same encoded data.

Why key-only extraction?

Concatenating and reordering non-key columns is wasted work — they don't participate in sort comparisons. For wide schemas (e.g., small key + large value columns), this saves significant memory bandwidth. The value columns are reconstructed via interleave_record_batch from the original input batches using the same sorted indices.

Spill strategy

With merge headroom (sort_spill_reservation_bytes > 0): merge all runs into a single sorted stream before spilling to one file. Fewer files = lower fan-in for the final MultiLevelMerge.
Without headroom (sort_spill_reservation_bytes == 0): spill each run as its own file. The multi-level merge handles low merge memory by reducing fan-in.

Dead code removal

Removes in_mem_sort_stream, sort_batch_stream, consume_and_spill_append, spill_finish, organize_stringview_arrays, and in_progress_spill_file.

Config changes

New: sort_coalesce_target_rows (default 32768)
Deprecated: sort_in_place_threshold_bytes (no longer read, warn attribute per API health policy)

Are these changes tested?

6 new unit tests (coalescing, partial flush, per-run spill, merged spill, StringArray key sort, wide schema sort)
All 32 sort unit tests pass
All sort fuzz, sort query fuzz, and spilling fuzz tests pass
information_schema.slt updated for new config

Are there any user-facing changes?

New config sort_coalesce_target_rows (default 32768) controls the coalesce target before sorting
sort_in_place_threshold_bytes is deprecated
Sorts under tight memory budgets (sort_spill_reservation_bytes near zero) that previously failed now succeed via multi-level merge with reduced fan-in

…erge fan-in. Subset of changes from apache#21600.

mbutrovich · 2026-04-17T02:15:10Z

run benchmark tpch10

env:
   PREFER_HASH_JOIN: false

adriangbot · 2026-04-17T02:17:39Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4264796101-1401-x4d22 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing externalsorter4 (e13736a) to 4e015db (merge-base) diff using: tpch10
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-17T02:33:50Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

Comparing HEAD and externalsorter4
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                   HEAD ┃                       externalsorter4 ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │      371.31 / 373.70 ±1.75 / 376.12 ms │     371.17 / 373.88 ±2.36 / 377.45 ms │     no change │
│ QQuery 2  │     478.56 / 497.62 ±14.99 / 516.73 ms │     449.65 / 452.84 ±3.60 / 458.23 ms │ +1.10x faster │
│ QQuery 3  │     594.38 / 654.33 ±37.00 / 690.26 ms │     513.52 / 516.79 ±2.79 / 521.36 ms │ +1.27x faster │
│ QQuery 4  │      500.67 / 509.34 ±5.45 / 516.55 ms │     334.41 / 338.17 ±2.17 / 340.48 ms │ +1.51x faster │
│ QQuery 5  │  1057.93 / 1125.63 ±51.77 / 1189.47 ms │ 1060.91 / 1093.75 ±33.38 / 1155.59 ms │     no change │
│ QQuery 6  │      136.33 / 137.81 ±1.74 / 140.81 ms │     137.18 / 138.87 ±2.62 / 144.09 ms │     no change │
│ QQuery 7  │  1533.97 / 1558.86 ±16.62 / 1578.60 ms │ 1411.22 / 1433.35 ±18.94 / 1466.60 ms │ +1.09x faster │
│ QQuery 8  │ 1450.91 / 1745.37 ±340.21 / 2193.19 ms │ 1200.93 / 1216.57 ±12.56 / 1236.07 ms │ +1.43x faster │
│ QQuery 9  │ 2108.58 / 2306.95 ±103.05 / 2400.05 ms │ 1833.74 / 1916.67 ±50.53 / 1963.68 ms │ +1.20x faster │
│ QQuery 10 │     522.49 / 542.53 ±17.72 / 569.04 ms │     518.55 / 527.77 ±7.85 / 539.17 ms │     no change │
│ QQuery 11 │      448.01 / 460.84 ±6.97 / 468.01 ms │     434.45 / 440.04 ±3.86 / 445.56 ms │     no change │
│ QQuery 12 │      292.76 / 297.73 ±6.71 / 310.72 ms │     275.55 / 279.05 ±3.65 / 285.74 ms │ +1.07x faster │
│ QQuery 13 │      369.58 / 373.94 ±4.39 / 381.31 ms │     353.82 / 355.96 ±2.01 / 358.90 ms │     no change │
│ QQuery 14 │      195.09 / 197.38 ±1.94 / 199.66 ms │     192.09 / 195.77 ±2.60 / 199.02 ms │     no change │
│ QQuery 15 │      328.29 / 329.95 ±2.26 / 334.44 ms │     318.89 / 328.39 ±7.11 / 339.31 ms │     no change │
│ QQuery 16 │      118.89 / 122.13 ±2.46 / 125.42 ms │     122.22 / 124.91 ±2.38 / 129.16 ms │     no change │
│ QQuery 17 │ 1564.00 / 1702.80 ±148.38 / 1892.91 ms │  1380.95 / 1396.46 ±9.68 / 1411.33 ms │ +1.22x faster │
│ QQuery 18 │  1551.81 / 1578.92 ±20.49 / 1602.74 ms │ 1438.95 / 1455.37 ±14.13 / 1475.66 ms │ +1.08x faster │
│ QQuery 19 │     276.80 / 288.00 ±16.54 / 320.80 ms │    276.42 / 291.01 ±24.64 / 340.10 ms │     no change │
│ QQuery 20 │      448.94 / 451.07 ±1.88 / 453.34 ms │     440.85 / 447.96 ±4.43 / 453.68 ms │     no change │
│ QQuery 21 │ 2967.01 / 3271.30 ±162.92 / 3459.39 ms │ 2679.88 / 2699.79 ±22.84 / 2744.08 ms │ +1.21x faster │
│ QQuery 22 │      185.03 / 189.67 ±3.78 / 196.16 ms │     153.19 / 157.74 ±4.92 / 167.01 ms │ +1.20x faster │
└───────────┴────────────────────────────────────────┴───────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary              ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)              │ 18715.88ms │
│ Total Time (externalsorter4)   │ 16181.11ms │
│ Average Time (HEAD)            │   850.72ms │
│ Average Time (externalsorter4) │   735.50ms │
│ Queries Faster                 │         11 │
│ Queries Slower                 │          0 │
│ Queries with No Change         │         11 │
│ Queries with Failure           │          0 │
└────────────────────────────────┴────────────┘

Resource Usage

tpch10 — base (merge-base)

Metric	Value
Wall time	94.0s
Peak memory	11.3 GiB
Avg memory	8.6 GiB
CPU user	869.6s
CPU sys	72.6s
Peak spill	0 B

tpch10 — branch

Metric	Value
Wall time	81.2s
Peak memory	12.0 GiB
Avg memory	8.0 GiB
CPU user	807.0s
CPU sys	62.0s
Peak spill	0 B

File an issue against this benchmark runner

mbutrovich and others added 12 commits April 14, 2026 14:48

Rewrite ExternalSorter to coalesce batches before sorting, reducing m…

db0ee41

…erge fan-in. Subset of changes from apache#21600.

Simplify confusing logic.

32a6882

Merge branch 'main' into externalsorter

689dfd2

Merge branch 'main' into externalsorter

0198302

Merge branch 'main' into externalsorter

f5655e2

Merge branch 'main' into externalsorter

c119d1c

Checkpoint after refactor.

cbddca1

Checkpoint after refactor.

2ed5912

Merge branch 'main' into externalsorter3

7618b69

Checkpoint after refactor.

b6f3d32

Checkpoint after refactor.

d5b518c

Merge branch 'main' into externalsorter4

e13736a

github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate execution Related to the execution crate physical-plan Changes to the physical-plan crate labels Apr 17, 2026

mbutrovich added the performance Make DataFusion faster label Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: another ExternalSorter refactor#21688

perf: another ExternalSorter refactor#21688
mbutrovich wants to merge 12 commits intoapache:mainfrom
mbutrovich:externalsorter4

mbutrovich commented Apr 17, 2026 •

edited

Loading

Uh oh!

mbutrovich commented Apr 17, 2026

Uh oh!

adriangbot commented Apr 17, 2026

Uh oh!

adriangbot commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mbutrovich commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Key-only coalescing with adaptive sort kernel

Why RowConverter for multi-column sort?

Why key-only extraction?

Spill strategy

Dead code removal

Config changes

Are these changes tested?

Are there any user-facing changes?

Uh oh!

mbutrovich commented Apr 17, 2026

Uh oh!

adriangbot commented Apr 17, 2026

Uh oh!

adriangbot commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mbutrovich commented Apr 17, 2026 •

edited

Loading