Port filter_pushdown.rs async tests to sqllogictest#21620
Open
adriangb wants to merge 1 commit intoapache:mainfrom
Open
Port filter_pushdown.rs async tests to sqllogictest#21620adriangb wants to merge 1 commit intoapache:mainfrom
adriangb wants to merge 1 commit intoapache:mainfrom
Conversation
Port 24 end-to-end filter-pushdown tests out of
`datafusion/core/tests/physical_optimizer/filter_pushdown.rs` into the
sqllogictest suite. The new `datafusion.explain.analyze_categories`
session config lets `EXPLAIN ANALYZE` emit only deterministic metric
categories ('rows'), so these tests can assert directly on the
`predicate=DynamicFilter [ ... ]` text without `<slt:ignore>` scrubbing
around timing/bytes.
## What moved
New/extended tests in
`datafusion/sqllogictest/test_files/push_down_filter_parquet.slt`:
- TopK dynamic filter pushdown (single-col, multi-col sort,
integration with max_row_group_size=128 and pushdown_rows_matched /
pushdown_rows_pruned counters)
- HashJoin CollectLeft dynamic filter with `struct(a, b) IN (SET)` shape
- Nested hash joins (filter propagates to both inner scans)
- Parent filter split across the two sides of a HashJoin
- TopK above HashJoin (both dynamic filters ANDed on the probe scan)
- Dynamic filter through a GROUP BY between HashJoin and probe scan
- TopK projection rewrite (reorder, prune, expression, alias shadowing)
- NULL-bearing build-side join keys
- LEFT JOIN and LEFT SEMI JOIN dynamic filter pushdown
- HashTable strategy (`hash_lookup`) via
`hash_join_inlist_pushdown_max_size = 1` on both string and integer
multi-column keys
New tests in
`datafusion/sqllogictest/test_files/push_down_filter_regression.slt`:
- Aggregate dynamic filter baseline: MIN(a), MAX(a), MIN(a) + MAX(a),
MIN(a) + MAX(b), mixed MIN/MAX with unsupported expression input,
all-NULL input (filter stays `true`), MIN(a+1) (no filter emitted)
- Filter on grouping column pushes through AggregateExec
- Filter on aggregate result (HAVING count > 5) stays above the aggregate
- End-to-end aggregate dynamic filter pruning a multi-file parquet scan
## What stayed in Rust
Ten async tests were marked non-portable with a short comment explaining
why. In short: they either hand-wire `PartitionMode::Partitioned` /
`RepartitionExec` structures SQL never constructs, assert via debug APIs
(`dynamic_filter_for_test()`, `apply_expressions` +
`downcast_ref::<DynamicFilterPhysicalExpr>`) that are not observable
from SQL, or target the specific stacked-`FilterExec` shape that the
logical optimizer collapses before physical planning.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
#21160 added
datafusion.explain.analyze_categories, which letsEXPLAIN ANALYZEemit only deterministic metric categories (e.g.'rows'). That unlocked a long-standing blocker on porting tests out ofdatafusion/core/tests/physical_optimizer/filter_pushdown.rs: previously these tests had to assert on execution state viainstasnapshots over hand-wiredExecutionPlantrees and mockTestSourcedata, which kept them expensive to read, expensive to update, and impossible to test from the user-facing SQL path.With
analyze_categories = 'rows', thepredicate=DynamicFilter [ ... ]text on a parquet scan is stable across runs, so the same invariants can now be expressed as plainEXPLAIN ANALYZESQL in sqllogictest, where they are easier to read, easier to update, and exercise the full SQL → logical optimizer → physical optimizer → execution pipeline rather than a single optimizer rule in isolation.What changes are included in this PR?
24 end-to-end filter-pushdown tests are ported out of
filter_pushdown.rsand deleted. The helpersrun_aggregate_dyn_filter_caseandrun_projection_dyn_filter_case(and their supporting structs) are deleted along with the tests that used them. The 24 synchronous#[test]optimizer-rule-in-isolation tests are untouched — they stay in Rust because they specifically exerciseFilterPushdown::new()/OptimizationTestover a hand-built plan.datafusion/sqllogictest/test_files/push_down_filter_parquet.sltNew tests covering:
max_row_group_size = 128, asserting onpushdown_rows_matched = 128/pushdown_rows_pruned = 99.87 K)struct(a, b) IN (SET) ([...])contentWHEREfilter splitting across the two sides of a HashJoinGROUP BYsitting between a HashJoin and the probe scanLEFT JOINandLEFT SEMI JOINdynamic filter pushdownhash_lookup) viahash_join_inlist_pushdown_max_size = 1, on both string and integer multi-column keysdatafusion/sqllogictest/test_files/push_down_filter_regression.sltNew tests covering:
MIN(a),MAX(a),MIN(a), MAX(a),MIN(a), MAX(b), mixedMIN/MAXwith an unsupported expression input, all-NULL input (filter staystrue),MIN(a+1)(no filter emitted)WHEREfilter on a grouping column pushes throughAggregateExecHAVING count(b) > 5filter stays above the aggregateThe aggregate baseline tests run under
analyze_level = summary+analyze_categories = 'none'so that metrics render empty and only thepredicate=DynamicFilter [ ... ]content remains — the filter text is deterministic even though the pruning counts are subject to parallel-execution scheduling.What stayed in Rust
Ten async tests now carry a short
// Not portable to sqllogictest: …header explaining why. In short, they either:PartitionMode::Partitionedor aRepartitionExecboundary that SQL never constructs for the sizes of data these tests useHashJoinExec::dynamic_filter_for_test().is_used(),ExecutionPlan::apply_expressions()+downcast_ref::<DynamicFilterPhysicalExpr>) that are not observable from SQLFilterExecshape (FilterPushdown do not generate correct column index when merge FilterExec #20109 regression) that the logical optimizer collapses before physical planningAre these changes tested?
Yes — the ported tests are the tests. Each ported slt case was generated with
cargo test -p datafusion-sqllogictest --test sqllogictests -- <file> --complete, then re-run twice back-to-back without--completeto confirm determinism. The remaining Rustfilter_pushdowntests continue to pass (cargo test -p datafusion --test core_integration filter_pushdown→ 47 passed, 0 failed).cargo clippy --tests -D warningsandcargo fmt --allare clean.Test plan
cargo test -p datafusion-sqllogictest --test sqllogictests -- push_down_filtercargo test -p datafusion --test core_integration filter_pushdowncargo clippy -p datafusion --tests -- -D warningscargo fmt --allAre there any user-facing changes?
No. This is a test-only refactor.
🤖 Generated with Claude Code