Skip to content

Port filter_pushdown.rs async tests to sqllogictest#21620

Open
adriangb wants to merge 1 commit intoapache:mainfrom
adriangb:port-filter-pushdown-tests-to-slt
Open

Port filter_pushdown.rs async tests to sqllogictest#21620
adriangb wants to merge 1 commit intoapache:mainfrom
adriangb:port-filter-pushdown-tests-to-slt

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

  • Closes #.

Rationale for this change

#21160 added datafusion.explain.analyze_categories, which lets EXPLAIN ANALYZE emit only deterministic metric categories (e.g. 'rows'). That unlocked a long-standing blocker on porting tests out of datafusion/core/tests/physical_optimizer/filter_pushdown.rs: previously these tests had to assert on execution state via insta snapshots over hand-wired ExecutionPlan trees and mock TestSource data, which kept them expensive to read, expensive to update, and impossible to test from the user-facing SQL path.

With analyze_categories = 'rows', the predicate=DynamicFilter [ ... ] text on a parquet scan is stable across runs, so the same invariants can now be expressed as plain EXPLAIN ANALYZE SQL in sqllogictest, where they are easier to read, easier to update, and exercise the full SQL → logical optimizer → physical optimizer → execution pipeline rather than a single optimizer rule in isolation.

What changes are included in this PR?

24 end-to-end filter-pushdown tests are ported out of filter_pushdown.rs and deleted. The helpers run_aggregate_dyn_filter_case and run_projection_dyn_filter_case (and their supporting structs) are deleted along with the tests that used them. The 24 synchronous #[test] optimizer-rule-in-isolation tests are untouched — they stay in Rust because they specifically exercise FilterPushdown::new() / OptimizationTest over a hand-built plan.

datafusion/sqllogictest/test_files/push_down_filter_parquet.slt

New tests covering:

  • TopK dynamic filter pushdown integration (100k-row parquet, max_row_group_size = 128, asserting on pushdown_rows_matched = 128 / pushdown_rows_pruned = 99.87 K)
  • TopK single-column and multi-column (compound-sort) dynamic filter shapes
  • HashJoin CollectLeft dynamic filter with struct(a, b) IN (SET) ([...]) content
  • Nested hash joins propagating filters to both inner scans
  • Parent WHERE filter splitting across the two sides of a HashJoin
  • TopK above HashJoin, with both dynamic filters ANDed on the probe scan
  • Dynamic filter flowing through a GROUP BY sitting between a HashJoin and the probe scan
  • TopK projection rewrite — reorder, prune, expression, alias shadowing
  • NULL-bearing build-side join keys
  • LEFT JOIN and LEFT SEMI JOIN dynamic filter pushdown
  • HashTable strategy (hash_lookup) via hash_join_inlist_pushdown_max_size = 1, on both string and integer multi-column keys

datafusion/sqllogictest/test_files/push_down_filter_regression.slt

New tests covering:

  • Aggregate dynamic filter baseline: MIN(a), MAX(a), MIN(a), MAX(a), MIN(a), MAX(b), mixed MIN/MAX with an unsupported expression input, all-NULL input (filter stays true), MIN(a+1) (no filter emitted)
  • WHERE filter on a grouping column pushes through AggregateExec
  • HAVING count(b) > 5 filter stays above the aggregate
  • End-to-end aggregate dynamic filter actually pruning a multi-file parquet scan

The aggregate baseline tests run under analyze_level = summary + analyze_categories = 'none' so that metrics render empty and only the predicate=DynamicFilter [ ... ] content remains — the filter text is deterministic even though the pruning counts are subject to parallel-execution scheduling.

What stayed in Rust

Ten async tests now carry a short // Not portable to sqllogictest: … header explaining why. In short, they either:

  • Hand-wire PartitionMode::Partitioned or a RepartitionExec boundary that SQL never constructs for the sizes of data these tests use
  • Assert via debug-only APIs (HashJoinExec::dynamic_filter_for_test().is_used(), ExecutionPlan::apply_expressions() + downcast_ref::<DynamicFilterPhysicalExpr>) that are not observable from SQL
  • Target the specific stacked-FilterExec shape (FilterPushdown do not generate correct column index when merge FilterExec #20109 regression) that the logical optimizer collapses before physical planning

Are these changes tested?

Yes — the ported tests are the tests. Each ported slt case was generated with cargo test -p datafusion-sqllogictest --test sqllogictests -- <file> --complete, then re-run twice back-to-back without --complete to confirm determinism. The remaining Rust filter_pushdown tests continue to pass (cargo test -p datafusion --test core_integration filter_pushdown → 47 passed, 0 failed). cargo clippy --tests -D warnings and cargo fmt --all are clean.

Test plan

  • cargo test -p datafusion-sqllogictest --test sqllogictests -- push_down_filter
  • cargo test -p datafusion --test core_integration filter_pushdown
  • cargo clippy -p datafusion --tests -- -D warnings
  • cargo fmt --all

Are there any user-facing changes?

No. This is a test-only refactor.

🤖 Generated with Claude Code

Port 24 end-to-end filter-pushdown tests out of
`datafusion/core/tests/physical_optimizer/filter_pushdown.rs` into the
sqllogictest suite. The new `datafusion.explain.analyze_categories`
session config lets `EXPLAIN ANALYZE` emit only deterministic metric
categories ('rows'), so these tests can assert directly on the
`predicate=DynamicFilter [ ... ]` text without `<slt:ignore>` scrubbing
around timing/bytes.

## What moved

New/extended tests in
`datafusion/sqllogictest/test_files/push_down_filter_parquet.slt`:

- TopK dynamic filter pushdown (single-col, multi-col sort,
  integration with max_row_group_size=128 and pushdown_rows_matched /
  pushdown_rows_pruned counters)
- HashJoin CollectLeft dynamic filter with `struct(a, b) IN (SET)` shape
- Nested hash joins (filter propagates to both inner scans)
- Parent filter split across the two sides of a HashJoin
- TopK above HashJoin (both dynamic filters ANDed on the probe scan)
- Dynamic filter through a GROUP BY between HashJoin and probe scan
- TopK projection rewrite (reorder, prune, expression, alias shadowing)
- NULL-bearing build-side join keys
- LEFT JOIN and LEFT SEMI JOIN dynamic filter pushdown
- HashTable strategy (`hash_lookup`) via
  `hash_join_inlist_pushdown_max_size = 1` on both string and integer
  multi-column keys

New tests in
`datafusion/sqllogictest/test_files/push_down_filter_regression.slt`:

- Aggregate dynamic filter baseline: MIN(a), MAX(a), MIN(a) + MAX(a),
  MIN(a) + MAX(b), mixed MIN/MAX with unsupported expression input,
  all-NULL input (filter stays `true`), MIN(a+1) (no filter emitted)
- Filter on grouping column pushes through AggregateExec
- Filter on aggregate result (HAVING count > 5) stays above the aggregate
- End-to-end aggregate dynamic filter pruning a multi-file parquet scan

## What stayed in Rust

Ten async tests were marked non-portable with a short comment explaining
why. In short: they either hand-wire `PartitionMode::Partitioned` /
`RepartitionExec` structures SQL never constructs, assert via debug APIs
(`dynamic_filter_for_test()`, `apply_expressions` +
`downcast_ref::<DynamicFilterPhysicalExpr>`) that are not observable
from SQL, or target the specific stacked-`FilterExec` shape that the
logical optimizer collapses before physical planning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Apr 14, 2026
@adriangb adriangb requested a review from alamb April 14, 2026 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant