Skip to content

feat: change approx percentile/median UDFs to return floats#21074

Open
theirix wants to merge 25 commits intoapache:mainfrom
theirix:rm-aggregates-integers
Open

feat: change approx percentile/median UDFs to return floats#21074
theirix wants to merge 25 commits intoapache:mainfrom
theirix:rm-aggregates-integers

Conversation

@theirix
Copy link
Copy Markdown
Contributor

@theirix theirix commented Mar 20, 2026

Which issue does this PR close?

Rationale for this change

  1. Migrating to the modern TypeSignature API: 264030c/datafusion/expr-common/src/signature.rs

  2. Coercing types of approx_percentile_cont, approx_percentile_cont_with_weight, approx_median to floats. It matches PostgreSQL, DuckDB, and ClickHouse behaviour, except for Spark.

What changes are included in this PR?

  • Port remaining UDFs (approx_percentile_cont, approx_percentile_cont_with_weight, approx_median, stub functions) to signature APIs
  • Deprecate INTEGERS and NUMERICS arrays in favour of using the TypeSignature API
  • They are not removed yet, but marked as deprecated to avoid breaking downstream
  • Fix up a SLT for approx_percentile_cont, approx_median to make sure it returns a float

Are these changes tested?

  • Tests are passing
  • Updated tests to expect floats in return types

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Mar 20, 2026
@github-actions github-actions bot added the core Core DataFusion crate label Mar 20, 2026
@github-actions github-actions bot added the optimizer Optimizer rules label Mar 20, 2026
Copy link
Copy Markdown
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this up. Some of my concerns which held me off on this effort was the impact on error messaging (though I recall there was a recent PR to clean this up #20605) and if this deals with null type inputs correctly now (see #19458)

Also it seems some of the return types are changing now, I think we should call this out?

Comment thread datafusion/expr-common/src/type_coercion/aggregates.rs
Comment thread datafusion/sqllogictest/test_files/aggregate.slt
Comment thread datafusion/expr-common/src/signature.rs
@theirix
Copy link
Copy Markdown
Contributor Author

theirix commented Mar 23, 2026

Thanks for taking this up. Some of my concerns which held me off on this effort was the impact on error messaging (though I recall there was a recent PR to clean this up #20605) and if this deals with null type inputs correctly now (see #19458)

Also it seems some of the return types are changing now, I think we should call this out?

Thank you for the review!

Added more tests for null - seems like it's consistent. The return value of these UDFs was float before (it is seen in UDF doc and also in other SQL engines), so it's reasonable to expect floats in tests too. Do you mean this change of types?

@theirix theirix requested a review from Jefffrey March 31, 2026 18:20
Copy link
Copy Markdown
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more tests for null - seems like it's consistent. The return value of these UDFs was float before (it is seen in UDF doc and also in other SQL engines), so it's reasonable to expect floats in tests too. Do you mean this change of types?

Yes, just referring to how it seems we used to return the input type but now we'll always coerce to float. We can see the return type changing in the SLTs, so would be good to highlight this in the title + PR body (since now we're going beyond a simple refactor)

signature: Signature::one_of(variants, Volatility::Immutable),
}
// Additionally accept an integer number of centroids for T-Digest
let signature = Signature::one_of(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're now coercing to floats, we can remove some of the implementation code that handles integer types

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, removed some code. It also affected approx_median, piggybacking on approx_percentile_cont - updated signature and tests as well.

Clarified the scope of this PR in description - it changes more than expected

@theirix theirix changed the title Refactor percentile UDFs to use TypeSignature API feat: refactor percentiles to TypeSignature, coerce to floats Apr 16, 2026
Copy link
Copy Markdown
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to change these to return floats since they are meant to be continuous instead of discrete 👍

@theirix
Copy link
Copy Markdown
Contributor Author

theirix commented Apr 17, 2026

Yes, it makes sense. Thank you for the review!

@Jefffrey Jefffrey changed the title feat: refactor percentiles to TypeSignature, coerce to floats feat: change approx percentile/median UDFs to return floats Apr 17, 2026
@Jefffrey
Copy link
Copy Markdown
Contributor

With the output type changing, I wonder if that is a significant enough change to warrant an entry in the upgrade guide 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate functions Changes to functions implementation logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor away usage of NUMERICS/INTEGERS in datafusion/expr-common/src/type_coercion/aggregates.rs

2 participants