Skip to content

[metrics] Support DDSketch in the parquet pipeline#6257

Open
mattmkim wants to merge 5 commits intomainfrom
matthew.kim/parquet-sketches
Open

[metrics] Support DDSketch in the parquet pipeline#6257
mattmkim wants to merge 5 commits intomainfrom
matthew.kim/parquet-sketches

Conversation

@mattmkim
Copy link
Copy Markdown
Contributor

@mattmkim mattmkim commented Mar 31, 2026

Description

This PR can be reviewed commit by commit.

This PR updates the parquet pipeline to process DDSketches. See https://datadoghq.atlassian.net/wiki/spaces/QKHS/pages/6291357728/DDSketch+in+Parquet for more information about the DDSketch spec.

How was this PR tested?

Describe how you tested this PR.

@mattmkim mattmkim force-pushed the matthew.kim/parquet-sketches branch from db0e0db to 86c034b Compare March 31, 2026 21:14
@mattmkim mattmkim changed the title [draft] parquet ddsketch engine [metrics] Support DDSketch in the parquet pipeline Mar 31, 2026
@mattmkim mattmkim marked this pull request as ready for review March 31, 2026 21:30
@mattmkim mattmkim force-pushed the matthew.kim/parquet-sketches branch from c3fc790 to 2261237 Compare March 31, 2026 21:38
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 727f085864

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread quickwit/quickwit-indexing/src/actors/publisher.rs
Comment thread quickwit/quickwit-metastore/src/metastore/postgres/metastore.rs Outdated
@fulmicoton-dd
Copy link
Copy Markdown
Collaborator

i cannot review this PR due to lack of context. I know what DDSketch are, but I do not know what they are used for in the context the metrics ingestion pipeline, why they are stored in different files, etc.

Base automatically changed from matthew.kim/metrics-wide-schema to main April 6, 2026 19:48
@mattmkim mattmkim force-pushed the matthew.kim/parquet-sketches branch from 2261237 to 6e7d6a9 Compare April 7, 2026 18:50
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6e7d6a90ae

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread quickwit/quickwit-metastore/src/metastore/postgres/metastore.rs Outdated

let keys_inner = keys_builder.values();
for &k in &dp.keys {
keys_inner.append_value(k);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we store the length of keys/counts? I.e. how do we recover the number of elements per data_point from _inner arrays?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are stored implicitly by Arrow - it stores an array of offsets, so you can get the length of row i with offsets[i+1] - offsets[i]

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 37f4298f66

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread quickwit/quickwit-parquet-engine/src/storage/split_writer.rs Outdated
Comment thread quickwit/quickwit-parquet-engine/src/split/metadata.rs
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9a38c47b7c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread quickwit/quickwit-parquet-engine/src/storage/writer.rs Outdated
Comment thread quickwit/quickwit-metastore/src/metastore/postgres/metastore.rs
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dff8e08fa0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread quickwit/quickwit-metastore/src/metastore/postgres/metastore.rs Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 40cfe5a158

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 1949 to +1951
for metadata in &splits_metadata {
let insertable =
InsertableMetricsSplit::from_metadata(metadata, MetricsSplitState::Staged)
.map_err(|e| MetastoreError::JsonSerializeError {
struct_name: "MetricsSplitMetadata".to_string(),
message: e.to_string(),
})?;
let insertable = InsertableParquetSplit::from_metadata(metadata, SplitState::Staged)
.map_err(|err| MetastoreError::JsonSerializeError {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject split-kind mismatches during parquet staging

stage_parquet_splits_impl chooses the destination table from the RPC (kind) but never verifies that each ParquetSplitMetadata.kind matches that table. A mixed or misrouted request can therefore persist sketch metadata into metrics_splits (or vice versa), violating table invariants and causing downstream routing/listing logic that relies on split kind to query the wrong API/table. Add an explicit metadata.kind == kind check and fail fast on mismatch before building insert rows.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a35fdb8600

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@mattmkim mattmkim force-pushed the matthew.kim/parquet-sketches branch from a35fdb8 to 3bbfb7f Compare April 14, 2026 23:27
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3bbfb7ff26

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread quickwit/quickwit-metastore/src/metastore/file_backed/mod.rs Outdated
Comment thread quickwit/quickwit-metastore/src/metastore/postgres/parquet_model.rs
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3bbfb7ff26

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread quickwit/quickwit-metastore/src/metastore/file_backed/mod.rs Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1d430881b0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1422 to +1424
let index_uid = request.index_uid().clone();
let splits_metadata = request.deserialize_splits_metadata()?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce request index UID for staged sketch metadata

stage_sketch_splits deserializes and forwards splits_metadata_json without checking that each ParquetSplitMetadata.index_uid matches request.index_uid, so a caller can submit StageSketchSplitsRequest.index_uid = A while embedding metadata for B. In file-backed mode this stores rows under index A whose serialized metadata claims B, which breaks request-scoped invariants and can cause later list/publish responses to carry inconsistent index identities; this path should validate equality (or overwrite metadata index UID from the request) before staging.

Useful? React with 👍 / 👎.

@mattmkim mattmkim force-pushed the matthew.kim/parquet-sketches branch from 1d43088 to ad84838 Compare April 15, 2026 05:27
@mattmkim mattmkim force-pushed the matthew.kim/parquet-sketches branch from ad84838 to 0655391 Compare April 15, 2026 05:37

/// Returns whether the given index ID uses the Parquet/DataFusion pipeline.
pub fn is_parquet_pipeline_index(index_id: &str) -> bool {
is_metrics_index(index_id) || is_sketches_index(index_id)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine for now, but we'd be adding every parquet pipeline here every time we add one.

Also, every index that starts with sketches- doesn't necessarily have to be a Parquet pipeline.

The "fix" seems to be creating something like a ParquetMapping, and that's like a whole thing, so we can get back to it later.

update_timestamp TIMESTAMP NOT NULL DEFAULT (CURRENT_TIMESTAMP AT TIME ZONE 'UTC'),
publish_timestamp TIMESTAMP,

FOREIGN KEY(index_uid) REFERENCES indexes(index_uid)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a ON DELETE CASCADE here? If we ever need to delete this, we'd have to clean up the table ourselves first.

The metrics table also doesn't include this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants