feat: add sorted_series column for DataFusion streaming aggregation by g-talbot · Pull Request #6290 · quickwit-oss/quickwit

@g-talbot changed the base branch from main to gtt/sorted-series-column

April 10, 2026 14:12

@g-talbot g-talbot changed the base branch from gtt/sorted-series-column to main

April 10, 2026 14:12

@g-talbot g-talbot changed the base branch from main to gtt/sorted-series-column

April 10, 2026 14:14

alanfgates

alanfgates

chatgpt-codex-connector[bot]

Base automatically changed from gtt/sorted-series-column to main

April 15, 2026 22:27
Compute a composite, lexicographically sortable binary column
(sorted_series) at Parquet write time using storekey order-preserving
encoding. For each row the key encodes:

  1. Non-null sort schema tag columns as (ordinal: u8, value: str)
  2. timeseries_id (i64) as final discriminator

Identical timeseries always produce identical byte keys regardless of
timestamp or value, enabling DataFusion's streaming AggregateExec and
BoundedWindowAggExec with O(1) memory instead of O(N) hash tables.

Also fixes create_nullable_dict_array which used the original array
index as dictionary key instead of the position in the unique values
array, causing out-of-bounds panics for mixed null/non-null inputs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Without the ordinal, the timeseries_id bytes could collide with a
subsequent tag column's ordinal+string encoding. Every component in
the key now consistently gets an ordinal prefix from its sort schema
position.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add tests that assert:
- timeseries_id gets ordinal 6 prefix (its sort schema position)
- key length is exact: ordinal(1) + str(2) + ordinal(1) + i64(8) = 12
- when timeseries_id is absent, no trailing ordinal appears

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Writes a 6-row batch with 4 distinct series (including null tags)
through the ParquetWriter pipeline, reads back, and verifies:

- 4 distinct keys produced (series identity)
- series with 3 rows produces 3 identical keys
- null host differs from present host (ordinal skipping)
- all-null tags differ from partial-null tags
- ordinal bytes are correct (0x00 for metric_name, 0x01 for service,
  0x06 for timeseries_id) even when intermediate tags are null
- equal keys are contiguous after sort (streaming aggregation ready)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Regenerate storekey entry via dd-rust-license-tool (correct authors)
- Fix 4 rustfmt nightly formatting diffs in sorted_series tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector[bot]