feat: compute deterministic timeseries_id column at ingest by g-talbot · Pull Request #6286

feat: compute deterministic timeseries_id column at ingest by g-talbot · Pull Request #6286 · quickwit-oss/quickwit

Base automatically changed from gtt/parquet-column-ordering to gtt/docs-claude-md

April 10, 2026 10:57

g-talbot changed the base branch from gtt/docs-claude-md to gtt/parquet-column-ordering

April 10, 2026 10:59

g-talbot changed the base branch from gtt/parquet-column-ordering to main

April 10, 2026 11:00

g-talbot changed the base branch from main to gtt/parquet-column-ordering-v2

April 10, 2026 11:11

Base automatically changed from gtt/parquet-column-ordering-v2 to main

April 13, 2026 23:41

Add a timeseries_id column (Int64) to the metrics Arrow batch,
computed as a SipHash-2-4 of the series identity columns (metric_name,
metric_type, and all tags excluding temporal/value columns). The hash
uses fixed keys for cross-process determinism.

The column is already declared in the metrics default sort schema
(between host and timestamp_secs), so the parquet writer now
automatically sorts by it and places it in the correct physical
position.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The timeseries_id hash is persisted to Parquet files — any change
silently corrupts compaction and queries. Add:

- 3 pinned stability tests with hardcoded expected hash values
- 3 proptest properties (order independence, excluded tag immunity,
  extra-tag discrimination) each running 256 random cases
- Boundary ambiguity test ({"ab":"c"} vs {"a":"bc"})
- Same-series-different-timestamp invariant test
- All-excluded-tags coverage (every EXCLUDED_TAGS entry verified)
- Edge cases: empty strings, unicode, 100-tag cardinality
- Module-level doc explaining the stability contract

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Mirror the CTE + FOR UPDATE pattern from delete_splits to prevent
stale-state races. Without row locking, a concurrent
mark_metrics_splits_for_deletion can commit between the state read
and the DELETE, causing spurious FailedPrecondition errors and retry
churn.

The new query locks the target rows before reading their state,
reports not-deletable (Staged/Published) and not-found splits
separately, and only deletes when all requested splits are in
MarkedForDeletion state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mattmkim deleted the gtt/sorted-series-column branch

April 15, 2026 22:27