feat: port zonemap package and wire into Parquet pipeline by g-talbot · Pull Request #6295

feat: port zonemap package and wire into Parquet pipeline by g-talbot · Pull Request #6295 · quickwit-oss/quickwit

and others added 11 commits

April 13, 2026 07:24

Port the Go zonemap package (DFA-based prefix-preserving superset regex
builder) to Rust and integrate it into the Parquet write pipeline. For
each string-valued sort schema column, a compact regex is generated that
accepts all column values (and possibly more). These regexes enable
query-time split pruning: if a predicate cannot match any string
accepted by the regex, the split can be skipped entirely.

Zonemap module (quickwit-parquet-engine/src/zonemap/):
- automaton.rs: DFA with weighted pruning and deterministic regex
  generation, including character class collapsing and suffix factoring
- regex_builder.rs: PrefixPreservingRegexBuilder with progressive
  pruning during registration and final prune at build time
- minmax.rs: FNV-1a hash-based MinMax builder for future range pruning
- mod.rs: extract_zonemap_regexes() public API for Arrow RecordBatches

Pipeline integration:
- writer.rs: extract zonemap regexes in prepare_write() after row_keys,
  store as JSON in qh.zonemap_regexes Parquet KV metadata
- split_writer.rs: capture zonemap_regexes in MetricsSplitMetadata
- WriteMetadata type alias carries both row_keys and zonemap through
  write_to_bytes() and write_to_file_with_metadata()

34 tests covering automaton (regex, pruning, escaping, character classes,
disjunctive clauses, long strings, post-prune behavior), regex builder
(basic, pruning, progressive), MinMax (string/int hashing, reset, empty),
Arrow extraction (basic, nulls, missing columns, special chars, disabled),
and full pipeline integration (KV metadata, split writer, JSON roundtrip,
write path consistency).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The pub(crate) re-export was unused in non-test builds, causing
-D unused-imports to fail in CI. The constant is only needed by
zonemap pipeline_tests, so gate it behind #[cfg(test)].

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Reorder pub use re-exports and use crate:: imports to satisfy
cargo +nightly fmt with group_imports = StdExternalCrate.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

serde_json::to_string on HashMap<String, String> cannot fail — silently
swallowing the error with let Ok() would hide a real bug.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Port remaining Go builder_test.go cases:
- TestBuildFragmentZoneMap: exact regex string verification
- TestNilSortSchema: empty sort fields → no regexes
- TestNonMutatedResult: builder reuse independence
- TestInvalidUTF8: Unicode BMP handling (Rust variant)
- TestZoneMapForIntColumns: int columns produce no regex
- TestResetWithLsmComparisonCutoff: LSM cutoff truncation

Also fixes extract_zonemap_regexes to respect lsm_comparison_cutoff
from the sort schema — only columns before the cutoff get zonemaps,
matching Go FragmentZoneMapBuilder.Reset() behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Include the complete benchmark_data/integrations file from the Go
zonemap package and use it via include_str!() in the benchmark test.
Verifies that 584 real integration names with max 64 transitions
produces a valid pruned regex.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add exact regex string verification for long service name (matches Go
TestBuildFragmentZoneMap: "^a_very_very_very_very_long_long_.+$") and
metric_name single-value case ("^cpu\.usage$"). Also verifies service
and env dict columns produce identical regexes for identical values.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Mark completed items in GAP-002 (sort schema parser, configurable
directions, timeseries_id, schema-driven sort, metadata storage) and
GAP-004 (MetricsSplitMetadata fields, RowKeys, zonemap regexes,
sorting_columns, KV metadata). Update ADR-002 Implementation Status
to reflect the full PR stack (#6287-#6295).

Remaining open items: per-index metastore storage (Phase 32), null
ordering fix, Parquet column/offset index enabling, PostgreSQL
migration for row_keys + zonemap columns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Change nulls_first from true to false in sort_batch(), sorting_columns()
metadata, and the SS-1 verification sort. Nulls now sort after all
non-null values for both ascending and descending columns.

This simplifies compaction: when a sort column is absent from a split,
all rows are treated as null. With nulls-last, these rows cluster at the
end and don't interfere with key-range comparisons between splits.

Adds test_nulls_sort_last_ascending_and_descending which writes through
the full pipeline with ascending and descending service columns,
verifying null rows appear last in both cases.

Updates ADR-002 and GAP-002 to mark the null ordering item resolved.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fixes clippy::field_reassign_with_default.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>