feat: port zonemap package and wire into Parquet pipeline by g-talbot · Pull Request #6295 · quickwit-oss/quickwit
and others added 11 commits
April 13, 2026 07:24Port the Go zonemap package (DFA-based prefix-preserving superset regex builder) to Rust and integrate it into the Parquet write pipeline. For each string-valued sort schema column, a compact regex is generated that accepts all column values (and possibly more). These regexes enable query-time split pruning: if a predicate cannot match any string accepted by the regex, the split can be skipped entirely. Zonemap module (quickwit-parquet-engine/src/zonemap/): - automaton.rs: DFA with weighted pruning and deterministic regex generation, including character class collapsing and suffix factoring - regex_builder.rs: PrefixPreservingRegexBuilder with progressive pruning during registration and final prune at build time - minmax.rs: FNV-1a hash-based MinMax builder for future range pruning - mod.rs: extract_zonemap_regexes() public API for Arrow RecordBatches Pipeline integration: - writer.rs: extract zonemap regexes in prepare_write() after row_keys, store as JSON in qh.zonemap_regexes Parquet KV metadata - split_writer.rs: capture zonemap_regexes in MetricsSplitMetadata - WriteMetadata type alias carries both row_keys and zonemap through write_to_bytes() and write_to_file_with_metadata() 34 tests covering automaton (regex, pruning, escaping, character classes, disjunctive clauses, long strings, post-prune behavior), regex builder (basic, pruning, progressive), MinMax (string/int hashing, reset, empty), Arrow extraction (basic, nulls, missing columns, special chars, disabled), and full pipeline integration (KV metadata, split writer, JSON roundtrip, write path consistency). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The pub(crate) re-export was unused in non-test builds, causing -D unused-imports to fail in CI. The constant is only needed by zonemap pipeline_tests, so gate it behind #[cfg(test)]. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reorder pub use re-exports and use crate:: imports to satisfy cargo +nightly fmt with group_imports = StdExternalCrate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
serde_json::to_string on HashMap<String, String> cannot fail — silently swallowing the error with let Ok() would hide a real bug. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Port remaining Go builder_test.go cases: - TestBuildFragmentZoneMap: exact regex string verification - TestNilSortSchema: empty sort fields → no regexes - TestNonMutatedResult: builder reuse independence - TestInvalidUTF8: Unicode BMP handling (Rust variant) - TestZoneMapForIntColumns: int columns produce no regex - TestResetWithLsmComparisonCutoff: LSM cutoff truncation Also fixes extract_zonemap_regexes to respect lsm_comparison_cutoff from the sort schema — only columns before the cutoff get zonemaps, matching Go FragmentZoneMapBuilder.Reset() behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Include the complete benchmark_data/integrations file from the Go zonemap package and use it via include_str!() in the benchmark test. Verifies that 584 real integration names with max 64 transitions produces a valid pruned regex. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add exact regex string verification for long service name (matches Go
TestBuildFragmentZoneMap: "^a_very_very_very_very_long_long_.+$") and
metric_name single-value case ("^cpu\.usage$"). Also verifies service
and env dict columns produce identical regexes for identical values.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mark completed items in GAP-002 (sort schema parser, configurable directions, timeseries_id, schema-driven sort, metadata storage) and GAP-004 (MetricsSplitMetadata fields, RowKeys, zonemap regexes, sorting_columns, KV metadata). Update ADR-002 Implementation Status to reflect the full PR stack (#6287-#6295). Remaining open items: per-index metastore storage (Phase 32), null ordering fix, Parquet column/offset index enabling, PostgreSQL migration for row_keys + zonemap columns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change nulls_first from true to false in sort_batch(), sorting_columns() metadata, and the SS-1 verification sort. Nulls now sort after all non-null values for both ascending and descending columns. This simplifies compaction: when a sort column is absent from a split, all rows are treated as null. With nulls-last, these rows cluster at the end and don't interfere with key-range comparisons between splits. Adds test_nulls_sort_last_ascending_and_descending which writes through the full pipeline with ascending and descending service columns, verifying null rows appear last in both cases. Updates ADR-002 and GAP-002 to mark the null ordering item resolved. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters