feat(bigquery,cubestore): Parquet pre-aggregation export with WIF-native URLs by KrishnaRMaddikara · Pull Request #10499 · cube-js/cube
…ive URLs Problem 1 — CSV.gz export is slow and expensive: BigQuery driver exports pre-aggregation data as CSV.gz. For large tables this means gigabytes of intermediate files. Parquet is 3-5x smaller and is the native CubeStore internal format. Problem 2 — getSignedUrl() requires SA key bytes (broken on GKE WIF): getSignedUrl() requires service account key bytes to sign URLs. WIF tokens from the metadata server cannot sign URLs. Pre-agg pipeline silently fails: BQ exports fine, CubeStore gets 403. Problem 3 — CubeStore cannot import Parquet (issue cube-js#3051): CubeStore only accepted CSV in its external import path. CubeStore already uses parquet/arrow internally for .chunk.parquet but CREATE TABLE ... WITH (input_format) lacked Parquet support. Fix: packages/cubejs-bigquery-driver/src/BigQueryDriver.ts: - Export format: CSV.gz -> PARQUET - URL generation: getSignedUrl() -> gs://bucket/object (IAM-authenticated) - Return key: csvFile -> parquetFile packages/cubejs-cubestore-driver/src/CubeStoreDriver.ts: - Add importParquetFile() method - Add parquetFile branch in uploadTableWithIndexes() - Sends: CREATE TABLE t (...) WITH (input_format = 'parquet') LOCATION 'gs://...' rust/cubestore/cubestore/src/metastore/mod.rs: - Add ImportFormat::Parquet variant to enum rust/cubestore/cubestore/src/sql/mod.rs: - Parse input_format = 'parquet' in WITH clause rust/cubestore/cubestore/src/import/mod.rs: - Dispatch ImportFormat::Parquet to do_import_parquet() - Add do_import_parquet() using DataFusion ParquetRecordBatchReaderBuilder - Add arrow_array_value_to_string() helper for Arrow to TableValue conversion - Fix resolve_location() to handle gs:// URLs via GCS API with WIF token - Fix estimate_location_row_count() to skip fs::metadata() for remote URLs Works with Workload Identity when combined with the GCS WIF fix in gcs.rs. Backward compatible: Postgres/Snowflake/Redshift pre-aggs unaffected. Closes cube-js#3051 Closes cube-js#9837
…uet path - Make csvFile optional in TableCSVData — BigQuery now returns parquetFile only - Add parquetFile?: string[] to TableCSVData interface - Update isDownloadTableCSVData() to recognise parquetFile - Delete stale export files before BQ extract to prevent prefix collision - Remove exportBucketCsvEscapeSymbol from Parquet return (CSV-specific field)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters