Querying NDJSON Files in Stage | Databend
In Databend, you can directly query NDJSON files stored in stages without first loading the data into tables. This approach is particularly useful for data exploration, ETL processing, and ad-hoc analysis scenarios.
NDJSON (Newline Delimited JSON) is a JSON-based file format where each line contains a complete and valid JSON object. This format is especially well-suited for streaming data processing and big data analytics.
{"id": 1, "title": "Database Fundamentals", "author": "John Doe", "price": 45.50, "category": "Technology"}
{"id": 2, "title": "Machine Learning in Practice", "author": "Jane Smith", "price": 68.00, "category": "AI"}
{"id": 3, "title": "Web Development Guide", "author": "Mike Johnson", "price": 52.30, "category": "Frontend"}
Create an external stage with your own S3 bucket and credentials where your NDJSON files are stored.
Now you can query the NDJSON files directly from the stage. This example extracts the title and author fields from each JSON object:
If the NDJSON files are compressed with gzip, modify the pattern to match compressed files:
Key difference: The pattern .*[.]ndjson[.]gz matches files ending with .ndjson.gz. Databend automatically decompresses gzip files during query execution thanks to the COMPRESSION = AUTO setting in the file format.