[Bug]: BigtableSource "Desired bundle size 0 bytes must be greater than 0"
In short,
- if
targetParallelism≥BigtableSource#getEstimatedSizeBytes; then desiredBundleSizeBytesis set to0; which- makes
BigtableSource#splitKeyRangeIntoBundleSizedSubrangesangry.
What happened?
Imagine a case where in:
| long estimatedBytes = source.getEstimatedSizeBytes(options); | |
| long bytesPerBundle = estimatedBytes / targetParallelism; | |
| List<? extends BoundedSource<T>> bundles = source.split(bytesPerBundle, options); |
targetParallelismis32; andsource.getEstimatedByteSize()is10
then
bytesPerBundlewill be0
so
| List<? extends BoundedSource<T>> bundles = source.split(bytesPerBundle, options); |
will be called with the values: split.source(0L, options)
In OffsetBasedSource#split, this desired-0-sized split is handled:
| long desiredBundleSizeOffsetUnits = | |
| Math.max(Math.max(1, desiredBundleSizeBytes / getBytesPerOffset()), minBundleSize); |
But BigtableSource#split does not seem to handle the desired-0-sized split:
| desiredBundleSizeBytes = | |
| Math.max(sizeEstimate / maximumNumberOfSplits, desiredBundleSizeBytes); | |
| // Delegate to testable helper. | |
| List<BigtableSource> splits = | |
| splitBasedOnSamples(desiredBundleSizeBytes, getSampleRowKeys(options)); |
so a few frames down the road from BigtableSource#split you'll end up violating this checkArgument in BigtableSource#splitKeyRangeIntoBundleSizedSubranges:
| checkArgument( | |
| desiredBundleSizeBytes > 0, | |
| "Desired bundle size %s bytes must be greater than 0.", | |
| desiredBundleSizeBytes); |
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
- Component: Python SDK
- Component: Java SDK
- Component: Go SDK
- Component: Typescript SDK
- Component: IO connector
- Component: Beam YAML
- Component: Beam examples
- Component: Beam playground
- Component: Beam katas
- Component: Website
- Component: Spark Runner
- Component: Flink Runner
- Component: Samza Runner
- Component: Twister2 Runner
- Component: Hazelcast Jet Runner
- Component: Google Cloud Dataflow Runner