[Bug]: BigtableSource "Desired bundle size 0 bytes must be greater than 0"

In short,

  • if targetParallelismBigtableSource#getEstimatedSizeBytes; then
  • desiredBundleSizeBytes is set to 0; which
  • makes BigtableSource#splitKeyRangeIntoBundleSizedSubranges angry.

What happened?

Imagine a case where in:

long estimatedBytes = source.getEstimatedSizeBytes(options);
long bytesPerBundle = estimatedBytes / targetParallelism;
List<? extends BoundedSource<T>> bundles = source.split(bytesPerBundle, options);
  • targetParallelism is 32; and
  • source.getEstimatedByteSize() is 10

then

  • bytesPerBundle will be 0

so

List<? extends BoundedSource<T>> bundles = source.split(bytesPerBundle, options);

will be called with the values: split.source(0L, options)

In OffsetBasedSource#split, this desired-0-sized split is handled:

long desiredBundleSizeOffsetUnits =
Math.max(Math.max(1, desiredBundleSizeBytes / getBytesPerOffset()), minBundleSize);

But BigtableSource#split does not seem to handle the desired-0-sized split:

desiredBundleSizeBytes =
Math.max(sizeEstimate / maximumNumberOfSplits, desiredBundleSizeBytes);
// Delegate to testable helper.
List<BigtableSource> splits =
splitBasedOnSamples(desiredBundleSizeBytes, getSampleRowKeys(options));

so a few frames down the road from BigtableSource#split you'll end up violating this checkArgument in BigtableSource#splitKeyRangeIntoBundleSizedSubranges:

checkArgument(
desiredBundleSizeBytes > 0,
"Desired bundle size %s bytes must be greater than 0.",
desiredBundleSizeBytes);

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner