[Task]: Spark runner flatMap output should not be required to fit in the memory

What needs to happen?

Currently on Spark runner, if single processElement call produces multiple output elements, they all needs to fit in the memory [1]. This is problematic e.g. for ParquetIO, which instead of Source<> based reads uses DoFn and let reader from inside DoFn push all elements to the output. Similar happens with JdbcIO and was discussed here [2].

The goal is to overcome this constraint and allow to produce large output from DoFn on Spark runner.

[1] https://github.com/apache/beam/blob/v2.39.0/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkProcessContext.java#L125

[2] https://www.mail-archive.com/dev@beam.apache.org/msg16806.html

Issue Priority

Priority: 2

Issue Component

Component: runner-spark