Apache Beam SDK for Python — Apache Beam 2.71.0 documentation
Apache Beam provides a simple, powerful programming model for building both batch and streaming parallel data processing pipelines.
The Apache Beam SDK for Python provides access to Apache Beam capabilities from the Python programming language.
Overview
The key concepts in this programming model are
PCollection: represents a collection of data, which could be bounded or unbounded in size.PTransform: represents a computation that transforms input PCollections into output PCollections.Pipeline: manages a directed acyclic graph ofPTransforms andPCollections that is ready for execution.PipelineRunner: specifies where and how the pipeline should execute.Read: read from an external source.Write: write to an external data sink.
Typical usage
At the top of your source file:
import apache_beam as beam
After this import statement
Transform classes are available as
beam.FlatMap,beam.GroupByKey, etc.Pipeline class is available as
beam.PipelineText read/write transforms are available as
beam.io.ReadFromText,beam.io.WriteToText.
Examples
The examples subdirectory has some examples.