Apache Beam SDK for Python — Apache Beam 2.71.0 documentation

Apache Beam provides a simple, powerful programming model for building both batch and streaming parallel data processing pipelines.

The Apache Beam SDK for Python provides access to Apache Beam capabilities from the Python programming language.

Overview

The key concepts in this programming model are

PCollection: represents a collection of data, which could be bounded or unbounded in size.
PTransform: represents a computation that transforms input PCollections into output PCollections.
Pipeline: manages a directed acyclic graph of PTransform s and PCollection s that is ready for execution.
PipelineRunner: specifies where and how the pipeline should execute.
Read: read from an external source.
Write: write to an external data sink.

At the top of your source file:

import apache_beam as beam

After this import statement

Transform classes are available as beam.FlatMap, beam.GroupByKey, etc.
Pipeline class is available as beam.Pipeline
Text read/write transforms are available as beam.io.ReadFromText, beam.io.WriteToText.

Examples

The examples subdirectory has some examples.