text.WhitespaceTokenizer

Tokenizes a tensor of UTF-8 strings on whitespaces.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter

text.WhitespaceTokenizer()

Used in the notebooks

Used in the guide Used in the tutorials

Methods

split

View source

split(
    input
)

Alias for Tokenizer.tokenize.

split_with_offsets

View source

split_with_offsets(
    input
)

Alias for TokenizerWithOffsets.tokenize_with_offsets.

tokenize

View source

tokenize(
    input
)

Tokenizes a tensor of UTF-8 strings on whitespaces.

The strings are split on ICU defined whitespace characters. These whitespace characters are dropped.

Example:

WhitespaceTokenizer().tokenize("small medium large")
<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'small', b'medium',
b'large'], dtype=object)>

Args
input A RaggedTensor or Tensor of UTF-8 strings with any shape.
Returns
A RaggedTensor of tokenized text. The returned shape is the shape of the input tensor with an added ragged dimension for tokens of each string.

tokenize_with_offsets

View source

tokenize_with_offsets(
    input
)

Tokenizes a tensor of UTF-8 strings on whitespaces.

The strings are split on ICU defined whitespace characters. These whitespace characters are dropped.

Example:

splitter = WhitespaceTokenizer()
pieces, starts, ends = splitter.tokenize_with_offsets("a bb ccc")
print(pieces.numpy(), starts.numpy(), ends.numpy())
[b'a' b'bb' b'ccc'] [0 2 5] [1 4 8]

Args
input A RaggedTensoror Tensor of UTF-8 strings with any shape.
Returns
A tuple (tokens, start_offsets, end_offsets) where:
  • tokens: A RaggedTensor of tokenized text.
  • start_offsets: A RaggedTensor of the tokens' starting byte offset.
  • end_offsets: A RaggedTensor of the tokens' ending byte offset.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-01-30 UTC.