text.SplitterWithOffsets

An abstract base class for splitters that return offsets.

Inherits From: Splitter

text.SplitterWithOffsets(
    name=None
)

Each SplitterWithOffsets subclass must implement the split_with_offsets method, which returns a tuple containing both the pieces and the offsets where those pieces occurred in the input string. E.g.:

class CharSplitter(SplitterWithOffsets):
  def split_with_offsets(self, input):
    chars, starts = tf.strings.unicode_split_with_offsets(input, 'UTF-8')
    lengths = tf.expand_dims(tf.strings.length(input), -1)
    ends = tf.concat([starts[..., 1:], tf.cast(lengths, tf.int64)], -1)
    return chars, starts, ends
  def split(self, input):
    return self.split_with_offsets(input)[0]
pieces, starts, ends = CharSplitter().split_with_offsets("a😊c")
print(pieces.numpy(), starts.numpy(), ends.numpy())
[b'a' b'\xf0\x9f\x98\x8a' b'c'] [0 1 5] [1 5 6]

Methods

split

View source

@abc.abstractmethod
split(
    input
)

Splits the input tensor into pieces.

Generally, the pieces returned by a splitter correspond to substrings of the original string, and can be encoded using either strings or integer ids.

Example:

print(tf_text.WhitespaceTokenizer().split("small medium large"))
tf.Tensor([b'small' b'medium' b'large'], shape=(3,), dtype=string)

Args
input An N-dimensional UTF-8 string (or optionally integer) Tensor or RaggedTensor.
Returns
An N+1-dimensional UTF-8 string or integer Tensor or RaggedTensor. For each string from the input tensor, the final, extra dimension contains the pieces that string was split into.

split_with_offsets

View source

@abc.abstractmethod
split_with_offsets(
    input
)

Splits the input tensor, and returns the resulting pieces with offsets.

Example:

splitter = tf_text.WhitespaceTokenizer()
pieces, starts, ends = splitter.split_with_offsets("a bb ccc")
print(pieces.numpy(), starts.numpy(), ends.numpy())
[b'a' b'bb' b'ccc'] [0 2 5] [1 4 8]

Args
input An N-dimensional UTF-8 string (or optionally integer) Tensor or RaggedTensor.
Returns
A tuple (pieces, start_offsets, end_offsets) where:
  • pieces is an N+1-dimensional UTF-8 string or integer Tensor or RaggedTensor.

  • start_offsets is an N+1-dimensional integer Tensor or RaggedTensor containing the starting indices of each piece (byte indices for input strings).

  • end_offsets is an N+1-dimensional integer Tensor or RaggedTensor containing the exclusive ending indices of each piece (byte indices for input strings).

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-01-30 UTC.