text.SplitterWithOffsets

An abstract base class for splitters that return offsets.

Inherits From: Splitter

text.SplitterWithOffsets(
    name=None
)

Each SplitterWithOffsets subclass must implement the split_with_offsets method, which returns a tuple containing both the pieces and the offsets where those pieces occurred in the input string. E.g.:

class CharSplitter(SplitterWithOffsets):
  def split_with_offsets(self, input):
    chars, starts = tf.strings.unicode_split_with_offsets(input, 'UTF-8')
    lengths = tf.expand_dims(tf.strings.length(input), -1)
    ends = tf.concat([starts[..., 1:], tf.cast(lengths, tf.int64)], -1)
    return chars, starts, ends
  def split(self, input):
    return self.split_with_offsets(input)[0]
pieces, starts, ends = CharSplitter().split_with_offsets("a😊c")
print(pieces.numpy(), starts.numpy(), ends.numpy())
[b'a' b'\xf0\x9f\x98\x8a' b'c'] [0 1 5] [1 5 6]

Methods

`split`

View source

@abc.abstractmethod
split(
    input
)

Splits the input tensor into pieces.

Generally, the pieces returned by a splitter correspond to substrings of the original string, and can be encoded using either strings or integer ids.

Example:

print(tf_text.WhitespaceTokenizer().split("small medium large"))
tf.Tensor([b'small' b'medium' b'large'], shape=(3,), dtype=string)

Args
`input`	An N-dimensional UTF-8 string (or optionally integer) `Tensor` or `RaggedTensor`.

Returns
An N+1-dimensional UTF-8 string or integer `Tensor` or `RaggedTensor`. For each string from the input tensor, the final, extra dimension contains the pieces that string was split into.

`split_with_offsets`

View source

@abc.abstractmethod
split_with_offsets(
    input
)

Splits the input tensor, and returns the resulting pieces with offsets.

Example:

splitter = tf_text.WhitespaceTokenizer()
pieces, starts, ends = splitter.split_with_offsets("a bb ccc")
print(pieces.numpy(), starts.numpy(), ends.numpy())
[b'a' b'bb' b'ccc'] [0 2 5] [1 4 8]

Args
`input`	An N-dimensional UTF-8 string (or optionally integer) `Tensor` or `RaggedTensor`.

Returns

Returns
A tuple `(pieces, start_offsets, end_offsets)` where: `pieces` is an N+1-dimensional UTF-8 string or integer `Tensor` or `RaggedTensor`. `start_offsets` is an N+1-dimensional integer `Tensor` or `RaggedTensor` containing the starting indices of each piece (byte indices for input strings). `end_offsets` is an N+1-dimensional integer `Tensor` or `RaggedTensor` containing the exclusive ending indices of each piece (byte indices for input strings).

A tuple (pieces, start_offsets, end_offsets) where:

pieces is an N+1-dimensional UTF-8 string or integer Tensor or RaggedTensor.
start_offsets is an N+1-dimensional integer Tensor or RaggedTensor containing the starting indices of each piece (byte indices for input strings).
end_offsets is an N+1-dimensional integer Tensor or RaggedTensor containing the exclusive ending indices of each piece (byte indices for input strings).