tfds.deprecated.text.TokenTextEncoder

TextEncoder backed by a list of tokens.

Inherits From: TextEncoder

tfds.deprecated.text.TokenTextEncoder(
    vocab_list,
    oov_buckets=1,
    oov_token='UNK',
    lowercase=False,
    tokenizer=None,
    strip_vocab=True,
    decode_token_separator=' '
)

Tokenization splits on (and drops) non-alphanumeric characters with regex "\W+".

Args

vocab_list list<str>, list of tokens.
oov_buckets int, the number of ints to reserve for OOV hash buckets. Tokens that are OOV will be hash-modded into a OOV bucket in encode.
oov_token str, the string to use for OOV ids in decode.
lowercase bool, whether to make all text and tokens lowercase.
tokenizer Tokenizer, responsible for converting incoming text into a list of tokens.
strip_vocab bool, whether to strip whitespace from the beginning and end of elements of vocab_list.
decode_token_separator str, the string used to separate tokens when decoding.

Attributes

lowercase
oov_token
tokenizer
tokens
vocab_size Size of the vocabulary. Decode produces ints [1, vocab_size).

Methods

decode

View source

decode(
    ids
)

Decodes a list of integers into text.

encode

View source

encode(
    s
)

Encodes text into a list of integers.

load_from_file

View source

@classmethod
load_from_file(
    filename_prefix
)

Load from file. Inverse of save_to_file.

save_to_file

View source

save_to_file(
    filename_prefix
)

Store to file. Inverse of load_from_file.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-04-26 UTC.