What Exactly Does Target_vocab_size Mean In The Method Tfds.features.text.subwordtextencoder.build_from_corpus?

September 08, 2024 Post a Comment

According to this link, target_vocab_size: int, approximate size of the vocabulary to create. The statement is pretty ambiguous for me. As far as I can understand, the encoder will

Solution 1:

The documentation says:

Encoding is fully invertible because all out-of-vocab wordpieces are byte-encoded

Which means unknown word pieces will be encoded one character at a time. It's best understood through an example. Let's suppose you build a SubwordTextEncoder using a very large corpus of English text such that most of the common words are in vocabulary.

vocab_size = 10000
tokenizer = tfds.features.text.SubwordTextEncoder.build_from_corpus(
    corpus_sentences, vocab_size)

Let's say you try to tokenize the following sentence.

tokenizer.encode("good badwords badxyz")

It will be tokenized as:

good
bad
words
bad
x
y
z

As you can see, since the word piece "xyz" is not in vocabulary it is tokenized as characters.

Python Freelancers

What Exactly Does Target_vocab_size Mean In The Method Tfds.features.text.subwordtextencoder.build_from_corpus?

Solution 1:

Post a Comment for "What Exactly Does Target_vocab_size Mean In The Method Tfds.features.text.subwordtextencoder.build_from_corpus?"