I believe I found the reason for the bug.
In the implementation of class StringLookup(IndexLookup)
, we find:
super().__init__(
max_tokens=max_tokens,
num_oov_indices=num_oov_indices,
mask_token=mask_token,
oov_token=oov_token,
vocabulary=vocabulary,
idf_weights=idf_weights,
invert=invert,
output_mode=output_mode,
pad_to_max_tokens=pad_to_max_tokens,
sparse=sparse,
name=name,
vocabulary_dtype="string",
**kwargs,
)
self.encoding = encoding
self._convert_input_args = False
self._allow_non_tensor_positional_args = True
self.supports_jit = False
Note that it invokes the superclass (IndexLookup
) constructor before setting the encoding. Then, in the implementation of IndexLookup.__init__
, we find:
if vocabulary is not None:
self.set_vocabulary(vocabulary, idf_weights)
But set_vocabulary
invokes _tensor_vocab_to_numpy
:
if tf.is_tensor(vocabulary):
vocabulary = self._tensor_vocab_to_numpy(vocabulary)
Which tries to access self.encoding
:
# Overridden methods from IndexLookup.
def _tensor_vocab_to_numpy(self, vocabulary):
vocabulary = vocabulary.numpy()
return np.array(
[tf.compat.as_text(x, self.encoding) for x in vocabulary]
)
Since self.encoding
is not yet initialized, an error occurs.
It seems version 3.0.0 of Keras introduced this bug. In version 2.15.0, the StringLookup
constructor initializes self.encoding
before calling the superclass constructor:
self.encoding = encoding
super().__init__(
max_tokens=max_tokens,
num_oov_indices=num_oov_indices,
mask_token=mask_token,
oov_token=oov_token,
vocabulary=vocabulary,
vocabulary_dtype=tf.string,
idf_weights=idf_weights,
invert=invert,
output_mode=output_mode,
sparse=sparse,
pad_to_max_tokens=pad_to_max_tokens,
**kwargs
)
I have reported this bug here.