Question about input_dim and mask_zero in embedding layers

From the documentation on tf.keras.layers.Embedding:

input_dim:

Integer. Size of the vocabulary, i.e. maximum integer index + 1.

mask_zero:

Boolean, whether or not the input value 0 is a special “padding” value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is True, then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal size of vocabulary + 1).

  1. If my vocabulary size is n but they are encoded with index values from 1 to n (0 is left for padding), is input_dim equal to n or n+1? The maximum integer index + 1 part of the documentation is confusing me.

  2. If the inputs are padded with zeroes, what are the consequences of leaving mask_zero = False?

  3. If mask_zero = True, based on the documentation, I would have to increment the answer from my first question by one? What is the expected behaviour if this was not done?

2 Likes

Hi @Lu_Bin_Liu

To answer your queries for understanding on tf.keras.layers.Embedding though its clearly mentioned in the definition :

  1. If your vocabulary size is n and indices range from 1 to n (0 is left for padding), then use input_dim=n+1 because input_dim specifies the total number of possible indices, including padding.

  2. If you leave mask_zero=False while using padded zeros, then the model will treat padded zeros as actual tokens in the vocabulary which can lead to inaccurate results, especially when using recurrent layers where masked padding is often crucial for handling variable-length sequences.

  3. If you set mask_zero=True but fail to increment input_dim accordingly, TensorFlow will raise an exception on vocabulary size mismatch because enabling masking requires reserving index 0 for padding, so increase the vocabulary size by 1.