How to apply a hierarchical mask in Tensorflow2.0 (tf.keras)

I am trying to build a hierarchical sequence model for time series classification (refer to the paper: hierarchical attention networks for document classification). But I’m very confused about how to mask the hierarchical sequences.

My data is a hierarchical time series. Specifically, each sample is composed of multiple sub-sequences and each sub-sequence is a multiple multivariate time series (just like word–> sentence -->document in NLP). So I need to pad and mask it twice. This is critical as a document will often not have the same number of sentences (or all sentences the same number of words). Finally, I get data as follows:

array([[[[0.21799476, 0.26063576],
         [0.2170655 , 0.53772384],
         [0.18505535, 0.30702454],
         [0.22714901, 0.17020395],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ]],

        [[0.2160176 , 0.23789616],
         [0.2675753 , 0.21807681],
         [0.26932836, 0.21914595],
         [0.26932836, 0.21914595],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ]]],

       [[[0.03941338, 0.3380829 ],
         [0.04766269, 0.3031088 ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ]],

        [[0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ]]]], dtype=float32)

Then I build a hierarchical model as follows:

inputs = Input(shape=(maxlen_event, maxlen_seq, 2))
x = TimeDistributed(
        Sequential([
            Masking(),
            LSTM(units=8, return_sequences=False)
        ])
    )(inputs)
x = LSTM(units=32, return_sequences=False)(x)
x = Dense(16, activation='relu')(x)
output = Dense(16, activation='sigmoid')(x)

As my data is padded in on both dimensions, I don’t know how to mask it correctly. I have two questions about it:
Q1: In TimeDistributed, do I use the masking layer correctly to mask the first padding?
Q2: How to mask the second padding?

Thank you.

Hi @Wwwwei,

Apologies for the delay in response.

Yes, you are using the Masking layer correctly within the TimeDistributed wrapper where this step masks out padding in the time steps (maxlen_seq).

As far as I’m aware, I suggest to apply another Masking layer before the TimeDistributed layer, which would mask the padded events in the maxlen_event dimension.

inputs = Input(shape=(maxlen_event, maxlen_seq, 2))

# Mask events (padding across the first dimension)
masked_inputs = Masking()(inputs)

# Now apply TimeDistributed to process each event separately
x = TimeDistributed(
    Sequential([
        Masking(),
        LSTM(units=8, return_sequences=False)
    ])
)(masked_inputs)

Thank You.

good morning to all developer in here