Hi,
I’m implementing yet another Tacotron 2 (Text-To-Speech) using TF2.8.0.
It consists of an encoder/decoder architecture with an attention module providing an alignment between the text and the frames of a (mel-)spectrogram we want to produce. Learning a good alignment between mel-spec and text is the key of this model.
I face a discrepancy between eager and graph mode when training it.
Basically eager mode succeeds at learning the alignment and sounds great while graph mode just fails at providing such alignment. However it does compile and run without any error. Actually in graph mode it seems able to learn something, but it’s far away from something intelligible, since it fails at learning a proper alignment. The learning curves look in both case like any typical converging model.
The main problem is the Eager mode is roughly ~8-10 times slower.
Are there things to care about when switching from eager to graph mode that I miss ?
LSTMCell ? for loop ?
I can’t share the whole code since it wouldn’t fit in a page but, and I have no idea which part is the most relevant since the code compile without any warning or error.
def call(self, batch, training=False):
phon, mels, mels_len = batch
mels = tf.transpose(mels, perm=[0,2,1])
x = self.tokenizer(phon) #TextVectorization module
y = self.char_embedding(x) # Embedding module
mask = self.char_embedding.compute_mask(x)
y = self.encoder(y, mask) #Convolution layers with batch norm + biLSTM
mels, gates, alignments = self.decoder(y, mels, mask) # Teacher Forcing : LSTMCell and Attention
residual = self.decoder.postnet(tf.squeeze(mels,-1)) # Convolution Layers with batch norm
mels_post = mels + tf.expand_dims(residual, -1)
return (mels, mels_post, mels_len), gates, alignments
The teacher forcing scheme is done through a for loop iterating over the ground truth mel-spectrogram.
mels_size = tf.shape(mel_gt)[0]
mels_out, gates_out, alignments = tf.TensorArray(tf.float32, size=mels_size), tf.TensorArray(tf.float32, size=mels_size), tf.TensorArray(tf.float32, size=mels_size)
self.prepare_decoder(enc_out)
for i in tf.range(mels_size):
mel_in = mel_gt[i]
mel_out, gate_out, alignment = self.decode(mel_in, enc_out, self.W_enc_out, enc_out_mask) #LSTMCell s and Attention module
mels_out = mels_out.write(i, mel_out)
gates_out = gates_out.write(i, gate_out)
alignments = alignments.write(i, alignment)
mels_out = mels_out.stack()
gates_out = gates_out.stack()
alignments = alignments.stack()
In graph mode it learns since the first epoch a nice looking alignment.
In eager mode I can wait for days without any alignment (roughly diagonal tho, so it still learn a lil bit something)