Tensorflow.js converter is changing my input parameter sizes?

Hi! I’m trying to convert two models from TensorSpeech/TensorflowTTS into tensorflow.js, but the expected input dimensions seem to be set to a specific number instead of being allowed to vary. E.g. I expect the input shape for my mel spectrogram generator to be [-1, -1], but tensorflowjs_wizard converts it to [-1, 10], which only lets me input exactly 10 phonemes. The output of this spectrogram generator is a different size than the conversion wizard makes the vocoder model accept. Is there a setting in the wizard or in tensorflow.js I am overlooking?

TensorflowTTS: GitHub - TensorSpeech/TensorFlowTTS: 😝 TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
I am using this notebook: Google Colab simply running fastspeech2.save and mb_melgan.save to export the spectrogram and vocoder models respectively, then using tensorflowjs_wizard to convert them to tensorflow.js models. The exact JS error I am getting is this (when trying to pass 14 phonemes to the spectrogram model instead of 10)

Error: The shape of dict['input_1'] provided in model.execute(dict) must be [-1,10], but was [1,14]

My code: https://hastebin.com/fuyoyebije.js

Welcome to the forum and thanks for posting - sounds like a pretty cool project you are attempting there. Let me check in with folk and see what may be causing this.

Are you able to share the input saved model with us for us to check what may be happening here? If you could share via Google drive or similar to me I can ask our team to check what may be causing this. If you can not share publicly via link here let me know and I can drop you a message on LinkedIn/Twitter with my work email to share to privately if that works better for you for the purpose of trying to debug this. Thanks!

Sure! Thanks a ton for helping :slightly_smiling_face:
Here in this folder is the original SavedModel and the tensorflow.js model converted with tensorflowjs_wizard. It contains architecture information, but it is an instance of FastSpeech2 from TensorflowTTS.

https://drive.google.com/drive/folders/1MzFQL_Z2HYgan_TGSL3xgOuOsCi7BgdG?usp=sharing

Analysing this file leads to the following signature definitions:

The given SavedModel SignatureDef contains the following input(s):
  inputs['input_1'] tensor_info:
      dtype: DT_INT32
      shape: (-1, 10)
      name: serving_default_input_1:0
  inputs['input_2'] tensor_info:
      dtype: DT_INT32
      shape: (-1)
      name: serving_default_input_2:0
  inputs['input_3'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1)
      name: serving_default_input_3:0
  inputs['input_4'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1)
      name: serving_default_input_4:0
  inputs['input_5'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1)
      name: serving_default_input_5:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['output_1'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, -1, 80)
      name: StatefulPartitionedCall:0
  outputs['output_2'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, -1, 80)
      name: StatefulPartitionedCall:1
  outputs['output_3'] tensor_info:
      dtype: DT_INT32
      shape: (-1, 10)
      name: StatefulPartitionedCall:2
  outputs['output_4'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 10)
      name: StatefulPartitionedCall:3
  outputs['output_5'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 10)
      name: StatefulPartitionedCall:4

It seems the saved model you are using has the shape: (-1, 10) as the default input there so the TFJS conversion also adheres the saved model definition shown.

PS on a side note if you get this working I would love to see the end demo :slight_smile:

I got it!
I had to change the parameters to the save function to use the same signature as inference.

fastspeech2.save('text2mel', signatures=fastspeech2.inference)

For whatever reason the signature save was picking before was using dummy data from elsewhere in the code.

1 Like

Hey, I got it working so here’s that demo :slight_smile: TensorflowTTS in Tensorflow.js
Unfortunately it seems to work extremely slowly, even after quantizing it to a much smaller size. I’ve already tried all the different backends (it is not compatible with WASM), and no luck making it faster than WebGL. I don’t know very much about FastSpeech2, MB-MelGAN, or TensorFlow, so I’m not sure there’s anything I can do to make it faster. I guess my dream of turning this into a useful browser extension won’t work :frowning:

1 Like

Latency aside, this is so cool! I have some questions: Can you customize the voice easily for this eg can I make it sound like me? I managed to clone my voice once with a 3rd party and curious if it would now be possible via your conversion too?

Even though it takes some time - on my old laptop its about 8 seconds for me, it is still a really interesting demo. I will try on my desktop later which is more modern with a dedicated GPU to see if that improves things. I will also share with our team to see if they have any ideas on optimization here.

For the WASM compatibility is it a missing op issue?

Also would you be interested in talking about your work on a future TensorFlow.js show and tell. This is the first time I have seen someone port a model like this to JS. If you are new to the show you can find some of my prior interviews here.

@Taylor I have tried your demo, it is pretty amazing.
The performance can be better, since your model contains variable length input, the model need to compile the op shader each inference if the shapes changes.
You can see that for the exact same input string, the first inference could take up to 6sec, which the following inference only takes 600ms.
We also have a flag you can try where we parameterize the shape for the op shader, it should significant improve the first inference. Can you try to add following flag to your demo.

tf.env().set('WEBGL_USE_SHAPES_UNIFORMS', true);
1 Like

Wow! As a newbie to TensorFlow (and to ML in general) this is a reaction I did not expect! :sweat_smile:
TF.JS show and tell seems really cool. I’d be honored to be on! But, now I must ask you, you know this is a port of someone else’s model, right? I feel like I did only one tenth of the work they did to get this model working. That said, and motivated by the slowness of the previous model, I have been studying up on ML with plans to create a native TensorFlow.js model of FastSpeech2! Perhaps we could talk about that. :smile:

In terms of cloning your voice with this, while this architecture is not designed to facilitate that, I wouldn’t rule it out completely. As I understand it, FastSpeech2 trains significantly faster than auto-regressive models such as Tacotron 2 (and I believe with less data but I’m not sure). A noisy and somewhat acceptable voice is possible from Tacotron 2 using only 10 minutes of training data, so I’m sure FastSpeech2 could do just as good if not better! I wouldn’t bet too much on it being fast enough in the browser, though, since the whole model would need to be trained at once.

The WASM compatibility error seems to be a missing implementation of Softplus, an activation function:

Error: Kernel 'Softplus' not registered for backend 'wasm'
1 Like

Wow, if that’s true then my extension idea might really work with this model! Unfortunately, I tried it both on a lower-end (512MB VRAM) and higher-end (6GB VRAM) machine, and this flag made no difference. Perhaps I am using it wrong, but I could see with tf.getBackend() that the backend was set to 'webgl' and tf.env().get('WEBGL_USE_SHAPES_UNIFORMS') returns true. I have tried with and without tf.enableProdMode()

Sure we have had people talk about their conversion experiences too in the past especially if they have built something with the resulting model or done optimizations which I figured you had done given the discussions above. Not to mention this is the first time I have seen a successful TTS conversion. Always exciting to see how others may take and use such a conversion in their work.

Maybe we can wait until you have done some optimization or re-made TFJS native version / using it for something novel in the browser - eg a demo with a text chat demo to read aloud messages or some accessibility use case to read highlighted text out aloud for example as part of a chrome extension. Ideally it can solve a problem for many and that is always good content to talk.

No rush here so just let me know when you feel you have made some progress and we can have a chat to see where you are at.

Feel free to drop me a direct message on the forum or via my other social media if you are following me there when you feel it is in a good state that you can talk about your learnings.

Knowledge sharing is the key part of this show and to inspire others to also convert more models so the JS community can take those and use them in ways others have never dreamt of! JS Engineers are very creative bunch of people :slight_smile:

Ping is out of office today but he will hopefully get back next week around your usage of his suggestion.

1 Like