How do I build a custom voice recognition model for multiple people?

Is there a quick and easy import tool for custom voice data?
Is there a free local training Speech Recognition to text tool (including exporting model for tf.js) for custom raw voice data?
Can I run tf.js to automatically learn unknown speech sounds and integrate them into existing model examples?

ps: I don’t want to train my custom data through a cloud-based paid service

Welcome to the community. If you just need sound recognition you can try Teachable Machine that makes it easy to recognize short form sounds eg 1 second in length. I have not seen a full voice recognition conversion yet as those tend to be quite large in file size, but sound recognition is most certainly possible. check:

And then select audio project. If you like what it trains in browser you can click download on top right and save the model files generated to your computer. All training is done in browser using TensorFlow.js so no server is used here other than to deliver the initial webpage so your sounds are never sent to a server.

If you want to do voice recognition in JavaScript it actually exists via the WebSpeech API:

You do not need TensorFlow.js to use that. It is part of the browser implementation and will use whatever OS level voice recognition exists.

Good luck!

1 Like

Because I have a hearing impairment and the recognition rate of such products in real life is very low and there is no self-learning enhanced training feature.

So I want to research if tf.js has a self-learning unsupervised function. And improve the recognition rate.

If there are only short voice commands, it is not helpful for hearing impaired people.

I see! Thank you for the context.

So our short form audio detection would be good to inform you of sounds like a fire alarm, a gunshot, a doorbell etc - things that repeat or distinct. So in that sense it could be useful for that sort of a task to then trigger a push alert on your phone to notify you something needs attention which may otherwise be missed if one can not hear them.

In terms of voice recognition, right now, the API above is the best bet for JavaScript as the on device voice models to the best of my knowledge are Gigabytes in size I believe? Maybe @lgusm knows more on that voice recognition models or knows someone who does?

It is a sort of Google project euphonia but with TF.js

See also Conformer Parrotron: a Faster and Stronger End-to-end SpeechConversion and Recognition Model for Atypical Speech

Thank you for your reply.
The fact is that I need to communicate with normal people.
I can’t use short speech to understand what normal people use to say.

I would like tf.js to provide a voice training version of long sentences.

Thank you for the information.
Unfortunately they don’t accept non-English

Hi Flash,

We’ve just published this community tutorial: Fine-tuning Wav2Vec2 with an LM head  |  TensorFlow Hub

It’s not going to help you directly as it’s an English model, but it could give you some kind of start.

I’ll keep looking for better options and let you know if I find something.

hi, thanks for your community tutorial link.

MY PC is CPU i5-3470, and no GPU.
OS: windows 10 pro
env: miniconda
I wrote the code according to the instruction (GitHub - flashlin/deep_learning)

but it show the error message

tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnimplementedError:  Fused conv implementation does not support grouped convolutions for now.
         [[{{node StatefulPartitionedCall/wav2vec2/encoder/pos_conv_embed/conv/Conv1DWithWeightNorm}}]] [Op:__inference_restored_function_body_39909]

Function call stack:
restored_function_body

did you install the dependencies?

!pip3 install -q git+https://github.com/vasudevgupta7/gsoc-wav2vec2@main
!sudo apt-get install -y libsndfile1-dev
!pip3 install -q SoundFile

Can you try that on Google Colab please? that’s easier to get the env working fine it’s free.

But if I use google colab,
How do I automatically collect unrecognizable sounds on the client side?
and perform training automatically
Enhance the learning and merge it into the trained model.

Google Colab is typically to try Python code out via browser - it seems lgusm’s suggestion above is Python based not JavaScript - it actually fires up a server to execute so may be trickier than using JS to gather sensor data from device as it is not front end on device.

If you want to do the data collection on the client side you would need to make your own custom version of Teachable Machine so that it could generate data in the right form you could use to retrain the model @lgusm suggested which you could then maybe convert to TensorFlow.js format via our converter? Do you know if that one is compatible for conversion @lgusm or has a JS implementation?

1 Like

I like how easy Teachable Machine is to use,
However, Teachable Machine has no place to upload Teachable Machine trained models so that I can enhance them.

How do I view the Teachable Machine Audio Project Source Code?
Or can I customize a project?

Teachable Machines repo is here: GitHub - googlecreativelab/teachablemachine-community: Example code snippets and machine learning code for Teachable Machine

it can give you some insights but I think it’s focused on short commands (like this tutorial: Simple audio recognition: Recognizing keywords  |  TensorFlow Core)

That model I shared has just been published, there’s no TFJS version yet and it’s a little big (+200MB). I shared it because it’s a state of the art for Automatic Speech Recognition and can give some ideas.

Is it available in the XLSR version? As probably It could be easier to finetune that one in a low resource regime.

I don’t think it is but that just gave me some good insights!!!

They have organized a nice fine-tuning community week a few months ago:

It could be nice to involve also our community on initiatives like these e.g. with TFHub /cc @thea @Joana @yarri-oss

1 Like

So in terms of uploading previously saved training data to Teachable Machine I believe it does allow you to open arbitrary data saved from other TM produced models etc if you have access to them. You just need to click on the 3 lines at the top left to access the file menu to do so. Eg on this page: Teachable Machine

Do this:

Check out @lgusm suggestions for acessing the raw code of TM though and there is also a fun codelab on how to make your own Teachable Machine for images here - but as audio classification is an image problem it may also help you out:

1 Like

I spent a lot of time setting up windows 10 to run tensorflow environments.
Just now I finally managed to run the tutorial you provided. (like this tutorial: Simple audio recognition: Recognizing keywords | TensorFlow Core )

If I lengthen the contents of commands,
Is it possible to train a language with variable length sentences?