Thanks for your work on MedASR, I’ve done some preliminary work with it and it looks very promising. I’m curious about the source/training data for the KenLM model provided for beam search, can you share any details on its source, training set, and licencing etc?
That’s great to hear! The license for the LM model is the same as for MedASR the HAI-DEF Terms of Use. I believe it’s been trained on the same sources as MedASR listed in the model card but will confirm with the engineering team.
I’ve been able to convert the PyTorch model to a ONNX model, and run both that and the KenLM token model with C++ bindings to an existing project. Running alongside Whisper (which I was using previously) I can confirm significant WER improvement compared to Whisper, and for my (streaming radiology dictation) use case only a 3.5% WER compared to baseline (offline) transcription using MedASR.
I’m working now on a hybrid model using Whisper and MedASR to transcribe the same dictation concurrently, and potentially moving both the ONNX runtime use and CTC beam search to the GPU for even faster performance.
One issue I have noticed is that performance on dictated dates is relatively poor (e.g. “Comparison COLON NEWLINE 10 January 2026” is transcribed as [COMPARISON] {colon} \n {unintelligible}. Do you think this is because the dates are being removed from anonymized training sets?
If so it may be worth fine tuning with some dates, I’ll try this when I get a chance.