Provenance of KenLM model included with MedASR

Hi,

Thanks for your work on MedASR, I’ve done some preliminary work with it and it looks very promising. I’m curious about the source/training data for the KenLM model provided for beam search, can you share any details on its source, training set, and licencing etc?

Thanks,

Ryan

That’s great to hear! The license for the LM model is the same as for MedASR the HAI-DEF Terms of Use. I believe it’s been trained on the same sources as MedASR listed in the model card but will confirm with the engineering team.

The LM is trained on the transcript of “proprietary dataset 1” in our model card. Notably, it’s a SentencePiece token LM instead of a word LM.

That’s fantastic, thanks for clarifying.

I’ve been able to convert the PyTorch model to a ONNX model, and run both that and the KenLM token model with C++ bindings to an existing project. Running alongside Whisper (which I was using previously) I can confirm significant WER improvement compared to Whisper, and for my (streaming radiology dictation) use case only a 3.5% WER compared to baseline (offline) transcription using MedASR.

I’m working now on a hybrid model using Whisper and MedASR to transcribe the same dictation concurrently, and potentially moving both the ONNX runtime use and CTC beam search to the GPU for even faster performance.

One issue I have noticed is that performance on dictated dates is relatively poor (e.g. “Comparison COLON NEWLINE 10 January 2026” is transcribed as [COMPARISON] {colon} \n {unintelligible}. Do you think this is because the dates are being removed from anonymized training sets?

If so it may be worth fine tuning with some dates, I’ll try this when I get a chance.