Provenance of KenLM model included with MedASR

Ryan_Walklin · December 29, 2025, 3:49am

Hi,

Thanks for your work on MedASR, I’ve done some preliminary work with it and it looks very promising. I’m curious about the source/training data for the KenLM model provided for beam search, can you share any details on its source, training set, and licencing etc?

Thanks,

Ryan

Rory_Pilgrim · January 1, 2026, 9:54am

That’s great to hear! The license for the LM model is the same as for MedASR the HAI-DEF Terms of Use. I believe it’s been trained on the same sources as MedASR listed in the model card but will confirm with the engineering team.

Ke_Wu · January 5, 2026, 6:07pm

The LM is trained on the transcript of “proprietary dataset 1” in our model card. Notably, it’s a SentencePiece token LM instead of a word LM.

Ryan_Walklin · January 9, 2026, 11:37pm

That’s fantastic, thanks for clarifying.

I’ve been able to convert the PyTorch model to a ONNX model, and run both that and the KenLM token model with C++ bindings to an existing project. Running alongside Whisper (which I was using previously) I can confirm significant WER improvement compared to Whisper, and for my (streaming radiology dictation) use case only a 3.5% WER compared to baseline (offline) transcription using MedASR.

I’m working now on a hybrid model using Whisper and MedASR to transcribe the same dictation concurrently, and potentially moving both the ONNX runtime use and CTC beam search to the GPU for even faster performance.

One issue I have noticed is that performance on dictated dates is relatively poor (e.g. “Comparison COLON NEWLINE 10 January 2026” is transcribed as [COMPARISON] {colon} \n {unintelligible}. Do you think this is because the dates are being removed from anonymized training sets?

If so it may be worth fine tuning with some dates, I’ll try this when I get a chance.

Topic		Replies	Views
Comparison of Whisper vs Medasr HAI-DEF models	4	326	January 5, 2026
MEDASR for other languages? HAI-DEF models	0	62	April 20, 2026
MedGemma \| Need for Official Quantized Text Models for Resource Constrained Users HAI-DEF models , medgemma	3	269	July 11, 2025
MedASR: Clarification Needed on Handling of Brace Tokens and Preprocessing Rules for Fine-Tuning & Decoding HAI-DEF model	3	252	April 29, 2026
Bug fix in MedGemma 4B-IT HAI-DEF bug	1	128	July 18, 2025

Provenance of KenLM model included with MedASR

Related topics