MedASR: Clarification Needed on Handling of Brace Tokens and Preprocessing Rules for Fine-Tuning & Decoding

csetanmayjain · January 19, 2026, 5:52am

Hi Team,

I am currently benchmarking and experimenting with the google/medasr model using the provided evaluation pipeline. During this process, I noticed that many transcripts in the medical ASR domain contain special tokens enclosed in braces, such as:

{period}, {comma}, {slash}, {next line}, {question mark}, etc.

These tokens appear to represent spoken punctuation or formatting markers.

From the current preprocessing and evaluation logic, I have a few questions that I would like to clarify:

1. Expected List of Supported Brace Tokens

Is there an official or recommended list of brace-based tokens (e.g., {period}, {comma}, {slash}, etc.) that the model is trained to recognize? It would be very helpful if you could provide:

A predefined list of all such tokens supported by the tokenizer
Guidance on whether this list is fixed or customizable

2. Preprocessing Rules for Fine-Tuning

For fine-tuning purposes:

Should these brace tokens be preserved as-is in training text?
Should they be removed before tokenization?
Are there recommended preprocessing rules specifically for medical ASR transcripts containing such tokens?

Could you please provide the recommended preprocessing pipeline for fine-tuning, particularly:

How to handle {...} tokens in ground truth text
Whether they should be normalized, removed, or kept intact
Any tokenizer-specific considerations

Pannaga_J · February 23, 2026, 10:04am

Hi @csetanmayjain
To clarify your two specific questions related to brace tokens
Predefined list of supported brace tokens
There is no official list published as part of the model documentation. If these tokens appear in the training corpus, they are implicitly supported to the extent that they are represented in the tokenizer vocabulary .
Fixed vs customizable
The tokenizer vocabulary itself is fixed for the released checkpoint. However, preprocessing and normalization steps in your evaluation pipeline are customizable. If you modify or strip brace tokens, you should ensure that your reference and hypothesis texts are normalized consistently before computing metrics such as WER.

With respect to finetuning you should preserve {...} tokens if your downstream use case requires explicit spoken punctuation or your training data consistently includes those tokens.
The key is fine-tuning target format should match your desired inference-time output format. Also ,please keep preprocessing consistent across train/dev/test .

Please check out this notebook medasr/notebooks/fine_tune_with_hugging_face.ipynb at main · Google-Health/medasr · GitHub
Please let me know if you have any more questions .

csetanmayjain · February 24, 2026, 9:39am

Hi Pannaga,

Thank you for the clarification, that helps.

Since the training data used for the released checkpoint is not publicly available, it’s a bit difficult to understand the exact conventions and scope used for spoken punctuation during training.

For example:

{comma} vs ,
{period} vs .
{new line} vs actual line breaks

Because of this, fine-tuning may become slightly ambiguous; the model may already have learned some punctuation in symbolic form and some in brace-token form. We need to normalize our data accordingly.

Would it be possible to share one of the following to help guide fine-tuning?

A list (even approximate) of brace tokens that appeared in the training corpus
Any normalization rules used internally before training for the spoken punctuation and special symbols.
Whether punctuation in the training data was primarily represented as {token} or standard characters

Even partial guidance would help ensure that downstream fine-tuning stays aligned with the model’s learned format.

Thanks again for the support.

Topic		Replies	Views
MedTok Tokenizer HAI-DEF models , medgemma	1	161	August 27, 2025
Comparison of Whisper vs Medasr HAI-DEF models	4	275	January 5, 2026
Provenance of KenLM model included with MedASR HAI-DEF models	3	179	January 9, 2026
MedGemma finetuning - padding and labels' masking HAI-DEF colab , medgemma	1	215	September 21, 2025
Medasr model missing HAI-DEF ai	2	164	December 19, 2025

MedASR: Clarification Needed on Handling of Brace Tokens and Preprocessing Rules for Fine-Tuning & Decoding

2. Preprocessing Rules for Fine-Tuning

Related topics