MedASR: Clarification Needed on Handling of Brace Tokens and Preprocessing Rules for Fine-Tuning & Decoding

Hi Team,

I am currently benchmarking and experimenting with the google/medasr model using the provided evaluation pipeline. During this process, I noticed that many transcripts in the medical ASR domain contain special tokens enclosed in braces, such as:

{period}, {comma}, {slash}, {next line}, {question mark}, etc.

These tokens appear to represent spoken punctuation or formatting markers.

From the current preprocessing and evaluation logic, I have a few questions that I would like to clarify:

1. Expected List of Supported Brace Tokens

Is there an official or recommended list of brace-based tokens (e.g., {period}, {comma}, {slash}, etc.) that the model is trained to recognize? It would be very helpful if you could provide:

  • A predefined list of all such tokens supported by the tokenizer

  • Guidance on whether this list is fixed or customizable


2. Preprocessing Rules for Fine-Tuning

For fine-tuning purposes:

  • Should these brace tokens be preserved as-is in training text?

  • Should they be removed before tokenization?

  • Are there recommended preprocessing rules specifically for medical ASR transcripts containing such tokens?

Could you please provide the recommended preprocessing pipeline for fine-tuning, particularly:

  • How to handle {...} tokens in ground truth text

  • Whether they should be normalized, removed, or kept intact

  • Any tokenizer-specific considerations

Hi @csetanmayjain
To clarify your two specific questions related to brace tokens
Predefined list of supported brace tokens
There is no official list published as part of the model documentation. If these tokens appear in the training corpus, they are implicitly supported to the extent that they are represented in the tokenizer vocabulary .
Fixed vs customizable
The tokenizer vocabulary itself is fixed for the released checkpoint. However, preprocessing and normalization steps in your evaluation pipeline are customizable. If you modify or strip brace tokens, you should ensure that your reference and hypothesis texts are normalized consistently before computing metrics such as WER.

With respect to finetuning you should preserve {...} tokens if your downstream use case requires explicit spoken punctuation or your training data consistently includes those tokens.
The key is fine-tuning target format should match your desired inference-time output format. Also ,please keep preprocessing consistent across train/dev/test .

Please check out this notebook medasr/notebooks/fine_tune_with_hugging_face.ipynb at main · Google-Health/medasr · GitHub
Please let me know if you have any more questions .

Hi Pannaga,

Thank you for the clarification, that helps.

Since the training data used for the released checkpoint is not publicly available, it’s a bit difficult to understand the exact conventions and scope used for spoken punctuation during training.

For example:

  • {comma} vs ,

  • {period} vs .

  • {new line} vs actual line breaks

Because of this, fine-tuning may become slightly ambiguous; the model may already have learned some punctuation in symbolic form and some in brace-token form. We need to normalize our data accordingly.

Would it be possible to share one of the following to help guide fine-tuning?

  1. A list (even approximate) of brace tokens that appeared in the training corpus

  2. Any normalization rules used internally before training for the spoken punctuation and special symbols.

  3. Whether punctuation in the training data was primarily represented as {token} or standard characters

Even partial guidance would help ensure that downstream fine-tuning stays aligned with the model’s learned format.

Thanks again for the support.

1 Like