MedASR: Clarification Needed on Handling of Brace Tokens and Preprocessing Rules for Fine-Tuning & Decoding

Hi Team,

I am currently benchmarking and experimenting with the google/medasr model using the provided evaluation pipeline. During this process, I noticed that many transcripts in the medical ASR domain contain special tokens enclosed in braces, such as:

{period}, {comma}, {slash}, {next line}, {question mark}, etc.

These tokens appear to represent spoken punctuation or formatting markers.

From the current preprocessing and evaluation logic, I have a few questions that I would like to clarify:

1. Expected List of Supported Brace Tokens

Is there an official or recommended list of brace-based tokens (e.g., {period}, {comma}, {slash}, etc.) that the model is trained to recognize? It would be very helpful if you could provide:

  • A predefined list of all such tokens supported by the tokenizer

  • Guidance on whether this list is fixed or customizable


2. Preprocessing Rules for Fine-Tuning

For fine-tuning purposes:

  • Should these brace tokens be preserved as-is in training text?

  • Should they be removed before tokenization?

  • Are there recommended preprocessing rules specifically for medical ASR transcripts containing such tokens?

Could you please provide the recommended preprocessing pipeline for fine-tuning, particularly:

  • How to handle {...} tokens in ground truth text

  • Whether they should be normalized, removed, or kept intact

  • Any tokenizer-specific considerations