Hi Team,
I am currently benchmarking and experimenting with the google/medasr model using the provided evaluation pipeline. During this process, I noticed that many transcripts in the medical ASR domain contain special tokens enclosed in braces, such as:
{period}, {comma}, {slash}, {next line}, {question mark}, etc.
These tokens appear to represent spoken punctuation or formatting markers.
From the current preprocessing and evaluation logic, I have a few questions that I would like to clarify:
1. Expected List of Supported Brace Tokens
Is there an official or recommended list of brace-based tokens (e.g., {period}, {comma}, {slash}, etc.) that the model is trained to recognize? It would be very helpful if you could provide:
-
A predefined list of all such tokens supported by the tokenizer
-
Guidance on whether this list is fixed or customizable
2. Preprocessing Rules for Fine-Tuning
For fine-tuning purposes:
-
Should these brace tokens be preserved as-is in training text?
-
Should they be removed before tokenization?
-
Are there recommended preprocessing rules specifically for medical ASR transcripts containing such tokens?
Could you please provide the recommended preprocessing pipeline for fine-tuning, particularly:
-
How to handle
{...}tokens in ground truth text -
Whether they should be normalized, removed, or kept intact
-
Any tokenizer-specific considerations