MedTok Tokenizer

patrickdamaso · August 23, 2025, 7:29pm

Could MedGemma be potentially fine-tuned using the MedTok tokenizer by Harvard?

MEDTOK is a multimodal tokenizer of medical codes that combines text descriptions of codes with graph-based representations of dependencies between codes derived from clinical ontologies and standard medical terminologies. MEDTOK is a general-purpose tokenizer that can be integrated into any transformer-based model or system that requires tokenization.

This could enhance the model’s ability to embed nuanced distinctions between closely related ICD codes. For example, differentiating E11 (Type 2 diabetes mellitus) from E11.1 (Type 2 diabetes mellitus with ketoacidosis) or E11.10 (Type 2 diabetes mellitus with ketoacidosis without coma).

fmahvar · August 27, 2025, 8:57pm

I’m afraid, it won’t be feasible to fine-tune MedGemma with a different tokenizer. In the fine-tuning notebooks we show how to tune using complete prompts. You may find this demo beneficial; it’s showing how to use MedGemma to extract structured diagnosis codes (ICD-10) from unstructured clinical notes.

May I ask in what scenario you are finding it necessary to employ MedTok tokens+MedGemma?

Fereshteh on behalf of HAI-DEF engineering

Topic		Replies	Views
AttributeError: 'tensorflow.python.framework.ops.EagerTensor' object has no attribute 'to_tensor' General Discussion models , keras , help_request	3	18232	October 15, 2021
How do I build a custom voice recognition model for multiple people? TF.js tfjs , datasets , help_request	24	7164	September 18, 2021
Trying to use AutoTokenizer with TensorFlow gives: `ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).` General Discussion models , transformers	3	3226	January 9, 2023
Transformer model for language understanding with another Dataset General Discussion nlp , datasets , help_request	1	1259	September 16, 2022
Model overfitting very fast General Discussion models , nlp , help_request	1	1199	August 21, 2021

MedTok Tokenizer

Related topics