MedTok Tokenizer

Could MedGemma be potentially fine-tuned using the MedTok tokenizer by Harvard?

MEDTOK is a multimodal tokenizer of medical codes that combines text descriptions of codes with graph-based representations of dependencies between codes derived from clinical ontologies and standard medical terminologies. MEDTOK is a general-purpose tokenizer that can be integrated into any transformer-based model or system that requires tokenization.

This could enhance the model’s ability to embed nuanced distinctions between closely related ICD codes. For example, differentiating E11 (Type 2 diabetes mellitus) from E11.1 (Type 2 diabetes mellitus with ketoacidosis) or E11.10 (Type 2 diabetes mellitus with ketoacidosis without coma).

2 Likes

I’m afraid, it won’t be feasible to fine-tune MedGemma with a different tokenizer. In the fine-tuning notebooks we show how to tune using complete prompts. You may find this demo beneficial; it’s showing how to use MedGemma to extract structured diagnosis codes (ICD-10) from unstructured clinical notes.

May I ask in what scenario you are finding it necessary to employ MedTok tokens+MedGemma?

Fereshteh on behalf of HAI-DEF engineering