Hello everyone,
We’re building an on-device AI triage tool for frontline health workers in Uganda. The app runs entirely on the user’s Android device using MedGemma 4B. We’ve taken time to make sure the questions below aren’t already covered on the forum, though we’ve noticed existing discussions tend to lean toward the 27B model.
We’d appreciate the team’s guidance on the following:
-
Hallucination guardrails for clinical triage
In our use case, a hallucinated diagnosis or a missed danger sign could be significant. These health workers often have limited clinical training, so the model’s outputs carry real weight in decision-making. What guardrail strategies does the team recommend for the 4B model in safety-critical workflows?
-
Confidence calibration at 4B scale
Is there existing guidance on calibrating model confidence, or techniques the team has found effective for getting the model to surface uncertainty rather than producing confident-sounding but incorrect assessments? At 4B, thinking mode isn’t available as it is on 27B, so we’re looking for alternative approaches.
-
Prompting and output formats
Are there specific prompting strategies or output formats the team has seen work well to improve helpfulness? We’d appreciate any general advice on getting the most out of the 4B model, particularly patterns the team has found useful in practice.
-
On-device vision model performance
We’re running the vision pipeline on-device and finding that image processing takes a considerable amount of time on mobile hardware. Are there recommended approaches for improving inference speed for the vision component on resource-constrained Android devices? Any guidance optimizations the team has tested for mobile deployment would be very helpful.
For context, the app runs fully on-device with no cloud fallback, so performance and safety at the edge are both critical for us.
Looking forward to any guidance the community can share.
Hi @Bronson_Bakunga
This is a great use case. I’ll go point-by-point with recommendations based on what we’ve seen work for smaller models in safety-critical settings.
-
Hallucination Guardrails
Since, Health Workers may not have extensive clinical training to catch subtle AI errors, your system architecture must prioritize safety over raw generation. Some methods you can use are
avoid open-ended diagnosis. Instead, force classification into a small, predefined set like Emergency referral, Urgent clinic visit, Routine car, Home care .This significantly reduces hallucinations.
-
Confidence calibration at 4B scale
Since MedGemma 4B lacks the native “thinking mode” of the 27B model, confidence works best as a combination of signals rather than something you fully trust from the model itself. You can have the model output a simple confidence label (low/medium/high). Also, it’s important to explicitly allow the model to abstain (e.g., “refer—uncertain”); otherwise it will tend to guess even in ambiguous cases. Over time, this leads to a system that surfaces uncertainty more honestly.
-
Prompting and output formats
For a 4B model, zero-shot open-ended prompts often lead to chatty, unformatted, or hallucinated text. I believe forcing the model to output JSON will be impactful as this allows your Android app to parse the data into a render a safe and standardized UI. Also ,use role-Playing and Few-Shot examples in the prompt as giving the model a strict persona and 2-3 examples in the system prompt will improve model adherence and output quality.
-
On-device vision model performance
Image processing is notoriously heavy for edge devices because the Vision Transformer encoder requires massive matrix multiplications . Some methods you can use are
a) Quantizing the model (e.g., 4-bit or 8-bit) and running it through something like TensorFlow Lite with NNAPI or GPU acceleration typically gives the biggest improvements.
b) You can also try reducing image resolution and crop to the relevant region before inference so you’re not processing unnecessary pixels. Can also add a lightweight pre-check for image quality (blur, lighting, framing) can prevent wasted computation.
c) Since a 4-7 second inference delay is often unavoidable on lower-end Android phones, you can design your app’s UX so the vision processing happens asynchronously.
Please give these a try and let me know if you need any more help with it.
Thanks
2 Likes
@Pannaga_J thank you for getting back to us so soon. These are great recommendations.
Let’s implement them and get back to you as soon as possible on what we have observed.
1 Like