Posting findings from a sustained testing effort against `litert-community/gemma-4-E2B-it-litert-lm` running on a Pixel 10a (Tensor G4 / Mali-G715, Android 14) via LiteRT-LM’s Kotlin `Conversation` API with native function calling. Built a structured eval rig (357 single-turn cases × 5 phrasing styles + 29 multi-turn scenarios + fixture-based pre/post-execution verification) and ran it 10+ times to characterize behavior. Sharing what surfaced that I don’t see widely discussed yet.
A longer technical write-up with all 10 findings + asks lives on the LiteRT-LM GitHub: [link to issue].
## The three findings worth your time
### 1. The GPU precision bug from HF discussion #10 is real and dominant
[google/gemma-4-E2B-it discussion #10]( google/gemma-4-E2B-it · Allucination parsing JSON responses containing numerical data such as 1234.00 ) reports number corruption on GPU + Conversation API. We’re hitting it from a totally different stack and seeing the same symptoms — plus a related contraction-token failure mode. Specifically:
- **Numbers in tool-call args get mangled.** `1029.3` emitted as `1.0,293`; `2026` as `202026`. Reproducible across multiple tool categories that take numeric strings.
- **Contractions tokenize wrong.** Model emits `I’X` patterns where `X` is a noun, instead of `I’ve X` — the `'ve`/`'d`/`‘m`/`‘haven’t’` second half is dropped between `I’` and the next word. Visible both in user-facing prose and in tool-call arguments.
- **6–9% of all our cases bucket as `bad_json`** — silent tool-call drops where the parser couldn’t recover.
CPU backend doesn’t show these. Discussion #10 has been open ~30 days with a Google engineer ack but no fix released. If anyone else is also hitting this on a different stack, would be valuable to compare notes — a second public confirmation may help prioritization.
### 2. Tool-call success inversely tracks argument count, very cleanly
This was the most actionable thing we found. Across 14 stub tools we built specifically to sweep parameter arity:
| Args | Honest pass rate |
|—:|—![]()
| 0–1 | 87–100% |
| 2 | 78–88% |
| 3 string params | 43–67% |
| 4 string params | **0%** |
We collapsed one 4-arg test tool into four 2-arg per-operation tools. Pass rate for that category went from **0% to 47%** in a single change. Each step down the arity ladder lops off ~20-30 percentage points.
Failure mechanism appears to be the model running out of generation budget mid-tool-call and emitting truncated JSON-shaped output that the PEG parser can’t repair. The implication for anyone shipping is: keep your tool signatures narrow. Three string args is the practical ceiling for reliable behavior on E2B + GPU. If you need more args, split into per-operation tools.
If you’ve seen a different breakpoint (lower or higher) in your own work, would love to hear it.
### 3. Routing-correctness ≠ execution-correctness — and the gap is huge
When we first measured tool-call accuracy by checking “did the right tool fire”, `notes-edit` (a category with 4 string params, our worst case) reported **87% pass**. After adding post-execution state assertions (“did the right note actually get modified with the right text”), the same cases dropped to **47% pass**. The 40-point gap was cases where the tool fired but operated on the wrong target or with mangled args.
If you’re evaluating any tool-using model and only checking “did the function get called”, you may be measuring vanity. Worth threading post-execution state checks into your harness if you don’t already.
## What would meaningfully help
Briefly, in case any of these prompt useful conversation:
- **Constrained decoding for tool-call output.** Even a brace-balance check would eliminate most `bad_json`.
- **A way to inspect or hook the raw token stream BEFORE LiteRT-LM’s PEG parser.** We can detect malformed `<|tool_call>` blocks but can’t repair + re-feed.
- **Sampler-order verification** for Gemma 4 (Unsloth docs say it expects `temperature → top_p → top_k`; we can’t confirm what LiteRT-LM does).
- **Chat template inspection / override** for `.litertlm` packages.
- **[FunctionGemma]( FunctionGemma: New Gemma model for function calling ) (270M) in `.litertlm` format** — would let on-device integrators run a router/worker split (FunctionGemma for tool routing, Gemma 4 E2B for the natural-language reply).
- **TurboQuant integration roadmap** for LiteRT-LM, particularly for KV cache.
## Numbers from a recent run
```
Cases captured: 325 / 357
Overall pass (honest): 71.4%
bad_json: 8.9% (parse failures + silent drops)
wrong_args: 4.9% (right tool, wrong values)
wrong_target: 0.6% (right tool, wrong target)
Latency p50: 9.3s
Latency p95: 26.4s
```
The honest pass rate is ~6 points lower than the routing-only rate — exactly the gap that proper post-execution verification surfaces.
## Closing
Posting partly to surface the bugs but mostly to compare notes. If you’re shipping anything tool-using on Gemma 4 E2B + LiteRT-LM, what failure modes are you seeing? Are there mitigations beyond the obvious ones (smaller arg counts, GPU shader warmup, parse-error → conversation rebuild) that have helped you?
Full technical writeup with all 10 findings + workarounds we’ve ended up using: Field report: Gemma 4 E2B + LiteRT-LM 0.10.x on Mali-G715 — 10 findings from sustained Android testing · Issue #2202 · google-ai-edge/LiteRT-LM · GitHub
Thanks.