Gemma 4 E2B + LiteRT-LM 0.10.x on Mali GPU — patterns from a structured tool-calling eval

Posting findings from a sustained testing effort against `litert-community/gemma-4-E2B-it-litert-lm` running on a Pixel 10a (Tensor G4 / Mali-G715, Android 14) via LiteRT-LM’s Kotlin `Conversation` API with native function calling. Built a structured eval rig (357 single-turn cases × 5 phrasing styles + 29 multi-turn scenarios + fixture-based pre/post-execution verification) and ran it 10+ times to characterize behavior. Sharing what surfaced that I don’t see widely discussed yet.

A longer technical write-up with all 10 findings + asks lives on the LiteRT-LM GitHub: [link to issue].

## The three findings worth your time

### 1. The GPU precision bug from HF discussion #10 is real and dominant

[google/gemma-4-E2B-it discussion #10]( google/gemma-4-E2B-it · Allucination parsing JSON responses containing numerical data such as 1234.00 ) reports number corruption on GPU + Conversation API. We’re hitting it from a totally different stack and seeing the same symptoms — plus a related contraction-token failure mode. Specifically:

- **Numbers in tool-call args get mangled.** `1029.3` emitted as `1.0,293`; `2026` as `202026`. Reproducible across multiple tool categories that take numeric strings.

- **Contractions tokenize wrong.** Model emits `I’X` patterns where `X` is a noun, instead of `I’ve X` — the `'ve`/`'d`/`‘m`/`‘haven’t’` second half is dropped between `I’` and the next word. Visible both in user-facing prose and in tool-call arguments.

- **6–9% of all our cases bucket as `bad_json`** — silent tool-call drops where the parser couldn’t recover.

CPU backend doesn’t show these. Discussion #10 has been open ~30 days with a Google engineer ack but no fix released. If anyone else is also hitting this on a different stack, would be valuable to compare notes — a second public confirmation may help prioritization.

### 2. Tool-call success inversely tracks argument count, very cleanly

This was the most actionable thing we found. Across 14 stub tools we built specifically to sweep parameter arity:

| Args | Honest pass rate |

|—:|—:expressionless:

| 0–1 | 87–100% |

| 2 | 78–88% |

| 3 string params | 43–67% |

| 4 string params | **0%** |

We collapsed one 4-arg test tool into four 2-arg per-operation tools. Pass rate for that category went from **0% to 47%** in a single change. Each step down the arity ladder lops off ~20-30 percentage points.

Failure mechanism appears to be the model running out of generation budget mid-tool-call and emitting truncated JSON-shaped output that the PEG parser can’t repair. The implication for anyone shipping is: keep your tool signatures narrow. Three string args is the practical ceiling for reliable behavior on E2B + GPU. If you need more args, split into per-operation tools.

If you’ve seen a different breakpoint (lower or higher) in your own work, would love to hear it.

### 3. Routing-correctness ≠ execution-correctness — and the gap is huge

When we first measured tool-call accuracy by checking “did the right tool fire”, `notes-edit` (a category with 4 string params, our worst case) reported **87% pass**. After adding post-execution state assertions (“did the right note actually get modified with the right text”), the same cases dropped to **47% pass**. The 40-point gap was cases where the tool fired but operated on the wrong target or with mangled args.

If you’re evaluating any tool-using model and only checking “did the function get called”, you may be measuring vanity. Worth threading post-execution state checks into your harness if you don’t already.

## What would meaningfully help

Briefly, in case any of these prompt useful conversation:

- **Constrained decoding for tool-call output.** Even a brace-balance check would eliminate most `bad_json`.

- **A way to inspect or hook the raw token stream BEFORE LiteRT-LM’s PEG parser.** We can detect malformed `<|tool_call>` blocks but can’t repair + re-feed.

- **Sampler-order verification** for Gemma 4 (Unsloth docs say it expects `temperature → top_p → top_k`; we can’t confirm what LiteRT-LM does).

- **Chat template inspection / override** for `.litertlm` packages.

- **[FunctionGemma]( FunctionGemma: New Gemma model for function calling ) (270M) in `.litertlm` format** — would let on-device integrators run a router/worker split (FunctionGemma for tool routing, Gemma 4 E2B for the natural-language reply).

- **TurboQuant integration roadmap** for LiteRT-LM, particularly for KV cache.

## Numbers from a recent run

```

Cases captured: 325 / 357

Overall pass (honest): 71.4%

bad_json: 8.9% (parse failures + silent drops)

wrong_args: 4.9% (right tool, wrong values)

wrong_target: 0.6% (right tool, wrong target)

Latency p50: 9.3s

Latency p95: 26.4s

```

The honest pass rate is ~6 points lower than the routing-only rate — exactly the gap that proper post-execution verification surfaces.

## Closing

Posting partly to surface the bugs but mostly to compare notes. If you’re shipping anything tool-using on Gemma 4 E2B + LiteRT-LM, what failure modes are you seeing? Are there mitigations beyond the obvious ones (smaller arg counts, GPU shader warmup, parse-error → conversation rebuild) that have helped you?

Full technical writeup with all 10 findings + workarounds we’ve ended up using: Field report: Gemma 4 E2B + LiteRT-LM 0.10.x on Mali-G715 — 10 findings from sustained Android testing · Issue #2202 · google-ai-edge/LiteRT-LM · GitHub

Thanks.

-

# Follow-up: ruling out the MTP rollout as a dominant cause

A reader of an adjacent forum post pointed at the recent MTP (Multi-Token Prediction) rollout in the `gemma-4-E2B-it-litert-lm` package as a possible regression source. We re-ran the full eval against the pre-MTP build to test the hypothesis. Posting the result because a negative finding is also useful.

## Setup

- **Pre-MTP build:** `gemma-4-E2B-it.litertlm` from HF commit `7fa1d78` (April 1), pulled via `resolve//`.

- **MTP build (control):** current `main` `.litertlm` from the same repo.

- Same eval rig, same 357 single-turn cases + 29 scenarios, same device (Pixel 10a / Mali-G715 / Android 14), same sampler config (`topK=64, topP=0.95, temperature=1.0`), same fixture+assertion verification, same backend (GPU).

- Each ran clean to completion (357/357), no stalls.

## Headline numbers

| Metric | MTP | Pre-MTP | Δ |

|—|—:|—:|—:expressionless:

| Overall honest pass | 71.4% | 72.3% | +0.9 pp |

| `bad_json` rate | 8.9% | 8.4% | −0.5 pp |

| `wrong_args` rate | 4.9% | 4.5% | −0.4 pp |

| Latency p50 | 9.3s | 8.8s | −0.5s |

| Latency p95 | 26.4s | 19.7s | **−6.7s** |

## High-arity categories (the predicted recovery zone)

| Category (arity) | MTP | Pre-MTP | Δ |

|—|—:|—:|—:expressionless:

| notes-edit (4 → 2 split) | 47.4% | 58.3% | +10.9 pp |

| notes-create (3-arg) | 43.5% | 48.1% | +4.6 pp |

| contacts (3-arg) | 52.0% | 52.0% | **0** |

| multi-intent | 62.5% | 62.5% | **0** |

n is 24–27 per category, so anything under ~10 pp is comfortably inside run-to-run noise on this rig.

## Reading

1. **MTP is not the dominant root cause** of the failure modes documented in the original issue. The categories most expected to recover (contacts and multi-intent) moved exactly zero. notes-edit / notes-create show modest gains but within plausible noise.

2. **`bad_json` is structural, not MTP-introduced.** 8.4% vs 8.9% is statistical equivalence. The within-category distribution is also stable: notes-create still produces ~44% `bad_json` (12 of 27) on both builds, and the same edge / multi-intent failures recur. This is consistent with finding #1 (GPU numerical precision corruption per HF discussion #10) and finding #2 (argument-count → generation-budget exhaustion) being the actual structural causes.

3. **Latency is genuinely better pre-MTP.** p95 drops 6.7s, plausibly because no drafter-verify overhead. That’s a real but separate observation; it doesn’t change the correctness picture.

4. **Argument-arity collapse remains the highest-leverage lever.** Splitting one 4-arg tool into four 2-arg per-operation tools moved its category from 0% → 47% — ~10× the impact of swapping the model build.

So the original issue’s findings stand as written. The MTP rollout is exonerated as the cause of the tool-calling failures we see.

Happy to share the per-category JSON if useful.