Exploring a fail-closed, pre-semantic safety kernel for LLMs (research prototype)

Hi everyone,

I’m currently working on a small but very focused research project around AI safety, and I’d love to get early technical feedback from people building with Gemini and related systems.

The project explores a simple question:
Can we introduce a deterministic, pre-semantic safety layer that halts model operation based purely on internal signals — without interpreting meaning, intent, or content?

What I’ve built so far:

  • a minimal no_std Rust safety kernel
  • deterministic, fail-closed behavior (once halted, always halted)
  • no allocation, no side effects, FFI-safe
  • designed to be auditable and suitable for formal specification
  • accompanied by a TLA+ model of the monotonic halt property

The kernel itself does not “understand” language.
It only evaluates numerical deviation signals provided by a host system and makes a binary decision: CONTINUE or ATOMIC HALT.

The motivation is not content filtering, but exploring whether internal model dynamics could act as an early safety signal — especially for catastrophic or runaway behaviors.

This is very much a research prototype, not a production claim.

What I’m looking for:

  • thoughts on whether this kind of pre-semantic, fail-closed component could be useful in real systems
  • feedback from people integrating safety mechanisms around Gemini or other frontier models
  • pointers to similar work I should study or compare against

If this resonates with your work, I’d be happy to share the repo and discuss details.

Thanks for reading.