Transformer for asynchronous multi-stream image time-series with online prediction?

I have two streams of images, each stream corresponding to a different “channel” (e.g. different sensor modality). The streams are not synchronized — at any given moment, a new image arrives from one stream or the other, each with a real-valued timestamp. I want to classify the sequence online, i.e. produce an updated prediction after every new incoming image.

Key constraints:

  • Spatial features within each image matter (not just a scalar summary)
  • Timestamps are irregular and not aligned across streams
  • Prediction must improve causally as more observations arrive

The natural design seems to be: ViT encoder per image → causal transformer over the merged token stream, with real-valued timestamp embeddings (e.g. Time2Vec) replacing positional indices, and band/channel ID as an additional embedding.

Is there an existing architecture or paper that handles this exact setup? Or is this a known gap?