Dear TensorFlow Community,
We’ve developed and released a diagnostic SDK called CollapseCleaner to address a class of runtime issues commonly reported but hard to reproduce — namely:
- Retained background threads (e.g., stuck DataLoader or training workers)
- Ambiguous tensor shapes that interfere with graph freezing and export
- CUDA memory leaks that persist even after session/epoch termination
Our investigation, originally rooted in system-wide introspection through the WaveMind AI architecture, identified a deeper structural phenomenon we termed:
Eclipse Leaks — hidden residuals in memory/graph space that survive standard cleanup mechanisms.
These artifacts can degrade runtime performance over time, especially in long-running training or serving jobs. The phenomenon aligns with findings from arXiv:2502.12115, which reported:
“10–25% of GPU inefficiencies arise from retained memory artifacts invisible to the user.”
Paper reference: arXiv:2502.12115
Core capabilities:
from collapsecleaner import clean_orphaned_threads, freeze_tensor_shape, detect_unreleased_cuda_contexts
clean_orphaned_threads() # Clears silent thread residues
freeze_tensor_shape(model) # Locks dynamic tensor dimensions
detect_unreleased_cuda_contexts() # (Beta) Flags unreleased CUDA allocators
Designed to be used as a pre-cleaner/post-cleaner in training loops, model conversion workflows, and CI/CD steps.
Origin and Motivation
This work emerged from a collapse-analysis framework in WaveMind, where we observed long-term accumulation of untracked execution residues.
Full background and technical breakdown:
CollapseCleaner – The Invisible Leak Draining Billions from AI (LinkedIn Post)
We’d love to hear your feedback — and we’re particularly interested in integration suggestions, or any reproducible cases from your pipelines where this tool could help.
Thank you.?