How to get the label distribution of a `tf.data.Dataset` efficiently?

The naive option is to use something like this:

import tensorflow as tf 
import numpy as np
import collections

num_classes = 2
num_samples = 10000
data_np = np.random.choice(num_classes, num_samples)

y = collections.defaultdict(int)
for i in dataset:
  cls, _ = i
  y[cls.numpy()] += 1

If you are looking for a non-numpy solution there was a API request at:

https://github.com/tensorflow/datasets/issues/2902

With numpy you could use many solutions like:

Not sure how these methods would scale.

I would give this one a try:

It is doing something similar iterating over the full dataset but in c++: