How to get the label distribution of a `tf.data.Dataset` efficiently?

Sayak_Paul · March 8, 2022, 2:43am

The naive option is to use something like this:

import tensorflow as tf 
import numpy as np
import collections

num_classes = 2
num_samples = 10000
data_np = np.random.choice(num_classes, num_samples)

y = collections.defaultdict(int)
for i in dataset:
  cls, _ = i
  y[cls.numpy()] += 1

Bhack · March 8, 2022, 2:44pm

If you are looking for a non-numpy solution there was a API request at:

https://github.com/tensorflow/datasets/issues/2902

Bhack · March 8, 2022, 2:51pm

With numpy you could use many solutions like:

Sayak_Paul · March 8, 2022, 3:10pm

Not sure how these methods would scale.

I would give this one a try:

Bhack · March 8, 2022, 3:21pm

It is doing something similar iterating over the full dataset but in c++:

github.com

tensorflow/tensorflow/blob/master/tensorflow/core/kernels/data/reduce_dataset_op.cc#L89-L102


      
          
          // Iterate through the input dataset.
          while (true) {
            if (ctx->cancellation_manager()->IsCancelled()) {
              return errors::Cancelled("Operation was cancelled");
            }
            std::vector<Tensor> next_input_element;
            bool end_of_input;
            TF_RETURN_IF_ERROR(
                iterator->GetNext(&iter_ctx, &next_input_element, &end_of_input));
            if (end_of_input) {
              break;
            }

Topic		Replies	Views
Randomly sampling equal points ensuring equal number per class General Discussion tfdata	18	3818	July 28, 2022
Convert a tensor string label to one hot encoding General Discussion tfdata , help_request	3	3400	June 22, 2021
How efficiently filter a specific number of entries and concatenating them in a unique tf.data.Dataset General Discussion tfdata	1	365	October 11, 2024
Tensorflow dataset pick a sample of whole data General Discussion datasets	1	418	February 3, 2023
How to Sort a tf.data.dataset? General Discussion models , datasets , tfdata , help_request	7	3098	August 1, 2023

How to get the label distribution of a `tf.data.Dataset` efficiently?

Related topics