Randomly sampling equal points ensuring equal number per class

Sayak_Paul · May 5, 2021, 8:07am

Hi folks.

Currently, I have a requirement for a batch of data that should have an equal number of samples from each of the given classes.

I am implementing it using the naive way for CIFAR10:

def support_sampler():
    idx_dict = dict()
    for class_id in np.arange(0, 10):
        subset_labels = sampled_labels[sampled_labels == class_id] 
        random_sampled = np.random.choice(len(subset_labels), 16)
        idx_dict[class_id] = random_sampled
    return np.concatenate(list(idx_dict.values()))

def get_support_ds():
    random_balanced_idx = support_sampler()
    temp_train, temp_labels = sampled_train[random_balanced_idx],\
        sampled_labels[random_balanced_idx]
    support_ds = tf.data.Dataset.from_tensor_slices((temp_train, temp_labels))
    support_ds = (
        support_ds
        .shuffle(BATCH_SIZE * 1000)
        .map(agumentation, num_parallel_calls=AUTO)
        .batch(BATCH_SIZE)
    )
    return support_ds

Is there a better way? Particularly using pure TF ops with tf.data?

Mark_Daoust · May 6, 2021, 1:56pm

Here the approach I used was to make a dataset for each class, and then merge them.

I used sample_from_datasets so it’s approximately equal. But you could also zip the datasets then and .map a function to stack all the zipped tensors.

Sayak_Paul · May 6, 2021, 2:08pm

Thanks Mark. I later revisted that tutorial and found out about that neat method. Solved my purpose.

I think having a separate sampler utility for tf.data pipelines might be better from usability standpoint.

Mark_Daoust · May 6, 2021, 2:24pm

There is this “rejection resample” function:

Sayak_Paul · May 6, 2021, 2:41pm

Oh my. This is really neat. Thanks for sharing.

I need to extend the example for my use case.

Sayak_Paul · May 7, 2021, 11:40am

@markdaoust here’s what I tried:

def class_func(image, label):
    return label

SUPPORT_BATCH_SIZE = 640

(x_train, y_train), (_, _) = tf.keras.datasets.cifar10.load_data()

sampled_idx = np.random.choice(len(x_train), 4000)
sampled_train, sampled_labels = x_train[sampled_idx], y_train[sampled_idx].squeeze()
sampled_labels = sampled_labels.astype("int32")
support_ds = tf.data.Dataset.from_tensor_slices((sampled_train, sampled_labels))

distribution = Counter(sampled_labels)
counts = np.array(list(distribution.values()))
fractions = counts/counts.sum().astype("float64")

target_distribution = np.array([0.1] * 10).astype("float64")
resampler = tf.data.experimental.rejection_resample(
    class_func, target_dist=target_distribution, initial_dist=fractions)
support_ds = support_ds.apply(resampler).batch(SUPPORT_BATCH_SIZE)

Here’s the root error:

TypeError: Input 'y' of 'Less' Op has type float64 that does not match type float32 of argument 'x'.

Any idea what I might have missed out on? Here’s the Colab if you wanna give it a shot.

Mark_Daoust · May 7, 2021, 12:32pm

To get your code to work, replace:

fractions = counts/counts.sum().astype("float64")

target_distribution = np.array([0.1] * 10).astype("float64")

with:

fractions = counts/counts.sum()
fractions = fractions.astype("float32")

target_distribution = np.array([0.1] * 10).astype("float32")

The implementation is just being a bit careless with the dtypes.

Here it’s does a random_ops.random_uniform([], seed=seed) < p)).

That uniform random returns a float32. So p needs to be float32, or it should say random_ops.random_uniform([], seed=seed, dtype=p.dtype)

Or it should assert that all those arguments are float32, or cast them to float32.

Sayak_Paul · May 7, 2021, 12:53pm

That worked. Thank you.

Indeed, dtype part was confusing to understand.

Sayak_Paul · May 7, 2021, 1:06pm

Although the code is working fine, the distribution is not what I would expect (the expectation here is to have a uniform distribution across the labels). Here’s a batch-wise summary:

Counter({6: 73, 1: 72, 7: 71, 5: 67, 0: 65, 8: 64, 9: 63, 4: 57, 3: 55, 2: 53})
Counter({9: 74, 0: 70, 4: 70, 2: 69, 3: 68, 1: 66, 7: 62, 6: 56, 5: 53, 8: 52})
Counter({0: 75, 3: 71, 6: 70, 1: 69, 8: 64, 9: 63, 4: 63, 2: 60, 7: 55, 5: 50})
Counter({4: 74, 0: 72, 7: 72, 1: 67, 5: 66, 6: 65, 3: 63, 9: 59, 2: 52, 8: 50})
Counter({2: 78, 7: 78, 6: 75, 1: 68, 4: 62, 5: 62, 9: 56, 0: 56, 3: 55, 8: 50})

For 640 samples with each batch, I would expect it to give 64 per class.

Sayak_Paul · May 8, 2021, 2:34am

I tried another approach:

sampled_idx = np.random.choice(len(x_train), 4000)
sampled_train, sampled_labels = x_train[sampled_idx], y_train[sampled_idx].squeeze()
sampled_labels = sampled_labels.astype("int32")
support_ds = tf.data.Dataset.from_tensor_slices((sampled_train, sampled_labels))

ds = []

for i in np.arange(0, 10):
    ds_label =  (
        support_ds
        .filter(lambda image, label: label==i)
        .repeat())
    ds.append(ds_label)

balanced_ds = tf.data.experimental.sample_from_datasets(
    ds, [0.1] * 10).batch(SUPPORT_BATCH_SIZE)

But here also when I do:

for samples, labels in balanced_ds.take(10):
    print(Counter(labels.numpy()))

the distribution does not come out as expected:

Counter({9: 74, 0: 73, 3: 71, 8: 70, 1: 70, 5: 67, 7: 64, 2: 55, 6: 51, 4: 45})
Counter({2: 76, 3: 70, 4: 68, 1: 67, 6: 64, 0: 62, 7: 62, 8: 60, 9: 56, 5: 55})
Counter({1: 78, 2: 75, 7: 74, 0: 68, 9: 67, 3: 61, 5: 58, 8: 55, 4: 54, 6: 50})
Counter({6: 82, 9: 69, 5: 68, 4: 64, 1: 63, 3: 62, 7: 62, 8: 61, 2: 56, 0: 53})
Counter({6: 76, 2: 69, 5: 69, 8: 68, 4: 67, 0: 66, 1: 59, 3: 59, 9: 55, 7: 52})
Counter({8: 77, 9: 71, 4: 68, 0: 66, 2: 66, 6: 66, 7: 64, 5: 62, 1: 60, 3: 40})
Counter({8: 86, 9: 66, 4: 65, 1: 64, 2: 62, 5: 61, 0: 60, 6: 60, 3: 58, 7: 58})
Counter({7: 75, 8: 73, 6: 70, 5: 70, 3: 68, 9: 64, 4: 61, 0: 55, 2: 53, 1: 51})
Counter({6: 78, 1: 70, 5: 67, 0: 66, 2: 66, 4: 64, 8: 60, 3: 58, 9: 56, 7: 55})
Counter({9: 75, 7: 70, 8: 69, 3: 67, 4: 65, 5: 63, 2: 62, 1: 57, 0: 57, 6: 55})

@markdaoust

Mark_Daoust · May 8, 2021, 1:17pm

Don’t trust a person’s ability to evaluate a probability distribution at a glance.

Here’s an independent implementation that gets equivalent results:

import numpy as np

for _ in range(10):
  d = np.zeros(10)
  for n in range(640):
    d[np.random.randint(10)] += 1
  print(sorted(d, reverse=True))

[79.0, 71.0, 69.0, 68.0, 65.0, 61.0, 60.0, 59.0, 58.0, 50.0]
[78.0, 70.0, 70.0, 68.0, 67.0, 64.0, 62.0, 57.0, 56.0, 48.0]
[78.0, 73.0, 70.0, 69.0, 67.0, 62.0, 59.0, 57.0, 53.0, 52.0]
[74.0, 71.0, 70.0, 68.0, 66.0, 61.0, 61.0, 60.0, 56.0, 53.0]
[77.0, 70.0, 67.0, 65.0, 65.0, 63.0, 62.0, 60.0, 57.0, 54.0]
[76.0, 73.0, 68.0, 67.0, 66.0, 61.0, 59.0, 58.0, 56.0, 56.0]
[74.0, 74.0, 70.0, 69.0, 68.0, 67.0, 65.0, 59.0, 48.0, 46.0]
[85.0, 69.0, 68.0, 66.0, 62.0, 61.0, 61.0, 59.0, 56.0, 53.0]
[73.0, 71.0, 67.0, 67.0, 65.0, 63.0, 61.0, 58.0, 58.0, 57.0]
[72.0, 70.0, 68.0, 67.0, 65.0, 63.0, 62.0, 60.0, 59.0, 54.0]

I’m not sure what the right statistical test is (something Dirichlet.) but use a bigger sample size and you’ll see that it’s converging. with 1e6 samples everything’s within 1%:

d = np.random.randint(10, size=int(1e6))
counts, _ = np.histogram(d, bins=range(11))
counts

array([100254,  99351, 100098, 100162,  99747, 100369,  99793, 100247,
       100039,  99940])

If you you want to force exact balance then with one dataset per class you can:

import tensorflow as tf

datasets = tuple(tf.data.Dataset.from_tensors(n).repeat() for n in range(10))
zipped = tf.data.Dataset.zip(datasets)
stacked = zipped.map(lambda *args: tf.stack(args, axis=0))

stacked.element_spec

TensorSpec(shape=(10,), dtype=tf.int32, name=None)

tf.data.experimental.get_single_element(stacked.take(1))

<tf.Tensor: shape=(10,), dtype=int32, 
  numpy=array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)>

Sayak_Paul · May 8, 2021, 1:45pm

Thanks for the pointers.

Mark_Daoust · May 11, 2021, 12:32am

Also:

Sayak_Paul · May 11, 2021, 8:44am

@markdaoust it just keeps getting interesting:

What I exactly wanted:

Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})

Crux of the code:

def dataset_for_class(i):
    i = tf.cast(i, tf.uint8)
    return support_ds.filter(lambda image, label: label == i).repeat()

sampled_idx = np.random.choice(len(x_train), 4000)
sampled_train, sampled_labels = x_train[sampled_idx], y_train[sampled_idx].squeeze()

support_ds = tf.data.Dataset.from_tensor_slices((sampled_train, sampled_labels))
stratified_ds = tf.data.Dataset.range(10).interleave(dataset_for_class, cycle_length=10) 
stratified_ds = stratified_ds.batch(640)

Notes:

Dataset is CIFAR10.
I made sure that the images getting batched are different as you would ntoice in the notebook provided above.

Mark_Daoust · May 11, 2021, 3:55pm

Yeah, that interleave is basically equivalent to the zip.

def dataset_for_class(i):
    i = tf.cast(i, tf.uint8)
    return support_ds.filter(lambda image, label: label == i).repeat()

Just remember that if you’re splitting a dataset like that, the dataset for each class loads the whole dataset, and throws out all but 1/n of it. So if you have a larger dataset with a larger number of classes you’ll probably want to cache each of the class-datasets (but there might also be a way to fix it with querues).

Sayak_Paul · May 11, 2021, 4:10pm

True that. Let’s just continue putting together our hacks and benchmark them. Who knows, future readers may find these incredibly useful.

On a slightly related note, as you may already know this kind of stratified sampling is pretty common for few-shot classification tasks (particularly for models like Prototypical Networks). Might be a good idea to work on a tutorial concerning this topic.

innat · July 26, 2022, 1:11pm

@Sayak_Paul @markdaoust
Thanks for this insightful discussion and working workaround mentioned here. It’s really helpful.

I’m trying to get similar output from tf.data API, especially while working with tf-similarity data sampler. For TFRecord format, it also adopts similar functions (interleave) from tf.data, here. But those samples additionally require each class to execute continuously. For example: in a batch, with num of repeated sample = 4

[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, ...]

Is there any convenient function from tf data API to achieve this sorting after batching on the training pairs? Following approach might be the way but after interleave, I’m expecting much optimize approach. Any tips?

batch_size = 240
num_classes = 10

dataset_encode = tf.data.Dataset.range(num_classes)
dataset_encode = dataset_encode.interleave(dataset_for_class, 
                                           cycle_length=10)
dataset_encode = dataset_encode.batch(batch_size) 


dataset0 = tuple(dataset_encode.filter
                 (
                     lambda x, y: tf.equal
                     (
                         y[n], n
                     )
                 ) for n in range(num_classes)
                )
...
zipped = tf.data.Dataset.zip(dataset0)
...

Update

One possible solution.

dataset = tf.data.Dataset.range(1, 6)  
dataset = dataset.interleave(
    lambda x: tf.data.Dataset.from_tensors(x).repeat(5),
    cycle_length=1, 
    block_length=3,
)
list(dataset.as_numpy_iterator())

[1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 
3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5])

innat · July 28, 2022, 8:46pm

Opened a ticket here for the slow process regarding this in the tf.data API.

https://github.com/tensorflow/tensorflow/issues/56934

Bhack · July 28, 2022, 11:10pm

It is also hard to create equal splits in TF datasets:

https://github.com/tensorflow/datasets/issues/3502

Topic		Replies	Views
Splitting dataset into train, validate, test and ensuring equal representation of classes TensorFlow models , datasets , help_request	2	2074	April 7, 2023
My Tensorflow Data pipeline has some issues returning same class samples for all steps General Discussion datasets , help_request	2	426	September 22, 2022
tf.data.Dataset varies at re-iteration. Manual reset possible? General Discussion datasets , keras , help_request	6	2090	September 3, 2022
Unknown/reduced dataset length after resampling General Discussion datasets	1	392	November 29, 2024
How to do Minority class sampling using tensorflow? General Discussion tfdata , help_request	1	1127	June 13, 2021

Randomly sampling equal points ensuring equal number per class

Update

Related topics