Randomly sampling equal points ensuring equal number per class

Hi folks.

Currently, I have a requirement for a batch of data that should have an equal number of samples from each of the given classes.

I am implementing it using the naive way for CIFAR10:

def support_sampler():
    idx_dict = dict()
    for class_id in np.arange(0, 10):
        subset_labels = sampled_labels[sampled_labels == class_id] 
        random_sampled = np.random.choice(len(subset_labels), 16)
        idx_dict[class_id] = random_sampled
    return np.concatenate(list(idx_dict.values()))

def get_support_ds():
    random_balanced_idx = support_sampler()
    temp_train, temp_labels = sampled_train[random_balanced_idx],\
        sampled_labels[random_balanced_idx]
    support_ds = tf.data.Dataset.from_tensor_slices((temp_train, temp_labels))
    support_ds = (
        support_ds
        .shuffle(BATCH_SIZE * 1000)
        .map(agumentation, num_parallel_calls=AUTO)
        .batch(BATCH_SIZE)
    )
    return support_ds

Is there a better way? Particularly using pure TF ops with tf.data?

1 Like

Here the approach I used was to make a dataset for each class, and then merge them.

I used sample_from_datasets so it’s approximately equal. But you could also zip the datasets then and .map a function to stack all the zipped tensors.

3 Likes

Thanks Mark. I later revisted that tutorial and found out about that neat method. Solved my purpose.

I think having a separate sampler utility for tf.data pipelines might be better from usability standpoint.

2 Likes

There is this “rejection resample” function:

4 Likes

Oh my. This is really neat. Thanks for sharing.

I need to extend the example for my use case.

1 Like

@markdaoust here’s what I tried:

def class_func(image, label):
    return label

SUPPORT_BATCH_SIZE = 640

(x_train, y_train), (_, _) = tf.keras.datasets.cifar10.load_data()

sampled_idx = np.random.choice(len(x_train), 4000)
sampled_train, sampled_labels = x_train[sampled_idx], y_train[sampled_idx].squeeze()
sampled_labels = sampled_labels.astype("int32")
support_ds = tf.data.Dataset.from_tensor_slices((sampled_train, sampled_labels))

distribution = Counter(sampled_labels)
counts = np.array(list(distribution.values()))
fractions = counts/counts.sum().astype("float64")

target_distribution = np.array([0.1] * 10).astype("float64")
resampler = tf.data.experimental.rejection_resample(
    class_func, target_dist=target_distribution, initial_dist=fractions)
support_ds = support_ds.apply(resampler).batch(SUPPORT_BATCH_SIZE)

Here’s the root error:

TypeError: Input 'y' of 'Less' Op has type float64 that does not match type float32 of argument 'x'.

Any idea what I might have missed out on? Here’s the Colab if you wanna give it a shot.

1 Like

To get your code to work, replace:

fractions = counts/counts.sum().astype("float64")

target_distribution = np.array([0.1] * 10).astype("float64")

with:

fractions = counts/counts.sum()
fractions = fractions.astype("float32")

target_distribution = np.array([0.1] * 10).astype("float32")

The implementation is just being a bit careless with the dtypes.

Here it’s does a random_ops.random_uniform([], seed=seed) < p)).

That uniform random returns a float32. So p needs to be float32, or it should say random_ops.random_uniform([], seed=seed, dtype=p.dtype)

Or it should assert that all those arguments are float32, or cast them to float32.

3 Likes

That worked. Thank you.

Indeed, dtype part was confusing to understand.

1 Like

Although the code is working fine, the distribution is not what I would expect (the expectation here is to have a uniform distribution across the labels). Here’s a batch-wise summary:

Counter({6: 73, 1: 72, 7: 71, 5: 67, 0: 65, 8: 64, 9: 63, 4: 57, 3: 55, 2: 53})
Counter({9: 74, 0: 70, 4: 70, 2: 69, 3: 68, 1: 66, 7: 62, 6: 56, 5: 53, 8: 52})
Counter({0: 75, 3: 71, 6: 70, 1: 69, 8: 64, 9: 63, 4: 63, 2: 60, 7: 55, 5: 50})
Counter({4: 74, 0: 72, 7: 72, 1: 67, 5: 66, 6: 65, 3: 63, 9: 59, 2: 52, 8: 50})
Counter({2: 78, 7: 78, 6: 75, 1: 68, 4: 62, 5: 62, 9: 56, 0: 56, 3: 55, 8: 50})

For 640 samples with each batch, I would expect it to give 64 per class.

1 Like

I tried another approach:

sampled_idx = np.random.choice(len(x_train), 4000)
sampled_train, sampled_labels = x_train[sampled_idx], y_train[sampled_idx].squeeze()
sampled_labels = sampled_labels.astype("int32")
support_ds = tf.data.Dataset.from_tensor_slices((sampled_train, sampled_labels))

ds = []

for i in np.arange(0, 10):
    ds_label =  (
        support_ds
        .filter(lambda image, label: label==i)
        .repeat())
    ds.append(ds_label)

balanced_ds = tf.data.experimental.sample_from_datasets(
    ds, [0.1] * 10).batch(SUPPORT_BATCH_SIZE)

But here also when I do:

for samples, labels in balanced_ds.take(10):
    print(Counter(labels.numpy()))

the distribution does not come out as expected:

Counter({9: 74, 0: 73, 3: 71, 8: 70, 1: 70, 5: 67, 7: 64, 2: 55, 6: 51, 4: 45})
Counter({2: 76, 3: 70, 4: 68, 1: 67, 6: 64, 0: 62, 7: 62, 8: 60, 9: 56, 5: 55})
Counter({1: 78, 2: 75, 7: 74, 0: 68, 9: 67, 3: 61, 5: 58, 8: 55, 4: 54, 6: 50})
Counter({6: 82, 9: 69, 5: 68, 4: 64, 1: 63, 3: 62, 7: 62, 8: 61, 2: 56, 0: 53})
Counter({6: 76, 2: 69, 5: 69, 8: 68, 4: 67, 0: 66, 1: 59, 3: 59, 9: 55, 7: 52})
Counter({8: 77, 9: 71, 4: 68, 0: 66, 2: 66, 6: 66, 7: 64, 5: 62, 1: 60, 3: 40})
Counter({8: 86, 9: 66, 4: 65, 1: 64, 2: 62, 5: 61, 0: 60, 6: 60, 3: 58, 7: 58})
Counter({7: 75, 8: 73, 6: 70, 5: 70, 3: 68, 9: 64, 4: 61, 0: 55, 2: 53, 1: 51})
Counter({6: 78, 1: 70, 5: 67, 0: 66, 2: 66, 4: 64, 8: 60, 3: 58, 9: 56, 7: 55})
Counter({9: 75, 7: 70, 8: 69, 3: 67, 4: 65, 5: 63, 2: 62, 1: 57, 0: 57, 6: 55})

@markdaoust

1 Like

Don’t trust a person’s ability to evaluate a probability distribution at a glance.

Here’s an independent implementation that gets equivalent results:

import numpy as np

for _ in range(10):
  d = np.zeros(10)
  for n in range(640):
    d[np.random.randint(10)] += 1
  print(sorted(d, reverse=True))
[79.0, 71.0, 69.0, 68.0, 65.0, 61.0, 60.0, 59.0, 58.0, 50.0]
[78.0, 70.0, 70.0, 68.0, 67.0, 64.0, 62.0, 57.0, 56.0, 48.0]
[78.0, 73.0, 70.0, 69.0, 67.0, 62.0, 59.0, 57.0, 53.0, 52.0]
[74.0, 71.0, 70.0, 68.0, 66.0, 61.0, 61.0, 60.0, 56.0, 53.0]
[77.0, 70.0, 67.0, 65.0, 65.0, 63.0, 62.0, 60.0, 57.0, 54.0]
[76.0, 73.0, 68.0, 67.0, 66.0, 61.0, 59.0, 58.0, 56.0, 56.0]
[74.0, 74.0, 70.0, 69.0, 68.0, 67.0, 65.0, 59.0, 48.0, 46.0]
[85.0, 69.0, 68.0, 66.0, 62.0, 61.0, 61.0, 59.0, 56.0, 53.0]
[73.0, 71.0, 67.0, 67.0, 65.0, 63.0, 61.0, 58.0, 58.0, 57.0]
[72.0, 70.0, 68.0, 67.0, 65.0, 63.0, 62.0, 60.0, 59.0, 54.0]

I’m not sure what the right statistical test is (something Dirichlet.) but use a bigger sample size and you’ll see that it’s converging. with 1e6 samples everything’s within 1%:

d = np.random.randint(10, size=int(1e6))
counts, _ = np.histogram(d, bins=range(11))
counts
array([100254,  99351, 100098, 100162,  99747, 100369,  99793, 100247,
       100039,  99940])

If you you want to force exact balance then with one dataset per class you can:

import tensorflow as tf

datasets = tuple(tf.data.Dataset.from_tensors(n).repeat() for n in range(10))
zipped = tf.data.Dataset.zip(datasets)
stacked = zipped.map(lambda *args: tf.stack(args, axis=0))

stacked.element_spec
TensorSpec(shape=(10,), dtype=tf.int32, name=None)
tf.data.experimental.get_single_element(stacked.take(1))
<tf.Tensor: shape=(10,), dtype=int32, 
  numpy=array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)>
3 Likes

Thanks for the pointers.

1 Like

Also:

2 Likes

@markdaoust it just keeps getting interesting:

What I exactly wanted:

Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})

Crux of the code:

def dataset_for_class(i):
    i = tf.cast(i, tf.uint8)
    return support_ds.filter(lambda image, label: label == i).repeat()

sampled_idx = np.random.choice(len(x_train), 4000)
sampled_train, sampled_labels = x_train[sampled_idx], y_train[sampled_idx].squeeze()

support_ds = tf.data.Dataset.from_tensor_slices((sampled_train, sampled_labels))
stratified_ds = tf.data.Dataset.range(10).interleave(dataset_for_class, cycle_length=10) 
stratified_ds = stratified_ds.batch(640)

Notes:

  • Dataset is CIFAR10.
  • I made sure that the images getting batched are different as you would ntoice in the notebook provided above.
2 Likes

Yeah, that interleave is basically equivalent to the zip.

def dataset_for_class(i):
    i = tf.cast(i, tf.uint8)
    return support_ds.filter(lambda image, label: label == i).repeat()

Just remember that if you’re splitting a dataset like that, the dataset for each class loads the whole dataset, and throws out all but 1/n of it. So if you have a larger dataset with a larger number of classes you’ll probably want to cache each of the class-datasets (but there might also be a way to fix it with querues).

2 Likes

True that. Let’s just continue putting together our hacks and benchmark them. Who knows, future readers may find these incredibly useful.

On a slightly related note, as you may already know this kind of stratified sampling is pretty common for few-shot classification tasks (particularly for models like Prototypical Networks). Might be a good idea to work on a tutorial concerning this topic.

2 Likes

@Sayak_Paul @markdaoust
Thanks for this insightful discussion and working workaround mentioned here. It’s really helpful.


I’m trying to get similar output from tf.data API, especially while working with tf-similarity data sampler. For TFRecord format, it also adopts similar functions (interleave) from tf.data, here. But those samples additionally require each class to execute continuously. For example: in a batch, with num of repeated sample = 4

[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, ...]

Is there any convenient function from tf data API to achieve this sorting after batching on the training pairs? Following approach might be the way but after interleave, I’m expecting much optimize approach. Any tips?

batch_size = 240
num_classes = 10

dataset_encode = tf.data.Dataset.range(num_classes)
dataset_encode = dataset_encode.interleave(dataset_for_class, 
                                           cycle_length=10)
dataset_encode = dataset_encode.batch(batch_size) 


dataset0 = tuple(dataset_encode.filter
                 (
                     lambda x, y: tf.equal
                     (
                         y[n], n
                     )
                 ) for n in range(num_classes)
                )
...
zipped = tf.data.Dataset.zip(dataset0)
...

Update

One possible solution.

dataset = tf.data.Dataset.range(1, 6)  
dataset = dataset.interleave(
    lambda x: tf.data.Dataset.from_tensors(x).repeat(5),
    cycle_length=1, 
    block_length=3,
)
list(dataset.as_numpy_iterator())

[1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 
3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5])

Opened a ticket here for the slow process regarding this in the tf.data API.

https://github.com/tensorflow/tensorflow/issues/56934

It is also hard to create equal splits in TF datasets:

https://github.com/tensorflow/datasets/issues/3502