Can you bit-pack and then unpack binary inputs?

gwiesenekker · August 30, 2023, 7:44am

Hi,

My machine has 128GB RAM. I have a (large) CSV file with binary inputs (0/1) and floating point labels that I want to load complete into RAM as loading and parsing the CSV file in batches is not very fast. Even when the inputs are encoded as bytes it would take more than 128GB RAM to store them. Is there a way these binary inputs can be bit-packed (that would reduce the size by a factor 8) in RAM and use them unpacked for training? So I do not want to bit-pack the inputs and then use the packed integers for training.

Thanks,
GW

gwiesenekker · August 30, 2023, 8:00am

ChatGPT 3.5 provides the following answer:

Certainly! Bit-packing is a technique to efficiently store binary data by packing multiple binary values into a single byte. This can be useful when dealing with large datasets of binary inputs, as it can reduce memory usage. In this example, I’ll show you how to bit-pack binary inputs and then unpack them in batches using TensorFlow for training.

Let’s assume you have a dataset of binary inputs represented as arrays of 0s and 1s. Here’s how you can perform bit-packing and unpacking using TensorFlow:

import numpy as np
import tensorflow as tf

Generate some random binary data

num_samples = 1000
input_size = 32
binary_data = np.random.randint(0, 2, size=(num_samples, input_size), dtype=np.uint8)

Function to bit-pack binary data

def bit_pack(data):
packed_data =
current_byte = 0
bit_count = 0

for bit in data:
    current_byte |= (bit << bit_count)
    bit_count += 1
    
    if bit_count == 8:
        packed_data.append(current_byte)
        current_byte = 0
        bit_count = 0
        
if bit_count > 0:
    packed_data.append(current_byte)
    
return np.array(packed_data, dtype=np.uint8)

Function to unpack bit-packed data

def bit_unpack(packed_data, original_size):
unpacked_data =

for byte in packed_data:
    for _ in range(8):
        unpacked_data.append(byte & 1)
        byte >>= 1
        
return np.array(unpacked_data[:original_size], dtype=np.uint8)

Bit-pack the binary data

packed_binary_data = np.array([bit_pack(sample) for sample in binary_data])

Create a TensorFlow dataset

batch_size = 32
dataset = tf.data.Dataset.from_tensor_slices((packed_binary_data, binary_data))
dataset = dataset.batch(batch_size)

Example usage in a training loop

for packed_batch, original_batch in dataset:
unpacked_batch = np.array([bit_unpack(packed_sample, input_size) for packed_sample in packed_batch.numpy()])

# Now you can use the unpacked_batch for training
# ...

gwiesenekker · August 30, 2023, 7:12pm

So I am trying the ChatGPT suggestion as follows:

def bit_pack(*data):
packed_data =
current_byte = 0
bit_count = 0
for bit in data:
    current_byte |= (bit << bit_count)
    bit_count += 1
    
    if bit_count == 8:
        packed_data.append(current_byte)
        current_byte = 0
        bit_count = 0
        
if bit_count > 0:
    packed_data.append(current_byte)
    
return np.array(packed_data, dtype=np.uint8)
dataset = tf.data.experimental.CsvDataset(
csv, column_types, header=True
)

dataset = dataset.map(bit_pack)

But then I get the error:

File “”, line 8, in bit_pack *
current_byte |= (bit << bit_count)
TypeError: unsupported operand type(s) for <<: 'Tensor' and 'int'

How do I convert the Tensor to an int? bit.numpy() gives a '“AttributeError: ‘Tensor’ object has no attribute ‘numpy’”

Thanks,
GW

gwiesenekker · August 30, 2023, 7:57pm

I am making some progress. If I change the bit_pack function to:

def bit_pack(data):
packed_data =
current_byte = 0
bit_count = 0

for bit in data[:-1]:
    current_byte |= (bit.numpy() << bit_count)
    bit_count += 1
    
    if bit_count == 8:
        packed_data.append(current_byte)
        current_byte = 0
        bit_count = 0
        
if bit_count > 0:
    packed_data.append(current_byte)
    
return np.array(packed_data, dtype=np.uint8)

and loop over the dataset as follows:

for element in dataset:
print(bit_pack(element))

I get the desired output:

[ 0 0 48 216 47 16 0 0 0 0 0 0 52 74 164 0 0 0
0 0 0 0 0 0]
[ 0 0 64 246 255 31 0 0 0 0 0 128 255 255 7 0 0 0
0 0 0 0 0 0]

But how do I now apply the bit_pack function to the dataset using map?

Thanks,
GW

gwiesenekker · August 30, 2023, 9:03pm

By gradually modifying the map example from the tf.data.Dataset documentation I have managed to narrow down the issue to:

The following map function does work:

def bit_pack_arg2(*data):
sum = 0;
for bit in data[:-1]:
sum = sum + bit;
return(sum)

dataset = tf.data.experimental.CsvDataset(
csv, column_types, header=True
)
result = dataset.map(bit_pack_arg2)
list(result.as_numpy_iterator())

It returns:

[21, 40, 33, 15, 38, 20, 16, 24, 12]

But the bit_pack_arg function

def bit_pack_arg(*data):
packed_data =
current_byte = 0
bit_count = 0

for bit in data[:-1]:
    current_byte |= (bit << bit_count)
    bit_count += 1
    
    if bit_count == 8:
        packed_data.append(current_byte)
        current_byte = 0
        bit_count = 0
        
if bit_count > 0:
    packed_data.append(current_byte)
    
return np.array(packed_data, dtype=np.uint8)

Gives the error:

TypeError: unsupported operand type(s) for <<: ‘Tensor’ and ‘int’

On the bit << bit_count line. But why does sum = sum + bit work, but current_byte |= (bit << bit_count) not?

Regards,
GW

gwiesenekker · August 31, 2023, 5:18am

So after a lot of trial-and-error the following bit_pack_arg function starts to work:

def bit_pack_arg(*data):
packed_data =
current_byte = 0
bit_count = 0

for bit in data[:-1]:
    #current_byte |= (bit << bit_count)
    current_byte |= tf.bitwise.left_shift(bit, bit_count)
    bit_count += 1

    if bit_count == 8:
        packed_data.append(current_byte)
        current_byte = 0
        bit_count = 0
    
if bit_count > 0:
    packed_data.append(current_byte)

#return np.array(packed_data, dtype=np.uint8)
return(packed_data)

After replacing the ‘<<’ operator with tf.left_shift I got the error:

NotImplementedError: Cannot convert a symbolic tf.Tensor (or_7/BitwiseOr:0) to a numpy array. This error may indicate that you’re trying to pass a Tensor to a NumPy call, which is not supported.

I uncommented the numpy.array call and when bit_pack_arg is mapped to the dataset I now get the output:

[(0,
0,
48,
216,
47,
16,
0,
0,
0,
0,
0,
0,
52,
74,
164,
0,
0,
0,
0,
0,
0,
0,
0,
0),

Close but it is not a nested array of numpy.arrays yet. So I guess what happens is that the dataset creates a symbolic tensor, the ‘+’ operator works on symbolic tensors but ‘<<’ and numpy.array not. So, are these bugs or are there rules for what you can use within tf.data.Dataset map functions? The np.array call converts the packed bytes to np.uint8. How can I convert the packed_data to uint8 without the np.array call?

Thanks,
GW

Mog · September 5, 2023, 8:44am

Oh, the load CSV functions only let you use int32.

Have you tried converting to Parquet and then using TFIO? tfio.experimental.IODataset | TensorFlow I/O

gwiesenekker · September 5, 2023, 3:59pm

Thank you, I will give that a try. Meanwhile ChatGPT 3.5 has answered my ‘<<’ question:

I apologize for any confusion earlier. You are correct; the << operator does not work on symbolic tensors within TensorFlow datasets. The << operator is intended for use with regular Python integers and cannot be directly applied to symbolic tensors. If you want to left-shift elements within a TensorFlow dataset, you would need to use TensorFlow operations that work on tensors.

Regards,
GW

gwiesenekker · December 29, 2023, 4:06am

Eureka after mixing Google, ChatGPT, Trial And Error like an alchemist.
The following code reads a CSV file with a header ‘PACK0, PACK1, …, PACK23, RESULT’, each row containing 192 binary (0/1) inputs that are supposed to be encoded as 24 8-bit ints (so 1,1,1,1,1,1,1,1 is encoded is 255) and one float output and unpacks the inputs in batches before training:

def unpackbits_tf(features, labels):
    mask = tf.constant([128, 64, 32, 16, 8, 4, 2, 1], dtype=features.dtype)
    expanded_features = tf.expand_dims(features, -1)
    unpacked = tf.cast(tf.bitwise.bitwise_and(expanded_features, mask) > 0, tf.int32)
    return tf.reshape(unpacked, [-1, features.shape[1] * 8]), labels

features = pd.read_csv('demo.csv', dtype = 'uint8', converters = {'RESULT': float})
labels = features.pop('RESULT')

x_train, x_test, y_train, y_test = train_test_split(features.to_numpy(), labels.to_numpy(), test_size=0.2)

BATCH_SIZE=128
with tf.device("CPU"):
    train = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(4 * BATCH_SIZE)
    validate = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(BATCH_SIZE)

train = train.map(unpackbits_tf)
validate = validate.map(unpackbits_tf)

model = tf.keras.Sequential([layers.Dense(192,activation="relu"),
                               layers.Dense(16,activation="relu"),
                               layers.Dense(16,activation="relu"),
                               layers.Dense(1,activation="sigmoid")])

model.compile(optimizer=tf.keras.optimizers.Adam(),loss = tf.keras.losses.MeanSquaredError())

callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', mode='auto', patience=10)

history = model.fit(train,
                      epochs=1000,
                      callbacks=[callback],
                      validation_data=validate)

Regards,
GW

Topic		Replies	Views
What can you use within tf.data.Dataset map functions/on symbolic tensors? General Discussion datasets , tfdata	1	1140	September 5, 2023
How to speedup input pipeline beyond vectorize and num_parallel_calls? General Discussion datasets , text-vectorization , epoc , tfdata	2	361	January 17, 2024
Dataset map function returns wrong tensor shape TensorFlow datasets , help_request	3	363	September 11, 2023
Loading npy files using tensorflow dataset pipeline General Discussion datasets , help_request	1	458	February 16, 2024
How to train a model with huge data and limited GPU memory using tf.data.Dataset APIs Keras models , gpu	5	609	July 14, 2023

Can you bit-pack and then unpack binary inputs?

Generate some random binary data

Function to bit-pack binary data

Function to unpack bit-packed data

Bit-pack the binary data

Create a TensorFlow dataset

Example usage in a training loop

Related topics