try to use dataset from_generator in tf.keras.sequential model to do logistic regression, but can’t get the input shape right. the code is from Github copilot:
import tensorflow as tf
import pandas as pd
Define a generator function to read the CSV file in chunks
def csv_generator(file_path, chunksize=1000):
for chunk in pd.read_csv(file_path, chunksize=chunksize):
for row in chunk.itertuples(index=False):
yield row
Define the feature and label extraction function
def parse_csv_row(*row):
features = row[:-1] # Assuming the last column is the label
label = row[-1]
return tf.convert_to_tensor(features, dtype=tf.float32), tf.convert_to_tensor(label, dtype=tf.float32)
Create a TensorFlow Dataset from the generator
file_path = ‘test.csv’
dataset = tf.data.Dataset.from_generator(
lambda: csv_generator(file_path),
output_signature=(
tf.TensorSpec(shape=(3,), dtype=tf.float32), # Adjust shape to match the number of features
tf.TensorSpec(shape=(), dtype=tf.float32) # Adjust shape to match the label
)
)
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(3,)), # Adjust shape to match the number of features
tf.keras.layers.Dense(1, activation=‘sigmoid’)
])
Define a generator function to read the CSV file in chunks
def csv_generator(file_path, chunksize=1000):
for chunk in pd.read_csv(file_path, chunksize=chunksize):
for row in chunk.itertuples(index=False):
yield row
Define the feature and label extraction function
def parse_csv_row(*row):
features = row[:-1] # Assuming the last column is the label
label = row[-1]
return tf.convert_to_tensor(features, dtype=tf.float32), tf.convert_to_tensor(label, dtype=tf.float32)
Create a TensorFlow Dataset from the generator
file_path = ‘test.csv’
dataset = tf.data.Dataset.from_generator(
lambda: csv_generator(file_path),
output_signature=(
tf.TensorSpec(shape=(3,), dtype=tf.float32), # Adjust shape to match the number of features
tf.TensorSpec(shape=(), dtype=tf.float32) # Adjust shape to match the label
)
)
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(3,)), # Adjust shape to match the number of features
tf.keras.layers.Dense(1, activation=‘sigmoid’)
])
Hi @minix, I have tried to execute the above given code and got the shape mismatch error so I tried to modify the defined functions and able to train the model without any error. I have also tried to create a dataset using `tf.data.Dataset.from_tensor_slices’ and trained the model without any error. Please refer to this gist for working code example. Thank You.
thanks a lot! it seems your code suggest the yield by chunk did not work with tf.dataset.batch, have to do row by row. but that is very inefficient when the rows in hundred million lines. the row processor needs to do more than just read the row. it has to work on the chunk to generate more features. this is a very common situation. Not sure what I missed from tensorflow capabilities.
thank you very much, greatly appreciate your help!