Can't get this sample code work: Sequential() using dataset from generator, matrix shape not match

minix · November 7, 2024, 7:09pm

try to use dataset from_generator in tf.keras.sequential model to do logistic regression, but can’t get the input shape right. the code is from Github copilot:
import tensorflow as tf
import pandas as pd

test.csv

x,y,z,target

0.7042368571200515,0.7007563770917052,0.9923080382725273,0

0.11426662913904129,0.8766193882299516,0.7525734407311002,0

0.3990706480213546,0.7509927893745748,0.08310617899165762,0

0.6541820169804226,0.8728205248135915,0.009735857788017,0

0.9824663638939366,0.8929642732916638,0.42717481372577704,0

0.05016847688897663,0.8640977215284672,0.00015648051633876392,1

0.9746878269143237,0.4675049082702597,0.8701887846733452,0

0.6351460617236527,0.41715847867753963,0.7187310540710531,0

0.48783818394438205,0.7666787407271458,0.27013849266804313,0

0.32031329934121744,0.17932919492349586,0.6898206330541312,0

0.9543041190067443,0.591335844278636,0.6588428533365475,0

0.19313525739712412,0.9852300738375201,0.6948888181361819,0

0.19820267930950564,0.933211841142697,0.903656352513663,0

0.8150172410867663,0.8880582276213321,0.5061326797194212,0

0.596347151996661,0.7352080480185654,0.7880475513257801,0

0.6134560868023209,0.3485123047276638,0.22781550361885472,0

0.8044922456384954,0.45120831616370516,0.5767554455960054,0

0.6715578431234355,0.7646054358158448,0.9451860531031546,0

0.7686609033200247,0.6114036496260894,0.7650105537257866,0

0.05197577933003528,0.28496109714833917,0.41306518543162885,0

0.344460901937348,0.766332305545744,0.5144459764257473,0

0.6599678166641048,0.5292402310805339,0.5094529642981013,0

0.10673926773965803,0.5238891179909103,0.9817150442751443,0

0.7036732515429891,0.23654285159967436,0.8762269476492692,0

0.8781094838240854,0.506176060331502,0.9067167580705571,0

0.3374843921398276,0.8600866154828248,0.2973216448787409,0

0.943770089167269,0.0686808858245227,0.48951596198556235,1

0.6765152574791524,0.1375712100211337,0.1737266892058592,0

0.7273752026856982,0.9533380344200385,0.4924386036510685,1

0.4658204645836098,0.2500965060050161,0.48105252784504837,0

0.2880634095162119,0.6276728155035326,0.19165303472399087,0

0.11083669998863499,0.21704265767720732,0.6676057357044906,0

0.12851954218455197,0.20802495693235157,0.667663085267044,0

0.8727789507757944,0.3265873016685742,0.1886650498978053,0

0.8461403050364225,0.43490654451648725,0.31975559963755273,0

0.5077604100733044,0.4655673281242404,0.2802123251669665,0

0.2233222755592028,0.04915222505078809,0.8972617683363415,0

0.2770966381433091,0.6911062101812422,0.35029445120157965,0

0.06505403740430493,0.5549924736882712,0.1512830697345361,0

0.633287065526996,0.6726877553668914,0.7480470622224006,0

0.15276758287427195,0.09551409131836819,0.7330651843012955,0

0.8177575478572151,0.3118379196659643,0.7115535780280724,0

0.4034709361948867,0.5915301572051304,0.8315961740558816,0

0.2521911664746448,0.48834451689396763,0.7968736310010842,0

0.17204367637440232,0.9044065209258801,0.46848650028550876,1

0.2730952384015819,0.6654793002791546,0.6148138694973475,0

0.8689420382367301,0.8348391041594503,0.05993433393586789,0

0.5976464192216739,0.4190036279235926,0.07710971075225881,0

0.9703383518752555,0.7117974134043004,0.984298292889044,0

Define a generator function to read the CSV file in chunks

def csv_generator(file_path, chunksize=1000):
for chunk in pd.read_csv(file_path, chunksize=chunksize):
for row in chunk.itertuples(index=False):
yield row

Define the feature and label extraction function

def parse_csv_row(*row):
features = row[:-1] # Assuming the last column is the label
label = row[-1]
return tf.convert_to_tensor(features, dtype=tf.float32), tf.convert_to_tensor(label, dtype=tf.float32)

Create a TensorFlow Dataset from the generator

file_path = ‘test.csv’
dataset = tf.data.Dataset.from_generator(
lambda: csv_generator(file_path),
output_signature=(
tf.TensorSpec(shape=(3,), dtype=tf.float32), # Adjust shape to match the number of features
tf.TensorSpec(shape=(), dtype=tf.float32) # Adjust shape to match the label
)
)

Map the parsing function to the dataset

dataset = dataset.map(parse_csv_row)

Batch the dataset

batch_size = 32
dataset = dataset.batch(batch_size)

Define a simple logistic regression model

model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(3,)), # Adjust shape to match the number of features
tf.keras.layers.Dense(1, activation=‘sigmoid’)
])

Compile the model

model.compile(optimizer=‘adam’, loss=‘binary_crossentropy’, metrics=[‘accuracy’])

Train the model

model.fit(dataset, epochs=10)

minix · November 7, 2024, 7:10pm

code is here:

import tensorflow as tf
import pandas as pd

test.csv

x,y,z,target

0.7042368571200515,0.7007563770917052,0.9923080382725273,0

0.11426662913904129,0.8766193882299516,0.7525734407311002,0

0.3990706480213546,0.7509927893745748,0.08310617899165762,0

0.6541820169804226,0.8728205248135915,0.009735857788017,0

0.9824663638939366,0.8929642732916638,0.42717481372577704,0

0.05016847688897663,0.8640977215284672,0.00015648051633876392,1

0.9746878269143237,0.4675049082702597,0.8701887846733452,0

0.6351460617236527,0.41715847867753963,0.7187310540710531,0

0.48783818394438205,0.7666787407271458,0.27013849266804313,0

0.32031329934121744,0.17932919492349586,0.6898206330541312,0

0.9543041190067443,0.591335844278636,0.6588428533365475,0

0.19313525739712412,0.9852300738375201,0.6948888181361819,0

0.19820267930950564,0.933211841142697,0.903656352513663,0

0.8150172410867663,0.8880582276213321,0.5061326797194212,0

0.596347151996661,0.7352080480185654,0.7880475513257801,0

0.6134560868023209,0.3485123047276638,0.22781550361885472,0

0.8044922456384954,0.45120831616370516,0.5767554455960054,0

0.6715578431234355,0.7646054358158448,0.9451860531031546,0

0.7686609033200247,0.6114036496260894,0.7650105537257866,0

0.05197577933003528,0.28496109714833917,0.41306518543162885,0

0.344460901937348,0.766332305545744,0.5144459764257473,0

0.6599678166641048,0.5292402310805339,0.5094529642981013,0

0.10673926773965803,0.5238891179909103,0.9817150442751443,0

0.7036732515429891,0.23654285159967436,0.8762269476492692,0

0.8781094838240854,0.506176060331502,0.9067167580705571,0

0.3374843921398276,0.8600866154828248,0.2973216448787409,0

0.943770089167269,0.0686808858245227,0.48951596198556235,1

0.6765152574791524,0.1375712100211337,0.1737266892058592,0

0.7273752026856982,0.9533380344200385,0.4924386036510685,1

0.4658204645836098,0.2500965060050161,0.48105252784504837,0

0.2880634095162119,0.6276728155035326,0.19165303472399087,0

0.11083669998863499,0.21704265767720732,0.6676057357044906,0

0.12851954218455197,0.20802495693235157,0.667663085267044,0

0.8727789507757944,0.3265873016685742,0.1886650498978053,0

0.8461403050364225,0.43490654451648725,0.31975559963755273,0

0.5077604100733044,0.4655673281242404,0.2802123251669665,0

0.2233222755592028,0.04915222505078809,0.8972617683363415,0

0.2770966381433091,0.6911062101812422,0.35029445120157965,0

0.06505403740430493,0.5549924736882712,0.1512830697345361,0

0.633287065526996,0.6726877553668914,0.7480470622224006,0

0.15276758287427195,0.09551409131836819,0.7330651843012955,0

0.8177575478572151,0.3118379196659643,0.7115535780280724,0

0.4034709361948867,0.5915301572051304,0.8315961740558816,0

0.2521911664746448,0.48834451689396763,0.7968736310010842,0

0.17204367637440232,0.9044065209258801,0.46848650028550876,1

0.2730952384015819,0.6654793002791546,0.6148138694973475,0

0.8689420382367301,0.8348391041594503,0.05993433393586789,0

0.5976464192216739,0.4190036279235926,0.07710971075225881,0

0.9703383518752555,0.7117974134043004,0.984298292889044,0

Define a generator function to read the CSV file in chunks

def csv_generator(file_path, chunksize=1000):
for chunk in pd.read_csv(file_path, chunksize=chunksize):
for row in chunk.itertuples(index=False):
yield row

Define the feature and label extraction function

def parse_csv_row(*row):
features = row[:-1] # Assuming the last column is the label
label = row[-1]
return tf.convert_to_tensor(features, dtype=tf.float32), tf.convert_to_tensor(label, dtype=tf.float32)

Create a TensorFlow Dataset from the generator

file_path = ‘test.csv’
dataset = tf.data.Dataset.from_generator(
lambda: csv_generator(file_path),
output_signature=(
tf.TensorSpec(shape=(3,), dtype=tf.float32), # Adjust shape to match the number of features
tf.TensorSpec(shape=(), dtype=tf.float32) # Adjust shape to match the label
)
)

Map the parsing function to the dataset

dataset = dataset.map(parse_csv_row)

Batch the dataset

batch_size = 32
dataset = dataset.batch(batch_size)

Define a simple logistic regression model

model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(3,)), # Adjust shape to match the number of features
tf.keras.layers.Dense(1, activation=‘sigmoid’)
])

Compile the model

model.compile(optimizer=‘adam’, loss=‘binary_crossentropy’, metrics=[‘accuracy’])

Train the model

model.fit(dataset, epochs=10)

Kiran_Sai_Ramineni · November 8, 2024, 5:44am

Hi @minix, I have tried to execute the above given code and got the shape mismatch error so I tried to modify the defined functions and able to train the model without any error. I have also tried to create a dataset using `tf.data.Dataset.from_tensor_slices’ and trained the model without any error. Please refer to this gist for working code example. Thank You.

minix · November 10, 2024, 3:06pm

thanks a lot! it seems your code suggest the yield by chunk did not work with tf.dataset.batch, have to do row by row. but that is very inefficient when the rows in hundred million lines. the row processor needs to do more than just read the row. it has to work on the chunk to generate more features. this is a very common situation. Not sure what I missed from tensorflow capabilities.

thank you very much, greatly appreciate your help!

Topic		Replies	Views
tf.data.Dastaset.from_generator has error when having multiple input General Discussion datasets , tfkeras , tfdata , model_maker	2	372	January 12, 2024
tf.data.Dastaset.from_generator has error when using np.array : 'charmap' codec cannot encode character General Discussion datasets , tfdata , lstm	4	343	December 20, 2023
TensorFlow "TypeError: Target data is missing" though dataset with 2 dimension tuple was supplied Keras datasets , help_request	2	4211	August 11, 2025
Input 0 of layer sequential_3 is incompatible with the layer General Discussion datasets , help_request	2	2893	June 23, 2022
It is impossible to force datasets.Dataset.to_tf_dataset() create tensors of the correct shape?! General Discussion model-code , tfdataset	1	183	June 26, 2024