I have a similar error:
ValueError: 'logits' and 'labels' must have the same shape, received ((None, 2) vs (None, 1)).
My dataframe has two columns: Links (to string type, has titles of new articles) and Shortlisted (of numeric type, has value either 0 (not shortlisted) or 1 (shortlisted)).
I am working on a binary classification (nlp) project using BERT (by Google)
My code is throwing this error:
ValueError: logits
and labels
must have the same shape, received ((None, 2) vs (None, 1)).
And I am not understanding what this means and how to fix this.
- Splitting into training and testing data.
df['Length'].max()train_df=df[~(df['Sheet'].str.contains('April2024', regex=True) | df['Sheet'].str.contains('May2024', regex=True))]
test_df=df[len(train_df):]
- Importing libraries and classes
from transformers import BertTokenizer, create_optimizer, TFBertForSequenceClassification
from sklearn.model_selection import train_test_split
import tensorflow as tf
- Split the data into training, validation, and test sets
train_texts, train_labels = train_df['Links'].tolist(), train_df['Shortlisted'].tolist()
val_texts, test_texts, val_labels, test_labels = train_test_split(
test_df['Links'].tolist(), test_df['Shortlisted'].tolist(), test_size=0.5, random_state=42)
- BERT Tokenizer (the max_length is 163 as the longest title that i have in ‘Links’ column is of length 163)
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
def tokenize_function(texts):
return tokenizer(texts, padding='max_length', truncation=True, max_length=163, return_tensors="tf")
train_encodings = tokenize_function(train_texts)
val_encodings = tokenize_function(val_texts)
test_encodings = tokenize_function(test_texts)
- Creating a Tensorflow dataset
train_dataset = tf.data.Dataset.from_tensor_slices((
dict(train_encodings),
train_labels
)).shuffle(len(train_texts)).batch(8)
val_dataset = tf.data.Dataset.from_tensor_slices((
dict(val_encodings),
val_labels
)).batch(16)
test_dataset = tf.data.Dataset.from_tensor_slices((
dict(test_encodings),
test_labels
)).batch(16)
- Fetching pre-trained model
model = TFBertForSequenceClassification.from_pretrained('bert-base-cased', num_labels=2)
- The error is here!!! (when training and validating the model, i have set epochs=1 as i have around 9700 records only to train the model)
model.compile(optimizer=tf.keras.optimizers.Adam(),
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=tf.keras.metrics.CategoricalAccuracy())
history = model.fit(train_dataset, epochs=1, validation_data=val_dataset)
- Testing model
results = model.evaluate(test_dataset)
print(results)
- Saving model
model.save_pretrained('./fine-tuned-bert')
tokenizer.save_pretrained('./fine-tuned-bert')
- Testing model on a new article
model = TFBertForSequenceClassification.from_pretrained('./fine-tuned-bert')
tokenizer = BertTokenizer.from_pretrained('./fine-tuned-bert')
def predict(text):
inputs = tokenizer(text, return_tensors='tf', padding=True, truncation=True, max_length=128)
outputs = model(**inputs)
predictions = tf.argmax(outputs.logits, axis=-1)
return predictions
new_text = "Green energy to drive power sector investment, coal to remain significant: Moody's"
predicted_label = predict(new_text)
print(predicted_label)
Also, this is my first time working on an NLP project. Any suggestions on improvement in this code will be highly appreciated!
P.S.: below snippet of code did not throw an error and gave an accuracy of above 95%. The test data also had 100% accuracy.
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
history = model.fit(train_dataset, epochs=1, validation_data=val_dataset)