How to convert BertLM logits to readable text?

Neesham · April 1, 2023, 9:10am

Hi,
I want to convert BertMaskedLM logits to human-readable text. Can anyone suggest to me how can I post-process the data? Here is the code,

features = ["The quick brown fox jumped.", "I forgot my homework."]
# Pretrained language model.
masked_lm = keras_nlp.models.BertMaskedLM.from_preset(
    "bert_base_en_uncased",
)
masked_lm.fit(x=features, batch_size=2)

output = masked_lm.predict(['It is a good [MASK]."])

The format of the output is numpy.ndarray.

Thanks for helping in advance!

Laxma_Reddy_Patlolla · June 14, 2024, 6:50pm

Hi @Neesham ,

Here is the sample colab gist created to convert BertLM logits to readable text.

Thanks

Tim_Wolfe · June 17, 2024, 5:10am

TensorFlow 2.16 + Python 3.12 – JARaaS Hybrid RAG - 6/16/2024

To convert the output logits from your BertMaskedLM model to human-readable text, you will need to perform the following steps:

Apply Softmax: Convert the logits to probabilities using the softmax function.
Find the Token with the Highest Probability: Identify the token with the highest probability for each masked token.
Map Token IDs to Words: Convert the token IDs to corresponding words using the tokenizer.

Below is an example of how you might achieve this in your code:

import numpy as np
from tensorflow import keras
from keras_nlp.models import BertTokenizer

# Pretrained language model.
masked_lm = keras_nlp.models.BertMaskedLM.from_preset(
    "bert_base_en_uncased",
)

# Tokenizer for BERT
tokenizer = BertTokenizer.from_preset("bert_base_en_uncased")

# Input sentences with mask tokens
features = ["The quick brown fox jumped.", "I forgot my homework."]
masked_input = ['It is a good [MASK].']

# Tokenize input
tokenized_input = tokenizer(masked_input)

# Predict logits
logits = masked_lm.predict(tokenized_input)

# Define the index of the masked token in the input
mask_index = tokenized_input[0].tolist().index(tokenizer.convert_tokens_to_ids(['[MASK]'])[0])

# Apply softmax to get probabilities
softmax_logits = keras.layers.Softmax()(logits)

# Get the token with the highest probability
predicted_token_id = np.argmax(softmax_logits[0][mask_index])

# Convert token ID to word
predicted_token = tokenizer.convert_ids_to_tokens([predicted_token_id])[0]

# Replace [MASK] with the predicted token
output_text = masked_input[0].replace("[MASK]", predicted_token)

print(f"Input: {masked_input[0]}")
print(f"Output: {output_text}")

This code snippet does the following:

It loads the BertMaskedLM model and corresponding tokenizer.
It tokenizes the input sentence that contains a masked token.
It predicts the logits using the masked language model.
It finds the index of the [MASK] token in the tokenized input.
It applies the softmax function to the logits to obtain probabilities.
It identifies the token ID with the highest probability at the masked token position.
It converts this token ID back to a human-readable word using the tokenizer.
Finally, it replaces [MASK] in the original sentence with the predicted word and prints the result.

Sources:

For tokenizing text and converting tokens back to text you can refer to KerasNLP and TensorFlow documentation:
TensorFlow Text Tutorials
TensorFlow Guide
The preceding code example is based on resources available in the document:
tensorflow text teaching material: text_generation.ipynb (internal document)

Topic		Replies	Views
Getting word ids back to strings General Discussion models , nlp , keras	1	1795	January 23, 2023
How to convert string to indices in a rnn model while exporting it using saved_model.save General Discussion models , nlp , datasets , help_request	1	1271	October 25, 2023
Subject: Seeking Guidance on Text Understanding and Entity Extraction Using TensorFlow General Discussion models , help_request	3	428	December 11, 2023
ValueError: `logits` and `labels` must have the same shape, received ((None, 256, 256, 4) vs (None,)) ValueError: `logits` and `labels` must have the same shape, received ((None, 256, 256, 4) vs (None,)) General Discussion models , keras , help_request	2	5287	June 7, 2024
List index out of range while saving a trained model General Discussion models , tpu , help_request	9	5674	March 22, 2022

How to convert BertLM logits to readable text?

Related topics