Hi,
I want to convert BertMaskedLM logits to human-readable text. Can anyone suggest to me how can I post-process the data? Here is the code,
features = ["The quick brown fox jumped.", "I forgot my homework."]
# Pretrained language model.
masked_lm = keras_nlp.models.BertMaskedLM.from_preset(
"bert_base_en_uncased",
)
masked_lm.fit(x=features, batch_size=2)
output = masked_lm.predict(['It is a good [MASK]."])
The format of the output is numpy.ndarray.
Thanks for helping in advance!
Hi @Neesham ,
Here is the sample colab gist created to convert BertLM logits to readable text.
Thanks
TensorFlow 2.16 + Python 3.12 – JARaaS Hybrid RAG - 6/16/2024
To convert the output logits from your BertMaskedLM model to human-readable text, you will need to perform the following steps:
- Apply Softmax: Convert the logits to probabilities using the softmax function.
- Find the Token with the Highest Probability: Identify the token with the highest probability for each masked token.
- Map Token IDs to Words: Convert the token IDs to corresponding words using the tokenizer.
Below is an example of how you might achieve this in your code:
import numpy as np
from tensorflow import keras
from keras_nlp.models import BertTokenizer
# Pretrained language model.
masked_lm = keras_nlp.models.BertMaskedLM.from_preset(
"bert_base_en_uncased",
)
# Tokenizer for BERT
tokenizer = BertTokenizer.from_preset("bert_base_en_uncased")
# Input sentences with mask tokens
features = ["The quick brown fox jumped.", "I forgot my homework."]
masked_input = ['It is a good [MASK].']
# Tokenize input
tokenized_input = tokenizer(masked_input)
# Predict logits
logits = masked_lm.predict(tokenized_input)
# Define the index of the masked token in the input
mask_index = tokenized_input[0].tolist().index(tokenizer.convert_tokens_to_ids(['[MASK]'])[0])
# Apply softmax to get probabilities
softmax_logits = keras.layers.Softmax()(logits)
# Get the token with the highest probability
predicted_token_id = np.argmax(softmax_logits[0][mask_index])
# Convert token ID to word
predicted_token = tokenizer.convert_ids_to_tokens([predicted_token_id])[0]
# Replace [MASK] with the predicted token
output_text = masked_input[0].replace("[MASK]", predicted_token)
print(f"Input: {masked_input[0]}")
print(f"Output: {output_text}")
This code snippet does the following:
- It loads the
BertMaskedLM
model and corresponding tokenizer.
- It tokenizes the input sentence that contains a masked token.
- It predicts the logits using the masked language model.
- It finds the index of the
[MASK]
token in the tokenized input.
- It applies the softmax function to the logits to obtain probabilities.
- It identifies the token ID with the highest probability at the masked token position.
- It converts this token ID back to a human-readable word using the tokenizer.
- Finally, it replaces
[MASK]
in the original sentence with the predicted word and prints the result.
Sources:
-
For tokenizing text and converting tokens back to text you can refer to KerasNLP and TensorFlow documentation:
TensorFlow Text Tutorials
TensorFlow Guide
-
The preceding code example is based on resources available in the document:
tensorflow text teaching material: text_generation.ipynb (internal document)