How to convert BertLM logits to readable text?

Hi,
I want to convert BertMaskedLM logits to human-readable text. Can anyone suggest to me how can I post-process the data? Here is the code,

features = ["The quick brown fox jumped.", "I forgot my homework."]
# Pretrained language model.
masked_lm = keras_nlp.models.BertMaskedLM.from_preset(
    "bert_base_en_uncased",
)
masked_lm.fit(x=features, batch_size=2)

output = masked_lm.predict(['It is a good [MASK]."])

The format of the output is numpy.ndarray.

Thanks for helping in advance!:innocent:

Hi @Neesham ,

Here is the sample colab gist created to convert BertLM logits to readable text.

Thanks

TensorFlow 2.16 + Python 3.12 – JARaaS Hybrid RAG - 6/16/2024

To convert the output logits from your BertMaskedLM model to human-readable text, you will need to perform the following steps:

  1. Apply Softmax: Convert the logits to probabilities using the softmax function.
  2. Find the Token with the Highest Probability: Identify the token with the highest probability for each masked token.
  3. Map Token IDs to Words: Convert the token IDs to corresponding words using the tokenizer.

Below is an example of how you might achieve this in your code:

import numpy as np
from tensorflow import keras
from keras_nlp.models import BertTokenizer

# Pretrained language model.
masked_lm = keras_nlp.models.BertMaskedLM.from_preset(
    "bert_base_en_uncased",
)

# Tokenizer for BERT
tokenizer = BertTokenizer.from_preset("bert_base_en_uncased")

# Input sentences with mask tokens
features = ["The quick brown fox jumped.", "I forgot my homework."]
masked_input = ['It is a good [MASK].']

# Tokenize input
tokenized_input = tokenizer(masked_input)

# Predict logits
logits = masked_lm.predict(tokenized_input)

# Define the index of the masked token in the input
mask_index = tokenized_input[0].tolist().index(tokenizer.convert_tokens_to_ids(['[MASK]'])[0])

# Apply softmax to get probabilities
softmax_logits = keras.layers.Softmax()(logits)

# Get the token with the highest probability
predicted_token_id = np.argmax(softmax_logits[0][mask_index])

# Convert token ID to word
predicted_token = tokenizer.convert_ids_to_tokens([predicted_token_id])[0]

# Replace [MASK] with the predicted token
output_text = masked_input[0].replace("[MASK]", predicted_token)

print(f"Input: {masked_input[0]}")
print(f"Output: {output_text}")

This code snippet does the following:

  1. It loads the BertMaskedLM model and corresponding tokenizer.
  2. It tokenizes the input sentence that contains a masked token.
  3. It predicts the logits using the masked language model.
  4. It finds the index of the [MASK] token in the tokenized input.
  5. It applies the softmax function to the logits to obtain probabilities.
  6. It identifies the token ID with the highest probability at the masked token position.
  7. It converts this token ID back to a human-readable word using the tokenizer.
  8. Finally, it replaces [MASK] in the original sentence with the predicted word and prints the result.

Sources:

  • For tokenizing text and converting tokens back to text you can refer to KerasNLP and TensorFlow documentation:
    TensorFlow Text Tutorials
    TensorFlow Guide

  • The preceding code example is based on resources available in the document:
    tensorflow text teaching material: text_generation.ipynb (internal document)