We have reimplemented Video Swin Transformer model in #Keras, considering supporting multi-backend framework in future. The pretrained weights are also available in both SavedModel and H5 format.
#VideoSwin is a pure transformer based video modeling algorithm, attained top accuracy on the major video recognition benchmarks.
An inference highlights:
from videoswin import VideoSwinT
>>> model = VideoSwinT(num_classes=400)
>>> model.load_weights(
'TFVideoSwinT_K400_IN1K_P244_W877_32x224.h5'
)
>>> container = read_video('sample.mp4')
>>> frames = frame_sampling(container, num_frames=32)
>>> y = model(frames)
>>> y.shape
TensorShape([1, 400])
>>> probabilities = tf.nn.softmax(y_pred_tf)
>>> probabilities = probabilities.numpy().squeeze(0)
>>> confidences = {
label_map_inv[i]: float(probabilities[i]) \
for i in np.argsort(probabilities)[::-1]
}
>>> confidences
A classification results on a sample from Kinetics-400.
{
'playing_cello': 0.9941741824150085,
'playing_violin': 0.0016851733671501279,
'playing_recorder': 0.0011555481469258666,
'playing_clarinet': 0.0009695519111119211,
'playing_harp': 0.0007713600643910468
}