Hi,
I came across a weird problem when I read TFrecords files from S3 through tf.dataset and cached them to my local path. Here is my reading code
filenames=['s3s:path1', ''s3s:path2']
dataset = tf.data.TFRecordDataset(filenames, compression_type="GZIP")
parsed_dataset = (
dataset.batch(batch_size, num_parallel_calls=tf.data.AUTOTUNE)
.map(decode, num_parallel_calls=tf.data.AUTOTUNE)
.cache(cache_file_path)
.prefetch(tf.data.AUTOTUNE)
)
It’s very strange that cache() still uses the internal memory which results in OOM. Here is the memory usage I printed via callback during training.
2022-03-08T22:19:40.154191003Z ...Training: end of batch 15700; got log keys: ['loss', 'copc', 'auc']
2022-03-08T22:19:40.159188560Z totalmemor: 59.958843GB
2022-03-08T22:19:40.159223737Z availablememory: 8.418320GB
2022-03-08T22:19:40.159250296Z usedmemory: 50.959393GB
2022-03-08T22:19:40.159257814Z percentof used memory: 86.000000
2022-03-08T22:19:40.159263710Z freememory:1.072124GB
2022-03-08T22:19:47.752077011Z Tue Mar 8 22:19:47 UTC 2022 job-submitter: job run error: signal: killed
I have tested the code on TF2.3 which has no such an issue, but TF2.5 and onwards have such an OOM issue. I am not sure whether or not this is bug or configuration problem. Could anyone help to answer or give some clues about this problem?