I am using below code to read from csv to tensorflow dataset
ratings_ds = tf.data.experimental.make_csv_dataset(
"./train_recom_transformed.csv",
batch_size=5,
select_columns=['user_id', 'song_id', 'listen_count', 'ratings','title','release','artist_name','year','count'],
header=True,
num_epochs=1,
ignore_errors=False,)
songs_ds = tf.data.experimental.make_csv_dataset(
"./songs_details.csv",
batch_size=128,
select_columns=['song_id','title','release','artist_name','year'],
num_epochs=1,
ignore_errors=True,)
ratings = ratings_ds.unbatch().map(lambda x: {
"song_id": x["song_id"],
"user_id": x["user_id"],
"ratings": x["ratings"],
"release":x["release"],
"artist_name":x["artist_name"],
"title":x["title"],
"year":x["year"],
"listencount":x["listen_count"],
"count":x["count"],
})
songs = songs_ds.unbatch().map(lambda x: {
"song_id":x["song_id"],
"release":x["release"],
"artist_name":x["artist_name"],
"title":x["title"],
"year":x["year"],
})
train = ratings.take(12000)
test = ratings.skip(12000).take(4000)
In this code how can I ensure that the same user id is not there in both train and test dataset. How can I avoid data leakage from train to test?
I did try sorting the csv file but then when reading into tensorflow dataset the sorting is lost.