Avoid data leakage from train to test in tensorflow dataset/Splitting an tensorflow dataset into train and test without data leakage

mksakeesh · August 9, 2022, 1:06am

I am using below code to read from csv to tensorflow dataset

ratings_ds = tf.data.experimental.make_csv_dataset(
    "./train_recom_transformed.csv",
    batch_size=5,
    select_columns=['user_id', 'song_id', 'listen_count', 'ratings','title','release','artist_name','year','count'],
    header=True,
    num_epochs=1,
    ignore_errors=False,)
songs_ds = tf.data.experimental.make_csv_dataset(
    "./songs_details.csv",
    batch_size=128,
    select_columns=['song_id','title','release','artist_name','year'],
    num_epochs=1,
    ignore_errors=True,)


ratings = ratings_ds.unbatch().map(lambda x: {
    "song_id": x["song_id"],
    "user_id": x["user_id"],
    "ratings": x["ratings"],
    "release":x["release"],
    "artist_name":x["artist_name"],
    "title":x["title"],
    "year":x["year"],
    "listencount":x["listen_count"],
    "count":x["count"],
})
songs = songs_ds.unbatch().map(lambda x: {
    "song_id":x["song_id"],
    "release":x["release"],
    "artist_name":x["artist_name"],
    "title":x["title"],
    "year":x["year"],
})

train = ratings.take(12000)
test = ratings.skip(12000).take(4000)

In this code how can I ensure that the same user id is not there in both train and test dataset. How can I avoid data leakage from train to test?

I did try sorting the csv file but then when reading into tensorflow dataset the sorting is lost.

rcauvin · August 13, 2022, 9:35pm

Are you preparing to train and test a recommender model?

lgusm · August 15, 2022, 9:21pm

From the top of my head,

I’d preprocess this data (in Pandas) and save separated files for train and test following whatever group you want. In your case by user_id

This makes it easier to keep playing with the model later as the data is already properly split in files

mksakeesh · August 18, 2022, 1:28am

Yes I am trying an recommendation model.

rcauvin · August 18, 2022, 4:20pm

I’m not quite sure why you don’t want the same user ID to appear in the train and test datasets. The ratings dataset represents ratings on user-item pairs. Thus the same user may appear multiple times in the dataset, rating different items. Typically, you want the model to learn the user’s item preferences in the train dataset and predict whether the same user will like a different item in the test dataset.

Topic		Replies	Views
Split make_csv_dataset batches intro a train and validation set? General Discussion datasets	4	1049	November 29, 2022
Data Leakage - image_dataset_from_directory() General Discussion data_validation	2	363	June 17, 2024
Excluding previously seen samples from test recommendations, on TFRS General Discussion recommenders , help_request	2	1358	August 12, 2022
My Tensorflow Data pipeline has some issues returning same class samples for all steps General Discussion datasets , help_request	2	408	September 22, 2022
Getting Nan Loss when training Deep neural Recommender model using tensorflow General Discussion models , recommenders , help_request	2	5992	June 6, 2022

Avoid data leakage from train to test in tensorflow dataset/Splitting an tensorflow dataset into train and test without data leakage

Related topics