Train Test Split Using np.random.rand()

Shiv_Katira · April 14, 2024, 10:09am

Why TensorFlow documentation uses np.random.rand() function to create train test split of the dataset? For example:

def split_dataset(dataset, test_ratio=0.30):
  """Splits a panda dataframe in two."""
  test_indices = np.random.rand(len(dataset)) < test_ratio
  return dataset[~test_indices], dataset[test_indices]

Above code snippet is copied from Automated hyper-parameter tuning. We can’t say with 100% guarantee that this code will split 30% of the dataset as test and remaining 70% as training since the process is dependent on random number generation without any initial seed. Then why the official documentation uses numpy random number generation to split the dataset?

Kiran_Sai_Ramineni · May 7, 2024, 9:08am

Hi @Shiv_Katira, yes by using np.random.rand we cannot split the dataset exactly to 70% and 30%. But if you see the number of the sample present after splitting the dataset it will be close to 70% and 30%. That might be the reason for using random method. Thank You.

Topic		Replies	Views
Data Leakage - image_dataset_from_directory() General Discussion data_validation	2	363	June 17, 2024
Splitting dataset into train, validate, test and ensuring equal representation of classes TensorFlow models , datasets , help_request	2	2041	April 7, 2023
Why no split() on Dataset class? General Discussion api , keras , tfdata	2	947	September 3, 2022
Splitting train-val-test using local datasets Keras api , help_request	2	2238	March 19, 2024
How is Shuffle random General Discussion tfdata	2	62	July 9, 2024

Train Test Split Using np.random.rand()

Related topics