Data loading to Tensorflow

Hi, i am very new to TF. i usually work with large datasets. I wanted to know what are the best/optimized ways to Tensorflow Datasets or Tensorflow in general. To perform operations such as sorting , standardization, clustering.

I use local VSCode as well as Azure DataBricks , so would like to know loading methods for both csv in local systerm and parquet,delta/UC on Databricks.

Also if there are any methods to convert data from and to pyspark dataframe.

@Sudh_Kumar Welcome to Tensorflow Forum!

When working with TensorFlow Datasets (TFDS) to perform operations like sorting, standardization, and clustering, there are several optimized approaches and techniques you can use. I’ll provide some general guidelines for each operation:

  1. Sorting: Sorting data in TensorFlow Datasets can be achieved using the tf.data.Dataset API. You can use the map() function to apply a sorting operation on your dataset. For instance, if you want to sort the data based on a specific feature or label, you can do something like this:
import tensorflow as tf

def sort_fn(features, label):
    # Sort based on a specific feature or label
    sorted_indices = tf.argsort(label)
    return tf.gather(features, sorted_indices), tf.gather(label, sorted_indices)

# Assuming you have a dataset 'dataset' with features and labels
sorted_dataset = dataset.map(sort_fn)
  1. Standardization: Standardizing the data is a common preprocessing step in machine learning. You can use TensorFlow’s tf.image.per_image_standardization for image datasets or tf.keras.layers.LayerNormalization for non-image datasets.
import tensorflow as tf

# Assuming you have a dataset 'dataset' with features and labels
def standardize_fn(features, label):
    features = tf.image.per_image_standardization(features)
    return features, label

standardized_dataset = dataset.map(standardize_fn)
  1. Clustering:
    To cluster a TensorFlow Dataset, you can use the tf.cluster.KMeansClustering() estimator. This estimator takes a TensorFlow Dataset as input and clusters the data into a specified number of clusters. For example, the following code clusters a dataset into 3 clusters:
import tensorflow as tf

dataset = tf.data.Dataset.from_tensor_slices([[1, 2], [3, 4], [5, 6]])
kmeans = tf.cluster.KMeansClustering(n_clusters=3)
clusters = kmeans.fit(dataset)

The best approach for optimizing these operations in TensorFlow Datasets will depend on the specific nature of your dataset and the desired results.

These are just a few of the ways to use TensorFlow Datasets to perform operations such as sorting, standardization, and clustering. For more information, please refer to the TensorFlow Datasets documentation: TensorFlow Datasets.

In addition to the above, here are some other tips for optimizing the performance of TensorFlow Datasets:

  • Use the prefetch() method to load data into memory ahead of time. This can significantly improve the performance of operations such as sorting and clustering.
  • Use the cache() method to store data in memory. This can be useful if you are repeatedly running the same operations on the same dataset.
  • Use the repeat() method to repeat the dataset a specified number of times. This can be useful if you are training a model on a small dataset.

To load a Parquet, Delta/UC file on Databricks using TensorFlow, you can use the following code:

import tensorflow as tf
import databricks.data.delta as dd
file_path = "dbfs:/path/to/file.parquet"
dataset = tf.data.experimental.load_from_parquet(file_path)

Also, there are several methods to convert data from and to PySpark DataFrame using TensorFlow. Here are a few examples:

Converting PySpark DataFrame to TensorFlow Dataset:

import tensorflow as tf
import pyspark.sql.functions as F

spark_df = spark.createDataFrame([(1, 2), (3, 4), (5, 6)])
tf_dataset = tf.data.Dataset.from_pyspark(spark_df, output_types=[tf.int32, tf.int32])

To convert a TensorFlow Dataset to a PySpark DataFrame, you can use the following code:

import tensorflow as tf
import pyspark.sql.functions as F
tf_dataset = tf.data.Dataset.from_tensor_slices([(1, 2), (3, 4), (5, 6)])
spark_df = tf_dataset.to_pyspark_dataframe()

Let us know if this helps!

Hi , thanks a lot for this detailed reply.
So, ill try to explain more in detail what is my usecase or what exactly i am try to do.

I am not following the standard features and label mode.
I have a large csv dataset with 5 string columns ,2 integer columns and 1 integer unique identifer column.

What i have been trying to do is read this data directly to a tensorflow dataset and then sort the data based on the 2 integer columns.

If you have very large dataset then you might also want to look at TensorFlow I/O

But a TF Dataset is not intended to be sorted, it is intended to be streamed. For a large csv I would look at polars. But I’m sure Databricks has a tool for that.

1 Like