Checking if a URL exists or not during dataset construction

Sayak_Paul · December 31, 2021, 12:00pm

Say I have the following URLs:

[
	'http://media.rightmove.co.uk/148k/147518/58718876/147518_SWO160154_EPCGRAPH_01_0000_max_135x100.png',
	'https://thumbs.ebaystatic.com/images/g/DYEAAOSwMHdXR0Vh/s-l225.jpg',
	'https://farm1.staticflickr.com/784/40182677504_27d67600f3_o.jpg',
	'https://t2.ftcdn.net/jpg/00/58/35/35/240_F_58353522_3plS29kylx1KZQ0lU6pYHuCAhUINvCSp.jpg',
	'https://findingblanche.files.wordpress.com/2013/07/photo4-1.jpg?w=764&',
]

I am constructing a tf.data.Dataset object with tf.data.Dataset.from_tensor_slices() with this list. Now, is it possible to filter out the URL whose status code is 404 (requests.get(url).response_code) during within the data pipeline itself?

Dennis · January 1, 2022, 8:11am

May I additionally ask if there is also a possibility to filter out those above URLs,
where the robots.txt, explicitly disallow’s my user-agent?

Sayak_Paul · January 1, 2022, 8:24am

Could you elaborate on this a bit? Didn’t get it.

Dennis · January 1, 2022, 12:35pm

@Sayak_Paul thanks for asking.

By requesting external URLs automatically, my concerns are often, if the robots.txt allows the script to fetch/get that URL (even for the http status codes). Also the robots.txt rules may change over time, so the URL which is today accessible, is maybe on disallow tomorrow inside my data pipeline.

For example, I would like to also test the ../robots.txt of each URL,
depending on the user-agent in the first step, before I send a request:

IF https://findingblanche.files.wordpress.com/robots.txt
  GET https://findingblanche.files.wordpress.com/2013/07/photo4-1.jpg...

to ensure, that my data pipeline doesn’t request excluded URLs over time.

I’ve only found urllib.robotparser.can_fetch(useragent, url) so far, as a pre-condition before sending
HTTP Requests (e.g: request.get(url).response_code) in python.

Mark_Daoust · January 1, 2022, 9:38pm

I’m not sure I see the advantage if doing this inside tf.data. WHgy not just filter it first?

AFAIK you’ll need you to use a tf.py_function for the actual check.

Then your options are either to return a zero-length tensor for the result if a url fails, or use:

Sayak_Paul · January 2, 2022, 2:43am

But it likely won’t scale when you have millions of URLs.

Let me provide a better context.

I am working with parquet files. A sample can downloaded from here:

!wget -q http://deploy.laion.ai/8f83b608504d46bb81708ec86e912220/dataset/part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet

Then I am reading it with tensorflow_io:

column_spec = {
    "URL": tf.TensorSpec(tf.TensorShape([]), tf.string),
}

files = tf.io.gfile.glob("*.parquet")

dataset = tfio.IODataset.from_parquet(files[0], columns=column_spec)

Then my idea was to read the images on the fly:

img_sz = (260, 260)

def read_image(features):
    raw = tf.io.read_file(features["URL"])
    image = tf.image.decode_jpeg(raw, channels=3)
    image = tf.image.resize(image, img_sz, antialias=True)
    return image

But as you can imagine raw = tf.io.read_file(features["URL"]) would fail if the URL itself is 404. So, I thought maybe we can also add a check on the fly to filter the URL.

Mark_Daoust · January 2, 2022, 2:22pm

TensorFlow doesn’t understand HTTP urls:

raw = tf.io.read_file(features["URL"])

UnimplementedError: File system scheme 'https' not implemented ...

With the filesystem-plugins, that might be possible. So I think you do need the py_function here. “ignore_errors” or returning an empty result still sounds like the solution.

won’t scale when you have millions of URLs.

At least waiting for IO is something python threads can do in parallel.

with futures.ThreadPoolExecutor() as ex:
       executor.map(_maybe_fetch, urls)

Either way this sounds like something you’ll want to do once and cache the result, like with Dataset.snapshot, no?

Sayak_Paul · January 2, 2022, 3:03pm

Okay. So, what do you suggest doing? Download the images first and then create the pipeline?

Mark_Daoust · January 3, 2022, 5:05pm

Download the images first and then create the pipeline?

If you’re running more than one epoch that might be a good idea.

If you build it as a tf.data.Dataset then ds.snapshot might have the same effect.

Sayak_Paul · January 4, 2022, 2:08am

Might be even better to probably first download the images and have them stored as TFRecords. When the number of images is high, this might be more beneficial.

Topic		Replies	Views
RuntimeError: __iter__() is only supported inside of tf.function or when eager execution is enabled General Discussion keras , gcp , tfdata , help_request	1	2818	February 1, 2024
ValueError: as_list() is not defined on an unknown TensorShape General Discussion datasets , tfdata , help_request	4	8872	November 9, 2021
Problem of using tfds when try to load my super resolution dataset General Discussion datasets , help_request	14	1998	February 17, 2022
Tensorflow-io issue General Discussion sig_io , help_request	8	8513	June 19, 2024
tensorflow.python.framework.errors_impl.FailedPreconditionError: Corrupted response from the server null General Discussion sig_io , help_request	1	1380	June 23, 2021

Checking if a URL exists or not during dataset construction

Related topics