Checking if a URL exists or not during dataset construction

Say I have the following URLs:

[
	'http://media.rightmove.co.uk/148k/147518/58718876/147518_SWO160154_EPCGRAPH_01_0000_max_135x100.png',
	'https://thumbs.ebaystatic.com/images/g/DYEAAOSwMHdXR0Vh/s-l225.jpg',
	'https://farm1.staticflickr.com/784/40182677504_27d67600f3_o.jpg',
	'https://t2.ftcdn.net/jpg/00/58/35/35/240_F_58353522_3plS29kylx1KZQ0lU6pYHuCAhUINvCSp.jpg',
	'https://findingblanche.files.wordpress.com/2013/07/photo4-1.jpg?w=764&',
]

I am constructing a tf.data.Dataset object with tf.data.Dataset.from_tensor_slices() with this list. Now, is it possible to filter out the URL whose status code is 404 (requests.get(url).response_code) during within the data pipeline itself?

May I additionally ask if there is also a possibility to filter out those above URLs,
where the robots.txt, explicitly disallow’s my user-agent?

Could you elaborate on this a bit? Didn’t get it.

@Sayak_Paul thanks for asking.

By requesting external URLs automatically, my concerns are often, if the robots.txt allows the script to fetch/get that URL (even for the http status codes). Also the robots.txt rules may change over time, so the URL which is today accessible, is maybe on disallow tomorrow inside my data pipeline.

For example, I would like to also test the ../robots.txt of each URL,
depending on the user-agent in the first step, before I send a request:

IF https://findingblanche.files.wordpress.com/robots.txt
  GET https://findingblanche.files.wordpress.com/2013/07/photo4-1.jpg...

to ensure, that my data pipeline doesn’t request excluded URLs over time.

I’ve only found urllib.robotparser.can_fetch(useragent, url) so far, as a pre-condition before sending
HTTP Requests (e.g: request.get(url).response_code) in python.

I’m not sure I see the advantage if doing this inside tf.data. WHgy not just filter it first?

AFAIK you’ll need you to use a tf.py_function for the actual check.

Then your options are either to return a zero-length tensor for the result if a url fails, or use:

But it likely won’t scale when you have millions of URLs.

Let me provide a better context.

I am working with parquet files. A sample can downloaded from here:

!wget -q http://deploy.laion.ai/8f83b608504d46bb81708ec86e912220/dataset/part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet

Then I am reading it with tensorflow_io:

column_spec = {
    "URL": tf.TensorSpec(tf.TensorShape([]), tf.string),
}

files = tf.io.gfile.glob("*.parquet")

dataset = tfio.IODataset.from_parquet(files[0], columns=column_spec)

Then my idea was to read the images on the fly:

img_sz = (260, 260)

def read_image(features):
    raw = tf.io.read_file(features["URL"])
    image = tf.image.decode_jpeg(raw, channels=3)
    image = tf.image.resize(image, img_sz, antialias=True)
    return image

But as you can imagine raw = tf.io.read_file(features["URL"]) would fail if the URL itself is 404. So, I thought maybe we can also add a check on the fly to filter the URL.

TensorFlow doesn’t understand HTTP urls:

raw = tf.io.read_file(features["URL"])
UnimplementedError: File system scheme 'https' not implemented ...

With the filesystem-plugins, that might be possible. So I think you do need the py_function here. “ignore_errors” or returning an empty result still sounds like the solution.

won’t scale when you have millions of URLs.

At least waiting for IO is something python threads can do in parallel.

with futures.ThreadPoolExecutor() as ex:
       executor.map(_maybe_fetch, urls)

Either way this sounds like something you’ll want to do once and cache the result, like with Dataset.snapshot, no?

Okay. So, what do you suggest doing? Download the images first and then create the pipeline?

Download the images first and then create the pipeline?

If you’re running more than one epoch that might be a good idea.

If you build it as a tf.data.Dataset then ds.snapshot might have the same effect.

1 Like

Might be even better to probably first download the images and have them stored as TFRecords. When the number of images is high, this might be more beneficial.