I am constructing a tf.data.Dataset object with tf.data.Dataset.from_tensor_slices() with this list. Now, is it possible to filter out the URL whose status code is 404 (requests.get(url).response_code) during within the data pipeline itself?
By requesting external URLs automatically, my concerns are often, if the robots.txt allows the script to fetch/get that URL (even for the http status codes). Also the robots.txt rules may change over time, so the URL which is today accessible, is maybe on disallow tomorrow inside my data pipeline.
For example, I would like to also test the ../robots.txt of each URL,
depending on the user-agent in the first step, before I send a request:
IF https://findingblanche.files.wordpress.com/robots.txt
GET https://findingblanche.files.wordpress.com/2013/07/photo4-1.jpg...
to ensure, that my data pipeline doesn’t request excluded URLs over time.
But as you can imagine raw = tf.io.read_file(features["URL"]) would fail if the URL itself is 404. So, I thought maybe we can also add a check on the fly to filter the URL.
UnimplementedError: File system scheme 'https' not implemented ...
With the filesystem-plugins, that might be possible. So I think you do need the py_function here. “ignore_errors” or returning an empty result still sounds like the solution.
won’t scale when you have millions of URLs.
At least waiting for IO is something python threads can do in parallel.
with futures.ThreadPoolExecutor() as ex:
executor.map(_maybe_fetch, urls)
Either way this sounds like something you’ll want to do once and cache the result, like with Dataset.snapshot, no?
Might be even better to probably first download the images and have them stored as TFRecords. When the number of images is high, this might be more beneficial.