Importing text files in Google Colab from Github

Coldgrad · July 20, 2022, 2:58am

I am trying to load and preprocess text in Google Colab using Tensorflow and I seem to be having problems importing my text files.

Based on Example 2 of the tutorial, I ran

DIRECTORY_URL = 'Github URL'
FILE_NAMES = ['file1.txt', 'file2.txt', 'file3.txt', 'file4.txt']

for name in FILE_NAMES:
  text_dir = utils.get_file(name, origin=DIRECTORY_URL + name)

parent_dir = pathlib.Path(text_dir).parent
list(parent_dir.iterdir())

However, I’m getting these odd results

Sentence:  b'            name-with-owner="YWtlNzAwL1B5dGhvbg=="'
Label: 3
Sentence:  b'        </tr>'
Label: 0
Sentence:  b'        </tr>'
Label: 0

Am I correct in assuming that Tf is reading in the raw text files from GitHub? I went and tried the first example in the above tutorial by uploading a zipped folder of the text files and trying to unzip via Tf in Colab, then reading the files.

data_url = 'GithubUrlWithZip.7z'

dataset_dir = utils.get_file(
    origin=data_url,
    extract=True,
    cache_dir=None,
    cache_subdir='')

dataset_dir = pathlib.Path(dataset_dir).parent
text_dir = dataset_dir/"datasets"
list(text_dir.iterdir())

but the text files did not read well when I checked.

sample_file = text_dir/"file1.txt"

with open(sample_file) as f:
  print(f.read())

The files reads as such

<!DOCTYPE html>
<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">
  <head>
    <meta charset="utf-8">
  <link rel="dns-prefetch" href="https://github.githubassets.com">
  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">
  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">
  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">
  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>
  <link rel="preconnect" href="https://avatars.githubusercontent.com">

  <link crossorigin="anonymous" media="all" integrity="sha512-ksfTgQOOnE+FFXf+yNfVjKSlEckJAdufFIYGK7ZjRhWcZgzAGcmZqqArTgMLpu90FwthqcCX4ldDgKXbmVMeuQ==" rel="stylesheet" href="https://github.githubassets.com/assets/light-92c7d381038e.css" /><link crossorigin="anonymous" media="all" integrity="sha512-1KkMNn8M/al/dtzBLupRwkIOgnA9MWkm8oxS+solP87jByEvY/g4BmoxLihRogKcX1obPnf4Yp7dI0ZTWO+ljg==" rel="stylesheet" href="https://github.githubassets.com/assets/dark-d4a90c367f0c.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" integrity="sha512-cZa7DZqvMBwD236uzEunO/G1dvw8/QftyT2UtLWKQFEy0z0eq0R5WPwqVME+3NSZG1YaLJAaIqtU+m0zWf/6SQ==" rel="stylesheet" data-href="https://github.githubassets.com/assets/dark_dimmed-7196bb0d9aaf.css" /><link data-color-theme="dark_high_contrast" crossorigin="anonymous" media="all" integrity="sha512-WVoKqJ4y1nLsdNH4RkRT5qrM9+n9RFe1RHSiTnQkBf5TSZkJEc9GpLpTIS7T15EQaUQBJ8BwmKvwFPVqfpTEIQ==" rel="stylesheet" data-href="https://github.githubassets.com/assets/dark_high_contrast-595a0aa89e32.css" /><link data-color-theme="dark_colorblind" crossorigin="anonymous" media="all" integrity="sha512-XpAMBMSRZ6RTXgepS8LjKiOeNK3BilRbv8qEiA/M3m+Q4GoqxtHedOI5BAZRikCzfBL4KWYvVzYZSZ8Gp/UnUg==" rel="stylesheet" data-href="https://github.githubassets.com/assets/dark_colorblind-5e900c04c491.css" /><link data-color-theme="light_colorblind" crossorigin="anonymous" media="all" integrity="sha512-3HF2HZ4LgEIQm77yOzoeR20CX1n2cUQlcywscqF4s+5iplolajiHV7E5ranBwkX65jN9TNciHEVSYebQ+8xxEw==" rel="stylesheet" data-href="https://github.githubassets.com/assets/light_colorblind-dc71761d9e0b.css" /><link data-color-theme="light_high_contrast" crossorigin="anonymous" media="all" integrity="sha512-+J8j3T0kbK9/sL3zbkCfPtgYcRD4qQfRbT6xnfOrOTjvz4zhr0M7AXPuE642PpaxGhHs1t77cTtieW9hI2K6Gw==" rel="stylesheet" data-href="https://github.githubassets.com/assets/light_high_contrast-f89f23dd3d24.css" /><link data-color-theme="light_tritanopia" crossorigin="anonymous" media="all" integrity="sha512-AQeAx5wHQAXNf0DmkvVlHYwA3f6BkxunWTI0GGaRN57GqD+H9tW8RKIKlopLS0qGaC54seFsPc601GDlqIuuHg==" rel="stylesheet" data-href="https://github.githubassets.com/assets/light_tritanopia-010780c79c07.css" /><link data-color-theme="dark_tritanopia" crossorigin="anonymous" media="all" integrity="sha512-+u5pmgAE0T03d/yI6Ha0NWwz6Pk0W6S6WEfIt8veDVdK8NTjcMbZmQB9XUCkDlrBoAKkABva8HuGJ+SzEpV1Uw==" rel="stylesheet" data-href="https://github.githubassets.com/assets/dark_tritanopia-faee699a0004.css" />

I also tried having the local files into my Google Drive, but for some reason I had issues reading in the text files. In the future if I’m working with larger and more numerous text data, I presume I wouldn’t be uploading them to Google Drive then reading the local files from Colab, so I want to properly learn how to import text files from Github or another source.

The tutorials seem to have their files in Google API storage, but when I tried uploading my files there, it was extremely slow.

Is there another method generally used for ML models like this or with text-based work?

Amin_Jabari · July 22, 2022, 7:30am

maybe anyone can share some tips here.

Coldgrad · July 22, 2022, 3:36pm

This issue has been solved in this Stackoverflow page.

Topic		Replies	Views
Text-based Tensorflow unexpected result of train_function (empty logs) General Discussion models , nlp , keras , tfdata , help_request	5	6815	July 27, 2022
No text files found in directory with text_dataset_from_directory General Discussion datasets , keras , help_request	2	2399	September 23, 2022
TFDS Error - Normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization General Discussion datasets , tfdata , gpu , help_request	2	2273	February 26, 2022
Trouble using tfds load to vectorize text Keras datasets , epoc , tfkeras	2	260	February 6, 2024
Tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 40: invalid start byte IO help_request	2	3269	February 6, 2023

Importing text files in Google Colab from Github

Related topics