Soumik and I are pleased to share a new NLP dataset for multi-label text classification. The dataset consists of paper titles, abstracts, and term categories scraped from arXiv. Find the dataset on Kaggle: arXiv Paper Abstracts | Kaggle.
We are also releasing our data collection pipeline which is based on Apache Beam that can be run on Cloud Dataflow (GCP) at scale and can be used to accumulate an even bigger dataset at ease: multi-label-text-classification/beam_arxiv_scrape.ipynb at master · soumik12345/multi-label-text-classification · GitHub.
To help the community get started quickly we have authored this blog post on keras.io that shows how to build a simple baseline model for a smaller version of the dataset: Large-scale multi-label text classification. Thanks to @fchollet @mattdangerw for all the help.
It would be great to see this dataset also being included in tfds
.