A new dataset for multi-label text classification

Sayak_Paul · September 30, 2021, 9:01am

Soumik and I are pleased to share a new NLP dataset for multi-label text classification. The dataset consists of paper titles, abstracts, and term categories scraped from arXiv. Find the dataset on Kaggle: arXiv Paper Abstracts | Kaggle.

We are also releasing our data collection pipeline which is based on Apache Beam that can be run on Cloud Dataflow (GCP) at scale and can be used to accumulate an even bigger dataset at ease: multi-label-text-classification/beam_arxiv_scrape.ipynb at master · soumik12345/multi-label-text-classification · GitHub.

To help the community get started quickly we have authored this blog post on keras.io that shows how to build a simple baseline model for a smaller version of the dataset: Large-scale multi-label text classification. Thanks to @fchollet @mattdangerw for all the help.

It would be great to see this dataset also being included in tfds.

Steven_Nall · December 8, 2021, 6:20pm

Thank you so much for sharing this one with us…

Topic		Replies	Views
Multi-label Text Classifier and Model Evaluation Metrics TensorFlow models , validation , evaluation , metrics	1	602	May 30, 2023
Introducing Korean text datasets library using tensorflow-datasets Show and Tell datasets	0	544	September 19, 2021
Text Classification with MLP-Mixer model Show and Tell models , keras , learning	0	1400	June 10, 2021
Masked Autoencoders are now available in 🤗 transformers in TensorFlow! TensorFlow models , keras	0	1792	March 30, 2022
Tensorflow Multi-label Classification General Discussion help_request	4	3665	July 30, 2021

A new dataset for multi-label text classification

Related topics