Skip to content

Add Reuters-21578 dataset

Achyudh Keshav Ram requested to merge arkeshav/Castor-data:master into master

A pre-processor to convert the Reuters-21578 dataset to TSV from SGM format according to the ApteMod test/train splits. This method returns the documents that belong to at least one of the categories that have at least one document in both the training and the test sets. The dataset has 90 categories with a training set of 7769 documents and a test set of 3019 documents.

Note: Currently the pre-processor outputs one hot labels and I am not sure if it's that agrees with the default Castor data format. Since it is possible for a document in the Reuters dataset to belong to multiple categories, it's either this or the base_10 conversion of the one hot vector.

Merge request reports