gobbli.dataset.newsgroups module¶

class gobbli.dataset.newsgroups.NewsgroupsDataset(*args, **kwargs)[source]¶

Bases: gobbli.dataset.nested_file.NestedFileDataset

gobbli Dataset for the 20 Newsgroups problem.

http://qwone.com/~jason/20Newsgroups/

Blank constructor needed to satisfy mypy

DELIMITER = '\x00'¶

TEST_X_FILE = 'test_X.csv'¶

TEST_Y_FILE = 'test_Y.csv'¶

TRAIN_X_FILE = 'train_X.csv'¶

TRAIN_Y_FILE = 'train_y.csv'¶

X_test()¶

X_train()¶

classmethod data_dir()¶

Return type: Path

download(data_dir)[source]¶

Download and extract the dataset archive into the given data dir. Return the resulting path.

Return type: Path

embed_input(embed_batch_size=32, pooling=<EmbedPooling.MEAN: 'mean'>, limit=None)¶

Return type: EmbedInput

folders()[source]¶

Return relative paths to the train and test folders, respectively, from the top level of the extracted archive.

Return type: Tuple[Path, Path]

labels()[source]¶

Return the set of folder names that should be considered labels in each directory.

Return type: Set[str]

classmethod load(*args, **kwargs)¶

Return type: BaseDataset

predict_input(predict_batch_size=32, limit=None)¶

Return type: PredictInput

read_source_file(file_path)[source]¶

Read the text from a source file. Used to account for per-dataset encodings and other format differences.

Return type: str

train_input(train_batch_size=32, valid_batch_size=8, num_train_epochs=3, valid_proportion=0.2, split_seed=1234, shuffle_seed=1234, limit=None)¶

Return type: TrainInput

y_test()¶

y_train()¶

gobbli.dataset.newsgroups module¶

Navigation

Related Topics