gobbli.dataset.nested_file module¶

class gobbli.dataset.nested_file.NestedFileDataset(*args, **kwargs)[source]¶

Bases: gobbli.dataset.base.BaseDataset

A dataset downloaded as an archive from some URL and composed of the following directory structure:

<train_folder>/

<label1>/: data1 data2
<label2>/: data1 data2

<test_folder>/

<label1>/: data1 data2
<label2>/: data1 data2

Blank constructor needed to satisfy mypy

DELIMITER = '\x00'¶

TEST_X_FILE = 'test_X.csv'¶

TEST_Y_FILE = 'test_Y.csv'¶

TRAIN_X_FILE = 'train_X.csv'¶

TRAIN_Y_FILE = 'train_y.csv'¶

X_test()[source]¶

X_train()[source]¶

classmethod data_dir()¶

Return type: Path

abstract download(data_dir)[source]¶

Download and extract the dataset archive into the given data dir. Return the resulting path.

Return type: Path

embed_input(embed_batch_size=32, pooling=<EmbedPooling.MEAN: 'mean'>, limit=None)¶

Return type: EmbedInput

abstract folders()[source]¶

Return relative paths to the train and test folders, respectively, from the top level of the extracted archive.

Return type: Tuple[Path, Path]

abstract labels()[source]¶

Return the set of folder names that should be considered labels in each directory.

Return type: Set[str]

classmethod load(*args, **kwargs)¶

Return type: BaseDataset

predict_input(predict_batch_size=32, limit=None)¶

Return type: PredictInput

abstract read_source_file(file_path)[source]¶

Read the text from a source file. Used to account for per-dataset encodings and other format differences.

Return type: str

train_input(train_batch_size=32, valid_batch_size=8, num_train_epochs=3, valid_proportion=0.2, split_seed=1234, shuffle_seed=1234, limit=None)¶

Return type: TrainInput

y_test()[source]¶

y_train()[source]¶

gobbli.dataset.nested_file module¶

Navigation

Related Topics