gobbli.dataset package¶

Submodules¶

Module contents¶

class gobbli.dataset.TrivialDataset(*args, **kwargs)[source]¶

Bases: gobbli.dataset.base.BaseDataset

gobbli Dataset containing only a few observations. Useful for verifying a model runs without waiting for an actual dataset to process.

Blank constructor needed to satisfy mypy

DATASET = ['This is positive.', 'This, although, is negative.']¶

LABELS = ['1', '0']¶

X_test()[source]¶

X_train()[source]¶

classmethod data_dir()¶

Return type: Path

embed_input(embed_batch_size=32, pooling=<EmbedPooling.MEAN: 'mean'>, limit=None)¶

Return type: EmbedInput

classmethod load(*args, **kwargs)¶

Return type: BaseDataset

predict_input(predict_batch_size=32, limit=None)¶

Return type: PredictInput

train_input(train_batch_size=32, valid_batch_size=8, num_train_epochs=3, valid_proportion=0.2, split_seed=1234, shuffle_seed=1234, limit=None)¶

Return type: TrainInput

y_test()[source]¶

y_train()[source]¶

class gobbli.dataset.NewsgroupsDataset(*args, **kwargs)[source]¶

Bases: gobbli.dataset.nested_file.NestedFileDataset

gobbli Dataset for the 20 Newsgroups problem.

http://qwone.com/~jason/20Newsgroups/

Blank constructor needed to satisfy mypy

DELIMITER = '\x00'¶

TEST_X_FILE = 'test_X.csv'¶

TEST_Y_FILE = 'test_Y.csv'¶

TRAIN_X_FILE = 'train_X.csv'¶

TRAIN_Y_FILE = 'train_y.csv'¶

X_test()¶

X_train()¶

classmethod data_dir()¶

Return type: Path

download(data_dir)[source]¶

Download and extract the dataset archive into the given data dir. Return the resulting path.

Return type: Path

embed_input(embed_batch_size=32, pooling=<EmbedPooling.MEAN: 'mean'>, limit=None)¶

Return type: EmbedInput

folders()[source]¶

Return relative paths to the train and test folders, respectively, from the top level of the extracted archive.

Return type: Tuple[Path, Path]

labels()[source]¶

Return the set of folder names that should be considered labels in each directory.

Return type: Set[str]

classmethod load(*args, **kwargs)¶

Return type: BaseDataset

predict_input(predict_batch_size=32, limit=None)¶

Return type: PredictInput

read_source_file(file_path)[source]¶

Read the text from a source file. Used to account for per-dataset encodings and other format differences.

Return type: str

train_input(train_batch_size=32, valid_batch_size=8, num_train_epochs=3, valid_proportion=0.2, split_seed=1234, shuffle_seed=1234, limit=None)¶

Return type: TrainInput

y_test()¶

y_train()¶

class gobbli.dataset.IMDBDataset(*args, **kwargs)[source]¶

Bases: gobbli.dataset.nested_file.NestedFileDataset

gobbli Dataset for the IMDB sentiment analysis problem.

https://ai.stanford.edu/~amaas/data/sentiment/

Blank constructor needed to satisfy mypy

DELIMITER = '\x00'¶

TEST_X_FILE = 'test_X.csv'¶

TEST_Y_FILE = 'test_Y.csv'¶

TRAIN_X_FILE = 'train_X.csv'¶

TRAIN_Y_FILE = 'train_y.csv'¶

X_test()¶

X_train()¶

classmethod data_dir()¶

Return type: Path

download(data_dir)[source]¶

Download and extract the dataset archive into the given data dir. Return the resulting path.

Return type: Path

embed_input(embed_batch_size=32, pooling=<EmbedPooling.MEAN: 'mean'>, limit=None)¶

Return type: EmbedInput

folders()[source]¶

Return relative paths to the train and test folders, respectively, from the top level of the extracted archive.

Return type: Tuple[Path, Path]

labels()[source]¶

Return the set of folder names that should be considered labels in each directory.

Return type: Set[str]

classmethod load(*args, **kwargs)¶

Return type: BaseDataset

predict_input(predict_batch_size=32, limit=None)¶

Return type: PredictInput

read_source_file(file_path)[source]¶

Read the text from a source file. Used to account for per-dataset encodings and other format differences.

Return type: str

train_input(train_batch_size=32, valid_batch_size=8, num_train_epochs=3, valid_proportion=0.2, split_seed=1234, shuffle_seed=1234, limit=None)¶

Return type: TrainInput

y_test()¶

y_train()¶

class gobbli.dataset.MovieSummaryDataset(*args, **kwargs)[source]¶

Bases: gobbli.dataset.base.BaseDataset

gobbli Dataset for the CMU Movie Summary dataset, framed as a multilabel classification problem predicting movie genres from plot summaries.

http://www.cs.cmu.edu/~ark/personas/

Blank constructor needed to satisfy mypy

METADATA_FILE = 'MovieSummaries/movie.metadata.tsv'¶

PLOT_SUMMARIES_FILE = 'MovieSummaries/plot_summaries.txt'¶

TRAIN_PCT = 0.8¶

X_test()[source]¶

X_train()[source]¶

classmethod data_dir()¶

Return type: Path

embed_input(embed_batch_size=32, pooling=<EmbedPooling.MEAN: 'mean'>, limit=None)¶

Return type: EmbedInput

classmethod load(*args, **kwargs)¶

Return type: BaseDataset

predict_input(predict_batch_size=32, limit=None)¶

Return type: PredictInput

train_input(train_batch_size=32, valid_batch_size=8, num_train_epochs=3, valid_proportion=0.2, split_seed=1234, shuffle_seed=1234, limit=None)¶

Return type: TrainInput

y_test()[source]¶

y_train()[source]¶

gobbli.dataset package¶

Submodules¶

Module contents¶

Navigation

Related Topics