gobbli.dataset package

Submodules

Module contents

class gobbli.dataset.TrivialDataset(*args, **kwargs)[source]

Bases: gobbli.dataset.base.BaseDataset

gobbli Dataset containing only a few observations. Useful for verifying a model runs without waiting for an actual dataset to process.

Blank constructor needed to satisfy mypy

DATASET = ['This is positive.', 'This, although, is negative.']
LABELS = ['1', '0']
X_test()[source]
X_train()[source]
classmethod data_dir()
Return type

Path

embed_input(embed_batch_size=32, pooling=<EmbedPooling.MEAN: 'mean'>, limit=None)
Return type

EmbedInput

classmethod load(*args, **kwargs)
Return type

BaseDataset

predict_input(predict_batch_size=32, limit=None)
Return type

PredictInput

train_input(train_batch_size=32, valid_batch_size=8, num_train_epochs=3, valid_proportion=0.2, split_seed=1234, shuffle_seed=1234, limit=None)
Return type

TrainInput

y_test()[source]
y_train()[source]
class gobbli.dataset.NewsgroupsDataset(*args, **kwargs)[source]

Bases: gobbli.dataset.nested_file.NestedFileDataset

gobbli Dataset for the 20 Newsgroups problem.

http://qwone.com/~jason/20Newsgroups/

Blank constructor needed to satisfy mypy

DELIMITER = '\x00'
TEST_X_FILE = 'test_X.csv'
TEST_Y_FILE = 'test_Y.csv'
TRAIN_X_FILE = 'train_X.csv'
TRAIN_Y_FILE = 'train_y.csv'
X_test()
X_train()
classmethod data_dir()
Return type

Path

download(data_dir)[source]

Download and extract the dataset archive into the given data dir. Return the resulting path.

Return type

Path

embed_input(embed_batch_size=32, pooling=<EmbedPooling.MEAN: 'mean'>, limit=None)
Return type

EmbedInput

folders()[source]

Return relative paths to the train and test folders, respectively, from the top level of the extracted archive.

Return type

Tuple[Path, Path]

labels()[source]

Return the set of folder names that should be considered labels in each directory.

Return type

Set[str]

classmethod load(*args, **kwargs)
Return type

BaseDataset

predict_input(predict_batch_size=32, limit=None)
Return type

PredictInput

read_source_file(file_path)[source]

Read the text from a source file. Used to account for per-dataset encodings and other format differences.

Return type

str

train_input(train_batch_size=32, valid_batch_size=8, num_train_epochs=3, valid_proportion=0.2, split_seed=1234, shuffle_seed=1234, limit=None)
Return type

TrainInput

y_test()
y_train()
class gobbli.dataset.IMDBDataset(*args, **kwargs)[source]

Bases: gobbli.dataset.nested_file.NestedFileDataset

gobbli Dataset for the IMDB sentiment analysis problem.

https://ai.stanford.edu/~amaas/data/sentiment/

Blank constructor needed to satisfy mypy

DELIMITER = '\x00'
TEST_X_FILE = 'test_X.csv'
TEST_Y_FILE = 'test_Y.csv'
TRAIN_X_FILE = 'train_X.csv'
TRAIN_Y_FILE = 'train_y.csv'
X_test()
X_train()
classmethod data_dir()
Return type

Path

download(data_dir)[source]

Download and extract the dataset archive into the given data dir. Return the resulting path.

Return type

Path

embed_input(embed_batch_size=32, pooling=<EmbedPooling.MEAN: 'mean'>, limit=None)
Return type

EmbedInput

folders()[source]

Return relative paths to the train and test folders, respectively, from the top level of the extracted archive.

Return type

Tuple[Path, Path]

labels()[source]

Return the set of folder names that should be considered labels in each directory.

Return type

Set[str]

classmethod load(*args, **kwargs)
Return type

BaseDataset

predict_input(predict_batch_size=32, limit=None)
Return type

PredictInput

read_source_file(file_path)[source]

Read the text from a source file. Used to account for per-dataset encodings and other format differences.

Return type

str

train_input(train_batch_size=32, valid_batch_size=8, num_train_epochs=3, valid_proportion=0.2, split_seed=1234, shuffle_seed=1234, limit=None)
Return type

TrainInput

y_test()
y_train()
class gobbli.dataset.MovieSummaryDataset(*args, **kwargs)[source]

Bases: gobbli.dataset.base.BaseDataset

gobbli Dataset for the CMU Movie Summary dataset, framed as a multilabel classification problem predicting movie genres from plot summaries.

http://www.cs.cmu.edu/~ark/personas/

Blank constructor needed to satisfy mypy

METADATA_FILE = 'MovieSummaries/movie.metadata.tsv'
PLOT_SUMMARIES_FILE = 'MovieSummaries/plot_summaries.txt'
TRAIN_PCT = 0.8
X_test()[source]
X_train()[source]
classmethod data_dir()
Return type

Path

embed_input(embed_batch_size=32, pooling=<EmbedPooling.MEAN: 'mean'>, limit=None)
Return type

EmbedInput

classmethod load(*args, **kwargs)
Return type

BaseDataset

predict_input(predict_batch_size=32, limit=None)
Return type

PredictInput

train_input(train_batch_size=32, valid_batch_size=8, num_train_epochs=3, valid_proportion=0.2, split_seed=1234, shuffle_seed=1234, limit=None)
Return type

TrainInput

y_test()[source]
y_train()[source]