gobbli.dataset package¶
Submodules¶
Module contents¶
-
class
gobbli.dataset.TrivialDataset(*args, **kwargs)[source]¶ Bases:
gobbli.dataset.base.BaseDatasetgobbli Dataset containing only a few observations. Useful for verifying a model runs without waiting for an actual dataset to process.
Blank constructor needed to satisfy mypy
-
DATASET= ['This is positive.', 'This, although, is negative.']¶
-
LABELS= ['1', '0']¶
-
classmethod
data_dir()¶ - Return type
Path
-
embed_input(embed_batch_size=32, pooling=<EmbedPooling.MEAN: 'mean'>, limit=None)¶ - Return type
-
classmethod
load(*args, **kwargs)¶ - Return type
-
predict_input(predict_batch_size=32, limit=None)¶ - Return type
-
train_input(train_batch_size=32, valid_batch_size=8, num_train_epochs=3, valid_proportion=0.2, split_seed=1234, shuffle_seed=1234, limit=None)¶ - Return type
-
-
class
gobbli.dataset.NewsgroupsDataset(*args, **kwargs)[source]¶ Bases:
gobbli.dataset.nested_file.NestedFileDatasetgobbli Dataset for the 20 Newsgroups problem.
http://qwone.com/~jason/20Newsgroups/
Blank constructor needed to satisfy mypy
-
DELIMITER= '\x00'¶
-
TEST_X_FILE= 'test_X.csv'¶
-
TEST_Y_FILE= 'test_Y.csv'¶
-
TRAIN_X_FILE= 'train_X.csv'¶
-
TRAIN_Y_FILE= 'train_y.csv'¶
-
X_test()¶
-
X_train()¶
-
classmethod
data_dir()¶ - Return type
Path
-
download(data_dir)[source]¶ Download and extract the dataset archive into the given data dir. Return the resulting path.
- Return type
Path
-
embed_input(embed_batch_size=32, pooling=<EmbedPooling.MEAN: 'mean'>, limit=None)¶ - Return type
-
folders()[source]¶ Return relative paths to the train and test folders, respectively, from the top level of the extracted archive.
- Return type
Tuple[Path,Path]
-
labels()[source]¶ Return the set of folder names that should be considered labels in each directory.
- Return type
Set[str]
-
classmethod
load(*args, **kwargs)¶ - Return type
-
predict_input(predict_batch_size=32, limit=None)¶ - Return type
-
read_source_file(file_path)[source]¶ Read the text from a source file. Used to account for per-dataset encodings and other format differences.
- Return type
str
-
train_input(train_batch_size=32, valid_batch_size=8, num_train_epochs=3, valid_proportion=0.2, split_seed=1234, shuffle_seed=1234, limit=None)¶ - Return type
-
y_test()¶
-
y_train()¶
-
-
class
gobbli.dataset.IMDBDataset(*args, **kwargs)[source]¶ Bases:
gobbli.dataset.nested_file.NestedFileDatasetgobbli Dataset for the IMDB sentiment analysis problem.
https://ai.stanford.edu/~amaas/data/sentiment/
Blank constructor needed to satisfy mypy
-
DELIMITER= '\x00'¶
-
TEST_X_FILE= 'test_X.csv'¶
-
TEST_Y_FILE= 'test_Y.csv'¶
-
TRAIN_X_FILE= 'train_X.csv'¶
-
TRAIN_Y_FILE= 'train_y.csv'¶
-
X_test()¶
-
X_train()¶
-
classmethod
data_dir()¶ - Return type
Path
-
download(data_dir)[source]¶ Download and extract the dataset archive into the given data dir. Return the resulting path.
- Return type
Path
-
embed_input(embed_batch_size=32, pooling=<EmbedPooling.MEAN: 'mean'>, limit=None)¶ - Return type
-
folders()[source]¶ Return relative paths to the train and test folders, respectively, from the top level of the extracted archive.
- Return type
Tuple[Path,Path]
-
labels()[source]¶ Return the set of folder names that should be considered labels in each directory.
- Return type
Set[str]
-
classmethod
load(*args, **kwargs)¶ - Return type
-
predict_input(predict_batch_size=32, limit=None)¶ - Return type
-
read_source_file(file_path)[source]¶ Read the text from a source file. Used to account for per-dataset encodings and other format differences.
- Return type
str
-
train_input(train_batch_size=32, valid_batch_size=8, num_train_epochs=3, valid_proportion=0.2, split_seed=1234, shuffle_seed=1234, limit=None)¶ - Return type
-
y_test()¶
-
y_train()¶
-
-
class
gobbli.dataset.MovieSummaryDataset(*args, **kwargs)[source]¶ Bases:
gobbli.dataset.base.BaseDatasetgobbli Dataset for the CMU Movie Summary dataset, framed as a multilabel classification problem predicting movie genres from plot summaries.
http://www.cs.cmu.edu/~ark/personas/
Blank constructor needed to satisfy mypy
-
METADATA_FILE= 'MovieSummaries/movie.metadata.tsv'¶
-
PLOT_SUMMARIES_FILE= 'MovieSummaries/plot_summaries.txt'¶
-
TRAIN_PCT= 0.8¶
-
classmethod
data_dir()¶ - Return type
Path
-
embed_input(embed_batch_size=32, pooling=<EmbedPooling.MEAN: 'mean'>, limit=None)¶ - Return type
-
classmethod
load(*args, **kwargs)¶ - Return type
-
predict_input(predict_batch_size=32, limit=None)¶ - Return type
-
train_input(train_batch_size=32, valid_batch_size=8, num_train_epochs=3, valid_proportion=0.2, split_seed=1234, shuffle_seed=1234, limit=None)¶ - Return type
-