gobbli.dataset package¶
Submodules¶
Module contents¶
-
class
gobbli.dataset.
TrivialDataset
(*args, **kwargs)[source]¶ Bases:
gobbli.dataset.base.BaseDataset
gobbli Dataset containing only a few observations. Useful for verifying a model runs without waiting for an actual dataset to process.
Blank constructor needed to satisfy mypy
-
DATASET
= ['This is positive.', 'This, although, is negative.']¶
-
LABELS
= ['1', '0']¶
-
classmethod
data_dir
()¶ - Return type
Path
-
embed_input
(embed_batch_size=32, pooling=<EmbedPooling.MEAN: 'mean'>, limit=None)¶ - Return type
-
classmethod
load
(*args, **kwargs)¶ - Return type
-
predict_input
(predict_batch_size=32, limit=None)¶ - Return type
-
train_input
(train_batch_size=32, valid_batch_size=8, num_train_epochs=3, valid_proportion=0.2, split_seed=1234, shuffle_seed=1234, limit=None)¶ - Return type
-
-
class
gobbli.dataset.
NewsgroupsDataset
(*args, **kwargs)[source]¶ Bases:
gobbli.dataset.nested_file.NestedFileDataset
gobbli Dataset for the 20 Newsgroups problem.
http://qwone.com/~jason/20Newsgroups/
Blank constructor needed to satisfy mypy
-
DELIMITER
= '\x00'¶
-
TEST_X_FILE
= 'test_X.csv'¶
-
TEST_Y_FILE
= 'test_Y.csv'¶
-
TRAIN_X_FILE
= 'train_X.csv'¶
-
TRAIN_Y_FILE
= 'train_y.csv'¶
-
X_test
()¶
-
X_train
()¶
-
classmethod
data_dir
()¶ - Return type
Path
-
download
(data_dir)[source]¶ Download and extract the dataset archive into the given data dir. Return the resulting path.
- Return type
Path
-
embed_input
(embed_batch_size=32, pooling=<EmbedPooling.MEAN: 'mean'>, limit=None)¶ - Return type
-
folders
()[source]¶ Return relative paths to the train and test folders, respectively, from the top level of the extracted archive.
- Return type
Tuple
[Path
,Path
]
-
labels
()[source]¶ Return the set of folder names that should be considered labels in each directory.
- Return type
Set
[str
]
-
classmethod
load
(*args, **kwargs)¶ - Return type
-
predict_input
(predict_batch_size=32, limit=None)¶ - Return type
-
read_source_file
(file_path)[source]¶ Read the text from a source file. Used to account for per-dataset encodings and other format differences.
- Return type
str
-
train_input
(train_batch_size=32, valid_batch_size=8, num_train_epochs=3, valid_proportion=0.2, split_seed=1234, shuffle_seed=1234, limit=None)¶ - Return type
-
y_test
()¶
-
y_train
()¶
-
-
class
gobbli.dataset.
IMDBDataset
(*args, **kwargs)[source]¶ Bases:
gobbli.dataset.nested_file.NestedFileDataset
gobbli Dataset for the IMDB sentiment analysis problem.
https://ai.stanford.edu/~amaas/data/sentiment/
Blank constructor needed to satisfy mypy
-
DELIMITER
= '\x00'¶
-
TEST_X_FILE
= 'test_X.csv'¶
-
TEST_Y_FILE
= 'test_Y.csv'¶
-
TRAIN_X_FILE
= 'train_X.csv'¶
-
TRAIN_Y_FILE
= 'train_y.csv'¶
-
X_test
()¶
-
X_train
()¶
-
classmethod
data_dir
()¶ - Return type
Path
-
download
(data_dir)[source]¶ Download and extract the dataset archive into the given data dir. Return the resulting path.
- Return type
Path
-
embed_input
(embed_batch_size=32, pooling=<EmbedPooling.MEAN: 'mean'>, limit=None)¶ - Return type
-
folders
()[source]¶ Return relative paths to the train and test folders, respectively, from the top level of the extracted archive.
- Return type
Tuple
[Path
,Path
]
-
labels
()[source]¶ Return the set of folder names that should be considered labels in each directory.
- Return type
Set
[str
]
-
classmethod
load
(*args, **kwargs)¶ - Return type
-
predict_input
(predict_batch_size=32, limit=None)¶ - Return type
-
read_source_file
(file_path)[source]¶ Read the text from a source file. Used to account for per-dataset encodings and other format differences.
- Return type
str
-
train_input
(train_batch_size=32, valid_batch_size=8, num_train_epochs=3, valid_proportion=0.2, split_seed=1234, shuffle_seed=1234, limit=None)¶ - Return type
-
y_test
()¶
-
y_train
()¶
-
-
class
gobbli.dataset.
MovieSummaryDataset
(*args, **kwargs)[source]¶ Bases:
gobbli.dataset.base.BaseDataset
gobbli Dataset for the CMU Movie Summary dataset, framed as a multilabel classification problem predicting movie genres from plot summaries.
http://www.cs.cmu.edu/~ark/personas/
Blank constructor needed to satisfy mypy
-
METADATA_FILE
= 'MovieSummaries/movie.metadata.tsv'¶
-
PLOT_SUMMARIES_FILE
= 'MovieSummaries/plot_summaries.txt'¶
-
TRAIN_PCT
= 0.8¶
-
classmethod
data_dir
()¶ - Return type
Path
-
embed_input
(embed_batch_size=32, pooling=<EmbedPooling.MEAN: 'mean'>, limit=None)¶ - Return type
-
classmethod
load
(*args, **kwargs)¶ - Return type
-
predict_input
(predict_batch_size=32, limit=None)¶ - Return type
-
train_input
(train_batch_size=32, valid_batch_size=8, num_train_epochs=3, valid_proportion=0.2, split_seed=1234, shuffle_seed=1234, limit=None)¶ - Return type
-