gobbli.model package

Subpackages

Submodules

Module contents

class gobbli.model.BERT(data_dir=None, load_existing=False, use_gpu=False, nvidia_visible_devices='all', logger=None, **kwargs)[source]

Bases: gobbli.model.base.BaseModel, gobbli.model.mixin.TrainMixin, gobbli.model.mixin.PredictMixin, gobbli.model.mixin.EmbedMixin

Classifier/embedding wrapper for Google Research’s BERT: https://github.com/google-research/bert

Create a model.

Parameters
  • data_dir (Optional[Path]) – Optional path to a directory used to store model data. If not given, a unique directory under GOBBLI_DIR will be created and used.

  • load_existing (bool) – If True, data_dir should be a directory that was previously used to create a model. Parameters will be loaded to match the original model, and user-specified model parameters will be ignored. If False, the data_dir must be empty if it already exists.

  • use_gpu (bool) – If True, use the nvidia-docker runtime (https://github.com/NVIDIA/nvidia-docker) to expose NVIDIA GPU(s) to the container. Will cause an error if the computer you’re running on doesn’t have an NVIDIA GPU and/or doesn’t have the nvidia-docker runtime installed.

  • nvidia_visible_devices (str) – Which GPUs to make available to the container; ignored if use_gpu is False. If not ‘all’, should be a comma-separated string: ex. 1,2.

  • logger (Optional[Logger]) – If passed, use this logger for logging instead of the default module-level logger.

  • **kwargs – Additional model-specific parameters to be passed to the model’s init() method.

build()

Perform any pre-setup that needs to be done before running the model (building Docker images, etc).

property class_weights_dir

The root directory used to store initial model weights (before fine-tuning). These should generally be some pretrained weights made available by model developers. This directory will NOT be created by default; models should download their weights and remove the weights directory if the download doesn’t finish properly.

Most models making use of this directory will have multiple sets of weights and will need to store those in subdirectories under this directory.

Return type

Path

Returns

The path to the class-wide weights directory.

data_dir()
Return type

Path

Returns

The main data directory unique to this instance of the model.

property do_lower_case
Return type

bool

Returns

Whether the BERT tokenizer should lowercase its input.

embed(embed_input, embed_dir_name=None)

Generates embeddings using a model and the params in the given gobbli.io.EmbedInput.

Parameters
  • embed_input (EmbedInput) – Contains various parameters needed to determine how to generate embeddings and what data to generate embeddings for.

  • embed_dir_name (Optional[str]) – Optional name to store embedding input and output under. The directory is always created under the model’s data_dir. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.

Return type

EmbedOutput

Returns

Output of training.

embed_dir()

The directory to be used for data related to embedding (weights, embeddings, etc)

Return type

Path

Returns

Path to the embedding data directory.

property image_tag
Return type

str

Returns

The Docker image tag to be used for the BERT container.

property info_path
Return type

Path

Returns

The path to the model’s info file, containing information about the model including the type of model, gobbli version it was trained using, etc.

init(params)[source]

See gobbli.model.base.BaseModel.init().

BERT parameters:

  • max_seq_length (int): The maximum total input sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded. Default: 128

  • bert_model (str): Name of a pretrained BERT model to use. See BERT_MODEL_ARCHIVES for a listing of available BERT models.

property logger
Return type

Logger

Returns

A logger for derived models to use.

property metadata_path
Return type

Path

Returns

The path to the model’s metadata file containing model-specific parameters.

classmethod model_class_dir()
Return type

Path

Returns

A directory shared among all classes of the model.

predict(predict_input, predict_dir_name=None)

Runs prediction on new data using params containing in the given gobbli.io.PredictInput.

Parameters
  • predict_input (PredictInput) – Contains various parameters needed to determine how to run prediction and what data to predict for.

  • predict_dir_name (Optional[str]) – Optional name to store prediction input and output under. The directory is always created under the model’s data_dir. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.

Return type

PredictOutput

predict_dir()

The directory to be used for data related to prediction (weights, predictions, etc)

Return type

Path

Returns

Path to the prediction data directory.

train(train_input, train_dir_name=None)

Trains a model using params in the given gobbli.io.TrainInput. The training process varies depending on the model, but in general, it includes the following steps:

  • Update weights using the training dataset

  • Evaluate performance on the validation dataset.

  • Repeat for a number of epochs.

  • When finished, report loss/accuracy and return the trained weights.

Parameters
  • train_input (TrainInput) – Contains various parameters needed to determine how to train and what data to train on.

  • train_dir_name (Optional[str]) – Optional name to store training input and output under. The directory is always created under the model’s data_dir. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.

Return type

TrainOutput

Returns

Output of training.

train_dir()

The directory to be used for data related to training (data files, etc).

Return type

Path

Returns

Path to the training data directory.

property weights_dir
Return type

Path

Returns

Directory containing pretrained weights for this instance.

class gobbli.model.FastText(data_dir=None, load_existing=False, use_gpu=False, nvidia_visible_devices='all', logger=None, **kwargs)[source]

Bases: gobbli.model.base.BaseModel, gobbli.model.mixin.TrainMixin, gobbli.model.mixin.PredictMixin, gobbli.model.mixin.EmbedMixin

Wrapper for Facebook’s fastText model: https://github.com/facebookresearch/fastText

Note: fastText benefits from some preprocessing steps: https://fasttext.cc/docs/en/supervised-tutorial.html#preprocessing-the-data

gobbli will only lowercase and escape newlines in your input by default. If you want more sophisticated preprocessing for punctuation, stemming, etc, consider performing some preprocessing on your own beforehand.

Create a model.

Parameters
  • data_dir (Optional[Path]) – Optional path to a directory used to store model data. If not given, a unique directory under GOBBLI_DIR will be created and used.

  • load_existing (bool) – If True, data_dir should be a directory that was previously used to create a model. Parameters will be loaded to match the original model, and user-specified model parameters will be ignored. If False, the data_dir must be empty if it already exists.

  • use_gpu (bool) – If True, use the nvidia-docker runtime (https://github.com/NVIDIA/nvidia-docker) to expose NVIDIA GPU(s) to the container. Will cause an error if the computer you’re running on doesn’t have an NVIDIA GPU and/or doesn’t have the nvidia-docker runtime installed.

  • nvidia_visible_devices (str) – Which GPUs to make available to the container; ignored if use_gpu is False. If not ‘all’, should be a comma-separated string: ex. 1,2.

  • logger (Optional[Logger]) – If passed, use this logger for logging instead of the default module-level logger.

  • **kwargs – Additional model-specific parameters to be passed to the model’s init() method.

build()

Perform any pre-setup that needs to be done before running the model (building Docker images, etc).

property class_weights_dir

The root directory used to store initial model weights (before fine-tuning). These should generally be some pretrained weights made available by model developers. This directory will NOT be created by default; models should download their weights and remove the weights directory if the download doesn’t finish properly.

Most models making use of this directory will have multiple sets of weights and will need to store those in subdirectories under this directory.

Return type

Path

Returns

The path to the class-wide weights directory.

data_dir()
Return type

Path

Returns

The main data directory unique to this instance of the model.

embed(embed_input, embed_dir_name=None)

Generates embeddings using a model and the params in the given gobbli.io.EmbedInput.

Parameters
  • embed_input (EmbedInput) – Contains various parameters needed to determine how to generate embeddings and what data to generate embeddings for.

  • embed_dir_name (Optional[str]) – Optional name to store embedding input and output under. The directory is always created under the model’s data_dir. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.

Return type

EmbedOutput

Returns

Output of training.

embed_dir()

The directory to be used for data related to embedding (weights, embeddings, etc)

Return type

Path

Returns

Path to the embedding data directory.

property image_tag
Return type

str

Returns

The tag to use for the fastText image.

property info_path
Return type

Path

Returns

The path to the model’s info file, containing information about the model including the type of model, gobbli version it was trained using, etc.

init(params)[source]

See gobbli.model.base.BaseModel.init().

For more info on fastText parameter semantics, see the docs. The fastText supervised tutorial has some more detailed explanation.

fastText parameters:

  • word_ngrams (int): Max length of word n-grams.

  • lr (float): Learning rate.

  • dim (int): Dimension of learned vectors.

  • ws (int): Context window size.

  • autotune_duration (int): Duration in seconds to spend autotuning parameters. Any of the above parameters will not be autotuned if they are manually specified.

  • autotune_modelsize (str): Maximum size of autotuned model (ex “2M” for 2 megabytes). Any of the above parameters will not be autotuned if they are manually specified.

  • fasttext_model (str): Name of a pretrained fastText model to use. See FASTTEXT_VECTOR_ARCHIVES for a listing of available pretrained models.

property logger
Return type

Logger

Returns

A logger for derived models to use.

property metadata_path
Return type

Path

Returns

The path to the model’s metadata file containing model-specific parameters.

classmethod model_class_dir()
Return type

Path

Returns

A directory shared among all classes of the model.

predict(predict_input, predict_dir_name=None)

Runs prediction on new data using params containing in the given gobbli.io.PredictInput.

Parameters
  • predict_input (PredictInput) – Contains various parameters needed to determine how to run prediction and what data to predict for.

  • predict_dir_name (Optional[str]) – Optional name to store prediction input and output under. The directory is always created under the model’s data_dir. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.

Return type

PredictOutput

predict_dir()

The directory to be used for data related to prediction (weights, predictions, etc)

Return type

Path

Returns

Path to the prediction data directory.

train(train_input, train_dir_name=None)

Trains a model using params in the given gobbli.io.TrainInput. The training process varies depending on the model, but in general, it includes the following steps:

  • Update weights using the training dataset

  • Evaluate performance on the validation dataset.

  • Repeat for a number of epochs.

  • When finished, report loss/accuracy and return the trained weights.

Parameters
  • train_input (TrainInput) – Contains various parameters needed to determine how to train and what data to train on.

  • train_dir_name (Optional[str]) – Optional name to store training input and output under. The directory is always created under the model’s data_dir. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.

Return type

TrainOutput

Returns

Output of training.

train_dir()

The directory to be used for data related to training (data files, etc).

Return type

Path

Returns

Path to the training data directory.

property weights_dir
Return type

Path

Returns

The directory containing pretrained weights for this instance.

class gobbli.model.MajorityClassifier(data_dir=None, load_existing=False, use_gpu=False, nvidia_visible_devices='all', logger=None, **kwargs)[source]

Bases: gobbli.model.base.BaseModel, gobbli.model.mixin.TrainMixin, gobbli.model.mixin.PredictMixin

Simple classifier that returns the majority class from the training set.

Useful for ensuring user code works with the gobbli input/output format without having to build a time-consuming model.

Create a model.

Parameters
  • data_dir (Optional[Path]) – Optional path to a directory used to store model data. If not given, a unique directory under GOBBLI_DIR will be created and used.

  • load_existing (bool) – If True, data_dir should be a directory that was previously used to create a model. Parameters will be loaded to match the original model, and user-specified model parameters will be ignored. If False, the data_dir must be empty if it already exists.

  • use_gpu (bool) – If True, use the nvidia-docker runtime (https://github.com/NVIDIA/nvidia-docker) to expose NVIDIA GPU(s) to the container. Will cause an error if the computer you’re running on doesn’t have an NVIDIA GPU and/or doesn’t have the nvidia-docker runtime installed.

  • nvidia_visible_devices (str) – Which GPUs to make available to the container; ignored if use_gpu is False. If not ‘all’, should be a comma-separated string: ex. 1,2.

  • logger (Optional[Logger]) – If passed, use this logger for logging instead of the default module-level logger.

  • **kwargs – Additional model-specific parameters to be passed to the model’s init() method.

build()

Perform any pre-setup that needs to be done before running the model (building Docker images, etc).

property class_weights_dir

The root directory used to store initial model weights (before fine-tuning). These should generally be some pretrained weights made available by model developers. This directory will NOT be created by default; models should download their weights and remove the weights directory if the download doesn’t finish properly.

Most models making use of this directory will have multiple sets of weights and will need to store those in subdirectories under this directory.

Return type

Path

Returns

The path to the class-wide weights directory.

data_dir()
Return type

Path

Returns

The main data directory unique to this instance of the model.

property info_path
Return type

Path

Returns

The path to the model’s info file, containing information about the model including the type of model, gobbli version it was trained using, etc.

init(params)[source]

Initialize a derived model using parameters specific to that model.

Parameters

params (Dict[str, Any]) – A dictionary where keys are parameter names and values are parameter values.

property logger
Return type

Logger

Returns

A logger for derived models to use.

property metadata_path
Return type

Path

Returns

The path to the model’s metadata file containing model-specific parameters.

classmethod model_class_dir()
Return type

Path

Returns

A directory shared among all classes of the model.

predict(predict_input, predict_dir_name=None)

Runs prediction on new data using params containing in the given gobbli.io.PredictInput.

Parameters
  • predict_input (PredictInput) – Contains various parameters needed to determine how to run prediction and what data to predict for.

  • predict_dir_name (Optional[str]) – Optional name to store prediction input and output under. The directory is always created under the model’s data_dir. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.

Return type

PredictOutput

predict_dir()

The directory to be used for data related to prediction (weights, predictions, etc)

Return type

Path

Returns

Path to the prediction data directory.

train(train_input, train_dir_name=None)

Trains a model using params in the given gobbli.io.TrainInput. The training process varies depending on the model, but in general, it includes the following steps:

  • Update weights using the training dataset

  • Evaluate performance on the validation dataset.

  • Repeat for a number of epochs.

  • When finished, report loss/accuracy and return the trained weights.

Parameters
  • train_input (TrainInput) – Contains various parameters needed to determine how to train and what data to train on.

  • train_dir_name (Optional[str]) – Optional name to store training input and output under. The directory is always created under the model’s data_dir. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.

Return type

TrainOutput

Returns

Output of training.

train_dir()

The directory to be used for data related to training (data files, etc).

Return type

Path

Returns

Path to the training data directory.

property weights_dir

The directory containing weights for a specific instance of the model. This is the class weights directory by default, but subclasses might define this property to return a subdirectory based on a set of pretrained model weights.

Return type

Path

Returns

The instance-specific weights directory.

class gobbli.model.MTDNN(data_dir=None, load_existing=False, use_gpu=False, nvidia_visible_devices='all', logger=None, **kwargs)[source]

Bases: gobbli.model.base.BaseModel, gobbli.model.mixin.TrainMixin, gobbli.model.mixin.PredictMixin

Classifier wrapper for Microsoft’s MT-DNN: https://github.com/namisan/mt-dnn

Create a model.

Parameters
  • data_dir (Optional[Path]) – Optional path to a directory used to store model data. If not given, a unique directory under GOBBLI_DIR will be created and used.

  • load_existing (bool) – If True, data_dir should be a directory that was previously used to create a model. Parameters will be loaded to match the original model, and user-specified model parameters will be ignored. If False, the data_dir must be empty if it already exists.

  • use_gpu (bool) – If True, use the nvidia-docker runtime (https://github.com/NVIDIA/nvidia-docker) to expose NVIDIA GPU(s) to the container. Will cause an error if the computer you’re running on doesn’t have an NVIDIA GPU and/or doesn’t have the nvidia-docker runtime installed.

  • nvidia_visible_devices (str) – Which GPUs to make available to the container; ignored if use_gpu is False. If not ‘all’, should be a comma-separated string: ex. 1,2.

  • logger (Optional[Logger]) – If passed, use this logger for logging instead of the default module-level logger.

  • **kwargs – Additional model-specific parameters to be passed to the model’s init() method.

build()

Perform any pre-setup that needs to be done before running the model (building Docker images, etc).

property class_weights_dir

The root directory used to store initial model weights (before fine-tuning). These should generally be some pretrained weights made available by model developers. This directory will NOT be created by default; models should download their weights and remove the weights directory if the download doesn’t finish properly.

Most models making use of this directory will have multiple sets of weights and will need to store those in subdirectories under this directory.

Return type

Path

Returns

The path to the class-wide weights directory.

data_dir()
Return type

Path

Returns

The main data directory unique to this instance of the model.

property image_tag
Return type

str

Returns

The Docker image tag to be used for the MT-DNN container.

property info_path
Return type

Path

Returns

The path to the model’s info file, containing information about the model including the type of model, gobbli version it was trained using, etc.

init(params)[source]

See gobbli.model.base.BaseModel.init().

MT-DNN parameters:

  • max_seq_length (int): The maximum total input sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded. Default: 128

  • mtdnn_model (str): Name of a pretrained MT-DNN model to use. See MTDNN_MODEL_FILES for a listing of available MT-DNN models.

property logger
Return type

Logger

Returns

A logger for derived models to use.

property metadata_path
Return type

Path

Returns

The path to the model’s metadata file containing model-specific parameters.

classmethod model_class_dir()
Return type

Path

Returns

A directory shared among all classes of the model.

predict(predict_input, predict_dir_name=None)

Runs prediction on new data using params containing in the given gobbli.io.PredictInput.

Parameters
  • predict_input (PredictInput) – Contains various parameters needed to determine how to run prediction and what data to predict for.

  • predict_dir_name (Optional[str]) – Optional name to store prediction input and output under. The directory is always created under the model’s data_dir. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.

Return type

PredictOutput

predict_dir()

The directory to be used for data related to prediction (weights, predictions, etc)

Return type

Path

Returns

Path to the prediction data directory.

train(train_input, train_dir_name=None)

Trains a model using params in the given gobbli.io.TrainInput. The training process varies depending on the model, but in general, it includes the following steps:

  • Update weights using the training dataset

  • Evaluate performance on the validation dataset.

  • Repeat for a number of epochs.

  • When finished, report loss/accuracy and return the trained weights.

Parameters
  • train_input (TrainInput) – Contains various parameters needed to determine how to train and what data to train on.

  • train_dir_name (Optional[str]) – Optional name to store training input and output under. The directory is always created under the model’s data_dir. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.

Return type

TrainOutput

Returns

Output of training.

train_dir()

The directory to be used for data related to training (data files, etc).

Return type

Path

Returns

Path to the training data directory.

property weights_dir
Return type

Path

Returns

The directory containing pretrained weights for this instance.

class gobbli.model.RandomEmbedder(data_dir=None, load_existing=False, use_gpu=False, nvidia_visible_devices='all', logger=None, **kwargs)[source]

Bases: gobbli.model.base.BaseModel, gobbli.model.mixin.TrainMixin, gobbli.model.mixin.EmbedMixin

Dummy embeddings generator that returns random numbers as embeddings and has a stub training method to create a uniform API with other embedding models.

Useful for ensuring user code works with the gobbli input/output format without having to build a time-consuming model.

Create a model.

Parameters
  • data_dir (Optional[Path]) – Optional path to a directory used to store model data. If not given, a unique directory under GOBBLI_DIR will be created and used.

  • load_existing (bool) – If True, data_dir should be a directory that was previously used to create a model. Parameters will be loaded to match the original model, and user-specified model parameters will be ignored. If False, the data_dir must be empty if it already exists.

  • use_gpu (bool) – If True, use the nvidia-docker runtime (https://github.com/NVIDIA/nvidia-docker) to expose NVIDIA GPU(s) to the container. Will cause an error if the computer you’re running on doesn’t have an NVIDIA GPU and/or doesn’t have the nvidia-docker runtime installed.

  • nvidia_visible_devices (str) – Which GPUs to make available to the container; ignored if use_gpu is False. If not ‘all’, should be a comma-separated string: ex. 1,2.

  • logger (Optional[Logger]) – If passed, use this logger for logging instead of the default module-level logger.

  • **kwargs – Additional model-specific parameters to be passed to the model’s init() method.

DIMENSIONALITY = 32
SEED = 1234
build()

Perform any pre-setup that needs to be done before running the model (building Docker images, etc).

property class_weights_dir

The root directory used to store initial model weights (before fine-tuning). These should generally be some pretrained weights made available by model developers. This directory will NOT be created by default; models should download their weights and remove the weights directory if the download doesn’t finish properly.

Most models making use of this directory will have multiple sets of weights and will need to store those in subdirectories under this directory.

Return type

Path

Returns

The path to the class-wide weights directory.

data_dir()
Return type

Path

Returns

The main data directory unique to this instance of the model.

embed(embed_input, embed_dir_name=None)

Generates embeddings using a model and the params in the given gobbli.io.EmbedInput.

Parameters
  • embed_input (EmbedInput) – Contains various parameters needed to determine how to generate embeddings and what data to generate embeddings for.

  • embed_dir_name (Optional[str]) – Optional name to store embedding input and output under. The directory is always created under the model’s data_dir. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.

Return type

EmbedOutput

Returns

Output of training.

embed_dir()

The directory to be used for data related to embedding (weights, embeddings, etc)

Return type

Path

Returns

Path to the embedding data directory.

property info_path
Return type

Path

Returns

The path to the model’s info file, containing information about the model including the type of model, gobbli version it was trained using, etc.

init(params)[source]

Initialize a derived model using parameters specific to that model.

Parameters

params – A dictionary where keys are parameter names and values are parameter values.

property logger
Return type

Logger

Returns

A logger for derived models to use.

property metadata_path
Return type

Path

Returns

The path to the model’s metadata file containing model-specific parameters.

classmethod model_class_dir()
Return type

Path

Returns

A directory shared among all classes of the model.

tokenize(X)[source]

Return a tokenized list of documents.

Return type

List[List[str]]

train(train_input, train_dir_name=None)

Trains a model using params in the given gobbli.io.TrainInput. The training process varies depending on the model, but in general, it includes the following steps:

  • Update weights using the training dataset

  • Evaluate performance on the validation dataset.

  • Repeat for a number of epochs.

  • When finished, report loss/accuracy and return the trained weights.

Parameters
  • train_input (TrainInput) – Contains various parameters needed to determine how to train and what data to train on.

  • train_dir_name (Optional[str]) – Optional name to store training input and output under. The directory is always created under the model’s data_dir. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.

Return type

TrainOutput

Returns

Output of training.

train_dir()

The directory to be used for data related to training (data files, etc).

Return type

Path

Returns

Path to the training data directory.

property weights_dir

The directory containing weights for a specific instance of the model. This is the class weights directory by default, but subclasses might define this property to return a subdirectory based on a set of pretrained model weights.

Return type

Path

Returns

The instance-specific weights directory.

class gobbli.model.Transformer(data_dir=None, load_existing=False, use_gpu=False, nvidia_visible_devices='all', logger=None, **kwargs)[source]

Bases: gobbli.model.base.BaseModel, gobbli.model.mixin.TrainMixin, gobbli.model.mixin.PredictMixin, gobbli.model.mixin.EmbedMixin

Classifier/embedding wrapper for any of the Transformers from transformers.

Create a model.

Parameters
  • data_dir (Optional[Path]) – Optional path to a directory used to store model data. If not given, a unique directory under GOBBLI_DIR will be created and used.

  • load_existing (bool) – If True, data_dir should be a directory that was previously used to create a model. Parameters will be loaded to match the original model, and user-specified model parameters will be ignored. If False, the data_dir must be empty if it already exists.

  • use_gpu (bool) – If True, use the nvidia-docker runtime (https://github.com/NVIDIA/nvidia-docker) to expose NVIDIA GPU(s) to the container. Will cause an error if the computer you’re running on doesn’t have an NVIDIA GPU and/or doesn’t have the nvidia-docker runtime installed.

  • nvidia_visible_devices (str) – Which GPUs to make available to the container; ignored if use_gpu is False. If not ‘all’, should be a comma-separated string: ex. 1,2.

  • logger (Optional[Logger]) – If passed, use this logger for logging instead of the default module-level logger.

  • **kwargs – Additional model-specific parameters to be passed to the model’s init() method.

build()

Perform any pre-setup that needs to be done before running the model (building Docker images, etc).

property class_weights_dir

The root directory used to store initial model weights (before fine-tuning). These should generally be some pretrained weights made available by model developers. This directory will NOT be created by default; models should download their weights and remove the weights directory if the download doesn’t finish properly.

Most models making use of this directory will have multiple sets of weights and will need to store those in subdirectories under this directory.

Return type

Path

Returns

The path to the class-wide weights directory.

data_dir()
Return type

Path

Returns

The main data directory unique to this instance of the model.

embed(embed_input, embed_dir_name=None)

Generates embeddings using a model and the params in the given gobbli.io.EmbedInput.

Parameters
  • embed_input (EmbedInput) – Contains various parameters needed to determine how to generate embeddings and what data to generate embeddings for.

  • embed_dir_name (Optional[str]) – Optional name to store embedding input and output under. The directory is always created under the model’s data_dir. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.

Return type

EmbedOutput

Returns

Output of training.

embed_dir()

The directory to be used for data related to embedding (weights, embeddings, etc)

Return type

Path

Returns

Path to the embedding data directory.

property host_cache_dir

Directory to be used for downloaded transformers files. Should be the same across all instances of the class, since these are generally static model weights/config files that can be reused.

property image_tag
Return type

str

Returns

The Docker image tag to be used for the transformer container.

property info_path
Return type

Path

Returns

The path to the model’s info file, containing information about the model including the type of model, gobbli version it was trained using, etc.

init(params)[source]

See gobbli.model.base.BaseModel.init().

Transformer parameters:

  • transformer_model (str): Name of a transformer model architecture to use. For training/prediction, the value should be one such that from transformers import <value>ForSequenceClassification is a valid import. ex value = “Bert” -> from transformers import BertForSequenceClassification. Note this means only a subset of the transformers models are supported for these tasks – search the docs to see which ones you can use. For embedding generation, the import is <value>Model, so any transformer model is supported.

  • transformer_weights (str): Name of the pretrained weights to use. See the transformers docs for supported values. These depend on the transformer_model chosen.

  • config_overrides (dict): Dictionary of keys and values that will override config for the model.

  • max_seq_length: Truncate all sequences to this length after tokenization. Used to save memory.

  • lr: Learning rate for the AdamW optimizer.

  • adam_eps: Epsilon value for the AdamW optimizer.

  • gradient_accumulation_steps: Number of iterations to accumulate gradients before updating the model. Used to allow larger effective batch sizes for models too big to fit a large batch on the GPU. The “effective batch size” is gradient_accumulation_steps * TrainInput.params.train_batch_size. If you encounter memory errors while training, try decreasing the batch size and increasing gradient_accumulation_steps. For example, if a training batch size of 32 causes memory errors, try decreasing batch size to 16 and increasing gradient_accumulation_steps to 2. If you still have problems with memory, you can drop batch size to 8 and gradient_accumulation_steps to 4, and so on.

Note that gobbli relies on transformers to perform validation on these parameters, so initialization errors may not be caught until model runtime.

property logger
Return type

Logger

Returns

A logger for derived models to use.

property metadata_path
Return type

Path

Returns

The path to the model’s metadata file containing model-specific parameters.

classmethod model_class_dir()
Return type

Path

Returns

A directory shared among all classes of the model.

predict(predict_input, predict_dir_name=None)

Runs prediction on new data using params containing in the given gobbli.io.PredictInput.

Parameters
  • predict_input (PredictInput) – Contains various parameters needed to determine how to run prediction and what data to predict for.

  • predict_dir_name (Optional[str]) – Optional name to store prediction input and output under. The directory is always created under the model’s data_dir. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.

Return type

PredictOutput

predict_dir()

The directory to be used for data related to prediction (weights, predictions, etc)

Return type

Path

Returns

Path to the prediction data directory.

train(train_input, train_dir_name=None)

Trains a model using params in the given gobbli.io.TrainInput. The training process varies depending on the model, but in general, it includes the following steps:

  • Update weights using the training dataset

  • Evaluate performance on the validation dataset.

  • Repeat for a number of epochs.

  • When finished, report loss/accuracy and return the trained weights.

Parameters
  • train_input (TrainInput) – Contains various parameters needed to determine how to train and what data to train on.

  • train_dir_name (Optional[str]) – Optional name to store training input and output under. The directory is always created under the model’s data_dir. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.

Return type

TrainOutput

Returns

Output of training.

train_dir()

The directory to be used for data related to training (data files, etc).

Return type

Path

Returns

Path to the training data directory.

property weights_dir

The directory containing weights for a specific instance of the model. This is the class weights directory by default, but subclasses might define this property to return a subdirectory based on a set of pretrained model weights.

Return type

Path

Returns

The instance-specific weights directory.

class gobbli.model.SKLearnClassifier(data_dir=None, load_existing=False, use_gpu=False, nvidia_visible_devices='all', logger=None, **kwargs)[source]

Bases: gobbli.model.base.BaseModel, gobbli.model.mixin.TrainMixin, gobbli.model.mixin.PredictMixin

Classifier wrapper for scikit-learn classifiers. Wraps a sklearn.base.BaseEstimator which accepts text input and outputs predictions.

Creating an estimator that meets those conditions will generally require some use of sklearn.pipeline.Pipeline to compose a transform (e.g. a vectorizer to vectorize text) and an estimator (e.g. logistic regression). See the helper functions in this module for some examples. You may also consider wrapping the pipeline with sklearn.model_selection.GridSearchCV to tune hyperparameters.

For multilabel classification, the passed estimator will be automatically wrapped in a sklearn.multiclass.OneVsRestClassifier.

Create a model.

Parameters
  • data_dir (Optional[Path]) – Optional path to a directory used to store model data. If not given, a unique directory under GOBBLI_DIR will be created and used.

  • load_existing (bool) – If True, data_dir should be a directory that was previously used to create a model. Parameters will be loaded to match the original model, and user-specified model parameters will be ignored. If False, the data_dir must be empty if it already exists.

  • use_gpu (bool) – If True, use the nvidia-docker runtime (https://github.com/NVIDIA/nvidia-docker) to expose NVIDIA GPU(s) to the container. Will cause an error if the computer you’re running on doesn’t have an NVIDIA GPU and/or doesn’t have the nvidia-docker runtime installed.

  • nvidia_visible_devices (str) – Which GPUs to make available to the container; ignored if use_gpu is False. If not ‘all’, should be a comma-separated string: ex. 1,2.

  • logger (Optional[Logger]) – If passed, use this logger for logging instead of the default module-level logger.

  • **kwargs – Additional model-specific parameters to be passed to the model’s init() method.

build()

Perform any pre-setup that needs to be done before running the model (building Docker images, etc).

property class_weights_dir

The root directory used to store initial model weights (before fine-tuning). These should generally be some pretrained weights made available by model developers. This directory will NOT be created by default; models should download their weights and remove the weights directory if the download doesn’t finish properly.

Most models making use of this directory will have multiple sets of weights and will need to store those in subdirectories under this directory.

Return type

Path

Returns

The path to the class-wide weights directory.

data_dir()
Return type

Path

Returns

The main data directory unique to this instance of the model.

property info_path
Return type

Path

Returns

The path to the model’s info file, containing information about the model including the type of model, gobbli version it was trained using, etc.

init(params)[source]

See gobbli.model.base.BaseModel.init().

SKLearnClassifier parameters:

  • estimator_path (str): Path to an estimator pickled by joblib. The pickle will be loaded, and the resulting object will be used as the estimator. If not provided, a default pipeline composed of a TF-IDF vectorizer and a logistic regression will be used.

property logger
Return type

Logger

Returns

A logger for derived models to use.

property metadata_path
Return type

Path

Returns

The path to the model’s metadata file containing model-specific parameters.

classmethod model_class_dir()
Return type

Path

Returns

A directory shared among all classes of the model.

predict(predict_input, predict_dir_name=None)

Runs prediction on new data using params containing in the given gobbli.io.PredictInput.

Parameters
  • predict_input (PredictInput) – Contains various parameters needed to determine how to run prediction and what data to predict for.

  • predict_dir_name (Optional[str]) – Optional name to store prediction input and output under. The directory is always created under the model’s data_dir. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.

Return type

PredictOutput

predict_dir()

The directory to be used for data related to prediction (weights, predictions, etc)

Return type

Path

Returns

Path to the prediction data directory.

train(train_input, train_dir_name=None)

Trains a model using params in the given gobbli.io.TrainInput. The training process varies depending on the model, but in general, it includes the following steps:

  • Update weights using the training dataset

  • Evaluate performance on the validation dataset.

  • Repeat for a number of epochs.

  • When finished, report loss/accuracy and return the trained weights.

Parameters
  • train_input (TrainInput) – Contains various parameters needed to determine how to train and what data to train on.

  • train_dir_name (Optional[str]) – Optional name to store training input and output under. The directory is always created under the model’s data_dir. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.

Return type

TrainOutput

Returns

Output of training.

train_dir()

The directory to be used for data related to training (data files, etc).

Return type

Path

Returns

Path to the training data directory.

property weights_dir

The directory containing weights for a specific instance of the model. This is the class weights directory by default, but subclasses might define this property to return a subdirectory based on a set of pretrained model weights.

Return type

Path

Returns

The instance-specific weights directory.

class gobbli.model.SpaCyModel(data_dir=None, load_existing=False, use_gpu=False, nvidia_visible_devices='all', logger=None, **kwargs)[source]

Bases: gobbli.model.base.BaseModel, gobbli.model.mixin.TrainMixin, gobbli.model.mixin.PredictMixin, gobbli.model.mixin.EmbedMixin

gobbli interface for spaCy language models which allows for training and prediction via the TextCategorizer pipeline component and static embeddings via Vectors.

Create a model.

Parameters
  • data_dir (Optional[Path]) – Optional path to a directory used to store model data. If not given, a unique directory under GOBBLI_DIR will be created and used.

  • load_existing (bool) – If True, data_dir should be a directory that was previously used to create a model. Parameters will be loaded to match the original model, and user-specified model parameters will be ignored. If False, the data_dir must be empty if it already exists.

  • use_gpu (bool) – If True, use the nvidia-docker runtime (https://github.com/NVIDIA/nvidia-docker) to expose NVIDIA GPU(s) to the container. Will cause an error if the computer you’re running on doesn’t have an NVIDIA GPU and/or doesn’t have the nvidia-docker runtime installed.

  • nvidia_visible_devices (str) – Which GPUs to make available to the container; ignored if use_gpu is False. If not ‘all’, should be a comma-separated string: ex. 1,2.

  • logger (Optional[Logger]) – If passed, use this logger for logging instead of the default module-level logger.

  • **kwargs – Additional model-specific parameters to be passed to the model’s init() method.

build()

Perform any pre-setup that needs to be done before running the model (building Docker images, etc).

property class_weights_dir

The root directory used to store initial model weights (before fine-tuning). These should generally be some pretrained weights made available by model developers. This directory will NOT be created by default; models should download their weights and remove the weights directory if the download doesn’t finish properly.

Most models making use of this directory will have multiple sets of weights and will need to store those in subdirectories under this directory.

Return type

Path

Returns

The path to the class-wide weights directory.

data_dir()
Return type

Path

Returns

The main data directory unique to this instance of the model.

embed(embed_input, embed_dir_name=None)

Generates embeddings using a model and the params in the given gobbli.io.EmbedInput.

Parameters
  • embed_input (EmbedInput) – Contains various parameters needed to determine how to generate embeddings and what data to generate embeddings for.

  • embed_dir_name (Optional[str]) – Optional name to store embedding input and output under. The directory is always created under the model’s data_dir. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.

Return type

EmbedOutput

Returns

Output of training.

embed_dir()

The directory to be used for data related to embedding (weights, embeddings, etc)

Return type

Path

Returns

Path to the embedding data directory.

property host_cache_dir

Directory to be used for downloaded spaCy files. Should be the same across all instances of the class, since these are generally static model weights that can be reused.

property image_tag
Return type

str

Returns

The Docker image tag to be used for the spaCy container.

property info_path
Return type

Path

Returns

The path to the model’s info file, containing information about the model including the type of model, gobbli version it was trained using, etc.

init(params)[source]

See gobbli.model.base.BaseModel.init().

spaCy parameters:

  • model (str): Name of a spaCy model to use. Available values are in the spaCy model docs and the spacy-transformers docs.

  • architecture (str): Model architecture to use. Available values are in the spaCy API docs. This is ignored if using a spacy-transformers model.

  • dropout (float): Dropout proportion for training.

  • full_pipeline (bool): If True, enable the full spaCy language pipeline (including tagging, parsing, and named entity recognition) for the TextCategorizer model used in training and prediction. This makes training/prediction much slower but theoretically provides more information to the model. This is ignored if using a spacy-transformers model.

Note that gobbli relies on spaCy to perform validation on these parameters, so initialization errors may not be caught until model runtime.

property logger
Return type

Logger

Returns

A logger for derived models to use.

property metadata_path
Return type

Path

Returns

The path to the model’s metadata file containing model-specific parameters.

classmethod model_class_dir()
Return type

Path

Returns

A directory shared among all classes of the model.

predict(predict_input, predict_dir_name=None)

Runs prediction on new data using params containing in the given gobbli.io.PredictInput.

Parameters
  • predict_input (PredictInput) – Contains various parameters needed to determine how to run prediction and what data to predict for.

  • predict_dir_name (Optional[str]) – Optional name to store prediction input and output under. The directory is always created under the model’s data_dir. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.

Return type

PredictOutput

predict_dir()

The directory to be used for data related to prediction (weights, predictions, etc)

Return type

Path

Returns

Path to the prediction data directory.

train(train_input, train_dir_name=None)

Trains a model using params in the given gobbli.io.TrainInput. The training process varies depending on the model, but in general, it includes the following steps:

  • Update weights using the training dataset

  • Evaluate performance on the validation dataset.

  • Repeat for a number of epochs.

  • When finished, report loss/accuracy and return the trained weights.

Parameters
  • train_input (TrainInput) – Contains various parameters needed to determine how to train and what data to train on.

  • train_dir_name (Optional[str]) – Optional name to store training input and output under. The directory is always created under the model’s data_dir. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.

Return type

TrainOutput

Returns

Output of training.

train_dir()

The directory to be used for data related to training (data files, etc).

Return type

Path

Returns

Path to the training data directory.

property weights_dir

The directory containing weights for a specific instance of the model. This is the class weights directory by default, but subclasses might define this property to return a subdirectory based on a set of pretrained model weights.

Return type

Path

Returns

The instance-specific weights directory.

class gobbli.model.USE(data_dir=None, load_existing=False, use_gpu=False, nvidia_visible_devices='all', logger=None, **kwargs)[source]

Bases: gobbli.model.base.BaseModel, gobbli.model.mixin.EmbedMixin

Wrapper for Universal Sentence Encoder embeddings: https://tfhub.dev/google/universal-sentence-encoder/4

Create a model.

Parameters
  • data_dir (Optional[Path]) – Optional path to a directory used to store model data. If not given, a unique directory under GOBBLI_DIR will be created and used.

  • load_existing (bool) – If True, data_dir should be a directory that was previously used to create a model. Parameters will be loaded to match the original model, and user-specified model parameters will be ignored. If False, the data_dir must be empty if it already exists.

  • use_gpu (bool) – If True, use the nvidia-docker runtime (https://github.com/NVIDIA/nvidia-docker) to expose NVIDIA GPU(s) to the container. Will cause an error if the computer you’re running on doesn’t have an NVIDIA GPU and/or doesn’t have the nvidia-docker runtime installed.

  • nvidia_visible_devices (str) – Which GPUs to make available to the container; ignored if use_gpu is False. If not ‘all’, should be a comma-separated string: ex. 1,2.

  • logger (Optional[Logger]) – If passed, use this logger for logging instead of the default module-level logger.

  • **kwargs – Additional model-specific parameters to be passed to the model’s init() method.

build()

Perform any pre-setup that needs to be done before running the model (building Docker images, etc).

property class_weights_dir

The root directory used to store initial model weights (before fine-tuning). These should generally be some pretrained weights made available by model developers. This directory will NOT be created by default; models should download their weights and remove the weights directory if the download doesn’t finish properly.

Most models making use of this directory will have multiple sets of weights and will need to store those in subdirectories under this directory.

Return type

Path

Returns

The path to the class-wide weights directory.

data_dir()
Return type

Path

Returns

The main data directory unique to this instance of the model.

embed(embed_input, embed_dir_name=None)

Generates embeddings using a model and the params in the given gobbli.io.EmbedInput.

Parameters
  • embed_input (EmbedInput) – Contains various parameters needed to determine how to generate embeddings and what data to generate embeddings for.

  • embed_dir_name (Optional[str]) – Optional name to store embedding input and output under. The directory is always created under the model’s data_dir. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.

Return type

EmbedOutput

Returns

Output of training.

embed_dir()

The directory to be used for data related to embedding (weights, embeddings, etc)

Return type

Path

Returns

Path to the embedding data directory.

property image_tag
Return type

str

Returns

The Docker image tag to be used for the USE container.

property info_path
Return type

Path

Returns

The path to the model’s info file, containing information about the model including the type of model, gobbli version it was trained using, etc.

init(params)[source]

See gobbli.model.base.BaseModel.init().

USE parameters:

  • use_model (str): Name of a USE model to use. See USE_MODEL_ARCHIVES for a listing of available USE models.

property logger
Return type

Logger

Returns

A logger for derived models to use.

property metadata_path
Return type

Path

Returns

The path to the model’s metadata file containing model-specific parameters.

classmethod model_class_dir()
Return type

Path

Returns

A directory shared among all classes of the model.

property weights_dir
Return type

Path

Returns

Directory containing pretrained weights for this instance.

class gobbli.model.TfidfEmbedder(data_dir=None, load_existing=False, use_gpu=False, nvidia_visible_devices='all', logger=None, **kwargs)[source]

Bases: gobbli.model.base.BaseModel, gobbli.model.mixin.EmbedMixin

Embedding wrapper for scikit-learn’s sklearn.feature_extraction.text.TfidfVectorizer. Generates “embeddings” composed of TF-IDF vectors.

Create a model.

Parameters
  • data_dir (Optional[Path]) – Optional path to a directory used to store model data. If not given, a unique directory under GOBBLI_DIR will be created and used.

  • load_existing (bool) – If True, data_dir should be a directory that was previously used to create a model. Parameters will be loaded to match the original model, and user-specified model parameters will be ignored. If False, the data_dir must be empty if it already exists.

  • use_gpu (bool) – If True, use the nvidia-docker runtime (https://github.com/NVIDIA/nvidia-docker) to expose NVIDIA GPU(s) to the container. Will cause an error if the computer you’re running on doesn’t have an NVIDIA GPU and/or doesn’t have the nvidia-docker runtime installed.

  • nvidia_visible_devices (str) – Which GPUs to make available to the container; ignored if use_gpu is False. If not ‘all’, should be a comma-separated string: ex. 1,2.

  • logger (Optional[Logger]) – If passed, use this logger for logging instead of the default module-level logger.

  • **kwargs – Additional model-specific parameters to be passed to the model’s init() method.

build()

Perform any pre-setup that needs to be done before running the model (building Docker images, etc).

property class_weights_dir

The root directory used to store initial model weights (before fine-tuning). These should generally be some pretrained weights made available by model developers. This directory will NOT be created by default; models should download their weights and remove the weights directory if the download doesn’t finish properly.

Most models making use of this directory will have multiple sets of weights and will need to store those in subdirectories under this directory.

Return type

Path

Returns

The path to the class-wide weights directory.

data_dir()
Return type

Path

Returns

The main data directory unique to this instance of the model.

embed(embed_input, embed_dir_name=None)

Generates embeddings using a model and the params in the given gobbli.io.EmbedInput.

Parameters
  • embed_input (EmbedInput) – Contains various parameters needed to determine how to generate embeddings and what data to generate embeddings for.

  • embed_dir_name (Optional[str]) – Optional name to store embedding input and output under. The directory is always created under the model’s data_dir. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.

Return type

EmbedOutput

Returns

Output of training.

embed_dir()

The directory to be used for data related to embedding (weights, embeddings, etc)

Return type

Path

Returns

Path to the embedding data directory.

property info_path
Return type

Path

Returns

The path to the model’s info file, containing information about the model including the type of model, gobbli version it was trained using, etc.

init(params)[source]

See gobbli.model.base.BaseModel.init().

TFidfEmbedder parameters will be passed directly to the sklearn.feature_extraction.text.TfidfVectorizer constructor, which will perform its own validation.

property logger
Return type

Logger

Returns

A logger for derived models to use.

property metadata_path
Return type

Path

Returns

The path to the model’s metadata file containing model-specific parameters.

classmethod model_class_dir()
Return type

Path

Returns

A directory shared among all classes of the model.

property weights_dir

The directory containing weights for a specific instance of the model. This is the class weights directory by default, but subclasses might define this property to return a subdirectory based on a set of pretrained model weights.

Return type

Path

Returns

The instance-specific weights directory.