gobbli.model.sklearn package¶
Submodules¶
Module contents¶
-
class
gobbli.model.sklearn.
SKLearnClassifier
(data_dir=None, load_existing=False, use_gpu=False, nvidia_visible_devices='all', logger=None, **kwargs)[source]¶ Bases:
gobbli.model.base.BaseModel
,gobbli.model.mixin.TrainMixin
,gobbli.model.mixin.PredictMixin
Classifier wrapper for scikit-learn classifiers. Wraps a
sklearn.base.BaseEstimator
which accepts text input and outputs predictions.Creating an estimator that meets those conditions will generally require some use of
sklearn.pipeline.Pipeline
to compose a transform (e.g. a vectorizer to vectorize text) and an estimator (e.g. logistic regression). See the helper functions in this module for some examples. You may also consider wrapping the pipeline withsklearn.model_selection.GridSearchCV
to tune hyperparameters.For multilabel classification, the passed estimator will be automatically wrapped in a
sklearn.multiclass.OneVsRestClassifier
.Create a model.
- Parameters
data_dir¶ (
Optional
[Path
]) – Optional path to a directory used to store model data. If not given, a unique directory under GOBBLI_DIR will be created and used.load_existing¶ (
bool
) – If True,data_dir
should be a directory that was previously used to create a model. Parameters will be loaded to match the original model, and user-specified model parameters will be ignored. If False, the data_dir must be empty if it already exists.use_gpu¶ (
bool
) – If True, use the nvidia-docker runtime (https://github.com/NVIDIA/nvidia-docker) to expose NVIDIA GPU(s) to the container. Will cause an error if the computer you’re running on doesn’t have an NVIDIA GPU and/or doesn’t have the nvidia-docker runtime installed.nvidia_visible_devices¶ (
str
) – Which GPUs to make available to the container; ignored ifuse_gpu
is False. If not ‘all’, should be a comma-separated string: ex.1,2
.logger¶ (
Optional
[Logger
]) – If passed, use this logger for logging instead of the default module-level logger.**kwargs¶ – Additional model-specific parameters to be passed to the model’s
init()
method.
-
build
()¶ Perform any pre-setup that needs to be done before running the model (building Docker images, etc).
-
property
class_weights_dir
¶ The root directory used to store initial model weights (before fine-tuning). These should generally be some pretrained weights made available by model developers. This directory will NOT be created by default; models should download their weights and remove the weights directory if the download doesn’t finish properly.
Most models making use of this directory will have multiple sets of weights and will need to store those in subdirectories under this directory.
- Return type
Path
- Returns
The path to the class-wide weights directory.
-
data_dir
()¶ - Return type
Path
- Returns
The main data directory unique to this instance of the model.
-
property
info_path
¶ - Return type
Path
- Returns
The path to the model’s info file, containing information about the model including the type of model, gobbli version it was trained using, etc.
-
init
(params)[source]¶ See
gobbli.model.base.BaseModel.init()
.SKLearnClassifier parameters:
estimator_path
(str
): Path to an estimator pickled by joblib. The pickle will be loaded, and the resulting object will be used as the estimator. If not provided, a default pipeline composed of a TF-IDF vectorizer and a logistic regression will be used.
-
property
logger
¶ - Return type
Logger
- Returns
A logger for derived models to use.
-
property
metadata_path
¶ - Return type
Path
- Returns
The path to the model’s metadata file containing model-specific parameters.
-
classmethod
model_class_dir
()¶ - Return type
Path
- Returns
A directory shared among all classes of the model.
-
predict
(predict_input, predict_dir_name=None)¶ Runs prediction on new data using params containing in the given
gobbli.io.PredictInput
.- Parameters
predict_input¶ (
PredictInput
) – Contains various parameters needed to determine how to run prediction and what data to predict for.predict_dir_name¶ (
Optional
[str
]) – Optional name to store prediction input and output under. The directory is always created under the model’sdata_dir
. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.
- Return type
-
predict_dir
()¶ The directory to be used for data related to prediction (weights, predictions, etc)
- Return type
Path
- Returns
Path to the prediction data directory.
-
train
(train_input, train_dir_name=None)¶ Trains a model using params in the given
gobbli.io.TrainInput
. The training process varies depending on the model, but in general, it includes the following steps:Update weights using the training dataset
Evaluate performance on the validation dataset.
Repeat for a number of epochs.
When finished, report loss/accuracy and return the trained weights.
- Parameters
train_input¶ (
TrainInput
) – Contains various parameters needed to determine how to train and what data to train on.train_dir_name¶ (
Optional
[str
]) – Optional name to store training input and output under. The directory is always created under the model’sdata_dir
. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.
- Return type
- Returns
Output of training.
-
train_dir
()¶ The directory to be used for data related to training (data files, etc).
- Return type
Path
- Returns
Path to the training data directory.
-
property
weights_dir
¶ The directory containing weights for a specific instance of the model. This is the class weights directory by default, but subclasses might define this property to return a subdirectory based on a set of pretrained model weights.
- Return type
Path
- Returns
The instance-specific weights directory.
-
class
gobbli.model.sklearn.
TfidfEmbedder
(data_dir=None, load_existing=False, use_gpu=False, nvidia_visible_devices='all', logger=None, **kwargs)[source]¶ Bases:
gobbli.model.base.BaseModel
,gobbli.model.mixin.EmbedMixin
Embedding wrapper for scikit-learn’s
sklearn.feature_extraction.text.TfidfVectorizer
. Generates “embeddings” composed of TF-IDF vectors.Create a model.
- Parameters
data_dir¶ (
Optional
[Path
]) – Optional path to a directory used to store model data. If not given, a unique directory under GOBBLI_DIR will be created and used.load_existing¶ (
bool
) – If True,data_dir
should be a directory that was previously used to create a model. Parameters will be loaded to match the original model, and user-specified model parameters will be ignored. If False, the data_dir must be empty if it already exists.use_gpu¶ (
bool
) – If True, use the nvidia-docker runtime (https://github.com/NVIDIA/nvidia-docker) to expose NVIDIA GPU(s) to the container. Will cause an error if the computer you’re running on doesn’t have an NVIDIA GPU and/or doesn’t have the nvidia-docker runtime installed.nvidia_visible_devices¶ (
str
) – Which GPUs to make available to the container; ignored ifuse_gpu
is False. If not ‘all’, should be a comma-separated string: ex.1,2
.logger¶ (
Optional
[Logger
]) – If passed, use this logger for logging instead of the default module-level logger.**kwargs¶ – Additional model-specific parameters to be passed to the model’s
init()
method.
-
build
()¶ Perform any pre-setup that needs to be done before running the model (building Docker images, etc).
-
property
class_weights_dir
¶ The root directory used to store initial model weights (before fine-tuning). These should generally be some pretrained weights made available by model developers. This directory will NOT be created by default; models should download their weights and remove the weights directory if the download doesn’t finish properly.
Most models making use of this directory will have multiple sets of weights and will need to store those in subdirectories under this directory.
- Return type
Path
- Returns
The path to the class-wide weights directory.
-
data_dir
()¶ - Return type
Path
- Returns
The main data directory unique to this instance of the model.
-
embed
(embed_input, embed_dir_name=None)¶ Generates embeddings using a model and the params in the given
gobbli.io.EmbedInput
.- Parameters
embed_input¶ (
EmbedInput
) – Contains various parameters needed to determine how to generate embeddings and what data to generate embeddings for.embed_dir_name¶ (
Optional
[str
]) – Optional name to store embedding input and output under. The directory is always created under the model’sdata_dir
. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.
- Return type
- Returns
Output of training.
-
embed_dir
()¶ The directory to be used for data related to embedding (weights, embeddings, etc)
- Return type
Path
- Returns
Path to the embedding data directory.
-
property
info_path
¶ - Return type
Path
- Returns
The path to the model’s info file, containing information about the model including the type of model, gobbli version it was trained using, etc.
-
init
(params)[source]¶ See
gobbli.model.base.BaseModel.init()
.TFidfEmbedder parameters will be passed directly to the
sklearn.feature_extraction.text.TfidfVectorizer
constructor, which will perform its own validation.
-
property
logger
¶ - Return type
Logger
- Returns
A logger for derived models to use.
-
property
metadata_path
¶ - Return type
Path
- Returns
The path to the model’s metadata file containing model-specific parameters.
-
classmethod
model_class_dir
()¶ - Return type
Path
- Returns
A directory shared among all classes of the model.
-
property
weights_dir
¶ The directory containing weights for a specific instance of the model. This is the class weights directory by default, but subclasses might define this property to return a subdirectory based on a set of pretrained model weights.
- Return type
Path
- Returns
The instance-specific weights directory.
-
gobbli.model.sklearn.
persist_estimator
(estimator)[source]¶ Saves the given estimator to a gobbli-managed filepath, where it can be loaded from disk by the SKLearnClassifier. This is useful if you want to use an estimator but don’t want to bother with saving it to disk on your own.
- Parameters
estimator¶ (
BaseEstimator
) – The estimator to load.- Return type
Path
- Returns
The path where the estimator was saved.
-
gobbli.model.sklearn.
make_cv_tfidf_logistic_regression
(grid_params=None)[source]¶ - Parameters
grid_params¶ (
Optional
[Dict
[str
,Any
]]) – Grid search parameters for the pipeline. Passed directly tosklearn.model_selection.GridSearchCV
. Seemake_default_tfidf_logistic_regression()
for the names of the pipeline components. If not given, will use a somewhat reasonable default.- Return type
- Returns
A cross-validated pipeline combining a TF-IDF vectorizer and logistic regression model with the specified grid parameters.