gobbli.model.fasttext.model module¶

gobbli.model.fasttext.model.FASTTEXT_VECTOR_ARCHIVES = {'crawl-300d': 'https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip', 'crawl-300d-subword': 'https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip', 'wiki-aligned-300d': 'https://dl.fbaipublicfiles.com/fasttext/vectors-aligned/wiki.en.align.vec', 'wiki-crawl-300d': 'https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz', 'wiki-news-300d': 'https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip', 'wiki-news-300d-subword': 'https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.vec.zip'}¶: A mapping from pretrained vector names to archives. See the fastText docs for information about each set of vectors. Note, some sets of vectors are very, very large.

class gobbli.model.fasttext.model.FastText(data_dir=None, load_existing=False, use_gpu=False, nvidia_visible_devices='all', logger=None, **kwargs)[source]¶

Bases: gobbli.model.base.BaseModel, gobbli.model.mixin.TrainMixin, gobbli.model.mixin.PredictMixin, gobbli.model.mixin.EmbedMixin

Wrapper for Facebook’s fastText model: https://github.com/facebookresearch/fastText

Note: fastText benefits from some preprocessing steps: https://fasttext.cc/docs/en/supervised-tutorial.html#preprocessing-the-data

gobbli will only lowercase and escape newlines in your input by default. If you want more sophisticated preprocessing for punctuation, stemming, etc, consider performing some preprocessing on your own beforehand.

Create a model.

Parameters

data_dir¶ (Optional[Path]) – Optional path to a directory used to store model data. If not given, a unique directory under GOBBLI_DIR will be created and used.
load_existing¶ (bool) – If True, data_dir should be a directory that was previously used to create a model. Parameters will be loaded to match the original model, and user-specified model parameters will be ignored. If False, the data_dir must be empty if it already exists.
use_gpu¶ (bool) – If True, use the nvidia-docker runtime (https://github.com/NVIDIA/nvidia-docker) to expose NVIDIA GPU(s) to the container. Will cause an error if the computer you’re running on doesn’t have an NVIDIA GPU and/or doesn’t have the nvidia-docker runtime installed.
nvidia_visible_devices¶ (str) – Which GPUs to make available to the container; ignored if use_gpu is False. If not ‘all’, should be a comma-separated string: ex. 1,2.
logger¶ (Optional[Logger]) – If passed, use this logger for logging instead of the default module-level logger.
**kwargs¶ – Additional model-specific parameters to be passed to the model’s init() method.

build()¶: Perform any pre-setup that needs to be done before running the model (building Docker images, etc).

property class_weights_dir¶

The root directory used to store initial model weights (before fine-tuning). These should generally be some pretrained weights made available by model developers. This directory will NOT be created by default; models should download their weights and remove the weights directory if the download doesn’t finish properly.

Most models making use of this directory will have multiple sets of weights and will need to store those in subdirectories under this directory.

Return type: Path
Returns: The path to the class-wide weights directory.

data_dir()¶

Return type: Path
Returns: The main data directory unique to this instance of the model.

embed(embed_input, embed_dir_name=None)¶

Generates embeddings using a model and the params in the given gobbli.io.EmbedInput.

Parameters

embed_input¶ (EmbedInput) – Contains various parameters needed to determine how to generate embeddings and what data to generate embeddings for.
embed_dir_name¶ (Optional[str]) – Optional name to store embedding input and output under. The directory is always created under the model’s data_dir. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.

Return type

EmbedOutput

Returns

Output of training.

embed_dir()¶

The directory to be used for data related to embedding (weights, embeddings, etc)

Return type: Path
Returns: Path to the embedding data directory.

property image_tag¶

Return type: str
Returns: The tag to use for the fastText image.

property info_path¶

Return type: Path
Returns: The path to the model’s info file, containing information about the model including the type of model, gobbli version it was trained using, etc.

init(params)[source]¶

See gobbli.model.base.BaseModel.init().

For more info on fastText parameter semantics, see the docs. The fastText supervised tutorial has some more detailed explanation.

fastText parameters:

word_ngrams (int): Max length of word n-grams.
lr (float): Learning rate.
dim (int): Dimension of learned vectors.
ws (int): Context window size.
autotune_duration (int): Duration in seconds to spend autotuning parameters. Any of the above parameters will not be autotuned if they are manually specified.
autotune_modelsize (str): Maximum size of autotuned model (ex “2M” for 2 megabytes). Any of the above parameters will not be autotuned if they are manually specified.
fasttext_model (str): Name of a pretrained fastText model to use. See FASTTEXT_VECTOR_ARCHIVES for a listing of available pretrained models.

property logger¶

Return type: Logger
Returns: A logger for derived models to use.

property metadata_path¶

Return type: Path
Returns: The path to the model’s metadata file containing model-specific parameters.

classmethod model_class_dir()¶

Return type: Path
Returns: A directory shared among all classes of the model.

predict(predict_input, predict_dir_name=None)¶

Runs prediction on new data using params containing in the given gobbli.io.PredictInput.

Parameters

predict_input¶ (PredictInput) – Contains various parameters needed to determine how to run prediction and what data to predict for.
predict_dir_name¶ (Optional[str]) – Optional name to store prediction input and output under. The directory is always created under the model’s data_dir. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.

Return type

PredictOutput

predict_dir()¶

The directory to be used for data related to prediction (weights, predictions, etc)

Return type: Path
Returns: Path to the prediction data directory.

train(train_input, train_dir_name=None)¶

Trains a model using params in the given gobbli.io.TrainInput. The training process varies depending on the model, but in general, it includes the following steps:

Update weights using the training dataset
Evaluate performance on the validation dataset.
Repeat for a number of epochs.
When finished, report loss/accuracy and return the trained weights.

Parameters

train_input¶ (TrainInput) – Contains various parameters needed to determine how to train and what data to train on.
train_dir_name¶ (Optional[str]) – Optional name to store training input and output under. The directory is always created under the model’s data_dir. If a name is not given, a unique name is generated via a UUID. If a name is given, that directory must not already exist.

Return type

TrainOutput

Returns

Output of training.

train_dir()¶

The directory to be used for data related to training (data files, etc).

Return type: Path
Returns: Path to the training data directory.

property weights_dir¶

Return type: Path
Returns: The directory containing pretrained weights for this instance.

class gobbli.model.fasttext.model.FastTextCheckpoint(path)[source]¶

Bases: object

property model¶

From a checkpoint, return the path to the binary model.

Return type: Path

property vectors¶

From a checkpoint, return the path to the text vectors.

Return type: Path

gobbli.model.fasttext.model module¶

Navigation

Related Topics