gobbli.augment package

Subpackages

Submodules

Module contents

class gobbli.augment.BERTMaskedLM(data_dir=None, load_existing=False, use_gpu=False, nvidia_visible_devices='all', logger=None, **kwargs)[source]

Bases: gobbli.model.base.BaseModel, gobbli.augment.base.BaseAugment

BERT-based data augmenter. Applies masked language modeling to generate predictions for missing tokens using a trained BERT model.

Create a model.

Parameters
  • data_dir (Optional[Path]) – Optional path to a directory used to store model data. If not given, a unique directory under GOBBLI_DIR will be created and used.

  • load_existing (bool) – If True, data_dir should be a directory that was previously used to create a model. Parameters will be loaded to match the original model, and user-specified model parameters will be ignored. If False, the data_dir must be empty if it already exists.

  • use_gpu (bool) – If True, use the nvidia-docker runtime (https://github.com/NVIDIA/nvidia-docker) to expose NVIDIA GPU(s) to the container. Will cause an error if the computer you’re running on doesn’t have an NVIDIA GPU and/or doesn’t have the nvidia-docker runtime installed.

  • nvidia_visible_devices (str) – Which GPUs to make available to the container; ignored if use_gpu is False. If not ‘all’, should be a comma-separated string: ex. 1,2.

  • logger (Optional[Logger]) – If passed, use this logger for logging instead of the default module-level logger.

  • **kwargs – Additional model-specific parameters to be passed to the model’s init() method.

augment(X, times=5, p=0.1)[source]

Return additional texts for each text in the passed array.

Parameters
  • X (List[str]) – Input texts.

  • times (int) – How many texts to generate per text in the input.

  • p (float) – Probability of considering each token in the input for replacement. Note that some tokens aren’t able to be replaced by a given augmentation method and will be ignored, so the actual proportion of replaced tokens in your input may be much lower than this number.

Return type

List[str]

Returns

Generated texts (length = times * len(X)).

build()

Perform any pre-setup that needs to be done before running the model (building Docker images, etc).

property class_weights_dir

The root directory used to store initial model weights (before fine-tuning). These should generally be some pretrained weights made available by model developers. This directory will NOT be created by default; models should download their weights and remove the weights directory if the download doesn’t finish properly.

Most models making use of this directory will have multiple sets of weights and will need to store those in subdirectories under this directory.

Return type

Path

Returns

The path to the class-wide weights directory.

data_dir()
Return type

Path

Returns

The main data directory unique to this instance of the model.

property host_cache_dir

Directory to be used for downloaded transformers files. Should be the same across all instances of the class, since these are generally static model weights/config files that can be reused.

property image_tag
Return type

str

Returns

The Docker image tag to be used for the BERT container.

property info_path
Return type

Path

Returns

The path to the model’s info file, containing information about the model including the type of model, gobbli version it was trained using, etc.

init(params)[source]

See gobbli.model.base.BaseModel.init().

BERTMaskedLM parameters:

  • bert_model (str): Name of a pretrained BERT model to use. See the transformers docs for supported values.

  • diversity: 0 < diversity <= 1; determines the likelihood of selecting replacement words based on their predicted probability. At 1, the most probable words are most likely to be selected as replacements. As diversity decreases, likelihood of selection becomes less dependent on predicted probability.

  • n_probable: The number of probable tokens to consider for replacement.

  • batch_size: Number of documents to run through the BERT model at once.

property logger
Return type

Logger

Returns

A logger for derived models to use.

property metadata_path
Return type

Path

Returns

The path to the model’s metadata file containing model-specific parameters.

classmethod model_class_dir()
Return type

Path

Returns

A directory shared among all classes of the model.

property weights_dir

The directory containing weights for a specific instance of the model. This is the class weights directory by default, but subclasses might define this property to return a subdirectory based on a set of pretrained model weights.

Return type

Path

Returns

The instance-specific weights directory.

class gobbli.augment.Word2Vec(model, tokenizer=<TokenizeMethod.SPLIT: 'split'>, n_similar=10, diversity=0.8)[source]

Bases: gobbli.augment.base.BaseAugment

Data augmentation method based on word2vec. Replaces words with similar words according to vector similarity.

Parameters
  • model (Any) – Pretrained word2vec model to use. If a string, it should correspond to one of the keys in WORD2VEC_MODELS. The corresponding pretrained vectors will be downloaded and used. If a Path, it’s assumed to be a file containing pretrained vectors, which will be loaded into a gensim word2vec model. If a gensim Word2Vec model, it will be used directly.

  • tokenizer (Union[str, TokenizeMethod, Callable[[List[str]], List[List[str]]]]) – Function used for tokenizing texts to do word replacement. If an instance of gobbli.util.TokenizeMethod, the corresponding preset tokenization method will be used. If a callable, it should accept a list of strings and return a list of tokenized strings.

  • n_similar (int) – Number of similar words to be considered for replacement.

  • diversity (float) – 0 < diversity <= 1; determines the likelihood of selecting replacement words based on their similarity to the original word. At 1, the most similar words are most likely to be selected as replacements. As diversity decreases, likelihood of selection becomes less dependent on similarity.

augment(X, times=5, p=0.1)[source]

Return additional texts for each text in the passed array.

Parameters
  • X (List[str]) – Input texts.

  • times (int) – How many texts to generate per text in the input.

  • p (float) – Probability of considering each token in the input for replacement. Note that some tokens aren’t able to be replaced by a given augmentation method and will be ignored, so the actual proportion of replaced tokens in your input may be much lower than this number.

Return type

List[str]

Returns

Generated texts (length = times * len(X)).

classmethod data_dir()
Return type

Path

Returns

The data directory used for this class of augmentation model.

class gobbli.augment.WordNet(skip_download_check=False, spacy_model='en_core_web_sm')[source]

Bases: gobbli.augment.base.BaseAugment

Data augmentation method based on WordNet. Replaces words with similar words according to the WordNet ontology. Texts will be Part of Speech-tagged using spaCy to help ensure only sensible replacements (i.e., within the same part of speech) are considered.

Parameters
  • skip_download_check (bool) – If True, don’t try to download the WordNet corpus; assume it’s already been downloaded.

  • spacy_model – The language model to be used for Part of Speech tagging by spaCy. The model must already have been installed.

augment(X, times=5, p=0.1)[source]

Return additional texts for each text in the passed array.

Parameters
  • X (List[str]) – Input texts.

  • times (int) – How many texts to generate per text in the input.

  • p (float) – Probability of considering each token in the input for replacement. Note that some tokens aren’t able to be replaced by a given augmentation method and will be ignored, so the actual proportion of replaced tokens in your input may be much lower than this number.

Return type

List[str]

Returns

Generated texts (length = times * len(X)).

classmethod data_dir()
Return type

Path

Returns

The data directory used for this class of augmentation model.

class gobbli.augment.MarianMT(data_dir=None, load_existing=False, use_gpu=False, nvidia_visible_devices='all', logger=None, **kwargs)[source]

Bases: gobbli.model.base.BaseModel, gobbli.augment.base.BaseAugment

Backtranslation-based data augmenter using the Marian Neural Machine Translation model. Translates English text to one of several languages and back to obtain similar texts for training.

Create a model.

Parameters
  • data_dir (Optional[Path]) – Optional path to a directory used to store model data. If not given, a unique directory under GOBBLI_DIR will be created and used.

  • load_existing (bool) – If True, data_dir should be a directory that was previously used to create a model. Parameters will be loaded to match the original model, and user-specified model parameters will be ignored. If False, the data_dir must be empty if it already exists.

  • use_gpu (bool) – If True, use the nvidia-docker runtime (https://github.com/NVIDIA/nvidia-docker) to expose NVIDIA GPU(s) to the container. Will cause an error if the computer you’re running on doesn’t have an NVIDIA GPU and/or doesn’t have the nvidia-docker runtime installed.

  • nvidia_visible_devices (str) – Which GPUs to make available to the container; ignored if use_gpu is False. If not ‘all’, should be a comma-separated string: ex. 1,2.

  • logger (Optional[Logger]) – If passed, use this logger for logging instead of the default module-level logger.

  • **kwargs – Additional model-specific parameters to be passed to the model’s init() method.

LANGUAGE_CODE_MAPPING = {'afrikaans': 'af', 'albanian': 'sq', 'basque': 'eu', 'bemba': 'bem', 'berber': 'ber', 'bislama': 'bi', 'bulgarian': 'bg', 'catalan': 'ca', 'cebuano': 'ceb', 'central-bikol': 'bcl', 'chichewa': 'ny', 'chitonga': 'toi', 'chuukese': 'chk', 'czech': 'cs', 'danish': 'da', 'dutch': 'nl', 'efik': 'efi', 'esperanto': 'eo', 'estonian': 'et', 'ewe': 'ee', 'fijian': 'fj', 'finnish': 'fi', 'french': 'fr', 'ga': 'gaa', 'galician': 'gl', 'german': 'de', 'gilbertese': 'gil', 'gun': 'guw', 'haitian': 'ht', 'hausa': 'ha', 'hiligaynon': 'hil', 'hiri-motu': 'ho', 'hungarian': 'hu', 'icelandic': 'is', 'igbo': 'ig', 'iloko': 'ilo', 'indonesian': 'id', 'irish': 'ga', 'isoko': 'iso', 'italian': 'it', 'japanese': 'jap', 'kikaonde': 'kqn', 'kikongo': 'kwy', 'kiluba': 'lu', 'kinyarwanda': 'rw', 'kirundi': 'run', 'kongo': 'kg', 'kuanyama': 'kj', 'kwangali': 'kwn', 'lingala': 'ln', 'luganda': 'lg', 'lunda': 'lun', 'luo': 'luo', 'luvale': 'lue', 'macedonian': 'mk', 'malagasy': 'mg', 'malayalam': 'ml', 'maltese': 'mt', 'manx': 'gv', 'marathi': 'mr', 'marshallese': 'mh', 'mauritian-creole': 'mfe', 'mizo': 'lus', 'moore': 'mos', 'ndonga': 'ng', 'niuean': 'niu', 'nyaneka': 'nyk', 'oromo': 'om', 'otetela': 'tll', 'pangasinan': 'pag', 'papiamento': 'pap', 'ponapean': 'pon', 'portugese': 'bzs', 'russian': 'ru', 'samoan': 'sm', 'sango': 'sg', 'sepedi': 'nso', 'sesotho-lesotho': 'st', 'setswana': 'tn', 'seychelles-creole': 'crs', 'shona': 'sn', 'silozi': 'loz', 'slovak': 'sk', 'solomon-islands-pidgin': 'pis', 'swahili-congo': 'swc', 'swati': 'ss', 'swedish': 'sv', 'tagalog': 'tl', 'tigrinya': 'ti', 'tiv': 'tiv', 'tok-pisin': 'tpi', 'tongan': 'to', 'tshiluba': 'lua', 'tsonga': 'ts', 'tuvaluan': 'tvl', 'ukrainian': 'uk', 'umbundu': 'umb', 'uruund': 'rnd', 'welsh': 'cy', 'xhosa': 'xh'}
augment(X, times=None, p=None)[source]

Return additional texts for each text in the passed array.

Parameters
  • X (List[str]) – Input texts.

  • times (Optional[int]) – How many texts to generate per text in the input.

  • p (Optional[float]) – Probability of considering each token in the input for replacement. Note that some tokens aren’t able to be replaced by a given augmentation method and will be ignored, so the actual proportion of replaced tokens in your input may be much lower than this number.

Return type

List[str]

Returns

Generated texts (length = times * len(X)).

build()

Perform any pre-setup that needs to be done before running the model (building Docker images, etc).

property class_weights_dir

The root directory used to store initial model weights (before fine-tuning). These should generally be some pretrained weights made available by model developers. This directory will NOT be created by default; models should download their weights and remove the weights directory if the download doesn’t finish properly.

Most models making use of this directory will have multiple sets of weights and will need to store those in subdirectories under this directory.

Return type

Path

Returns

The path to the class-wide weights directory.

data_dir()
Return type

Path

Returns

The main data directory unique to this instance of the model.

property host_cache_dir

Directory to be used for downloaded transformers files. Should be the same across all instances of the class, since these are generally static model weights/config files that can be reused.

property image_tag
Return type

str

Returns

The Docker image tag to be used for the Marian container.

property info_path
Return type

Path

Returns

The path to the model’s info file, containing information about the model including the type of model, gobbli version it was trained using, etc.

init(params)[source]

See gobbli.model.base.BaseModel.init().

MarianMT parameters:

  • batch_size: Number of documents to run through the Marian model at once.

  • target_languages: List of target languages to translate texts to and back. See MarianMT.ALL_TARGET_LANGUAGES for a full list of possible values. You may only augment texts up to the number of languages specified, since each language will be used at most once. So if you want to augment 5 times, you need to specify at least 5 languages when initializing the model.

property logger
Return type

Logger

Returns

A logger for derived models to use.

classmethod marian_inverse_model(language)[source]
Return type

str

Returns

Name of the Marian MT model to use to translate the passed language back to English.

classmethod marian_model(language)[source]
Return type

str

Returns

Name of the Marian MT model to use to translate English to the passed language.

property metadata_path
Return type

Path

Returns

The path to the model’s metadata file containing model-specific parameters.

classmethod model_class_dir()
Return type

Path

Returns

A directory shared among all classes of the model.

property weights_dir

The directory containing weights for a specific instance of the model. This is the class weights directory by default, but subclasses might define this property to return a subdirectory based on a set of pretrained model weights.

Return type

Path

Returns

The instance-specific weights directory.