gobbli.augment.word2vec module

gobbli.augment.word2vec.LOGGER = <Logger gobbli.augment.word2vec (WARNING)>
gobbli.augment.word2vec.WORD2VEC_MODELS = {'charngram': ('charngram.100d', 'charNgram.txt'), 'fasttext-en': ('fasttext.en.300d', 'wiki.en.vec'), 'fasttext-simple': ('fasttext.simple.300d', 'wiki.simple.vec'), 'glove.42B.300d': ('glove.42B.300d', 'glove.42B.300d.txt'), 'glove.6B.100d': ('glove.6B', 'glove.6B.100d.txt'), 'glove.6B.200d': ('glove.6B', 'glove.6B.200d.txt'), 'glove.6B.300d': ('glove.6B', 'glove.6B.300d.txt'), 'glove.6B.50d': ('glove.6B', 'glove.6B.50d.txt'), 'glove.840B.300d': ('glove.840B.300d', 'glove.840B.300d.txt'), 'glove.twitter.27B.100d': ('glove.twitter.27B', 'glove.twitter.27B.100d'), 'glove.twitter.27B.200d': ('glove.twitter.27B', 'glove.twitter.27B.200d'), 'glove.twitter.27B.25d': ('glove.twitter.27B', 'glove.twitter.27B.25d'), 'glove.twitter.27B.50d': ('glove.twitter.27B', 'glove.twitter.27B.50d')}

A mapping from word2vec model names to archive URLs and filenames (since some models contain multiple files per archive). Pass one of these key names to Word2Vec to use pretrained model weights.

class gobbli.augment.word2vec.Word2Vec(model, tokenizer=<TokenizeMethod.SPLIT: 'split'>, n_similar=10, diversity=0.8)[source]

Bases: gobbli.augment.base.BaseAugment

Data augmentation method based on word2vec. Replaces words with similar words according to vector similarity.

  • model (Any) – Pretrained word2vec model to use. If a string, it should correspond to one of the keys in WORD2VEC_MODELS. The corresponding pretrained vectors will be downloaded and used. If a Path, it’s assumed to be a file containing pretrained vectors, which will be loaded into a gensim word2vec model. If a gensim Word2Vec model, it will be used directly.

  • tokenizer (Union[str, TokenizeMethod, Callable[[List[str]], List[List[str]]]]) – Function used for tokenizing texts to do word replacement. If an instance of gobbli.util.TokenizeMethod, the corresponding preset tokenization method will be used. If a callable, it should accept a list of strings and return a list of tokenized strings.

  • n_similar (int) – Number of similar words to be considered for replacement.

  • diversity (float) – 0 < diversity <= 1; determines the likelihood of selecting replacement words based on their similarity to the original word. At 1, the most similar words are most likely to be selected as replacements. As diversity decreases, likelihood of selection becomes less dependent on similarity.

augment(X, times=5, p=0.1)[source]

Return additional texts for each text in the passed array.

  • X (List[str]) – Input texts.

  • times (int) – How many texts to generate per text in the input.

  • p (float) – Probability of considering each token in the input for replacement. Note that some tokens aren’t able to be replaced by a given augmentation method and will be ignored, so the actual proportion of replaced tokens in your input may be much lower than this number.

Return type



Generated texts (length = times * len(X)).

classmethod data_dir()
Return type



The data directory used for this class of augmentation model.