gobbli.augment.word2vec module¶
-
gobbli.augment.word2vec.
LOGGER
= <Logger gobbli.augment.word2vec (WARNING)>¶
-
gobbli.augment.word2vec.
WORD2VEC_MODELS
= {'charngram': ('charngram.100d', 'charNgram.txt'), 'fasttext-en': ('fasttext.en.300d', 'wiki.en.vec'), 'fasttext-simple': ('fasttext.simple.300d', 'wiki.simple.vec'), 'glove.42B.300d': ('glove.42B.300d', 'glove.42B.300d.txt'), 'glove.6B.100d': ('glove.6B', 'glove.6B.100d.txt'), 'glove.6B.200d': ('glove.6B', 'glove.6B.200d.txt'), 'glove.6B.300d': ('glove.6B', 'glove.6B.300d.txt'), 'glove.6B.50d': ('glove.6B', 'glove.6B.50d.txt'), 'glove.840B.300d': ('glove.840B.300d', 'glove.840B.300d.txt'), 'glove.twitter.27B.100d': ('glove.twitter.27B', 'glove.twitter.27B.100d'), 'glove.twitter.27B.200d': ('glove.twitter.27B', 'glove.twitter.27B.200d'), 'glove.twitter.27B.25d': ('glove.twitter.27B', 'glove.twitter.27B.25d'), 'glove.twitter.27B.50d': ('glove.twitter.27B', 'glove.twitter.27B.50d')}¶ A mapping from word2vec model names to archive URLs and filenames (since some models contain multiple files per archive). Pass one of these key names to
Word2Vec
to use pretrained model weights.
-
class
gobbli.augment.word2vec.
Word2Vec
(model, tokenizer=<TokenizeMethod.SPLIT: 'split'>, n_similar=10, diversity=0.8)[source]¶ Bases:
gobbli.augment.base.BaseAugment
Data augmentation method based on word2vec. Replaces words with similar words according to vector similarity.
- Parameters
model¶ (
Any
) – Pretrained word2vec model to use. If a string, it should correspond to one of the keys inWORD2VEC_MODELS
. The corresponding pretrained vectors will be downloaded and used. If a Path, it’s assumed to be a file containing pretrained vectors, which will be loaded into a gensim word2vec model. If a gensim Word2Vec model, it will be used directly.tokenizer¶ (
Union
[str
,TokenizeMethod
,Callable
[[List
[str
]],List
[List
[str
]]]]) – Function used for tokenizing texts to do word replacement. If an instance ofgobbli.util.TokenizeMethod
, the corresponding preset tokenization method will be used. If a callable, it should accept a list of strings and return a list of tokenized strings.n_similar¶ (
int
) – Number of similar words to be considered for replacement.diversity¶ (
float
) – 0 < diversity <= 1; determines the likelihood of selecting replacement words based on their similarity to the original word. At 1, the most similar words are most likely to be selected as replacements. As diversity decreases, likelihood of selection becomes less dependent on similarity.
-
augment
(X, times=5, p=0.1)[source]¶ Return additional texts for each text in the passed array.
- Parameters
X¶ (
List
[str
]) – Input texts.times¶ (
int
) – How many texts to generate per text in the input.p¶ (
float
) – Probability of considering each token in the input for replacement. Note that some tokens aren’t able to be replaced by a given augmentation method and will be ignored, so the actual proportion of replaced tokens in your input may be much lower than this number.
- Return type
List
[str
]- Returns
Generated texts (length =
times * len(X)
).
-
classmethod
data_dir
()¶ - Return type
Path
- Returns
The data directory used for this class of augmentation model.