gobbli package¶

Subpackages¶

Submodules¶

Module contents¶

class gobbli.TokenizeMethod[source]¶

Bases: enum.Enum

Enum describing the different canned tokenization methods gobbli supports. Processes requiring tokenization should generally allow a user to pass in a custom tokenization function if their needs aren’t met by one of these.

SPLIT¶: Naive tokenization based on whitespace. Probably only useful for testing. Tokens will be lowercased.

SPACY¶: Simple tokenization using spaCy’s English language model. Tokens will be lowercased, and non-alphabetic tokens will be filtered out.

SENTENCEPIECE¶: SentencePiece-based tokenization.

SENTENCEPIECE = 'sentencepiece'

SPACY = 'spacy'

SPLIT = 'split'

Fork me on GitHub