gobbli package


Module contents

class gobbli.TokenizeMethod[source]

Bases: enum.Enum

Enum describing the different canned tokenization methods gobbli supports. Processes requiring tokenization should generally allow a user to pass in a custom tokenization function if their needs aren’t met by one of these.


Naive tokenization based on whitespace. Probably only useful for testing. Tokens will be lowercased.


Simple tokenization using spaCy’s English language model. Tokens will be lowercased, and non-alphabetic tokens will be filtered out.


SentencePiece-based tokenization.

SENTENCEPIECE = 'sentencepiece'
SPACY = 'spacy'
SPLIT = 'split'